Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Regular expression to match a #

Reply
Thread Tools

Regular expression to match a #

 
 
John Machin
Guest
Posts: n/a
 
      08-11-2005
Duncan Booth wrote:
> John Machin wrote:
>
>
>>>Alternatively for C style #includes search for r'^\s*#\s*include\b'.

>>
>>Search for r'^something' can never be better/faster than match for
>>r'something', and with a dopey implementation of search [which
>>Python's re is NOT] it could be much worse. So please don't tell
>>newbies to search for r'^something'.
>>

>
> Search for r'^something' is always better than searching for r'something'
> when the spec requires the search to match only at the start of a line (on
> the principle that code that works is better than code which doesn't).
>
> It appears that this may be something the original poster wanted, so I
> stand by my suggestion.



We could well be lost in a semantic fog where at least one of us is
using "match" to mean "the match() method" and at least one of us is
using match to mean soemthing like "the outcome of using a search()
method [or a match() method]".

So I'll stand by my suggestion, too.

 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      08-11-2005
Aahz wrote:
> In article <42fb45d7$(E-Mail Removed)>,
> John Machin <(E-Mail Removed)> wrote:
>
>>Search for r'^something' can never be better/faster than match for
>>r'something', and with a dopey implementation of search [which Python's
>>re is NOT] it could be much worse. So please don't tell newbies to
>>search for r'^something'.

>
>
> You're somehow getting mixed up in thinking that "^" is some kind of
> "not" operator -- it's the start of line anchor in this context.


I can't imagine where you got that idea from.

If I change "[which Python's re is NOT]" to "[Python's re's search() is
not dopey]", does that help you?

The point was made in a context where the OP appeared to be reading a
line at a time and parsing it, and re.compile(r'something').match()
would do the job; re.compile(r'^something').search() will do the job too
-- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and
very inefficiently in the failing case with dopey implementations of
search() (which apply match() at offsets 0, 1, 2, .....).

 
Reply With Quote
 
 
 
 
Devan L
Guest
Posts: n/a
 
      08-11-2005
John Machin wrote:
> Aahz wrote:
> > In article <42fb45d7$(E-Mail Removed)>,
> > John Machin <(E-Mail Removed)> wrote:
> >
> >>Search for r'^something' can never be better/faster than match for
> >>r'something', and with a dopey implementation of search [which Python's
> >>re is NOT] it could be much worse. So please don't tell newbies to
> >>search for r'^something'.

> >
> >
> > You're somehow getting mixed up in thinking that "^" is some kind of
> > "not" operator -- it's the start of line anchor in this context.

>
> I can't imagine where you got that idea from.
>
> If I change "[which Python's re is NOT]" to "[Python's re's search() is
> not dopey]", does that help you?
>
> The point was made in a context where the OP appeared to be reading a
> line at a time and parsing it, and re.compile(r'something').match()
> would do the job; re.compile(r'^something').search() will do the job too
> -- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and
> very inefficiently in the failing case with dopey implementations of
> search() (which apply match() at offsets 0, 1, 2, .....).


I don't see much difference.
Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)]
on win32
Type "copyright", "credits" or "license()" for more information.

************************************************** **************
Personal firewall software may warn about the connection IDLE
makes to its subprocess using this computer's internal loopback
interface. This connection is not visible on any external
interface and no data is sent to or received from the Internet.
************************************************** **************

IDLE 1.1.1
>>> import timeit
>>> t1 = timeit.Timer('re.search("^\w"," will not work")','import re')
>>> t1.timeit()

34.938577109660628
>>> t2 = timeit.Timer('re.match("\w"," will not work")','import re')
>>> t2.timeit()

31.381461330979164
>>> 3.0/1000000

3.0000000000000001e-006
>>> t1.timeit()

35.282282524734228
>>> t2.timeit()

31.403153752781463

~4 second difference after a million times through seems to be trivial.
Then again, I haven't tested it for larger patterns and strings.

 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      08-11-2005
Devan L wrote:
> John Machin wrote:
>
>>Aahz wrote:
>>
>>>In article <42fb45d7$(E-Mail Removed)>,
>>>John Machin <(E-Mail Removed)> wrote:
>>>
>>>
>>>>Search for r'^something' can never be better/faster than match for
>>>>r'something', and with a dopey implementation of search [which Python's
>>>>re is NOT] it could be much worse. So please don't tell newbies to
>>>>search for r'^something'.
>>>
>>>
>>>You're somehow getting mixed up in thinking that "^" is some kind of
>>>"not" operator -- it's the start of line anchor in this context.

>>
>>I can't imagine where you got that idea from.
>>
>>If I change "[which Python's re is NOT]" to "[Python's re's search() is
>>not dopey]", does that help you?
>>
>>The point was made in a context where the OP appeared to be reading a
>>line at a time and parsing it, and re.compile(r'something').match()
>>would do the job; re.compile(r'^something').search() will do the job too
>>-- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and
>>very inefficiently in the failing case with dopey implementations of
>>search() (which apply match() at offsets 0, 1, 2, .....).

>
>
> I don't see much difference.


and I didn't expect that you would -- like I wrote above: "Python's re's
search() is not dopey".
 
Reply With Quote
 
Devan L
Guest
Posts: n/a
 
      08-11-2005

John Machin wrote:
> Devan L wrote:
> > John Machin wrote:
> >
> >>Aahz wrote:
> >>
> >>>In article <42fb45d7$(E-Mail Removed)>,
> >>>John Machin <(E-Mail Removed)> wrote:
> >>>
> >>>
> >>>>Search for r'^something' can never be better/faster than match for
> >>>>r'something', and with a dopey implementation of search [which Python's
> >>>>re is NOT] it could be much worse. So please don't tell newbies to
> >>>>search for r'^something'.
> >>>
> >>>
> >>>You're somehow getting mixed up in thinking that "^" is some kind of
> >>>"not" operator -- it's the start of line anchor in this context.
> >>
> >>I can't imagine where you got that idea from.
> >>
> >>If I change "[which Python's re is NOT]" to "[Python's re's search() is
> >>not dopey]", does that help you?
> >>
> >>The point was made in a context where the OP appeared to be reading a
> >>line at a time and parsing it, and re.compile(r'something').match()
> >>would do the job; re.compile(r'^something').search() will do the job too
> >>-- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and
> >>very inefficiently in the failing case with dopey implementations of
> >>search() (which apply match() at offsets 0, 1, 2, .....).

> >
> >
> > I don't see much difference.

>
> and I didn't expect that you would -- like I wrote above: "Python's re's
> search() is not dopey".


Your wording makes it hard to distinguish what exactly is "dopey".

 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      08-11-2005
John Machin wrote:
> Devan L wrote:
>
>> John Machin wrote:
>>
>>> Aahz wrote:
>>>
>>>> In article <42fb45d7$(E-Mail Removed)>,
>>>> John Machin <(E-Mail Removed)> wrote:
>>>>
>>>>
>>>>> Search for r'^something' can never be better/faster than match for
>>>>> r'something', and with a dopey implementation of search [which
>>>>> Python's
>>>>> re is NOT] it could be much worse. So please don't tell newbies to
>>>>> search for r'^something'.
>>>>
>>>>
>>>>
>>>> You're somehow getting mixed up in thinking that "^" is some kind of
>>>> "not" operator -- it's the start of line anchor in this context.
>>>
>>>
>>> I can't imagine where you got that idea from.
>>>
>>> If I change "[which Python's re is NOT]" to "[Python's re's search() is
>>> not dopey]", does that help you?
>>>
>>> The point was made in a context where the OP appeared to be reading a
>>> line at a time and parsing it, and re.compile(r'something').match()
>>> would do the job; re.compile(r'^something').search() will do the job too
>>> -- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and
>>> very inefficiently in the failing case with dopey implementations of
>>> search() (which apply match() at offsets 0, 1, 2, .....).

>>
>>
>>
>> I don't see much difference.

>
>
> and I didn't expect that you would -- like I wrote above: "Python's re's
> search() is not dopey".


*ahem*

C:\junk>python
Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> t1 = timeit.Timer('re.search("^\w"," will not work")','import re')
>>> t2 = timeit.Timer('re.match("\w"," will not work")','import re')
>>> t3 = timeit.Timer('obj(" will not work")','import

re;obj=re.compile("^\w").s
earch')
>>> t4 = timeit.Timer('obj(" will not work")','import

re;obj=re.compile("\w").ma
tch')
>>> t5 = timeit.Timer('obj(" will not work

qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq")'
,'import re;obj=re.compile("^\w").search')
>>> t6 = timeit.Timer('obj(" will not work

qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq")'
,'import re;obj=re.compile("\w").match')
>>> ["%.3f" % t.timeit() for t in t1, t2, t3, t4]

['5.510', '4.835', '1.588', '1.178']
>>> ["%.3f" % t.timeit() for t in t1, t2, t3, t4]

['5.512', '4.808', '1.584', '1.170']

Observation: factoring out the compile step makes the difference much
more apparent.

>>> ["%.3f" % t.timeit() for t in t3, t4, t5, t6]

['1.578', '1.175', '2.283', '1.174']
>>> ["%.3f" % t.timeit() for t in t3, t4, t5, t6]

['1.582', '1.179', '2.284', '1.172']
>>>


Conclusion: search time depends on length of searched string.

Meta-conclusion: Either I have to retract my
based-on-hope-rather-than-on-experimentation assertion, or redefine "not
dopey" to mean "surely nobody would search for ^x when match x would do,
so it would be dopey to optimise re for that"

So, back to the original point:

If re.match("something") does the job you want, don't use
re.search("^something") instead.
 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      08-11-2005
Devan L wrote:
> John Machin wrote:
>
>>Devan L wrote:
>>
>>>John Machin wrote:
>>>
>>>
>>>>Aahz wrote:
>>>>
>>>>
>>>>>In article <42fb45d7$(E-Mail Removed)>,
>>>>>John Machin <(E-Mail Removed)> wrote:
>>>>>
>>>>>
>>>>>
>>>>>>Search for r'^something' can never be better/faster than match for
>>>>>>r'something', and with a dopey implementation of search [which Python's
>>>>>>re is NOT] it could be much worse. So please don't tell newbies to
>>>>>>search for r'^something'.
>>>>>
>>>>>
>>>>>You're somehow getting mixed up in thinking that "^" is some kind of
>>>>>"not" operator -- it's the start of line anchor in this context.
>>>>
>>>>I can't imagine where you got that idea from.
>>>>
>>>>If I change "[which Python's re is NOT]" to "[Python's re's search() is
>>>>not dopey]", does that help you?
>>>>
>>>>The point was made in a context where the OP appeared to be reading a
>>>>line at a time and parsing it, and re.compile(r'something').match()
>>>>would do the job; re.compile(r'^something').search() will do the job too
>>>>-- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and
>>>>very inefficiently in the failing case with dopey implementations of
>>>>search() (which apply match() at offsets 0, 1, 2, .....).
>>>
>>>
>>>I don't see much difference.

>>
>>and I didn't expect that you would -- like I wrote above: "Python's re's
>>search() is not dopey".

>
>
> Your wording makes it hard to distinguish what exactly is "dopey".
>


"""
dopey implementations of search() (which apply match() at offsets 0, 1,
2, .....).
"""

The "dopiness" is that the ^ operator means that the pattern cannot
possibly match starting at 1, 2, 3, etc but a non-optimised search will
not recognise that and will try all possibilities, so the failing case
takes time dependant on the length of the string.
 
Reply With Quote
 
Bryan Olson
Guest
Posts: n/a
 
      08-12-2005
John Machin wrote:
[...]
> Observation: factoring out the compile step makes the difference much
> more apparent.
>
> >>> ["%.3f" % t.timeit() for t in t3, t4, t5, t6]

> ['1.578', '1.175', '2.283', '1.174']
> >>> ["%.3f" % t.timeit() for t in t3, t4, t5, t6]

> ['1.582', '1.179', '2.284', '1.172']
> >>>


To make it even more apparent, try:

import re
import profile

startsz = re.compile('^z')

for s in ('x' * 1000, 'x' * 100000, 'x'*10000000):
profile.run('startsz.search(s)')

Profile report is below.


> Conclusion: search time depends on length of searched string.
>
> Meta-conclusion: Either I have to retract my
> based-on-hope-rather-than-on-experimentation assertion, or redefine "not
> dopey" to mean "surely nobody would search for ^x when match x would do,
> so it would be dopey to optimise re for that"


No question, there's some dopiness to searching for the
beginning of the string at places other than beginning of the
string.

The tricky part would be optimizing '$'.


--
--Bryan


4 function calls in 0.003 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 :0(search)
1 0.003 0.003 0.003 0.003 :0(setprofile)
1 0.000 0.000 0.000 0.000 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.000 0.000 0.003 0.003 profile:0(startsz.search(s))


4 function calls in 0.002 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.002 0.002 0.002 0.002 :0(search)
1 0.000 0.000 0.000 0.000 :0(setprofile)
1 0.000 0.000 0.002 0.002 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.000 0.000 0.002 0.002 profile:0(startsz.search(s))


4 function calls in 0.228 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.228 0.228 0.228 0.228 :0(search)
1 0.000 0.000 0.000 0.000 :0(setprofile)
1 0.000 0.000 0.228 0.228 <string>:1(?)
0 0.000 0.000 profile:0(profiler)
1 0.000 0.000 0.228 0.228 profile:0(startsz.search(s))
 
Reply With Quote
 
Duncan Booth
Guest
Posts: n/a
 
      08-12-2005
John Machin wrote:

> The point was made in a context where the OP appeared to be reading a
> line at a time and parsing it, and re.compile(r'something').match()
> would do the job; re.compile(r'^something').search() will do the job too
> -- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and
> very inefficiently in the failing case with dopey implementations of
> search() (which apply match() at offsets 0, 1, 2, .....).


Answering the question you think should have been asked rather than the
question which was actually asked is a great newsnet tradition, and often
more helpful to the poster than a straight answer would have been. However,
you do have to be careful to make it clear that is what you are doing.

The OP did not use the word 'line' once in his post. He simply said he was
searching a string. You didn't use the word 'line' either. If you are going
to read more into the question than was actually asked, please try to say
what question it is you are actually answering.

If he is using individual lines and re.match then the presence or absence
of a leading ^ makes virtually no difference. If he is looking for all
occurences in a multiline string then re.search with an anchored match is a
correct way to do it (splitting the string into lines and using re.match is
an alternative which may or may not be appropriate).

Either way, putting the focus on the ^ was inappropriate: the issue is
whether to use re.search or re.match. If you assume that the search fails
on an 80 character line, then I get timings of 6.48uS (re.search), 4.68uS
(re.match with ^), 4.66uS (re.match without ^). A failing search on a
10,000 character line shows how performance will degrade (225uS for search,
no change for match), but notice that searching 1 10,000 character string
is more than twice as fast as matching 125 80 character lines.

I don't understand what you think an implementation of search() can do in
this case apart from trying for a match at offsets 0, 1, 2, ...? It could
find a match at any starting offset within the string, so it must scan the
string in some form. A clever regex implementation will use Boyer-Moore
where it can to avoid checking every index in the string, but for the
pattern I suggested it would suprise me if any implementations actually
manage much of an optimisation.
 
Reply With Quote
 
Duncan Booth
Guest
Posts: n/a
 
      08-12-2005
John Machin wrote:

>> Your wording makes it hard to distinguish what exactly is "dopey".
>>

>
> """
> dopey implementations of search() (which apply match() at offsets 0, 1,
> 2, .....).
> """
>
> The "dopiness" is that the ^ operator means that the pattern cannot
> possibly match starting at 1, 2, 3, etc but a non-optimised search will
> not recognise that and will try all possibilities, so the failing case
> takes time dependant on the length of the string.


The ^ operator can match at any position in the string if the preceding
character was a newline. 'Dopey' would be failing to take this into
account.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expression - looking to match 'www' only if it is the start of a URL hooterbite@yahoo.com ASP .Net 0 07-20-2005 04:11 PM
Regular Expression - looking to match 'www' only if it the start of a URL hooterbite@yahoo.com ASP .Net 4 07-12-2005 01:01 PM
how to match regular expression from right to left Liang Perl 2 08-27-2004 10:03 PM
match three digit number using regular expression championsleeper Perl 6 04-06-2004 08:54 PM
Dynamically changing the regular expression of Regular Expression validator VSK ASP .Net 2 08-24-2003 02:47 PM



Advertisments