Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > regex walktrough

Reply
Thread Tools

regex walktrough

 
 
rh
Guest
Posts: n/a
 
      12-08-2012
Look through some code I found this and wondered about what it does:
^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$

Here's my walk through:

1) ^ match at start of string
2) ?P<salsipuedes> if a match is found it will be accessible in a variable
salsipuedes
3) [0-9A-Za-z-_.//] this is the one that looks wrong to me, see below
4) + one or more from the preceeding char class
5) () the grouping we want returned (see #2)
6) $ end of the string to match against but before any newline


more on #3
the z-_ part looks wrong and seems that the - should be at the start
of the char set otherwise we get another range z-_ or does the a-z
preceeding the z-_ negate the z-_ from becoming a range? The "."
might be ok inside a char set. The two slashes look wrong but maybe
it has some special meaning in some case? I think only one slash is
needed.

I've looked at pydoc re, but it's cursory.

 
Reply With Quote
 
 
 
 
Hans Mulder
Guest
Posts: n/a
 
      12-08-2012
On 8/12/12 18:48:13, rh wrote:
> Look through some code I found this and wondered about what it does:
> ^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$
>
> Here's my walk through:
>
> 1) ^ match at start of string
> 2) ?P<salsipuedes> if a match is found it will be accessible in a
> variable salsipuedes


I wouldn't call it a variable. If m is a match-object produced
by this regex, then m.group('salsipuedes') will return the part
that was captured.

I'm not sure, though, why you'd want to define a group that
effectively spans the whole regex. If there's a match, then
m.group(0) will return the matching substring, and
m.group('salsipuedes') will return the substring that matched
the parenthesized part of the pattern and these two substrings
will be equal, since the only bits of the pattern outside the
parenthesis are zero-width assertions.

> 3) [0-9A-Za-z-_.//] this is the one that looks wrong to me, see below
> 4) + one or more from the preceeding char class
> 5) () the grouping we want returned (see #2)
> 6) $ end of the string to match against but before any newline
>
> more on #3
> the z-_ part looks wrong and seems that the - should be at the start
> of the char set otherwise we get another range z-_ or does the a-z
> preceeding the z-_ negate the z-_ from becoming a range?


The latter: a-z is a range and block the z-_ from being a range.
Consequently, the -_ bit matches only - and _.

> The "." might be ok inside a char set.


It is. Most special characters lose their special meaning
inside a char set.

> The two slashes look wrong but maybe it has some special meaning
> in some case? I think only one slash is needed.


You're correct: there's no special meaning and only one slash
is needed. But then, a char set is a set and duplcates are
simply ignored, so it does no harm.

Perhaps the person who wrote this was confusing slashes and
backslashes.

> I've looked at pydoc re, but it's cursory.


That's one way of putting it.


Hope this helps,

-- HansM


 
Reply With Quote
 
 
 
 
rh
Guest
Posts: n/a
 
      12-08-2012
On Sat, 08 Dec 2012 20:33:37 +0100
Hans Mulder <(E-Mail Removed)> wrote:

> On 8/12/12 18:48:13, rh wrote:
> > Look through some code I found this and wondered about what it
> > does: ^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$
> >
> > Here's my walk through:
> >
> > 1) ^ match at start of string
> > 2) ?P<salsipuedes> if a match is found it will be accessible in a
> > variable salsipuedes

>
> I wouldn't call it a variable. If m is a match-object produced
> by this regex, then m.group('salsipuedes') will return the part
> that was captured.
>
> I'm not sure, though, why you'd want to define a group that
> effectively spans the whole regex. If there's a match, then
> m.group(0) will return the matching substring, and
> m.group('salsipuedes') will return the substring that matched
> the parenthesized part of the pattern and these two substrings
> will be equal, since the only bits of the pattern outside the
> parenthesis are zero-width assertions.


Good point, it's making the re engine do extra work.
It's not my code and that's another gap in the author's proficiency.
(I don't know who the author is....FWIW)

>
> > 3) [0-9A-Za-z-_.//] this is the one that looks wrong to me, see
> > below
> > 4) + one or more from the preceeding char class
> > 5) () the grouping we want returned (see #2)
> > 6) $ end of the string to match against but before any newline
> >
> > more on #3
> > the z-_ part looks wrong and seems that the - should be at the start
> > of the char set otherwise we get another range z-_ or does the a-z
> > preceeding the z-_ negate the z-_ from becoming a range?

>
> The latter: a-z is a range and block the z-_ from being a range.
> Consequently, the -_ bit matches only - and _.
>
> > The "." might be ok inside a char set.

>
> It is. Most special characters lose their special meaning
> inside a char set.
>
> > The two slashes look wrong but maybe it has some special meaning
> > in some case? I think only one slash is needed.

>
> You're correct: there's no special meaning and only one slash
> is needed. But then, a char set is a set and duplcates are
> simply ignored, so it does no harm.


I wonder if there's harm in the performance. Probably not
but regex is some tricky code and can be expensive even when written
well. For example does this perform better than the original:
^(?P<salsipuedes>[-\w./]+)$

Not sure if the \w sequence includes the - or the . or the /
I think it does not.

>
> Perhaps the person who wrote this was confusing slashes and
> backslashes.


Possibly.

>
> > I've looked at pydoc re, but it's cursory.

>
> That's one way of putting it.
>
>
> Hope this helps,


Does help, thanks.

>
> -- HansM
>
>



--


 
Reply With Quote
 
Hans Mulder
Guest
Posts: n/a
 
      12-08-2012
On 8/12/12 23:57:48, rh wrote:
> Not sure if the \w sequence includes the - or the . or the /
> I think it does not.


You guessed right:

>>> [ c for c in 'x-./y' if re.match(r'\w', c) ]

['x', 'y']
>>>


So x and y match \w and -, . and / do not.


Hope this helps,

-- HansM

 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      12-09-2012
On 2012-12-08 23:34, Hans Mulder wrote:
> On 8/12/12 23:57:48, rh wrote:
>> Not sure if the \w sequence includes the - or the . or the /
>> I think it does not.

>
> You guessed right:
>
>>>> [ c for c in 'x-./y' if re.match(r'\w', c) ]

> ['x', 'y']
>>>>

>
> So x and y match \w and -, . and / do not.
>

This is shorter:

>>> re.findall(r'\w', 'x-./y')

['x', 'y']

But remember that r"\w" is more than just r"[A-Za-z0-9_]" (unless
you're using ASCII).
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: regex walktrough rh Python 3 12-09-2012 01:07 AM
Re: regex walktrough MRAB Python 0 12-08-2012 06:08 PM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine? =?Utf-8?B?SmViQnVzaGVsbA==?= ASP .Net 2 10-22-2005 02:43 PM
perl regex to java regex Rick Venter Java 5 11-06-2003 10:55 AM



Advertisments