Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Re: regex walktrough (http://www.velocityreviews.com/forums/t955281-re-regex-walktrough.html)

rh 12-08-2012 10:19 PM

Re: regex walktrough
 
On Sat, 08 Dec 2012 18:08:36 +0000
MRAB <python@mrabarnett.plus.com> wrote:

> On 2012-12-08 17:48, rh wrote:
> > Look through some code I found this and wondered about what it
> > does: ^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$
> >
> > Here's my walk through:
> >
> > 1) ^ match at start of string
> > 2) ?P<salsipuedes> if a match is found it will be accessible in a
> > variable salsipuedes
> > 3) [0-9A-Za-z-_.//] this is the one that looks wrong to me, see
> > below
> > 4) + one or more from the preceeding char class
> > 5) () the grouping we want returned (see #2)
> > 6) $ end of the string to match against but before any newline
> >
> >
> > more on #3
> > the z-_ part looks wrong and seems that the - should be at the start
> > of the char set otherwise we get another range z-_ or does the a-z
> > preceeding the z-_ negate the z-_ from becoming a range? The "."
> > might be ok inside a char set. The two slashes look wrong but maybe
> > it has some special meaning in some case? I think only one slash is
> > needed.
> >
> > I've looked at pydoc re, but it's cursory.
> >

> Python itself will help you:
>
> >>> re.compile(r"^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$",
> >>> flags=re.DEBUG)

> at at_beginning
> subpattern 1
> max_repeat 1 65535
> in
> range (48, 57)
> range (65, 90)
> range (97, 122)
> literal 45
> literal 95
> literal 46
> literal 47
> literal 47
> at at_end
>
> Inside the character set: "0-9", "A-Z" and "a-z" are ranges; "-", "_",
> "." and "/" are literals. Doubling the "/" is unnecessary (it has no
> special meaning). "-" is a literal because it immediately follows a
> range, so it can't be defining another range (if it immediately
> followed a literal and wasn't immediately followed by an unescaped "]"
> then it would, so r"[a-]" is the same as r"[a\-]").


Handy tip there, thanks.

re.compile(r"^(?P<salsipuedes>[-\w./]+)$", flags=re.DEBUG)
at at_beginning
subpattern 1
max_repeat 1 65535
in
literal 45
category category_word
literal 46
literal 47
at at_end

I reduced the expression too. Now I wonder why re.DEBUG doesn't unroll
category_word. Some other re flag?

>
> As for "(?P<salsipuedes>...)", it won't be accessible in a variable
> "salsipuedes", but will be accessible as a named group in the match
> object:
>
> >>> m = re.match(r"(?P<foo>[a-z]+)", "xyz")
> >>> m.group("foo")

> 'xyz'
>


Ok, "named group" it is.


Hans Mulder 12-08-2012 11:27 PM

Re: regex walktrough
 
On 8/12/12 23:19:40, rh wrote:
> I reduced the expression too. Now I wonder why re.DEBUG doesn't unroll
> category_word. Some other re flag?


he category word consists of the '_' character and the
characters for which .isalnum() return True.

On my system there are 102158 characters matching '\w':

>>> sum(1 for i in range(sys.maxunicode+1)

.... if re.match(r'\w', chr(i)))
102158
>>>


You wouldn't want to see the complete list.

-- HansM

MRAB 12-09-2012 12:56 AM

Re: regex walktrough
 
On 2012-12-08 23:27, Hans Mulder wrote:
> On 8/12/12 23:19:40, rh wrote:
>> I reduced the expression too. Now I wonder why re.DEBUG doesn't unroll
>> category_word. Some other re flag?

>
> he category word consists of the '_' character and the
> characters for which .isalnum() return True.
>
> On my system there are 102158 characters matching '\w':
>

That would be because you're using Python 3, where strings are Unicode.

>>>> sum(1 for i in range(sys.maxunicode+1)

> ... if re.match(r'\w', chr(i)))
> 102158
>>>>

>
> You wouldn't want to see the complete list.
>

The number of such codepoints depends on which version of Unicode is
being supported (Unicode is evolving all the time).

rh 12-09-2012 01:07 AM

Re: regex walktrough
 
On Sun, 09 Dec 2012 00:27:30 +0100
Hans Mulder <hansmu@xs4all.nl> wrote:

> On 8/12/12 23:19:40, rh wrote:
> > I reduced the expression too. Now I wonder why re.DEBUG doesn't
> > unroll category_word. Some other re flag?

>
> he category word consists of the '_' character and the
> characters for which .isalnum() return True.
>
> On my system there are 102158 characters matching '\w':
>
> >>> sum(1 for i in range(sys.maxunicode+1)

> ... if re.match(r'\w', chr(i)))
> 102158
> >>>

>
> You wouldn't want to see the complete list.


No and also wouldn't want to use \w unless really needed.
So that answers my other question.

>
> -- HansM



--




All times are GMT. The time now is 06:05 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.