Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: regex walktrough

Reply
Thread Tools

Re: regex walktrough

 
 
rh
Guest
Posts: n/a
 
      12-08-2012
On Sat, 08 Dec 2012 18:08:36 +0000
MRAB <(E-Mail Removed)> wrote:

> On 2012-12-08 17:48, rh wrote:
> > Look through some code I found this and wondered about what it
> > does: ^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$
> >
> > Here's my walk through:
> >
> > 1) ^ match at start of string
> > 2) ?P<salsipuedes> if a match is found it will be accessible in a
> > variable salsipuedes
> > 3) [0-9A-Za-z-_.//] this is the one that looks wrong to me, see
> > below
> > 4) + one or more from the preceeding char class
> > 5) () the grouping we want returned (see #2)
> > 6) $ end of the string to match against but before any newline
> >
> >
> > more on #3
> > the z-_ part looks wrong and seems that the - should be at the start
> > of the char set otherwise we get another range z-_ or does the a-z
> > preceeding the z-_ negate the z-_ from becoming a range? The "."
> > might be ok inside a char set. The two slashes look wrong but maybe
> > it has some special meaning in some case? I think only one slash is
> > needed.
> >
> > I've looked at pydoc re, but it's cursory.
> >

> Python itself will help you:
>
> >>> re.compile(r"^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$",
> >>> flags=re.DEBUG)

> at at_beginning
> subpattern 1
> max_repeat 1 65535
> in
> range (48, 57)
> range (65, 90)
> range (97, 122)
> literal 45
> literal 95
> literal 46
> literal 47
> literal 47
> at at_end
>
> Inside the character set: "0-9", "A-Z" and "a-z" are ranges; "-", "_",
> "." and "/" are literals. Doubling the "/" is unnecessary (it has no
> special meaning). "-" is a literal because it immediately follows a
> range, so it can't be defining another range (if it immediately
> followed a literal and wasn't immediately followed by an unescaped "]"
> then it would, so r"[a-]" is the same as r"[a\-]").


Handy tip there, thanks.

re.compile(r"^(?P<salsipuedes>[-\w./]+)$", flags=re.DEBUG)
at at_beginning
subpattern 1
max_repeat 1 65535
in
literal 45
category category_word
literal 46
literal 47
at at_end

I reduced the expression too. Now I wonder why re.DEBUG doesn't unroll
category_word. Some other re flag?

>
> As for "(?P<salsipuedes>...)", it won't be accessible in a variable
> "salsipuedes", but will be accessible as a named group in the match
> object:
>
> >>> m = re.match(r"(?P<foo>[a-z]+)", "xyz")
> >>> m.group("foo")

> 'xyz'
>


Ok, "named group" it is.

 
Reply With Quote
 
 
 
 
Hans Mulder
Guest
Posts: n/a
 
      12-08-2012
On 8/12/12 23:19:40, rh wrote:
> I reduced the expression too. Now I wonder why re.DEBUG doesn't unroll
> category_word. Some other re flag?


he category word consists of the '_' character and the
characters for which .isalnum() return True.

On my system there are 102158 characters matching '\w':

>>> sum(1 for i in range(sys.maxunicode+1)

.... if re.match(r'\w', chr(i)))
102158
>>>


You wouldn't want to see the complete list.

-- HansM
 
Reply With Quote
 
 
 
 
MRAB
Guest
Posts: n/a
 
      12-09-2012
On 2012-12-08 23:27, Hans Mulder wrote:
> On 8/12/12 23:19:40, rh wrote:
>> I reduced the expression too. Now I wonder why re.DEBUG doesn't unroll
>> category_word. Some other re flag?

>
> he category word consists of the '_' character and the
> characters for which .isalnum() return True.
>
> On my system there are 102158 characters matching '\w':
>

That would be because you're using Python 3, where strings are Unicode.

>>>> sum(1 for i in range(sys.maxunicode+1)

> ... if re.match(r'\w', chr(i)))
> 102158
>>>>

>
> You wouldn't want to see the complete list.
>

The number of such codepoints depends on which version of Unicode is
being supported (Unicode is evolving all the time).
 
Reply With Quote
 
rh
Guest
Posts: n/a
 
      12-09-2012
On Sun, 09 Dec 2012 00:27:30 +0100
Hans Mulder <(E-Mail Removed)> wrote:

> On 8/12/12 23:19:40, rh wrote:
> > I reduced the expression too. Now I wonder why re.DEBUG doesn't
> > unroll category_word. Some other re flag?

>
> he category word consists of the '_' character and the
> characters for which .isalnum() return True.
>
> On my system there are 102158 characters matching '\w':
>
> >>> sum(1 for i in range(sys.maxunicode+1)

> ... if re.match(r'\w', chr(i)))
> 102158
> >>>

>
> You wouldn't want to see the complete list.


No and also wouldn't want to use \w unless really needed.
So that answers my other question.

>
> -- HansM



--


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
regex walktrough rh Python 4 12-09-2012 12:56 AM
Re: regex walktrough MRAB Python 0 12-08-2012 06:08 PM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine? =?Utf-8?B?SmViQnVzaGVsbA==?= ASP .Net 2 10-22-2005 02:43 PM
perl regex to java regex Rick Venter Java 5 11-06-2003 10:55 AM



Advertisments