Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Need help reading a perl regexp - someone clue me?

Reply
Thread Tools

Need help reading a perl regexp - someone clue me?

 
 
Don Bruder
Guest
Posts: n/a
 
      02-04-2004

I've got a "canned" regexp I'm trying to analyze that I can't quite
follow due to one of the constructs used in it. Can anyone
translate/verify my translation for me?

Here's the segment that's throwing me (It's a very small sub-section of
a rather large and complex regexp - We're talking something on the order
of 300+ characters worth of "rather large and complex")

[a-zA-Z]{2}[.,\;%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}

Now, if I'm reading rightly, and I'm not totally hopeless as far as my
understanding of perl regexps goes, this should be looking to match "any
two letters followed by pretty much any punctuation mark (including
parens, braces, and brackets of all flavors, but (seemingly) excluding
the "bar" (AKA "OR") character) or any of the digits 0, 1, 3, 4, 6, or
7, followed by any two letters.

How far off base am I with that interpretation?

Should I be ignoring any usual "special meaning" of the 'bar' character
when it appears as part of a square-bracketed set, and therefore taking
the overall regexp to mean that the "bar" character *IS NOT* being
excluded or used in its "special" capacity?

--
Don Bruder - <--- Preferred Email - SpamAssassinated.
Hate SPAM? See <http://www.spamassassin.org> for some seriously great info.
I will choose a path that's clear: I will choose Free Will! - N. Peart
Fly trap info pages: <http://www.sonic.net/~dakidd/Horses/FlyTrap/index.html>
 
Reply With Quote
 
 
 
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      02-04-2004
Don Bruder wrote:
>
> [a-zA-Z]{2}[.,\;%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}


<snip>

> Should I be ignoring any usual "special meaning" of the 'bar'
> character when it appears as part of a square-bracketed set, and
> therefore taking the overall regexp to mean that the "bar"
> character *IS NOT* being excluded or used in its "special"
> capacity?


What happened when you tested it?

What you are calling a "sqare-bracketed set" is a character class, and
the answer is yes.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

 
Reply With Quote
 
 
 
 
J Krugman
Guest
Posts: n/a
 
      02-04-2004
In <A0aUb.12559$> Don Bruder <> writes:

>I've got a "canned" regexp I'm trying to analyze that I can't quite
>follow due to one of the constructs used in it. Can anyone
>translate/verify my translation for me?


>Here's the segment that's throwing me (It's a very small sub-section of
>a rather large and complex regexp - We're talking something on the order
>of 300+ characters worth of "rather large and complex")


>[a-zA-Z]{2}[.,\;%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}


>Now, if I'm reading rightly, and I'm not totally hopeless as far as my
>understanding of perl regexps goes, this should be looking to match "any
>two letters followed by pretty much any punctuation mark (including
>parens, braces, and brackets of all flavors, but (seemingly) excluding
>the "bar" (AKA "OR") character) or any of the digits 0, 1, 3, 4, 6, or
>7, followed by any two letters.


Why exclude "|"? It's right there in the character class, and
there's no ^ at the beginning of that class, so that regexp is
*supposed* to match "AB|CD".

Most of those backslashes are superfluous, BTW. You only need the
ones before $ and ].


 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      02-04-2004

Don Bruder <> wrote:
>
> I've got a "canned" regexp I'm trying to analyze that I can't quite
> follow due to one of the constructs used in it. Can anyone
> translate/verify my translation for me?
>
> Here's the segment that's throwing me (It's a very small sub-section of
> a rather large and complex regexp - We're talking something on the order
> of 300+ characters worth of "rather large and complex")
>
> [a-zA-Z]{2}[.,\;%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}


Good God who wrote that?!
None of those backslashes are necessary except for the one before ].
Those [a-zA-Z] should almost certainly be [[:alpha:]].

I would strongly recommend breaking the regex up into bits as you
understand it. Assign each 'chunk' to a variable with qr//, and use /x
on the bits so you can separate things out decently. For instance,
that bit you have there can be written:

my $code = qw/[[:alpha:]]{2}/;
my $symbol = qr/[.,;:...<>"]/;

/$code $symbol $code/x;

(I'm making the entirely unjustified assumption that the two-letter
sequences are some sort of code, to illustrate that you want to give
the pieces names which reflect their function, rather than merely what
they match). See how much more readable that is?

> Should I be ignoring any usual "special meaning" of the 'bar' character
> when it appears as part of a square-bracketed set,


Yes, you should. Read perldoc perlre again. Nothing is significant in
a [] class except ] (except at the start), ^ (if at the start), -
(except at either end), and \.

Ben

--
Musica Dei donum optimi, trahit homines, trahit deos. |
Musica truces mollit animos, tristesque mentes erigit. |
Musica vel ipsas arbores et horridas movet feras. |
 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      02-04-2004

Ben Morrow <> wrote:
> Don Bruder <> wrote:
>
> > [a-zA-Z]{2}[.,\;%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}

>
> Good God who wrote that?!
> None of those backslashes are necessary except for the one before
> ]...


....and the one before $.

> my $code = qw/[[:alpha:]]{2}/;

^ r
Apologies.

Ben

--
Heracles: Vulture! Here's a titbit for you / A few dried molecules of the gall
From the liver of a friend of yours. / Excuse the arrow but I have no spoon.
(Ted Hughes, [ Heracles shoots Vulture with arrow. Vulture bursts into ]
/Alcestis/) [ flame, and falls out of sight. ]
 
Reply With Quote
 
Don Bruder
Guest
Posts: n/a
 
      02-04-2004
In article <bvrh6d$21n$>,
Ben Morrow <> wrote:

> Don Bruder <> wrote:
> >
> > I've got a "canned" regexp I'm trying to analyze that I can't quite
> > follow due to one of the constructs used in it. Can anyone
> > translate/verify my translation for me?
> >
> > Here's the segment that's throwing me (It's a very small sub-section of
> > a rather large and complex regexp - We're talking something on the order
> > of 300+ characters worth of "rather large and complex")
> >
> > [a-zA-Z]{2}[.,\;%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}

>
> Good God who wrote that?!


Dunno. 'Tweren't me. I'm just trying to understand it.

> None of those backslashes are necessary except for the one before ].
> Those [a-zA-Z] should almost certainly be [[:alpha:]].


Tell the person who wrote it, not me! Extra backslashes aren't giving
me the problem, though. (FWIW, I'm not a Perl programmer, and the regexp
in question is part of another package, not (to my knowledge) Perl, but
questions regarding the syntax of such regexps are referred to the Perl
Regexp documentation - Either the package is written in Perl, and I'm
only seeing a piece of one of the plugins (very likely) or they coded it
to the Perl regexp standard since it was easier than "scratch-building"
their own regexp package)

> I would strongly recommend breaking the regex up into bits as you
> understand it.


Been doing pretty much that as I walked thorugh it.

> Assign each 'chunk' to a variable with qr//, and use /x
> on the bits so you can separate things out decently. For instance,
> that bit you have there can be written:
>
> my $code = qw/[[:alpha:]]{2}/;
> my $symbol = qr/[.,;:...<>"]/;
>
> /$code $symbol $code/x;
>
> (I'm making the entirely unjustified assumption that the two-letter
> sequences are some sort of code, to illustrate that you want to give
> the pieces names which reflect their function, rather than merely what
> they match). See how much more readable that is?


Agreed on the readability. But since I'm not intersted in trying to
"tweak" it or anything like that - only UNDERSTAND it - I'll be leaving
it "as-is".


> > Should I be ignoring any usual "special meaning" of the 'bar' character
> > when it appears as part of a square-bracketed set,

>
> Yes, you should.


Bingo. That's the answer I needed, and cleared a big part of the "fog" I
was stumbling around in. Now to figure out why only "013467" in the list
of digits... I could easily understand ALL digits, but having only that
particular sub-set of digits just doesn't seem to make any sense, either
on the surface, or in the context of what I know it's *SUPPOSED* to be
doing.

In case anybody's interested, here's the full regexp that I'm trying to
understand:

(Beware of line-wrap - there are no literal space/carriage
return/linefeed characters in the string other than the regulation CR/LF
pair at the very end, following the "/i")

/\s(?!(?:fn|re)(?:cc|to)=|(?:ma|qu|un)[`'"]|(?:dr|m[rst]|li|st|td)\.)[a
-zA-Z]{2}[.,\;%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}(?<!\.(?:
(?-i:[A-Z][a-z]{1})|a[eiu]|b[ebmrsz]|c[afhnrx]|d[bek]|es|f[ir]|g[uz]|h[kn
rtu]|i[elnqrst]|j[mops]|k[prwy]|m[ckx]|n[loz]|p[lmrty]|ru|s[eghm]|t[cnv]|
u[ksu]|v[gi])|:no|['`"](?:ed|ll|[rv]e))(?:[,'\?!]|\.?\s)/i

Its "advertised purpose" is to go through a block of text looking for a
string consisting of
"<space><alpha-char><alpha-char><period><alpha-char><alpha-char><space>",
with no interest in whether the two letters on either side of the
period are upper or lower case.

It appears (from my analysis - which may be in error) that several
two-character top level internet domain names (.us, .uk, .se, .cn, .br,
..ru, and quite a few others), a small handful of common filename
extensions (.db, .gz, .js, etc), and a few other two-letter combinations
(dr., mr./ms., etc) are special-cased to exclude them from causing a
match. It *DOES* work as advertised, so that's not at issue. I'm not
trying to debug it, tweak it, or otherwise mess with it, I just wanted
to know how/why it was doing what it did before I changed behavior (and
potentially breaking something due to not understanding exactly what was
being matched) that happens if/when it finds a match.

--
Don Bruder - <--- Preferred Email - SpamAssassinated.
Hate SPAM? See <http://www.spamassassin.org> for some seriously great info.
I will choose a path that's clear: I will choose Free Will! - N. Peart
Fly trap info pages: <http://www.sonic.net/~dakidd/Horses/FlyTrap/index.html>
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Question about encoding, I need a clue ... Geoff Wright Python 2 08-06-2011 05:59 AM
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
Re: Big CLUE about unstable system (Re: Need link to an online scanner) §ñühwØ£f Computer Support 0 11-20-2008 03:31 PM
I need a clue rman56 Computer Support 12 12-06-2005 02:41 PM
Need a clue on how to create this div box Danny Anderson HTML 2 04-21-2004 12:45 PM



Advertisments