In article <bvrh6d$21n$>,
Ben Morrow <> wrote:
> Don Bruder <> wrote:
> >
> > I've got a "canned" regexp I'm trying to analyze that I can't quite
> > follow due to one of the constructs used in it. Can anyone
> > translate/verify my translation for me?
> >
> > Here's the segment that's throwing me (It's a very small sub-section of
> > a rather large and complex regexp - We're talking something on the order
> > of 300+ characters worth of "rather large and complex")
> >
> > [a-zA-Z]{2}[.,\;
%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}
>
> Good God who wrote that?!
Dunno. 'Tweren't me. I'm just trying to understand it.
> None of those backslashes are necessary except for the one before ].
> Those [a-zA-Z] should almost certainly be [[:alpha:]].
Tell the person who wrote it, not me!

Extra backslashes aren't giving
me the problem, though. (FWIW, I'm not a Perl programmer, and the regexp
in question is part of another package, not (to my knowledge) Perl, but
questions regarding the syntax of such regexps are referred to the Perl
Regexp documentation - Either the package is written in Perl, and I'm
only seeing a piece of one of the plugins (very likely) or they coded it
to the Perl regexp standard since it was easier than "scratch-building"
their own regexp package)
> I would strongly recommend breaking the regex up into bits as you
> understand it.
Been doing pretty much that as I walked thorugh it.
> Assign each 'chunk' to a variable with qr//, and use /x
> on the bits so you can separate things out decently. For instance,
> that bit you have there can be written:
>
> my $code = qw/[[:alpha:]]{2}/;
> my $symbol = qr/[.,;:...<>"]/;
>
> /$code $symbol $code/x;
>
> (I'm making the entirely unjustified assumption that the two-letter
> sequences are some sort of code, to illustrate that you want to give
> the pieces names which reflect their function, rather than merely what
> they match). See how much more readable that is?
Agreed on the readability. But since I'm not intersted in trying to
"tweak" it or anything like that - only UNDERSTAND it - I'll be leaving
it "as-is".
> > Should I be ignoring any usual "special meaning" of the 'bar' character
> > when it appears as part of a square-bracketed set,
>
> Yes, you should.
Bingo. That's the answer I needed, and cleared a big part of the "fog" I
was stumbling around in. Now to figure out why only "013467" in the list
of digits... I could easily understand ALL digits, but having only that
particular sub-set of digits just doesn't seem to make any sense, either
on the surface, or in the context of what I know it's *SUPPOSED* to be
doing.
In case anybody's interested, here's the full regexp that I'm trying to
understand:
(Beware of line-wrap - there are no literal space/carriage
return/linefeed characters in the string other than the regulation CR/LF
pair at the very end, following the "/i")
/\s(?!(?:fn|re)

(?:cc|to)=|(?:ma|qu|un)[`'"]|(?:dr|m[rst]|li|st|td)\.)[a
-zA-Z]{2}[.,\;

%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}(?<!\.(?:
(?-i:[A-Z][a-z]{1})|a[eiu]|b[ebmrsz]|c[afhnrx]|d[bek]|es|f[ir]|g[uz]|h[kn
rtu]|i[elnqrst]|j[mops]|k[prwy]|m[ckx]|n[loz]|p[lmrty]|ru|s[eghm]|t[cnv]|
u[ksu]|v[gi])|:no|['`"](?:ed|ll|[rv]e))(?:[,'\?!]|\.?\s)/i
Its "advertised purpose" is to go through a block of text looking for a
string consisting of
"<space><alpha-char><alpha-char><period><alpha-char><alpha-char><space>",
with no interest in whether the two letters on either side of the
period are upper or lower case.
It appears (from my analysis - which may be in error) that several
two-character top level internet domain names (.us, .uk, .se, .cn, .br,
..ru, and quite a few others), a small handful of common filename
extensions (.db, .gz, .js, etc), and a few other two-letter combinations
(dr., mr./ms., etc) are special-cased to exclude them from causing a
match. It *DOES* work as advertised, so that's not at issue. I'm not
trying to debug it, tweak it, or otherwise mess with it, I just wanted
to know how/why it was doing what it did before I changed behavior (and
potentially breaking something due to not understanding exactly what was
being matched) that happens if/when it finds a match.
--
Don Bruder -
<--- Preferred Email - SpamAssassinated.
Hate SPAM? See <http://www.spamassassin.org> for some seriously great info.
I will choose a path that's clear: I will choose Free Will! - N. Peart
Fly trap info pages: <http://www.sonic.net/~dakidd/Horses/FlyTrap/index.html>