Velocity Reviews > Perl > another help

# another help

giampiero
Guest
Posts: n/a

 09-25-2005
i find three substring of length 2 (also repeated) followed after a
while to a reverse sequences (also repeated)

i use:
\$a=~s/(.{2,})+(.{2,})+(.{2,})+.*\3{1,}\2{1,}\1{1,}/\$1 \$2 \$3/o;

how to be sure in regular expression that length \$1+\$2+\$3 must be more
l?
thanx a lot from deep of my soul

Dr.Ruud
Guest
Posts: n/a

 09-25-2005
giampiero schreef:

> i find three substring of length 2 (also repeated) followed after a
> while to a reverse sequences (also repeated)

google is no excuse not to do that.

> i use:
> \$a =~ s/(.{2,})+(.{2,})+(.{2,})+.*\3{1,}\2{1,}\1{1,}/\$1 \$2 \$3/o;

The {2,} means two or more, is that what you want?
The {1,} means 1 or more, so is the same as '+'.

If you meant exactly 2:

\$a =~ s/(..)+(..)+(..)+.*(\3)+(\2)+(\1)+/\1 \2 \3/o;

(untested)

> how to be sure in regular expression that length \$1+\$2+\$3 must be
> more l?

That will always be 3 * 2 = 6.

--
Affijn, Ruud

"Gewoon is een tijger."

Matt Garrish
Guest
Posts: n/a

 09-25-2005

"Dr.Ruud" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> giampiero schreef:
>
>> i find three substring of length 2 (also repeated) followed after a
>> while to a reverse sequences (also repeated)

>
> google is no excuse not to do that.
>
>
>> i use:
>> \$a =~ s/(.{2,})+(.{2,})+(.{2,})+.*\3{1,}\2{1,}\1{1,}/\$1 \$2 \$3/o;

>
> The {2,} means two or more, is that what you want?
> The {1,} means 1 or more, so is the same as '+'.
>
> If you meant exactly 2:
>
> \$a =~ s/(..)+(..)+(..)+.*(\3)+(\2)+(\1)+/\1 \2 \3/o;
>
> (untested)
>

Capturing like that just isn't going to work. Something like the following
is probably what you wanted:

\$a = 'AAAABBBBCCCCsometexthereCCCCBBBBAAAA';
\$a =~ s/(..)\1*(..)\2*(..)\3*.*?\3+\2+\1+/\$1 \$2 \$3/;
print \$a;

Matt

Bob Walton
Guest
Posts: n/a

 09-26-2005
giampiero wrote:

> i find three substring of length 2 (also repeated) followed after a
> while to a reverse sequences (also repeated)
>
>
> i use:
> \$a=~s/(.{2,})+(.{2,})+(.{2,})+.*\3{1,}\2{1,}\1{1,}/\$1 \$2 \$3/o;

It seems doubtful that the above regex is actually what you want.
That's because the first (.{2,})+ will match any two or more
characters and assign them to \$1, then any next two or more
characters and assign *them* to \$1, etc. So portions of the
string which were matched (other than by the .*) will not be
present in \$1 \$2 or \$3. If you want what I think you said, you
need to place the parenthetical groupings so they pick up the
entire repeated group, like:

\$a=~s/((?:.{2,})+)
((?:.{2,})+)
((?:.{2,})+)
.*
\3{1,}\2{1,}\1{1,}
/\$1 \$2 \$3/xo;

Note that this regex is particularly inefficient, with huge
amounts of backtracking, so give it a while to execute if the
string has any complication at all. This could be improved
immensely by removing the redundant repeats with no change to
what is matched except for the improvement in efficiency. Example:

use warnings;
use strict;
my \$a='qabczycdefxxxxxxxxxefcdabczynn';
my \$b=\$a;
if( #original regexp
\$a=~s/(.{2,})+(.{2,})+(.{2,})+.*\3{1,}\2{1,}\1{1,}/\$1 \$2 \$3/o
){print "\\$a matched.\n";
print "\\$1=\$1\n";
print "\\$2=\$2\n";
print "\\$3=\$3\n";
}
print "\\$a is now \$a\n";

if( #suggested regexp
\$b=~s/(.{2,})
(.{2,})
(.{2,})
.*
\3+\2+\1+
/\$1 \$2 \$3/xo
){print "\\$b matched.\n";
print "\\$1=\$1\n";
print "\\$2=\$2\n";
print "\\$3=\$3\n";
}
print "\\$b is now \$b\n";

When run:

D:\junk>perl junk544.pl
\$a matched.
\$1=ef
\$2=xx
\$3=xx
\$a is now ef xx xxcdabczynn
\$b matched.
\$1=abczy
\$2=cd
\$3=ef
\$b is now qabczy cd efnn

D:\junk>

>
> how to be sure in regular expression that length \$1+\$2+\$3 must be more
> l?

Well, length \$1+\$2+\$3 will always be 1 unless the strings are
numeric . Assuming you actually mean
length(\$1)+length(\$2)+length(\$3), each of \$1 \$2 and \$3 must have
matched at least two characters, so if the match succeeded then
length(\$1)+length(\$2)+length(\$3)>=6. Perhaps you should check to
see if the match succeeded, as per the example above. Don't ever
use \$1 etc unless you know the match succeeded.
--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl

borges2003xx@yahoo.it
Guest
Posts: n/a

 09-29-2005
and if
\$a=~s/((?:.{0,})+)
((?:.{0,})+)
((?:.{0,})+)
.*
\3{1,}\2{1,}\1{1,}
/\$1 \$2 \$3/xo;

and the total of length of \$1+\$2+\$3>=12?

thanx again

Dr.Ruud
Guest
Posts: n/a

 09-29-2005
http://www.velocityreviews.com/forums/(E-Mail Removed) schreef:
> and if
> \$a=~s/((?:.{0,})+)
> ((?:.{0,})+)
> ((?:.{0,})+)
> .*
> \3{1,}\2{1,}\1{1,}
> /\$1 \$2 \$3/xo;
>
> and the total of length of \$1+\$2+\$3>=12?
>
> thanx again

{0,} is the same as *
{1,} is the same as +

Something like ((.*)+) hurts (the mind too). 1 or more of something that
can be empty, is not what was meant to be.

The usage of (?:, to cleanly use groups, looks OK.

I remember that your data had a basic grouplength of 2, like
'1212123456xxxxxxxx56343412'
Is that still true? If so, try:

\$a=~s/((?:..)+)
((?:..)+)
((?:..)+)
.*
\3+\2+\1+
/\$1 \$2 \$3/xo;

(untested)

--
Affijn, Ruud

"Gewoon is een tijger."

Bob Walton
Guest
Posts: n/a

 09-30-2005
(E-Mail Removed) wrote:
> and if
> \$a=~s/((?:.{0,})+)
> ((?:.{0,})+)
> ((?:.{0,})+)
> .*
> \3{1,}\2{1,}\1{1,}
> /\$1 \$2 \$3/xo;
>

Please note carefully that (?:.{0,})+ is exactly the same as .*,
with the exception that (?:.{0,})+ is grossly inefficient due to
the amount of backtracking it generates, particularly when
multiples of them appear in the same regexp. Also, note that
this regexp could match the null string. So you could
equivalently and much more efficiently write:

\$a=~s/(.*)(.*)(.*).*\3+\2+\1+/\$1 \$2 \$3/;

> and the total of length of \$1+\$2+\$3>=12?

I interpret this to mean that a success match is intended to
occur only if the sum of the lengths of the three strings is
twelve or more characters total. If so:

use warnings;
use strict;
my \$a='qabczycfffdefxxxxxxxxxefcfffdabczynn';
if(
\$a=~s/(.*)
(.*)
(.*)
.*
\3+\2+\1+
#Note: '`' x 100 is intended to refer to a sequence
#of characters which will never occur in the matched
(??{length(\$1)+length(\$2)+length(\$3)>=12?
'':'`' x 100})
/\$1 \$2 \$3/xo
){print "\\$a matched.\n";
print "\\$1=\$1\n";
print "\\$2=\$2\n";
print "\\$3=\$3\n";
}
print "\\$a is now>\$a<\n";

When run, this prints:

d:\junk>perl junk545.pl
\$a matched.
\$1=abczy
\$2=cfffd
\$3=ef
\$a is now>qabczy cfffd efnn<

d:\junk>

If the two sequences of fff in \$a are replaced with ff, the match
will fail because the sum of the string lengths is less than 12.

It can be instructive to add a print "\$1:\$2:\$3\n"; before the
conditional statement in the (??{}). That prints the progress of
the match as it proceeds.

....
--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl

giampiero
Guest
Posts: n/a

 10-07-2005
>Please note carefully that (?:.{0,})+ is exactly the same as .*,

???????????
(?:.{0,})+ equal (.*)+

Matt Garrish
Guest
Posts: n/a

 10-07-2005

"giampiero" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
> >Please note carefully that (?:.{0,})+ is exactly the same as .*,

>
> ???????????
> (?:.{0,})+ equal (.*)+
>

You seem to be misunderstaning the fundamental concept of a greedy operator.
On it's own, /.*/ will match nothing and everything. Consequently, writing
/(.*)+/ is a useless redundancy as it will always and only ever match once,
so the additional modifier isn't doing anything (.*? and .*+ being
completely other beasts).

Moreover, /.*/ is equivalent to /.{0,}/ as the * modifier means 0 or more
occurences. There is a difference between writing /(?:.{0,})/ and /(.*)/ and
that is that the first will not result in any value being assigned to \$1. If
you look closely at what was written above, it is only stated that the two
are the same without a grouping on .*.

Matt

Bob Walton
Guest
Posts: n/a

 10-09-2005
giampiero wrote:

>>Please note carefully that (?:.{0,})+ is exactly the same as .*,

>
>
> ???????????

Yes, the above is correct. Both will match any string of
characters (with a caveat around a newline depending on whether
the //s switch is active at the time the regexp is encountered --
but that behavior will be the same between the two). As to why
(?:.{0,})+ is the same as .* : {0,} is a longhand way of writing
*, so .{0,} is the same as .* . (?:.{0,}) is then also the same
as .* . Now, (?:.{0,}) will match any character string (see
caveat above), hence (?:.{0,})+ will also, with the + interpreted
as "once". Depending on the character string, it might also
match, say, half of the string followed by the other half, or a
quarter followed by the other three-fourths, etc etc. Note that
there are a whole bunch of ways (?:.{0,})+ can match a character
string -- but also note that the resulting match does in fact
match the entire character string, just as .* would have.

> (?:.{0,})+ equal (.*)+

This is incorrect. (.*)+ contains grouping parentheses which
will cause the last string matched by .* to be returned in \$1 and
other side reactions to occur in the various other
regexp-grouping-related variables. (?:.{0,})+ does not contain
any grouping parentheses pairs. Hence these two, while they will
match the same strings (namely, all of them, subject to my caveat
above), are not the same because they do not cause the same
ultimate actions.

You seem to be totally missing the idea of why one *never* wants
to do something like (?:.*)+ . It is not just that it takes more
time to type and to think about; it is that such an expression
causes an extreme amount of backtracking when something
subsequent to it fails to match in a regexp. That translates
into computer time -- potentially *years* of it -- spent doing
absolutely nothing worthwhile. Here is an example program that
shows the backtracking I'm talking about as the execution of the
regexps proceeds:

use warnings;
use strict;
my \$s='aaaaaaaaaaaaaaaaaaaaaaaaa';
print "Matching re1:\n";
\$s=~/(.*)(??{print "\$1\n";''})\1/;
<>;
print "Matching re2:\n";
\$s=~/((?:.*)+)(??{print "\$1\n";''})\1/;

The result of running this should be most instructive as to why
one should avoid unneeded backtracking in regexps. Note that the
same result is achieved with both "re1" and "re2" above, but at
substantially higher computational cost in the case of "re2".

--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl