Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Match a number of repeated chars, but NO MORE.

Reply
Thread Tools

Match a number of repeated chars, but NO MORE.

 
 
usenet@DavidFilmer.com
Guest
Posts: n/a
 
      12-02-2005
One particular aspect of a question in another newsgroup
(http://tinyurl.com/cbakx) interested me; I played around with some
solutions but couldn't come up with one that I thought was elegant. So
I thought I would introduce the question to this group for further
enlightenment.

<paraphrase> of the OP's question:

Suppose I have a string of characters: "abCCCdefg". I want to match
three consecutive occurrences of any character in a class. In this
example, my expression would match 'CCC'. OK, that's easy:

#!/usr/bin/perl
use warnings; use strict;
my $string = "abCCCdefg";
print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
__END__

But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

Are there any elegant ideas?

 
Reply With Quote
 
 
 
 
Eric J. Roode
Guest
Posts: n/a
 
      12-02-2005
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote in news:1133518600.589728.209460
@g14g2000cwa.googlegroups.com:

> But, suppose I wanted to constrain the match so that it would match
> three consecutive occurrences, but NO MORE than three. In other words,
> 'abCCCCCdefg' would NOT match. </paraphrase>
>
> I thought I could propose an 'elegant' answer like this:
>
> print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;
>
> but that doesn't work (it seems that \1 gets "used up" somehow). Of
> course, I could write a bunch of code to do it... that's trivial to do
> (but ugly, IMHO).
>
> Are there any elegant ideas?


I can get you part of the way there. Perhaps someone better at regexes
can take you the rest of the way.

First, use a "negative lookahead assertion": (and use the x
modifier!)

$string =~ / ([\w\d_-]+) \1{2} (?!\1) /x;

But there's still a problem: Though it won't match the first or second
"CCC" in your above string, it will match the third "CCC". In other
words, it'll match the "CCC" that begins after "abCC".

So you'll need to use a negative lookbehind assertion, too:

$string =~ /([\w\d_-]+) # Your match
\1{2} # Two more of it
(?!\1) # But not another one
(?<!\1{4}) # Not preceeded by 4 of \1 at this point
/x;

But there's a problem: since your match is variable-length (due to the +
quantifier), the negative lookbehind is variable-length, and that is
unfortunately not yet implemented in Perl.

I'm not sure where to take it from here, sorry.

--
Eric
`$=`;$_=\%!;($_)=/(.)/;$==++$|;($.,$/,$,,$\,$",$;,$^,$#,$~,$*,$:,@%)=(
$!=~/(.)(.).(.)(.)(.)(.)..(.)(.)(.)..(.)......(.)/,$"),$=++;$.++;$.++;
$_++;$_++;($_,$\,$,)=($~.$"."$;$/$%[$?]$_$\$,$:$%[$?]",$"&$~,$#,);$,++
;$,++;$^|=$";`$_$\$,$/$:$;$~$*$%[$?]$.$~$*${#}$%[$?]$;$\$"$^$~$*.>&$=`
 
Reply With Quote
 
 
 
 
it_says_BALLS_on_your_forehead
Guest
Posts: n/a
 
      12-02-2005

(E-Mail Removed) wrote:
> One particular aspect of a question in another newsgroup
> (http://tinyurl.com/cbakx) interested me; I played around with some
> solutions but couldn't come up with one that I thought was elegant. So
> I thought I would introduce the question to this group for further
> enlightenment.
>
> <paraphrase> of the OP's question:
>
> Suppose I have a string of characters: "abCCCdefg". I want to match
> three consecutive occurrences of any character in a class. In this
> example, my expression would match 'CCC'. OK, that's easy:
>
> #!/usr/bin/perl
> use warnings; use strict;
> my $string = "abCCCdefg";
> print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
> __END__
>
> But, suppose I wanted to constrain the match so that it would match
> three consecutive occurrences, but NO MORE than three. In other words,
> 'abCCCCCdefg' would NOT match. </paraphrase>
>
> I thought I could propose an 'elegant' answer like this:
>
> print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;
>
> but that doesn't work (it seems that \1 gets "used up" somehow). Of
> course, I could write a bunch of code to do it... that's trivial to do
> (but ugly, IMHO).
>
> Are there any elegant ideas?


i believe that \w includes \d as well as '_', [\w-] would be the char
class you want.

 
Reply With Quote
 
it_says_BALLS_on_your_forehead
Guest
Posts: n/a
 
      12-02-2005

Eric J. Roode wrote:
> (E-Mail Removed) wrote in news:1133518600.589728.209460
> @g14g2000cwa.googlegroups.com:
>
> > But, suppose I wanted to constrain the match so that it would match
> > three consecutive occurrences, but NO MORE than three. In other words,
> > 'abCCCCCdefg' would NOT match. </paraphrase>
> >
> > I thought I could propose an 'elegant' answer like this:
> >
> > print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;
> >
> > but that doesn't work (it seems that \1 gets "used up" somehow). Of
> > course, I could write a bunch of code to do it... that's trivial to do
> > (but ugly, IMHO).
> >
> > Are there any elegant ideas?

>
> I can get you part of the way there. Perhaps someone better at regexes
> can take you the rest of the way.
>
> First, use a "negative lookahead assertion": (and use the x
> modifier!)
>
> $string =~ / ([\w\d_-]+) \1{2} (?!\1) /x;
>
> But there's still a problem: Though it won't match the first or second
> "CCC" in your above string, it will match the third "CCC". In other
> words, it'll match the "CCC" that begins after "abCC".
>
> So you'll need to use a negative lookbehind assertion, too:
>
> $string =~ /([\w\d_-]+) # Your match
> \1{2} # Two more of it
> (?!\1) # But not another one
> (?<!\1{4}) # Not preceeded by 4 of \1 at this point
> /x;
>
> But there's a problem: since your match is variable-length (due to the +
> quantifier), the negative lookbehind is variable-length, and that is
> unfortunately not yet implemented in Perl.
>
> I'm not sure where to take it from here, sorry.


hmm, i'm aware of that constraint with lookbehinds. maybe it's too
early in the morning, but would you need lookbehinds? don't the matches
on the string occur from left to right, so you only need the negative
lookahead?

 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      12-02-2005
<(E-Mail Removed)> wrote in comp.lang.perl.misc:
> One particular aspect of a question in another newsgroup
> (http://tinyurl.com/cbakx) interested me; I played around with some
> solutions but couldn't come up with one that I thought was elegant. So
> I thought I would introduce the question to this group for further
> enlightenment.
>
> <paraphrase> of the OP's question:
>
> Suppose I have a string of characters: "abCCCdefg". I want to match
> three consecutive occurrences of any character in a class. In this
> example, my expression would match 'CCC'. OK, that's easy:
>
> #!/usr/bin/perl
> use warnings; use strict;
> my $string = "abCCCdefg";
> print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
> __END__


That regex isn't quite correct, it should only capture one occurrence
of the repeated character, not more. Also, \w already matches digits
and underscore:

/(\w)\1{2}/;

> But, suppose I wanted to constrain the match so that it would match
> three consecutive occurrences, but NO MORE than three. In other words,
> 'abCCCCCdefg' would NOT match. </paraphrase>
>
> I thought I could propose an 'elegant' answer like this:
>
> print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;
>
> but that doesn't work (it seems that \1 gets "used up" somehow). Of
> course, I could write a bunch of code to do it... that's trivial to do
> (but ugly, IMHO).


\1 doesn't get consumed, it is interpolated as a character escape, not
a backreference. [^\1] matches all characters except chr(1).

A negative lookahead works as intended, but still doesn't solve the
problem:

qr/([\w\d_-])\1{2}(?!\1)/;

This forces the following character to be different from \1, but
then the regex just moves on and matches the last three "C" in
"abCCCCCdefg". I don't see a way to force it to match only if
the preceding character is different from the repeated one.

Following this vein leads to something like this

my $re = qr/
(.) # any character
(?!\1) # ...followed by a different character
(\w) # ...which is a word character
\2{2} # ...followed by exactly two copies of itself
(?!\2) # ...followed by a different character
/x;

That works with the given examples, but only if there is actual text
before and after the repeated group, not if the repetitions appear
in the beginning or end of the string. Not to mention elegance...

Conclusion: It probably can be done in a single regex, but I doubt it
is worth the effort.

/((\w)\2{2,})/ and length( $1) == 3

Anno
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      12-02-2005
Eric J. Roode <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> (E-Mail Removed) wrote in news:1133518600.589728.209460
> @g14g2000cwa.googlegroups.com:
>
> > But, suppose I wanted to constrain the match so that it would match
> > three consecutive occurrences, but NO MORE than three. In other words,
> > 'abCCCCCdefg' would NOT match. </paraphrase>


[...]

> First, use a "negative lookahead assertion": (and use the x
> modifier!)
>
> $string =~ / ([\w\d_-]+) \1{2} (?!\1) /x;
>
> But there's still a problem: Though it won't match the first or second
> "CCC" in your above string, it will match the third "CCC". In other
> words, it'll match the "CCC" that begins after "abCC".
>
> So you'll need to use a negative lookbehind assertion, too:
>
> $string =~ /([\w\d_-]+) # Your match
> \1{2} # Two more of it
> (?!\1) # But not another one
> (?<!\1{4}) # Not preceeded by 4 of \1 at this point
> /x;
>
> But there's a problem: since your match is variable-length (due to the +
> quantifier), the negative lookbehind is variable-length, and that is
> unfortunately not yet implemented in Perl.


Capturing multiple characters isn't right anyway, the "+" ought to
be outside the parentheses. (With 6 or more "C", the difference shows.)
But that doesn't solve the problem with variable-length lookbehind.
It complains if you try to interpolate a backreference, even if the
backreference can logically only have one definite length.

Anno
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      12-02-2005
<(E-Mail Removed)> wrote in comp.lang.perl.misc:
> One particular aspect of a question in another newsgroup
> (http://tinyurl.com/cbakx) interested me; I played around with some
> solutions but couldn't come up with one that I thought was elegant. So
> I thought I would introduce the question to this group for further
> enlightenment.
>
> <paraphrase> of the OP's question:
>
> Suppose I have a string of characters: "abCCCdefg". I want to match
> three consecutive occurrences of any character in a class. In this
> example, my expression would match 'CCC'. OK, that's easy:
>
> #!/usr/bin/perl
> use warnings; use strict;
> my $string = "abCCCdefg";
> print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
> __END__


That regex isn't quite correct, it should only capture one occurrence
of the repeated character, not more. Also, \w already matches digits
and underscore:

[Later correction: It doesn't match underscore. I'm not correcting the
code, id doesn't matter to the discussion]

/(\w)\1{2}/;

> But, suppose I wanted to constrain the match so that it would match
> three consecutive occurrences, but NO MORE than three. In other words,
> 'abCCCCCdefg' would NOT match. </paraphrase>
>
> I thought I could propose an 'elegant' answer like this:
>
> print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;
>
> but that doesn't work (it seems that \1 gets "used up" somehow). Of
> course, I could write a bunch of code to do it... that's trivial to do
> (but ugly, IMHO).


\1 doesn't get consumed, it is interpolated as a character escape, not
a backreference. [^\1] matches all characters except chr(1).

A negative lookahead works as intended, but still doesn't solve the
problem:

qr/([\w\d_-])\1{2}(?!\1)/;

This forces the following character to be different from \1, but
then the regex just moves on and matches the last three "C" in
"abCCCCCdefg". I don't see a way to force it to match only if
the preceding character is different from the repeated one.

Following this vein leads to something like this

my $re = qr/
(.) # any character
(?!\1) # ...followed by a different character
(\w) # ...which is a word character
\2{2} # ...followed by exactly two copies of itself
(?!\2) # ...followed by a different character
/x;

That works with the given examples, but only if there is actual text
before and after the repeated group, not if the repetitions appear
in the beginning or end of the string. Not to mention elegance...

Conclusion: It probably can be done in a single regex, but I doubt it
is worth the effort.

/((\w)\2{2,})/ and length( $1) == 3

Anno
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.

--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.
 
Reply With Quote
 
it_says_BALLS_on_your forehead
Guest
Posts: n/a
 
      12-02-2005

Anno Siegel wrote:
> <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> > One particular aspect of a question in another newsgroup
> > (http://tinyurl.com/cbakx) interested me; I played around with some
> > solutions but couldn't come up with one that I thought was elegant. So
> > I thought I would introduce the question to this group for further
> > enlightenment.
> >
> > <paraphrase> of the OP's question:
> >
> > Suppose I have a string of characters: "abCCCdefg". I want to match
> > three consecutive occurrences of any character in a class. In this
> > example, my expression would match 'CCC'. OK, that's easy:
> >
> > #!/usr/bin/perl
> > use warnings; use strict;
> > my $string = "abCCCdefg";
> > print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
> > __END__

>
> That regex isn't quite correct, it should only capture one occurrence
> of the repeated character, not more. Also, \w already matches digits
> and underscore:
>
> [Later correction: It doesn't match underscore. I'm not correcting the
> code, id doesn't matter to the discussion]


are you sure it doesn't match underscore?

my $string2 = '_';
if ( $string2 =~ m/\w/ ) {
print "underscore matched.\n";
}
else {
print "underscore did not match.\n";
}

__OUTPUT__
underscore matched.

 
Reply With Quote
 
it_says_BALLS_on_your forehead
Guest
Posts: n/a
 
      12-02-2005

Anno Siegel wrote:
> <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> > One particular aspect of a question in another newsgroup
> > (http://tinyurl.com/cbakx) interested me; I played around with some
> > solutions but couldn't come up with one that I thought was elegant. So
> > I thought I would introduce the question to this group for further
> > enlightenment.
> >
> > <paraphrase> of the OP's question:
> >
> > Suppose I have a string of characters: "abCCCdefg". I want to match
> > three consecutive occurrences of any character in a class. In this
> > example, my expression would match 'CCC'. OK, that's easy:
> >
> > #!/usr/bin/perl
> > use warnings; use strict;
> > my $string = "abCCCdefg";
> > print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
> > __END__

>
> That regex isn't quite correct, it should only capture one occurrence
> of the repeated character, not more. Also, \w already matches digits
> and underscore:
>
> [Later correction: It doesn't match underscore. I'm not correcting the
> code, id doesn't matter to the discussion]
>
> /(\w)\1{2}/;
>
> > But, suppose I wanted to constrain the match so that it would match
> > three consecutive occurrences, but NO MORE than three. In other words,
> > 'abCCCCCdefg' would NOT match. </paraphrase>
> >
> > I thought I could propose an 'elegant' answer like this:
> >
> > print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;
> >
> > but that doesn't work (it seems that \1 gets "used up" somehow). Of
> > course, I could write a bunch of code to do it... that's trivial to do
> > (but ugly, IMHO).

>
> \1 doesn't get consumed, it is interpolated as a character escape, not
> a backreference. [^\1] matches all characters except chr(1).
>
> A negative lookahead works as intended, but still doesn't solve the
> problem:
>
> qr/([\w\d_-])\1{2}(?!\1)/;
>
> This forces the following character to be different from \1, but
> then the regex just moves on and matches the last three "C" in
> "abCCCCCdefg". I don't see a way to force it to match only if
> the preceding character is different from the repeated one.
>


actually, does the negative lookahead even work? it doesn't seem to. i
appear to get the same results as the OP, although for a different
reason perhaps, since you say that in the context of a character class,
\1 simply is an escaped 1, which is the same as the number 1. when
using the negative lookahead, it appears that the \1 is 'consumed'
already.
(in the example below, it would be \2).

my $testString = "abCCCCd";
if ($testString =~ m/((\w)\2{2})(?!\2)/) {
print "$1. matched\n";
}
else {
print "no match\n";
}

__OUTPUT__
CCC. matched

 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      12-02-2005
it_says_BALLS_on_your forehead <(E-Mail Removed)> wrote in comp.lang.perl.misc:
>
> Anno Siegel wrote:
> > <(E-Mail Removed)> wrote in comp.lang.perl.misc:


[...]

> > > <paraphrase> of the OP's question:
> > >
> > > Suppose I have a string of characters: "abCCCdefg". I want to match
> > > three consecutive occurrences of any character in a class. In this
> > > example, my expression would match 'CCC'. OK, that's easy:


[...]

> > > But, suppose I wanted to constrain the match so that it would match
> > > three consecutive occurrences, but NO MORE than three. In other words,
> > > 'abCCCCCdefg' would NOT match. </paraphrase>
> > >
> > > I thought I could propose an 'elegant' answer like this:
> > >
> > > print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;
> > >
> > > but that doesn't work (it seems that \1 gets "used up" somehow). Of
> > > course, I could write a bunch of code to do it... that's trivial to do
> > > (but ugly, IMHO).

> >
> > \1 doesn't get consumed, it is interpolated as a character escape, not
> > a backreference. [^\1] matches all characters except chr(1).
> >
> > A negative lookahead works as intended, but still doesn't solve the
> > problem:
> >
> > qr/([\w\d_-])\1{2}(?!\1)/;
> >
> > This forces the following character to be different from \1, but
> > then the regex just moves on and matches the last three "C" in
> > "abCCCCCdefg". I don't see a way to force it to match only if
> > the preceding character is different from the repeated one.
> >

>
> actually, does the negative lookahead even work? it doesn't seem to. i
> appear to get the same results as the OP, although for a different
> reason perhaps, since you say that in the context of a character class,
> \1 simply is an escaped 1, which is the same as the number 1. when


No, it is a character escape. In a non-regex double-quotish string as the
interior of [] in a regex, "\1" is the character chr( 1), etc.

> using the negative lookahead, it appears that the \1 is 'consumed'
> already.
> (in the example below, it would be \2).
>
> my $testString = "abCCCCd";
> if ($testString =~ m/((\w)\2{2})(?!\2)/) {
> print "$1. matched\n";
> }
> else {
> print "no match\n";
> }
>
> __OUTPUT__
> CCC. matched


So? It matched the last three "C" before "d", as enforced by the
lookahead:

my $testString = "abCCCCd";
if ($testString =~ m/((\w)\2{2})(?!\2)(.*)/) {
print "$1. matched before $3\n";
}
else {
print "no match\n";
}

CCC. matched before d

Anno
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
pat-match.lisp or extend-match.lisp in Python? ekzept Python 0 08-10-2007 06:08 PM
A string composed of a character repeated x number of times Sathyaish Java 11 04-04-2007 07:03 PM
$match = true() for empty $match?? Victor XML 2 05-17-2004 10:43 AM
Repeated ( but not "repeatable" ) system crash -- XP Home on newhardware Richard Owlett Computer Support 10 02-23-2004 12:42 AM
Java regex can't match lengthy match? hiwa Java 0 01-29-2004 10:09 AM



Advertisments