Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Interpolation of qr-regexes containing backreferences

Reply
Thread Tools

Interpolation of qr-regexes containing backreferences

 
 
Haakon Riiser
Guest
Posts: n/a
 
      01-22-2004
I just noticed that backreferences in qr-regexes behave differently
from what I expected when they are interpolated into a new regex.
I expected that the meaning of the backreference shouldn't change
when interpolated into a new regex. I.e., one should be able to
do things like:

$re1 = qr{(.)\1};
$re2 = qr{($re1$re1)};

which I would expect to be equivalent to

$re2 = qr{((.)\2(.)\3)};

Perl 5.8.3 instead does this:

$re2 = qr{((.)\1(.)\1)};

I searched for the problem on Google, and found that it has been
known for at least three years. Since it's still here, does that
mean that there's another solution that does not require me to
drop the interpolation and write the entire regex as one big chunk?

Thanks in advance for any replies.

--
Haakon
 
Reply With Quote
 
 
 
 
Ben Morrow
Guest
Posts: n/a
 
      01-22-2004

Haakon Riiser <(E-Mail Removed)> wrote:
> I just noticed that backreferences in qr-regexes behave differently
> from what I expected when they are interpolated into a new regex.
> I expected that the meaning of the backreference shouldn't change
> when interpolated into a new regex. I.e., one should be able to
> do things like:
>
> $re1 = qr{(.)\1};
> $re2 = qr{($re1$re1)};
>
> which I would expect to be equivalent to
>
> $re2 = qr{((.)\2(.)\3)};
>
> Perl 5.8.3 instead does this:
>
> $re2 = qr{((.)\1(.)\1)};


You could try (untested):

my $re1 = qr[(.)(??{$^N})];
my $re2 = qr[($re1$re1)];

Ben

--
perl -e'print map {/.(.)/s} sort unpack "a2"x26, pack "N"x13,
qw/1632265075 1651865445 1685354798 1696626283 1752131169 1769237618
1801808488 1830841936 1886550130 1914728293 1936225377 1969451372
2047502190/' # http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
 
 
 
Haakon Riiser
Guest
Posts: n/a
 
      01-22-2004
[Ben Morrow]

> Haakon Riiser <(E-Mail Removed)> wrote:
>> [...] I.e., one should be able to do things like:
>>
>> $re1 = qr{(.)\1};
>> $re2 = qr{($re1$re1)};
>>
>> which I would expect to be equivalent to
>>
>> $re2 = qr{((.)\2(.)\3)};
>>

>
> You could try (untested):
>
> my $re1 = qr[(.)(??{$^N})];
> my $re2 = qr[($re1$re1)];


Thanks, this works great! I've usually tried to avoid "highly
experimental" regex features such as (??{ ... }), but it's been
marked highly experimental for a few years now, so how dangerous
could it be?

I should probably reread that section of the regex manual since
I didn't pay too much attention to it the first time, it being
experimental and all.

--
Haakon
 
Reply With Quote
 
Haakon Riiser
Guest
Posts: n/a
 
      01-22-2004
[Ben Morrow]

> You could try (untested):
>
> my $re1 = qr[(.)(??{$^N})];
> my $re2 = qr[($re1$re1)];


One question regarding the behavior of (??{ ... }):
Take the following code: (Notice that there are two versions of
the $quoted_literal regex. The first one uses (??{ ... }) and $^N
and the other one uses the delimiter directly.)

use warnings;

$quoted_literal = qr/
(")
(??{ "[^$^N]*$^N" })
/x;

$quoted_literal = qr/
"
[^"]*
"
/x;

$data = 'this is "hello" world';
@list = $data =~ /($quoted_literal|[^"]*)/g;
for ($i = 0; $i < @list; $i++) {
printf "[$i] '\%s'\n", defined $list[$i] ? $list[$i] : "UNDEFINED";
}

If I run this program as it is (using the simple direct version of
$quoted_literal) the output is

[0] 'this is '
[1] '"hello"'
[2] ' world'
[3] ''

If the simple version of $quoted_literal is removed, i.e. making the
script use the (??{ ... }) / $^N version, the result is completely
different:

[0] 'this is '
[1] 'UNDEFINED'
[2] '"hello"'
[3] '"'
[4] ' world'
[5] 'UNDEFINED'
[6] ''
[7] 'UNDEFINED'

As I understood it, the two versions of $quoted_literal should
match exactly the same text, so I can't figure out why the results
aren't the same. Any help in understanding why this happens,
and preferably fixing it, is greatly appreciated.

--
Haakon
 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      01-22-2004

Haakon Riiser <(E-Mail Removed)> wrote:
> [Ben Morrow]
>
> > You could try (untested):
> >
> > my $re1 = qr[(.)(??{$^N})];
> > my $re2 = qr[($re1$re1)];

>
> One question regarding the behavior of (??{ ... }):
> Take the following code: (Notice that there are two versions of
> the $quoted_literal regex. The first one uses (??{ ... }) and $^N
> and the other one uses the delimiter directly.)
>
> use warnings;
>
> $quoted_literal = qr/
> (")
> (??{ "[^$^N]*$^N" })
> /x;
>
> $quoted_literal = qr/
> "
> [^"]*
> "
> /x;
>
> $data = 'this is "hello" world';
> @list = $data =~ /($quoted_literal|[^"]*)/g;
> for ($i = 0; $i < @list; $i++) {
> printf "[$i] '\%s'\n", defined $list[$i] ? $list[$i] : "UNDEFINED";
> }
>
> If I run this program as it is (using the simple direct version of
> $quoted_literal) the output is
>
> [0] 'this is '
> [1] '"hello"'
> [2] ' world'
> [3] ''
>
> If the simple version of $quoted_literal is removed, i.e. making the
> script use the (??{ ... }) / $^N version, the result is completely
> different:
>
> [0] 'this is '
> [1] 'UNDEFINED'
> [2] '"hello"'
> [3] '"'
> [4] ' world'
> [5] 'UNDEFINED'
> [6] ''
> [7] 'UNDEFINED'
>
> As I understood it, the two versions of $quoted_literal should
> match exactly the same text, so I can't figure out why the results
> aren't the same. Any help in understanding why this happens,
> and preferably fixing it, is greatly appreciated.


The regex with (??{}) in it has an extra set of parentheses. If you
take the second output again, and number the rows:

> [0] 'this is ' $1
> [1] 'UNDEFINED' $2
> [2] '"hello"' $1
> [3] '"' $2
> [4] ' world' $1
> [5] 'UNDEFINED' $2
> [6] '' $1
> [7] 'UNDEFINED' $2


it should be clear. BTW, you would almost certainly be better off
using Text::Balanced for this sort of thing.

Ben

--
EAT
KIDS (...er, whoops...)
FOR (E-Mail Removed)
99p
 
Reply With Quote
 
Haakon Riiser
Guest
Posts: n/a
 
      01-23-2004
[Ben Morrow]

> The regex with (??{}) in it has an extra set of parentheses. If
> you take the second output again, and number the rows:
>
>> [0] 'this is ' $1
>> [1] 'UNDEFINED' $2
>> [2] '"hello"' $1
>> [3] '"' $2
>> [4] ' world' $1
>> [5] 'UNDEFINED' $2
>> [6] '' $1
>> [7] 'UNDEFINED' $2

>
> it should be clear.


Argh, I can't believe I didn't spot that one. Time to take a
break I guess.

> BTW, you would almost certainly be better off using
> Text::Balanced for this sort of thing.


That would require me to totally rewrite my tokenizer. I was
working on a small parser (using the wonderful Parse::Yapp),
and did the entire tokenizing with a single regex-match.

@tokens = $raw_data =~ m{
$comment | ( $quoted_literal | $special | $op | $unquoted_literal )
}gx;

The language is quite simple, so it is possible to do every regex
without using internal capturing. The only construct that would
be simplified with backreferences was $quoted_literal, which
supports three types of strings: double quoted, single quoted,
and user-defined delimiter.

" ... "
' ... '
^c ... c

where c can be any character, and the delimiters can be escaped
by putting two of them next to each other:

'foo ''bar'' baz' == foo 'bar' baz

Since the third string type supports any character as a delimiter,
it would be nice if I could use backreferences. Now that that's
out of the question, I chose instead to generate a bunch of regexes
(one for each ASCII character) using sprintf. Not as elegant,
but it works, and it's probably faster than the equivalent solution
with backreferences would have been.

--
Haakon
 
Reply With Quote
 
Ilya Zakharevich
Guest
Posts: n/a
 
      01-23-2004
[A complimentary Cc of this posting was sent to
Haakon Riiser
<(E-Mail Removed)>], who wrote in article <(E-Mail Removed)>:
> $re1 = qr{(.)\1};
> $re2 = qr{($re1$re1)};
>
> which I would expect to be equivalent to
>
> $re2 = qr{((.)\2(.)\3)};


What makes you expect this? qr() is an analogue of qq() etc...

> Perl 5.8.3 instead does this:
>
> $re2 = qr{((.)\1(.)\1)};


As designed...

Hope this helps,
Ilya
 
Reply With Quote
 
Haakon Riiser
Guest
Posts: n/a
 
      01-25-2004
[Ilya Zakharevich]

>>> What makes you expect this? qr() is an analogue of qq() etc...

>
>> That's not how I would design it.

>
> Who cares? What is important is how it *is* designed.


Who cares how it is designed? You asked me what made *me* expect
that qr regexes can be interpolated with predictable behavior.
The answer was, of course, that this would make sense to me,
while the current design makes no sense since you can accomplish
the same thing by interpolating a string representation of the
regexe, while the more useful case of localized regex scope w/o
capturing side effects is impossible to achieve.

>> I think that if you need the
>> regex to be interpolated exactly as written, use q// or qq//.

>
> "Exactly as written"??? And what you think would it be, q// or qq//?
> (One canot replace qr() by qq(), any more than replace qq() by q().)


I shouldn't have to explain what I mean by "exactly as written".
In the case with q, that means character-by-character. With qq,
it means that the result of the string processing (translation of
of character escapes such as \n and \t, and variable interpolation)
is interpolated directly.

>> Interpolation of qr// should rewrite the regex, if necessary,
>> so that it matches the same text as it would match when used on
>> its own.

>
> That's (??{}). Why do you want to merge two different cases into one?


As I said in the previous post,

Interpolation of qr// should rewrite the regex, if necessary,
so that it matches the same text as it would match when used on
its own. This is much more useful, since you can then build
up a large regex from several small qr chunks, without having
to worry that modifications to one of the building blocks will
suddenly break regexes interpolated after it.

I think that the string type interpolation of qr that you think
is so well designed is an ugly kludge that makes big regexes
hard to maintain. I can't see *any* reason as to why you can't
simply create the regex as a regular string and interpolate that,
if you so desperately need separate regex building blocks that
can refer to each other. qr regexes could then be used when you
need the regexes to be completely shielded from each other (which
in my experience is *much* more common than wanting spaghetti
code regexes), and we wouldn't have to resort to (??{}) to get
something as common as backreferences.

I sure hope Ben Morrow was right when he said that qr interpolation
works the way I like it in Perl 6.

--
Haakon
 
Reply With Quote
 
Ilya Zakharevich
Guest
Posts: n/a
 
      01-25-2004
[A complimentary Cc of this posting was sent to
Haakon Riiser
<(E-Mail Removed)>], who wrote in article <(E-Mail Removed)>:
> >>> What makes you expect this? qr() is an analogue of qq() etc...


> >> That's not how I would design it.


> > Who cares? What is important is how it *is* designed.


> Who cares how it is designed? You asked me what made *me* expect
> that qr regexes can be interpolated with predictable behavior.


Do not put words in my mouth, please.

> The answer was, of course, that this would make sense to me,


So what documentation way does not matter, right?

> while the current design makes no sense since you can accomplish
> the same thing by interpolating a string representation of the
> regexe, while the more useful case of localized regex scope w/o
> capturing side effects is impossible to achieve.


I see that you not only do not read the docs, but also do not read the
answers to your questions on this newsgroup.

[Omiting meaningless suggestions already refuted in the preceeding
discussion.]

Hope this helps,
Ilya
 
Reply With Quote
 
gnari
Guest
Posts: n/a
 
      01-25-2004
"Ilya Zakharevich" <(E-Mail Removed)> wrote in message
news:bv13d0$18g2$(E-Mail Removed)...
> [A complimentary Cc of this posting was sent to
> Haakon Riiser
> <(E-Mail Removed)>], who wrote in article

<(E-Mail Removed)>:
> > >>> What makes you expect this? qr() is an analogue of qq() etc...

>
> > >> That's not how I would design it.

>
> > > Who cares? What is important is how it *is* designed.

>
> > Who cares how it is designed? You asked me what made *me* expect
> > that qr regexes can be interpolated with predictable behavior.

>
> Do not put words in my mouth, please.
>
> > The answer was, of course, that this would make sense to me,

>
> So what documentation way does not matter, right?
>
> > while the current design makes no sense since you can accomplish
> > the same thing by interpolating a string representation of the
> > regexe, while the more useful case of localized regex scope w/o
> > capturing side effects is impossible to achieve.

>
> I see that you not only do not read the docs, but also do not read the
> answers to your questions on this newsgroup.


hey. no need to let this degenerate into a flame war.

looked to me like the OP was familiar with the way it works,
but was expressing his view that he would have expected it to
be implemented differently than it is. some of the follow-ups
have been interesting, actually, and the the original question was not
without merit.

gnari.



 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Backreferences in python ? Pankaj Python 7 01-24-2006 09:15 AM
backreferences Amy Dillavou Python 4 09-28-2005 09:03 PM
regular expressions - math on backreferences Chris Nolte Perl 9 05-25-2004 07:43 PM
How to use backreferences in a variable for a regular expression Mark Fletcher Perl 1 05-19-2004 11:12 AM
java.util.regex: Backreferences? dhek bhun kho Java 2 07-09-2003 11:29 AM



Advertisments