Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Variable length lookbehind not implemented

Reply
Thread Tools

Variable length lookbehind not implemented

 
 
fmassion@web.de
Guest
Posts: n/a
 
      08-21-2013
Hi folks:

My text (sample):

saddle stitcher: <font color="#008080"><b>repl. of 8 saddle stitcher</b></font> <font color="#8000FF">

Goal:
I want to put numbers in square brakets, but only if they do not occur within tags.

My code:

#!/usr/bin/perl -w
open(IN,'sample.txt') || die("Datei kann nicht ge÷ffnet werden!\n");
my $number = '(?<!<.*?)\d+(?!.*?>)';
while(<IN>) {
$_ =~ s/$number/\[$number\]/g;
print "$_\n";
}
close (IN);

Error message:

Variable length lookbehind not implemented in regex m/(?<!<.*?)\d+(?!.*?>)/at D:\Perl\test.pl line 5, <IN> line 1.

I couldn't find an explanation for this error message. Has anyone an idea?
 
Reply With Quote
 
 
 
 
Charles DeRykus
Guest
Posts: n/a
 
      08-21-2013
On 8/21/2013 10:14 AM, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> Hi folks:
>
> My text (sample):
>
> saddle stitcher: <font color="#008080"><b>repl. of 8 saddle stitcher</b></font> <font color="#8000FF">
>
> Goal:
> I want to put numbers in square brakets, but only if they do not occur within tags.
>
> My code:
>
> #!/usr/bin/perl -w
> open(IN,'sample.txt') || die("Datei kann nicht ge÷ffnet werden!\n");
> my $number = '(?<!<.*?)\d+(?!.*?>)';
> while(<IN>) {
> $_ =~ s/$number/\[$number\]/g;
> print "$_\n";
> }
> close (IN);
>
> Error message:
>
> Variable length lookbehind not implemented in regex m/(?<!<.*?)\d+(?!.*?>)/ at D:\Perl\test.pl line 5, <IN> line 1.
>
> I couldn't find an explanation for this error message. Has anyone an idea?
>


See "negative look-behind" in perlre. The explanation is "works only for
fixed-width look-behind".

A quick, probably fragile, alternative:

my text;
{ undef $/; $text = <IN>;}

while ( $text =~ /\G ([^<]*?) (<.*?>) /sgx ) {
my($out, $in) = ($1,$2);
$out =~ s/(\d+)/[$1]/ag;
print $out, $in;
}



--
Charles DeRykus


 
Reply With Quote
 
 
 
 
Charles DeRykus
Guest
Posts: n/a
 
      08-21-2013
On 8/21/2013 2:11 PM, Charles DeRykus wrote:
> ....
>
> my text;
> { undef $/; $text = <IN>;}
>


Better written: { local $/; $text = <IN>}

--
Charles DeRykus

 
Reply With Quote
 
Uri Guttman
Guest
Posts: n/a
 
      08-22-2013
>>>>> "CD" == Charles DeRykus <(E-Mail Removed)> writes:

CD> On 8/21/2013 2:11 PM, Charles DeRykus wrote:
>> ....
>>
>> my text;
>> { undef $/; $text = <IN>;}
>>


CD> Better written: { local $/; $text = <IN>}

even better:

use File::Slurp ;
my $text = read_file( $file ) ;

uri
 
Reply With Quote
 
fmassion@web.de
Guest
Posts: n/a
 
      08-22-2013
Thanks to all of you for the explanations.

This code does the trick:

use File::Slurp ;
my $text = read_file( 'testfile.txt' ) ;
while ( $text =~ /\G ([^<]*?) (<.*?>) /sgx ) {
my($out, $in) = ($1,$2);
$out =~ s/(\d+)/[$1]/ag;
print $out, $in;
}

It also works with these lines:
my text;
{ undef $/; $text = <IN>;}

This is the result of the test:

saddle stitcher </font><font color="#008080"><b>repl. of [2] saddle stitcher</b></font> <font color="#8000FF">Mishandled paper </font><font color="#008080"><b>repl. of mishandled paper</b></font><br>Please add [8] staples .... (only numbers outside the tags have been processed.)
Francois
 
Reply With Quote
 
fmassion@web.de
Guest
Posts: n/a
 
      08-22-2013
Sorry, I found a flaw in the expression:

while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {

If the text doesn't end with a tag, the last $out is not printed in:
print $out, $in;

The last printed character is a ">"
We need somehow to find an expression whicht prints the remaining characters.
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      08-22-2013
Charles DeRykus <(E-Mail Removed)> writes:
> On 8/21/2013 2:11 PM, Charles DeRykus wrote:
>> ....
>>
>> my text;
>> { undef $/; $text = <IN>;}
>>

>
> Better written: { local $/; $text = <IN>}


Adding the reason for that: local $/ creates a new binding for $/
which is dynamically scoped to the enclosing block (it has dynamic
extent and indefinite scope[*]). This implies that $/ reverts to its
former value after the enclosing block has finished executing. Except
in very 'controlled and limited' circumstance, this is preferable to
overwriting whatever the current value happens to be at the moment and
'leaking' this 'local policy descision' to the all code executeing
after the block.
[*] The Lisp-terminology[**] is somewhat lacking here because the
newly established binding is only visible to code which is reachable
via an execution path starting in the block and this will usually only
be a subset of all of the program code (in absence of travesties like
'execute a random function found via the symbol table of a random
package').

[**]

http://www.cs.cmu.edu/Groups/AI/html...lm/node43.html
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      08-22-2013
(E-Mail Removed) writes:
> Sorry, I found a flaw in the expression:
>
> while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {
>
> If the text doesn't end with a tag, the last $out is not printed in:
> print $out, $in;
>
> The last printed character is a ">"


You could use a proper 'lexer' for HTML.

NB: This is something I just wrote down because I thought it couldn't
be that difficult. It is assumed that numbers which are part of a word
shouldn't be bracketed.

--------------
{
local $/;
$_ = <STDIN>;
}

my $in_tag;

{
unless ($in_tag) {
/\G</gc && do {
++$in_tag;
print('<');
redo;
};

/\G\b(\d+)\b/gc && do {
print("[$1]");
redo;
};

(/\G(\d+)/gc
|| /\G([^\d<]+)/gc) && do {
print($1);
redo;
};
} else {
/\G>/gc && do {
print('>');
--$in_tag;
redo;
};

/\G</gc && do {
print('<');
++$in_tag;
redo;
};

/\G([^<>]+)/gc && do {
print($1);
redo;
};
}
}
 
Reply With Quote
 
Charles DeRykus
Guest
Posts: n/a
 
      08-22-2013
On 8/22/2013 6:05 AM, (E-Mail Removed) wrote:
> Sorry, I found a flaw in the expression:
>
> while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {
>
> If the text doesn't end with a tag, the last $out is not printed in:
> print $out, $in;
>
> The last printed character is a ">"
> We need somehow to find an expression whicht prints the remaining characters.




This might be a quick fix.. but again it's probably fragile
in many cases.

while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
my($out, $in) = ($1 // '', $2 // '');
$out =~ s/(\d+)/[$1]/ag;
print $out,$in;
}

If unfamiliar with any of the above replacement regex items:

See: perldoc perlre # (?: ) and/or \z
perldoc perlop # \G and/or //

also perlre for the /a modifier

--
Charles DeRykus



 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      08-22-2013
Charles DeRykus <(E-Mail Removed)> writes:
> On 8/22/2013 6:05 AM, (E-Mail Removed) wrote:
>> Sorry, I found a flaw in the expression:
>>
>> while ( $text =~ /\G([^<]*?)(<.*?>)/sgx ) {
>>
>> If the text doesn't end with a tag, the last $out is not printed in:
>> print $out, $in;


[...]

> This might be a quick fix.. but again it's probably fragile
> in many cases.
>
> while ( $text =~ /\G ([^<]*) (?: (<.*?>) | \z ) /sgx ) {
> my($out, $in) = ($1 // '', $2 // '');
> $out =~ s/(\d+)/[$1]/ag;
> print $out,$in;
> }


It will also replace numbers in words (which may or may not be
desired). Also, according to a quick test, using

while ( $text =~ /\G ([^<]*) (<.*?>)? /sgx ) {

works, too.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: regex negative lookbehind assertion not working correctly? MRAB Python 0 03-31-2009 05:08 PM
regex negative lookbehind assertion not working correctly? Gabriel Rossetti Python 0 03-31-2009 03:38 PM
Variable-width lookbehind OKB (not okblacke) Python 6 11-20-2007 07:22 AM
Negative Lookbehind Replacement? mail Perl 1 03-02-2004 03:14 PM
Negative Lookbehind and Wildcards Thomas F. O'Connell Perl 1 02-28-2004 01:50 PM



Advertisments