Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Apparent bug in Perl 5.10 regexes w. UTF-8 expression

Reply
Thread Tools

Apparent bug in Perl 5.10 regexes w. UTF-8 expression

 
 
Ben Bullock
Guest
Posts: n/a
 
      07-13-2008
I've found a place where Perl seems to behave differently depending on
whether something is marked as UTF-8 or not, regardless of the fact that
it is just ASCII.

In the following code snippet,

#!/usr/local/bin/perl -lw
use strict;
use Encode 'decode';
use Lingua::JA::FindDates 'subsjdate';
binmode STDERR,"utf8";
binmode STDOUT,"utf8";
print STDERR "first try\n";
my $test = "ABCDEFG";
print subsjdate($test);
print STDERR "now try again\n";
$test = decode ('utf8', $test);
print subsjdate($test);

the output is like this:

ben ~ 541 $ ./test2.pl
first try

Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
ABCDEFG
now try again

ABCDEFG
ben ~ 542 $

But, if I

use utf8;

and call the routine with a non-ascii string, like 平成, I don't get the
error messages.

What's more, after about one hour of exhaustive checking, I'm fairly sure
that there is no uninitialized value in the pattern match in question. In
fact I can remove the error message by removing a variable which is
initialized, called $kanjidigits, from the pattern match, but that seems
even more weird.

I think the above-described behaviour, regardless of any errors in the
module, indicates an error in Perl. Also, I think there is nothing wrong
with the module. Does anybody have any other opinions?

 
Reply With Quote
 
 
 
 
Peter J. Holzer
Guest
Posts: n/a
 
      07-13-2008
On 2008-07-13 14:14, Ben Bullock <(E-Mail Removed)> wrote:
> I've found a place where Perl seems to behave differently depending on
> whether something is marked as UTF-8 or not, regardless of the fact that
> it is just ASCII.
>
> In the following code snippet,
>
> #!/usr/local/bin/perl -lw
> use strict;
> use Encode 'decode';
> use Lingua::JA::FindDates 'subsjdate';
> binmode STDERR,"utf8";
> binmode STDOUT,"utf8";
> print STDERR "first try\n";
> my $test = "ABCDEFG";
> print subsjdate($test);
> print STDERR "now try again\n";
> $test = decode ('utf8', $test);
> print subsjdate($test);
>
> the output is like this:
>
> ben ~ 541 $ ./test2.pl
> first try
>
> Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
> site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.

[...]
> What's more, after about one hour of exhaustive checking, I'm fairly sure
> that there is no uninitialized value in the pattern match in question.


Right. Your problem can be reproduced with this script:

#!/usr/bin/perl
use warnings;
use strict;

my $regex =
"([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x {5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x {5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";
my $test = "ABCDEFG";
if ($test =~ /($regex)/) {
print "m:<$1>\n";
}
__END__

If the last character ("\x{5e74}") is removed from the regexp, the
warning vanishes. But if the capturing () is removed (leaving just
"\\s*\x{5e74}", the warning vanishes, too - so it's not just \x{5e74}
which triggers the warning, only that combined with something else.

hp
 
Reply With Quote
 
 
 
 
Ben Morrow
Guest
Posts: n/a
 
      07-13-2008

Quoth "Peter J. Holzer" <(E-Mail Removed)>:
> On 2008-07-13 14:14, Ben Bullock <(E-Mail Removed)> wrote:
> > I've found a place where Perl seems to behave differently depending on
> > whether something is marked as UTF-8 or not, regardless of the fact that
> > it is just ASCII.

>
> Right. Your problem can be reproduced with this script:
>
> #!/usr/bin/perl
> use warnings;
> use strict;
>
> my $regex =
> "([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x {5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x {5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";


Using utf8 in regexen is not well-supported in 5.8; in particular, the
regex engine is not consistent about when to apply utf8 semantics and
when to apply byte semantics. Some of the bugs have been fixed in 5.10;
I don't know if they all have.

Ben

--
For far more marvellous is the truth than any artists of the past imagined!
Why do the poets of the present not speak of it? What men are poets who can
speak of Jupiter if he were like a man, but if he is an immense spinning
sphere of methane and ammonia must be silent? [Feynmann] http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
Ben Bullock
Guest
Posts: n/a
 
      07-13-2008
On Sun, 13 Jul 2008 19:46:14 +0100, Ben Morrow wrote:

> Quoth "Peter J. Holzer" <(E-Mail Removed)>:
>> On 2008-07-13 14:14, Ben Bullock <(E-Mail Removed)> wrote:
>> > I've found a place where Perl seems to behave differently depending

on
>> > whether something is marked as UTF-8 or not, regardless of the fact

that
>> > it is just ASCII.

>>
>> Right. Your problem can be reproduced with this script:
>>
>> #!/usr/bin/perl
>> use warnings;
>> use strict;
>>
>> my $regex =
>> "([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x

{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4 e09}]?\x{5343}[\x{5341}
\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x {767e}\x{4e8c}\x{4e00}\x
{516b}\x{4e09}]*)\\s*\x{5e74}";
>
> Using utf8 in regexen is not well-supported in 5.8; in particular, the
> regex engine is not consistent about when to apply utf8 semantics and
> when to apply byte semantics. Some of the bugs have been fixed in 5.10;
> I don't know if they all have.


The problem I described is the behaviour of Perl 5.10:

ben ~ 501 $ perl --version

This is perl, v5.10.0 built for i686-linux

Copyright 1987-2007, Larry Wall

Perl may be copied only under the terms of either the Artistic License or
the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

ben ~ 502 $ ben ~ 502 $ ./test2.pl
first try

Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.

etc.

Should I report this as a bug?


 
Reply With Quote
 
Ben Bullock
Guest
Posts: n/a
 
      07-13-2008
On Sun, 13 Jul 2008 22:18:43 +0000, Ben Bullock wrote:

> Should I report this as a bug?


Never mind, I reported it anyway.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Apparent bug in FileLock Harold Yarmouth Java 1 11-20-2008 02:46 AM
Apparent bug in 5.8 wrt tied scalars Eric J. Roode Perl Misc 2 11-23-2005 07:40 AM
Has apparent 2.4b1 bug been fixed? flatten in Lib\compiler\ast.py overloads 'list' name Bengt Richter Python 3 01-19-2005 05:17 PM
[BUG?] apparent hang (win32 and Linux), have stack trace Bill Kelly Ruby 6 08-27-2004 02:11 AM
Apparent bug in XmlSerializer or XmlTextWriter or something Integer Software XML 3 04-19-2004 09:28 PM



Advertisments