Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > utf8 in regexp (perl 5.8.1)

Reply
Thread Tools

utf8 in regexp (perl 5.8.1)

 
 
Wes Groleau
Guest
Posts: n/a
 
      04-11-2005
I have a file containing thousands of Spanish words, encoded AFAIK)
in UTF-8. I also have a perl script in UTF-8, which says (hope
pasting works):

#!/usr/bin/perl -w -CSD
#
# NOTE: The extra space in the bang line is mandatory (bug in perl 5.
use warnings;
use strict;
use utf8;

while (<>)
{
print if ( /ñ/ )
}

What is in the regexp is supposed to be "small n with tilde"
and I verified with od -xc that it is hex C3 B1 as is every
place in the file where that letter appears.

The script is intended to find all words containing that
letter. But it finds nothing. After wading through gallons
of text (man encoding, man utf8, man perlunicode, etc.),
I still had no reason to think it was wrong. But I added

use encoding "utf8";

and ran it again, getting only:

Ze-Admins-Computer:~/Desktop wgroleau$ char-find palabras.utf8
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xde) at
/Volumes/Parents/wgroleau/bin/char-find line 12.

?!? According to 'od -xc' the script does NOT contain any
byte that is 0xde In fact, the ONLY bytes in the script that
are not ASCII are the bytes for the "enye" which are on line
twelve, but neither of them is a DE and NO bytes are 00.

Have I found a bug in perl or is my ignorance just getting
the best of me?

Oh, yeah, I also tried a few things with 'binmode' that didn't
work either.

WWG
 
Reply With Quote
 
 
 
 
Wes Groleau
Guest
Posts: n/a
 
      04-12-2005
Wes Groleau wrote:
> [problems with]
> use utf8;
>
> while (<>)
> {
> print if ( /ñ/ )
> }


I removed "use utf8" and it worked. So I think it's
a bug, especially since

> use encoding "utf8";


caused

> Malformed UTF-8 character (unexpected non-continuation byte 0x00,
> immediately after start byte 0xde) at
> /Volumes/Parents/wgroleau/bin/char-find line 12.
>
> ?!? According to 'od -xc' the script does NOT contain any
> byte that is 0xde In fact, the ONLY bytes in the script that
> are not ASCII are the bytes for the "enye" which are on line
> twelve, but neither of them is a DE and NO bytes are 00.


--
Wes Groleau

A pessimist says the glass is half empty.

An optimist says the glass is half full.

An engineer says somebody made the glass
twice as big as it needed to be.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
given char* utf8, how to read unicode line by line, and output utf8 gry C++ 2 03-13-2012 04:32 AM
Anything to be done about utf8 regexp performance? Jochen Lehmeier Perl Misc 1 11-04-2009 07:43 AM
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
Programmatically turning a Regexp into an anchored Regexp Greg Hurrell Ruby 4 02-14-2007 06:56 PM
regexp problem with UTF8 Risto Vaarandi Perl Misc 0 07-16-2003 03:03 PM



Advertisments