Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Malformed utf8; where's the null byte coming from?

Reply
Thread Tools

Malformed utf8; where's the null byte coming from?

 
 
bill_mckinnon@interloper.net
Guest
Posts: n/a
 
      06-28-2006
I've spent some time trying to understand Perl's Unicode support and
its nuances, and I think I actually understand some amount of it. But
the behavior of this snippet of code is puzzling me at the moment:

--
#!/usr/local/bin/perl -w

use Encode qw(decode);

$s = decode('utf8', "Version"); # String w/utf8 flag set
$s =~ s/v\xc3\x83//i;
--

Running this with Perl 5.8.6 on Linux (and Windows) produces this
warning:

$ ./test.pl
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xc3) in substitution (s///) at ./test.pl
line 7.
$

Granted, what I'm trying to do is to match the literal utf8 bytes
for a Unicode character against a Unicode string, which may not be a
reasonable thing to do. But the way this fails doesn't make any sense
to me; I don't have a null byte after (or before) the \xc3 byte in my
regex. Also, if the regex string was being upgraded to Unicode
(presumably from iso-latin-1) I can see it not doing what I intended,
but this shouldn't cause this error; it should just not match the way I
want. And then if the \x sequences were taken to be code points instead
of literal bytes then that's fine...it may not do what I want, but it
still shouldn't cause this warning.
Does anyone know why this warning is coming up? It makes me think
there's more going on under the surface than just an extra iso-latin-1
-> utf8 conversion. Thanks in advance for any insight.

- Bill

P.S. - I can do the match I want by using the results of
encode('utf8', $s) to do the match; since it's a byte
string everything works fine. But I want to understand
what the issue was with the warning.

 
Reply With Quote
 
 
 
 
Ben Morrow
Guest
Posts: n/a
 
      06-28-2006

Quoth http://www.velocityreviews.com/forums/(E-Mail Removed):
> I've spent some time trying to understand Perl's Unicode support and
> its nuances, and I think I actually understand some amount of it. But
> the behavior of this snippet of code is puzzling me at the moment:
>
> --
> #!/usr/local/bin/perl -w
>
> use Encode qw(decode);
>
> $s = decode('utf8', "Version"); # String w/utf8 flag set
> $s =~ s/v\xc3\x83//i;
> --
>
> Running this with Perl 5.8.6 on Linux (and Windows) produces this
> warning:
>
> $ ./test.pl
> Malformed UTF-8 character (unexpected non-continuation byte 0x00,
> immediately after start byte 0xc3) in substitution (s///) at ./test.pl
> line 7.
> $


Some more data points: 5.8.7 i686-linux

1. There is no need for Encode.

my $s = "foo";
utf8::upgrade($s);

works fine (in the sense that it fails).

2. It only fails if the first character matches. This makes sense...

3. It only fails if there are zero-or-one characters after the \xc3.
Putting a second stops the warning.

4. It still fails if the \xc3 is the first character (and the string is
modified to match, obviously).

5. The match does not have to be at the start of the string.

6. \xf3 behaves the same way (the number of expected continuation bytes
doesn't matter).

I believe this is a bug: anyone else?

Of course, what you are trying to do is completely wrong . /v\xf3\x83/
is a regex which matches three characters, not two. The fact that those
three, if expressed as bytes in iso8859-1, happen to look like the utf8
for two characters is irrelevant. It seems perl is having something of
the same confusion you are

Ben

--
I must not fear. Fear is the mind-killer. I will face my fear and
I will let it pass through me. When the fear is gone there will be
nothing. Only I will remain.
(E-Mail Removed) Frank Herbert, 'Dune'
 
Reply With Quote
 
 
 
 
bill_mckinnon@interloper.net
Guest
Posts: n/a
 
      06-28-2006
Ben Morrow wrote:
> Of course, what you are trying to do is completely wrong . /v\xf3\x83/
> is a regex which matches three characters, not two. The fact that those
> three, if expressed as bytes in iso8859-1, happen to look like the utf8
> for two characters is irrelevant. It seems perl is having something of
> the same confusion you are


Yep, agreed...I was initially feeding the s/// data that DIDN'T have
the utf8 flag set even though it was real utf8 data, and this of course
works ok. At some point the regex got string data that did have the
utf8 flag set, and then it didn't work right and got this warning...and
I wondered what was up with the warning. : )
Also, interestingly enough the regex was trying to match a utf8 byte
stream that had been incorrectly interpreted as iso-8859-1 and then
re-encoded as utf8. : ) Funny how these things happen...

- Bill

 
Reply With Quote
 
Mumia W.
Guest
Posts: n/a
 
      06-29-2006
(E-Mail Removed) wrote:
> I've spent some time trying to understand Perl's Unicode support and
> its nuances, and I think I actually understand some amount of it. But
> the behavior of this snippet of code is puzzling me at the moment:
>
> --
> #!/usr/local/bin/perl -w
>
> use Encode qw(decode);
>
> $s = decode('utf8', "Version"); # String w/utf8 flag set
> $s =~ s/v\xc3\x83//i;
> --
> [...]


I was able to eliminate the warning by using "use encoding 'utf8'," but
there is a problem with the substitution.

use Encode qw(decode);
use encoding 'utf8';
my $s;

# rx is "vÃ"
my $rx = qq{"v\xc3\x83"};
$s = decode('utf8', "V\x{c3}\x{83}ersion"); # String w/utf8 flag set
print 'rx : ', $rx, "\n";
print 'before: ', $s, "\n";
$s =~ s/v\xc3\x83//i;
print 'after : ', $s, "\n";

__END__

This prints this:
rx : "vÃ"
before: V�ersion
after : �ersion


Notice that the "�" wasn't substituted even though the 'V' was. Why?


 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      06-29-2006

Quoth Ben Morrow <(E-Mail Removed)>:
>
> Quoth (E-Mail Removed):
> > I've spent some time trying to understand Perl's Unicode support and
> > its nuances, and I think I actually understand some amount of it. But
> > the behavior of this snippet of code is puzzling me at the moment:
> >
> > --
> > #!/usr/local/bin/perl -w
> >
> > use Encode qw(decode);
> >
> > $s = decode('utf8', "Version"); # String w/utf8 flag set
> > $s =~ s/v\xc3\x83//i;
> > --
> >
> > Running this with Perl 5.8.6 on Linux (and Windows) produces this
> > warning:
> >
> > $ ./test.pl
> > Malformed UTF-8 character (unexpected non-continuation byte 0x00,
> > immediately after start byte 0xc3) in substitution (s///) at ./test.pl
> > line 7.
> > $

>
> Some more data points: 5.8.7 i686-linux
>
> 1. There is no need for Encode.
>
> my $s = "foo";
> utf8::upgrade($s);
>
> works fine (in the sense that it fails).
>
> 2. It only fails if the first character matches. This makes sense...
>
> 3. It only fails if there are zero-or-one characters after the \xc3.
> Putting a second stops the warning.
>
> 4. It still fails if the \xc3 is the first character (and the string is
> modified to match, obviously).
>
> 5. The match does not have to be at the start of the string.
>
> 6. \xf3 behaves the same way (the number of expected continuation bytes
> doesn't matter).


Sorry, one more:

7. The warning only occurs when the /i flag is used.

> I believe this is a bug: anyone else?


Ben

--
And if you wanna make sense / Whatcha looking at me for? (Fiona Apple)
* (E-Mail Removed) *
 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      06-29-2006

Quoth "Mumia W." <(E-Mail Removed)>:
> (E-Mail Removed) wrote:
> > I've spent some time trying to understand Perl's Unicode support and
> > its nuances, and I think I actually understand some amount of it. But
> > the behavior of this snippet of code is puzzling me at the moment:
> >
> > --
> > #!/usr/local/bin/perl -w
> >
> > use Encode qw(decode);
> >
> > $s = decode('utf8', "Version"); # String w/utf8 flag set
> > $s =~ s/v\xc3\x83//i;
> > --
> > [...]

>
> I was able to eliminate the warning by using "use encoding 'utf8'," but
> there is a problem with the substitution.
>
> use Encode qw(decode);
> use encoding 'utf8';
> my $s;
>
> # rx is "vÃ"
> my $rx = qq{"v\xc3\x83"};
> $s = decode('utf8', "V\x{c3}\x{83}ersion"); # String w/utf8 flag set


These two do not match. The regex matches a 3-char string; $s (after
decoding) has only one char between the V and the e.

> print 'rx : ', $rx, "\n";
> print 'before: ', $s, "\n";
> $s =~ s/v\xc3\x83//i;
> print 'after : ', $s, "\n";
>
> __END__
>
> This prints this:
> rx : "vÃ"
> before: V�ersion
> after : �ersion
>
> Notice that the "�" wasn't substituted even though the 'V' was. Why?


Again, I think it's a bug. No substitution should have occurred, as the
regex didn't match.

Ben

--
I touch the fire and it freezes me, [(E-Mail Removed)]
I look into it and it's black.
Why can't I feel? My skin should crack and peel---
I want the fire back... Buffy, 'Once More With Feeling'
 
Reply With Quote
 
bill_mckinnon@interloper.net
Guest
Posts: n/a
 
      06-29-2006
Ben Morrow wrote:

> Again, I think it's a bug. No substitution should have occurred, as the
> regex didn't match.


Lacking any reasonable explanation to the contrary, this is my
theory too. : ) It looks like "perlbug" is the recommended way of
reporting bugs in Perl...I'll try to run through this at some point (I
should probably confirm it happens on the latest and greatest Perl,
etc). Thanks for the responses...

- Bill

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: how to extend a byte[] array with a null byte? Tom McGlynn Java 4 04-18-2008 11:49 PM
Re: how to extend a byte[] array with a null byte? Tom McGlynn Java 2 04-18-2008 01:00 PM
Re: how to extend a byte[] array with a null byte? Patricia Shanahan Java 0 04-17-2008 06:47 PM
"stringObj == null" vs "stringObj.equals(null)", for null check?? qazmlp1209@rediffmail.com Java 5 03-29-2006 10:37 PM
PyCon is Coming! PyCon is Coming! Steve Holden Python 0 01-05-2006 11:53 AM



Advertisments