On Tue, 29 Sep 2009 00:46:54 +0200, Ben Morrow <> wrote:
>> :encoding(utf
does validate on output though, or doesn't it?
>
> What do you mean, 'validate'?
To raise an error or at least display some warning about it when it
encounters
invalid bytes in an utf8 flagged string, like perl does at many other
places.
> Perl strings are (logically) sequences of
> Unicode characters, and any sequence of Unicode characters can be
> represented in utf8. If you end up with a perl string with a corrupted
> internal representation you've got bigger problems than invalid output
> encoding.
Or I might simply be using some code which raises the utf8 flag on strings
that
are not.
Example:
> cat test.pl
#!/usr/bin/perl -w
use strict;
use Encode;
my $validUTF8 = "\x{1010}";
my $bytes = encode("utf8",$validUTF

; # e1 80 90
my $invalidUTF8 = substr($bytes,0,length($bytes)-1); # e1 80
open(TMP,">invalidutf8.txt") or die;
print TMP $invalidUTF8;
close TMP;
open(TMP,"<invalidutf8.txt") or die;
binmode TMP,":utf8";
my $invalid2 = <TMP>;
close TMP;
print STDERR "is_utf8: '".utf8::is_utf8($invalid2).
"' valid: '".utf8::valid($invalid2)."'\n".
"length: ".length($invalid2)."\n";
print $invalid2;
binmode STDOUT,":utf8";
print $invalid2;
binmode STDOUT,":encoding(utf

";
print $invalid2;
> perl --version
This is perl, v5.8.8 built for i486-linux-gnu-thread-multi
....
> perl test.pl | hexdump -C
utf8 "\xE1" does not map to Unicode at test.pl line 16, <TMP> line 1.
Malformed UTF-8 character (unexpected end of string) in length at test.pl
line 19.
is_utf8: '1' valid: ''
length: 0
Wide character in print at test.pl line 23.
00000000 e1 80 e1 80 e1 80 |......|
00000006
> hexdump -C invalidutf8.txt
00000000 e1 80 |..|
00000002
The first line ("...does not map...") comes from reading the "binmode
:utf8" handle.
Note that $invalid2 contains exactly those two broken bytes from
invalidutf8.txt, anyway.
The next validation message comes from length($invalid2) ("...Malformed
UTF-8...").
Note that the string is indeed utf8 flagged, though perl has noticed that
it is invalid.
This example is just to provide a quick test case for invalid utf8 in an
utf8 string
in perl. My point from the previous post was that I assumed
:encoding(utf

on the
output handle would at least give another "...malformed..." message, or,
better yet,
would not output anything. It does not, though, it silently and happily
outputs the
broken utf8, just like printing to a non-utf8 handle (hence the "Wide
character in print")
or a ":utf8" handle.
You are right then - "encoding(utf

" seems only to differ from "utf8"
when used on an
input handle. If the 'binmode TMP,":encoding(utf

";' is used when reading
the broken
bytes in, then everything works fine ($invalid2==undef, in that case).
BTW, this is not purely hypotetical for me; I have to work with some
broken modules
which I cannot easily change, which in some weird cases produce such
invalid utf8
strings.
It would be interesting to see whether newer perls behaves the same. Is
there someone
who would like to run the test script through 5.10?