Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Perl 5.8.x, Unicode and In-memory Filehandles

Reply
Thread Tools

Perl 5.8.x, Unicode and In-memory Filehandles

 
 
Bernard Chan
Guest
Posts: n/a
 
      03-01-2006
Hello all,

I have just started out experimenting the Unicode capabilities of Perl.
I am currently working on a Web development project involving both
output buffering with Perl's open() in-memory filehandles, and Unicode
handling. Separately they work fine, but I have spent a lot of time
integrating them onto one platform. Hopefully experts around here may
give me some insights as to what I have missed.

I have written a module IO::OutputBuffer which is expected to be used as
follows:

$buf_ctx = IO::OutputBuffer::start(\*STDOUT); # start in-memory buffer
# now STDOUT points to the in-memory buffer
print "blablabla"; # Everything goes to in-memory buffer
# Content verified; commit to real STDOUT
IO::OutputBuffer::flush($buf_ctx);
# Stop buffering
IO::OutputBuffer::end($buf_ctx);
# STDOUT reverted to original

Because stray output is likely to make Apache-CGI complain, I would like
to capture all the output, validate it and then eventually commit to the
actual output stream before the script exits (there is also a similar
facility for capturing STDERR to log file, but not shown).

Basically, as a next step, I would like to make use of PerlIO layers to
implement some encoding conversion for clients who do not support UTF-8.
Otherwise, I may need to use Text::Iconv but I guess if I can use PerlIO
I will keep using that. For instance, if the user profile (or HTTP
request header) indicates he prefers Big5, I will do a UTF-8->Big5
conversion, for instance.

As a test, I added some code within the buffering perimeters performing
a test reading of a Chinese file with UTF-8 encoding. I would like to
output its content to the client side, performing a simulated conversion
to Big5 before returning.

I have minimized the process to a script as short as below:

================================================

#!/usr/bin/perl -w

binmode(STDOUT, ":encoding(big5)") or die "$!"; # Output encoding

BEGIN {
require "require.pl";
}

#use IO::OutputBuffer;
#$b_out = IO::OutputBuffer::start(\*STDOUT);
my ($io_sys, $BUF);
open $io_sys, ">&", \*STDOUT; close STDOUT;
open STDOUT, ">", \$BUF;


open FILE, "<:encoding(utf", "utf8_1.txt";
@lines = <FILE>;
close FILE;

print (join("<br>\n", @lines));

#IO::OutputBuffer::flush($b_out);
my $buffered_content = $BUF;
$BUF = '';
seek STDOUT, 0, 0;

print $io_sys $buffered_content;

====================================

However, I cannot get the file content to display in proper Big5.
Instead, I got apparently Unicode code points as follows:

Wide character in print at output_minimal.pl line 20.
"\x{00e7}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0081}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ab}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0094}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0096}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0087}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ac}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bb}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0085}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a2}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0097}" does not map to big5-eten at output_minimal.pl line 28.
UTF-8:
\x{00e7}\x{00b9}\x{0081}\x{00e9}\x{00ab}\x{0094}\x {00e4}\x{00b8}\x{00ad}\x{00e6}\x{0096}\x{0087}
<br>

<br>
\x{00e6}\x{00b8}\x{00ac}\x{00e8}\x{00a9}\x{00a6}\x {00e4}\x{00bb}\x{00a5}
UTF-8
\x{00e8}\x{00bc}\x{00b8}\x{00e5}\x{0085}\x{00a5}\x {00e6}\x{00bc}\x{00a2}\x{00e5}\x{00ad}\x{0097}

I guess that Perl has erroneously treated the content as non-Unicode and
thus tries to convert individual bytes as ISO8859-1 to Big5. I have
tried to insert utf8::upgrade($buffered_content) and then verified with
utf8::is_utf8() to ensure the input sequence is indeed valid UTF-8.

Anyone can help me? Thank you.

Regards,
Bernard Chan.


*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***
 
Reply With Quote
 
 
 
 
MSG
Guest
Posts: n/a
 
      03-01-2006
Bernard Chan wrote:
> Hello all,
>
> I have just started out experimenting the Unicode capabilities of Perl.
> I am currently working on a Web development project involving both
> output buffering with Perl's open() in-memory filehandles, and Unicode
> handling. Separately they work fine, but I have spent a lot of time
> integrating them onto one platform. Hopefully experts around here may
> give me some insights as to what I have missed.
>
> I have written a module IO::OutputBuffer which is expected to be used as
> follows:
>
> $buf_ctx = IO::OutputBuffer::start(\*STDOUT); # start in-memory buffer
> # now STDOUT points to the in-memory buffer
> print "blablabla"; # Everything goes to in-memory buffer
> # Content verified; commit to real STDOUT
> IO::OutputBuffer::flush($buf_ctx);
> # Stop buffering
> IO::OutputBuffer::end($buf_ctx);
> # STDOUT reverted to original
>
> Because stray output is likely to make Apache-CGI complain, I would like
> to capture all the output, validate it and then eventually commit to the
> actual output stream before the script exits (there is also a similar
> facility for capturing STDERR to log file, but not shown).
>
> Basically, as a next step, I would like to make use of PerlIO layers to
> implement some encoding conversion for clients who do not support UTF-8.
> Otherwise, I may need to use Text::Iconv but I guess if I can use PerlIO
> I will keep using that. For instance, if the user profile (or HTTP
> request header) indicates he prefers Big5, I will do a UTF-8->Big5
> conversion, for instance.
>
> As a test, I added some code within the buffering perimeters performing
> a test reading of a Chinese file with UTF-8 encoding. I would like to
> output its content to the client side, performing a simulated conversion
> to Big5 before returning.
>
> I have minimized the process to a script as short as below:
>
> ================================================
>
> #!/usr/bin/perl -w
>
> binmode(STDOUT, ":encoding(big5)") or die "$!"; # Output encoding
>
> BEGIN {
> require "require.pl";
> }
>
> #use IO::OutputBuffer;
> #$b_out = IO::OutputBuffer::start(\*STDOUT);
> my ($io_sys, $BUF);
> open $io_sys, ">&", \*STDOUT; close STDOUT;
> open STDOUT, ">", \$BUF;
>
>
> open FILE, "<:encoding(utf", "utf8_1.txt";
> @lines = <FILE>;
> close FILE;
>
> print (join("<br>\n", @lines));
>
> #IO::OutputBuffer::flush($b_out);
> my $buffered_content = $BUF;
> $BUF = '';
> seek STDOUT, 0, 0;
>
> print $io_sys $buffered_content;
>
> ====================================
>
> However, I cannot get the file content to display in proper Big5.
> Instead, I got apparently Unicode code points as follows:
>
> Wide character in print at output_minimal.pl line 20.
> "\x{00e7}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00b9}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{0081}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00e9}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00ab}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{0094}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{0096}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{0087}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00ac}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00a9}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00a6}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00bb}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{0085}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00a2}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
> "\x{0097}" does not map to big5-eten at output_minimal.pl line 28.
> UTF-8:
> \x{00e7}\x{00b9}\x{0081}\x{00e9}\x{00ab}\x{0094}\x {00e4}\x{00b8}\x{00ad}\x{00e6}\x{0096}\x{0087}
> <br>
>
> <br>
> \x{00e6}\x{00b8}\x{00ac}\x{00e8}\x{00a9}\x{00a6}\x {00e4}\x{00bb}\x{00a5}
> UTF-8
> \x{00e8}\x{00bc}\x{00b8}\x{00e5}\x{0085}\x{00a5}\x {00e6}\x{00bc}\x{00a2}\x{00e5}\x{00ad}\x{0097}
>
> I guess that Perl has erroneously treated the content as non-Unicode and
> thus tries to convert individual bytes as ISO8859-1 to Big5. I have
> tried to insert utf8::upgrade($buffered_content) and then verified with
> utf8::is_utf8() to ensure the input sequence is indeed valid UTF-8.
>
> Anyone can help me? Thank you.
>
> Regards,
> Bernard Chan.
>
>
> *** Free account sponsored by SecureIX.com ***
> *** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***


It seems suspicious that you set your STDOUT to "big5" at the very
beginning and then open and close STDOUT many times afterwards.
By the time you print, your STDOUT has already resumed to be
"standard'.
Anyway "wide character" warning indicates that you are outputing
unicode to an non-unicode file handle.

 
Reply With Quote
 
 
 
 
Bernard Chan
Guest
Posts: n/a
 
      03-01-2006

I am inclined to think this may be related to the in-memory nature of
the filehandle. In the latest revision of the test script I have tried this:

================================================
#!/usr/bin/perl -w

BEGIN {
require "require.pl";
}

my ($io_sys, $BUF);
open $io_sys, ">&", \*STDOUT; close STDOUT;
open STDOUT, ">:utf8", \$BUF;

open FILE, "<:encoding(utf", "utf8_1.txt";
@lines = <FILE>;
close FILE;

my $buffered_content2 = (join("<br>\n", @lines)); # (1)
print (join("<br>\n", @lines));

my $buffered_content = $BUF;
$BUF = '';
seek STDOUT, 0, 0;

binmode($io_sys, ":encoding(big5)");
print $io_sys $buffered_content2; # (2)
================================================

Basically the modifications are labelled as (1) and (2). Line (1) is the
actual added line. In this program, when I try to print
$buffered_content on line (2) as before, the same output as previously
quoted was seen. However, when I change line (2) to $buffered_content2,
the output is exactly what I wanted (Big5). So it seems like there are
differences because the expression resulted from join() in both cases
were identical. The only difference was that one was read from the
variable representing the in-memory buffer, while the other directly as
generated from the join().

I checked that bytewise the two strings are byte-to-byte identical, and
that after using utf8::upgrade($buffer_content) both strings are valid
UTF-8 with the UTF-8 flag set, but "eq" the two strings still returns
false. I think there should be some intricate stuff in there.

Anyone may explain why this is so? Thank you in advance.

Regards,
Bernard Chan.

*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***
 
Reply With Quote
 
Bernard Chan
Guest
Posts: n/a
 
      03-01-2006
MSG wrote:
>
> It seems suspicious that you set your STDOUT to "big5" at the very
> beginning and then open and close STDOUT many times afterwards. By
> the time you print, your STDOUT has already resumed to be "standard'.
>

That is because I would like to simulate the output buffering trickery I
would normally do with the module as described in my previous post, as I
would like to hide later scripts that they are printing to an in-memory
filehandle. If there are more elegant ways to do so without all these
trouble, please tell me so. Thank you.

I have removed the initial binmode() from my latest test script (see my
other post that I am posting in a few minutes). The original intent was
to set the PerlIO layer on the real STDOUT (not the in-memory one). I
may be able to avoid this.

And I would like to ask, if I binmode(STDOUT, "....."), will the PerlIO
layers installed be lost when I duped it (>&)? You see, I am just duping
filehandles around to make other routines unaware of the extra buffering
layer. If the layers will be lost in the duped filehandle, then you are
right, but I couldn't find anything said in the docs about this behaviour.

> Anyway "wide character" warning indicates that you are outputing
> unicode to an non-unicode file handle.


I have eliminated the wide character warning in the later test, after I
added ":utf8" to the open() that creates the in-memory filehandle. But
the problem remains.

Regards,
Bernard Chan.

*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
FAQ 5.9 How can I make a filehandle local to a subroutine? How do I pass filehandles between subroutines? How do I make an array of filehandles? PerlFAQ Server Perl Misc 0 01-12-2011 11:00 PM
<> and filehandles in hashes BZ Perl Misc 2 09-08-2005 01:20 PM
globs, ref globs, and lexically scoped filehandles Alex Hart Perl Misc 1 01-24-2005 11:02 PM
globs, ref globs, and lexically scoped filehandles Alex Hart Perl Misc 0 01-24-2005 08:21 AM
Finding all open filehandles and closing them before exiting Vilmos Soti Perl Misc 17 05-08-2004 11:32 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57