Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > UTF-8 read & print?

Reply
Thread Tools

UTF-8 read & print?

 
 
Tuxedo
Guest
Posts: n/a
 
      11-25-2012
In reading and printing a file that may contain UTF-8 characters and print
it into a web browser, my first attempt is:

#!/usr/bin/perl -w

use warnings;
use strict;
use CGI qw(:standard);

print "Content-type: text/plain; charset=UTF-8\n\n";

open my $fh, "<:encoding(UTF-", 'UTF-8-demo.txt';
binmode STDOUT, ':utf-8';
while (my $line = <$fh>) {
print $line;
}

The example file is this one:
http://www.cl.cam.ac.uk/~mgk25/ucs/e...UTF-8-demo.txt

Of course, different browsers and systems have different result depending
on supported characters in the UTF-8 range (I guess) and while most
characters in the above UTF-8-demo.txt display when reading the file as
above, some characters towards the end of the page, being the ones
following the lowercase basic Latin alphabet, i.e. the British pound sign,
the copyright symbol and the remaining 9 characters on that same line, do
not to display in an up-to-date web browser with the above read and print
procedure, while they do display as they should when accessing the
UTF-8-demo.txt file directly in a same browser via the above URL. If
however I omit the "encoding(UTF-" part after my $fh I find that those
particular characters print correctly.

While I guess UTF-8 compatibility is generally a broad topic, what are the
better or worse ways to read and print UTF-8 for maximum success in typical
web browsers?

Sorry if the question is a bit basic and has been asked times before, but
any comments and examples are always much appreciated.

Many thanks,
Tuxedo
 
Reply With Quote
 
 
 
 
Helmut Richter
Guest
Posts: n/a
 
      11-25-2012
On Sun, 25 Nov 2012, Tuxedo wrote:

> The example file is this one:
> http://www.cl.cam.ac.uk/~mgk25/ucs/e...UTF-8-demo.txt
>
> Of course, different browsers and systems have different result depending
> on supported characters in the UTF-8 range (I guess) and while most
> characters in the above UTF-8-demo.txt display when reading the file as
> above, some characters towards the end of the page, being the ones
> following the lowercase basic Latin alphabet, i.e. the British pound sign,
> the copyright symbol and the remaining 9 characters on that same line, do
> not to display in an up-to-date web browser with the above read and print
> procedure, while they do display as they should when accessing the
> UTF-8-demo.txt file directly in a same browser via the above URL. If
> however I omit the "encoding(UTF-" part after my $fh I find that those
> particular characters print correctly.


So you read the demo file and print it out again. If you print it to a
file, why not do a diff of the two files and see what has changed, if
anything? If the printing goes to HTTP output, why not give us the URL so
that we all can see whether your server serves exactly the same text as
the URL you gave us. We can hardly guess what happens when we are denied
access to the difference of the two versions.

--
Helmut Richter
 
Reply With Quote
 
 
 
 
Rainer Weikusat
Guest
Posts: n/a
 
      11-26-2012
Ben Morrow <(E-Mail Removed)> writes:
> Quoth Tuxedo <(E-Mail Removed)>:


[...]

> If you're just copying a file, it's better to do it in blocks than
> line-by-line.
>
> local $/ = \4096;
> while (...) { ... }


As soon as an application starts to do any explicit buffer management,
using the supposedly transparent buffer management embedded in the
buffered I/O subsystem is not only pointless but actually a bad idea
(one would assume that it should be self-evident that reading data
into a buffer of size x, copying it into a buffer of size y, copying
it into another buffer of size x and finally 'writing' it out isn't a
particularly sensible thing to do ...)

NB: It is interesting the observe the effect of using a larger buffer
size. For the test I made, 8192 seemed to be the best choice and this
improves the 'blocks' version significantly but the fread version only
marginally (in the first case, the speed increase was 34% of the
slower speed, for the second, it was only 6%).

---------
use Benchmark;

open($out, '>', '/dev/null');

timethese(-5,
{
lines => sub {
my $line;

seek(STDIN, 0, 0);
print $out ($line) while $line = <>;
},

fread => sub {
my $block;
local $/ = \4096;

seek(STDIN, 0, 0);
print $out ($block) while $block = <>;
},

blocks => sub {
my $block;

seek(STDIN, 0, 0);
syswrite($out, $block) while sysread(STDIN, $block, 4096);
}});
 
Reply With Quote
 
Tuxedo
Guest
Posts: n/a
 
      11-26-2012
Helmut Richter wrote:

> On Sun, 25 Nov 2012, Tuxedo wrote:


[...]

> So you read the demo file and print it out again. If you print it to a
> file, why not do a diff of the two files and see what has changed, if
> anything? If the printing goes to HTTP output, why not give us the URL so
> that we all can see whether your server serves exactly the same text as
> the URL you gave us. We can hardly guess what happens when we are denied
> access to the difference of the two versions.


No denial intended. I have no online version, although you are right, a
header sent by different servers may vary for example. I'm just trying gain
a better understanding of the various issues in submitting, writing,
reading and printing utf-8 and have some difficultly doing all of that in
my localhost environment. However, I now understand that at least the most
basic part is to set the charset. Thereafter, I'm not sure if encoding and
decoding user input is always necessary, at least not for simply echoing
some UTF-8 user input for example. For this, the below seems to work Ok:

use strict;
use warnings;
use CGI ':standard';

print header(-charset => 'UTF-8'),
start_html,
start_form,
textfield('unicode'),
submit,
end_form;

print param('unicode');
print end_html;


 
Reply With Quote
 
Tuxedo
Guest
Posts: n/a
 
      11-26-2012
Ben Morrow wrote:

>
> Quoth Tuxedo <(E-Mail Removed)>:
> > In reading and printing a file that may contain UTF-8 characters and
> > print it into a web browser, my first attempt is:
> >
> > #!/usr/bin/perl -w

>
> You don't need -w if you use warnings.
>
> >
> > use warnings;
> > use strict;
> > use CGI qw(:standard);
> >
> > print "Content-type: text/plain; charset=UTF-8\n\n";
> >
> > open my $fh, "<:encoding(UTF-", 'UTF-8-demo.txt';
> > binmode STDOUT, ':utf-8';

>
> binmode STDOUT, ':utf8';
>
> You should have got a warning about this. If you had been using autodie,
> you would have got an error (which is better, IMHO).
>
> > while (my $line = <$fh>) {
> > print $line;
> > }

>
> If you're just copying a file, it's better to do it in blocks than
> line-by-line.
>
> local $/ = \4096;
> while (...) { ... }
>
> Ben
>


Thanks for these comments. I must have misunderstood utf-8 vs. utf8,
thinking utf-8 caters to a broader spectrum of unicode charsets. I don't
know what I'm doing with the file yet, as I'm just learning by testing.

I will look into autodie as well as skip the -w flag from now on.

Tuxedo
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      11-26-2012
Tuxedo <(E-Mail Removed)> writes:
> Helmut Richter wrote:
>
>> On Sun, 25 Nov 2012, Tuxedo wrote:

>
> [...]
>
>> So you read the demo file and print it out again. If you print it to a
>> file, why not do a diff of the two files and see what has changed, if
>> anything? If the printing goes to HTTP output, why not give us the URL so
>> that we all can see whether your server serves exactly the same text as
>> the URL you gave us. We can hardly guess what happens when we are denied
>> access to the difference of the two versions.

>
> No denial intended. I have no online version, although you are right, a
> header sent by different servers may vary for example. I'm just trying gain
> a better understanding of the various issues in submitting, writing,
> reading and printing utf-8 and have some difficultly doing all of that in
> my localhost environment. However, I now understand that at least the most
> basic part is to set the charset. Thereafter, I'm not sure if encoding and
> decoding user input is always necessary, at least not for simply echoing
> some UTF-8 user input for example.


Practically, encoding or deconding UTF-8 explicitly is not necessary
because perl was designed to work with UTF-8 encoded Unicode strings
which are supposed to be decoded (and possibly, re-encoded) when and
if this has to be done because of a processing step which needs
this. Theoretically, this is considered to be too difficult to
implement correctly and hence, users of the language are encouraged to
behave as if Perl wasn't capable of working with UTF-8 and always use
the three pass algorithm 1. Decode all of the input into some internal
representation the processing code can work with. 2. Perform whatever
processing is necessary. 3. Re-encode all of the processed data into
whatever output format happens to be desired.

The plan9 paper on UTF-8 support contains the following, nice
statement:

To decide whether to compute using runes or UTF-encoded byte
strings requires balancing the cost of converting the data
when read and written against the cost of converting relevant
text on demand. For programs such as editors that run a long
time with a relatively constant dataset, runes are the better
choice.

http://plan9.bell-labs.com/sys/doc/utf.html

Since most Perl programs run a relatively short time with a highly
variable data set, the statement above suggests that the
implementation choice to do on-demand decoding was sensible. Eg, let's
assume someone is using some Perl code to do log file analysis. Log
files are often big and since this will usually involve doing regexp
matches on all input lines, decoding the input while trying to match
the regexp in a single processing loop will possibly be a lot cheaper
than first decoding everything and then looking for matches: When a
line of input is discarded as not being of interest, the hitertho
undecoded remainder doesn't need to be touched anymore.
 
Reply With Quote
 
Tuxedo
Guest
Posts: n/a
 
      11-26-2012
Rainer Weikusat wrote:

> Tuxedo <(E-Mail Removed)> writes:
> > Helmut Richter wrote:
> >
> >> On Sun, 25 Nov 2012, Tuxedo wrote:

> >
> > [...]
> >
> >> So you read the demo file and print it out again. If you print it to a
> >> file, why not do a diff of the two files and see what has changed, if
> >> anything? If the printing goes to HTTP output, why not give us the URL
> >> so that we all can see whether your server serves exactly the same text
> >> as the URL you gave us. We can hardly guess what happens when we are
> >> denied access to the difference of the two versions.

> >
> > No denial intended. I have no online version, although you are right, a
> > header sent by different servers may vary for example. I'm just trying
> > gain a better understanding of the various issues in submitting,
> > writing, reading and printing utf-8 and have some difficultly doing all
> > of that in my localhost environment. However, I now understand that at
> > least the most basic part is to set the charset. Thereafter, I'm not
> > sure if encoding and decoding user input is always necessary, at least
> > not for simply echoing some UTF-8 user input for example.

>
> Practically, encoding or deconding UTF-8 explicitly is not necessary
> because perl was designed to work with UTF-8 encoded Unicode strings
> which are supposed to be decoded (and possibly, re-encoded) when and
> if this has to be done because of a processing step which needs
> this. Theoretically, this is considered to be too difficult to
> implement correctly and hence, users of the language are encouraged to
> behave as if Perl wasn't capable of working with UTF-8 and always use
> the three pass algorithm 1. Decode all of the input into some internal
> representation the processing code can work with. 2. Perform whatever
> processing is necessary. 3. Re-encode all of the processed data into
> whatever output format happens to be desired.
>
> The plan9 paper on UTF-8 support contains the following, nice
> statement:
>
> To decide whether to compute using runes or UTF-encoded byte
> strings requires balancing the cost of converting the data
> when read and written against the cost of converting relevant
> text on demand. For programs such as editors that run a long
> time with a relatively constant dataset, runes are the better
> choice.
>
> http://plan9.bell-labs.com/sys/doc/utf.html
>
> Since most Perl programs run a relatively short time with a highly
> variable data set, the statement above suggests that the
> implementation choice to do on-demand decoding was sensible. Eg, let's
> assume someone is using some Perl code to do log file analysis. Log
> files are often big and since this will usually involve doing regexp
> matches on all input lines, decoding the input while trying to match
> the regexp in a single processing loop will possibly be a lot cheaper
> than first decoding everything and then looking for matches: When a
> line of input is discarded as not being of interest, the hitertho
> undecoded remainder doesn't need to be touched anymore.


Thanks for the intel including the plan9 link, adding to my must-read-about
list of subjects....

Tuxedo

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Help: Program to read from EOR end of last read? snoopy_@excite.com Java 3 04-07-2006 12:32 PM
Binary data stored in SQL Server: can't read from ASP.NET, *can* read from Access? Doug ASP .Net 3 11-04-2005 07:35 PM
Re: Unable to read video DVDs and can read Data DVDs Biz DVD Video 0 07-22-2005 03:44 AM
How much was read during istream::read ? Steve C++ 6 05-13-2004 05:54 PM
Re: How to change Read Only Constraint to Read-Write Isaac VHDL 0 07-10-2003 01:43 PM



Advertisments