Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   utf8 and chomp (http://www.velocityreviews.com/forums/t909783-utf8-and-chomp.html)

Josef Feit 02-22-2009 02:57 PM

utf8 and chomp
 
Hi,

I have run accross a Perl behaviour, which I do not
understand:

I am trying to analyze some text with utf8 characters,
eg a file with "nXlXx", where the 'X' stands for
some utf8 encoded character. eg. "náláx"
(not sure whether it gets through).

Please change the 'X' in the %ascii for some
utf8 character (should be 'á').


#!/usr/bin/perl
# -----------------------------------------------------------
use warnings;
use strict;
use encoding 'utf-8';
use 5.010;

my %ascii = (
'X' => 'a',
);

my $line = <>;
chomp $line; # to chomp or not to chomp
print length($line), ": ";;
for( my $i = 0; $i < length($line); $i++ ){
my $znak = substr($line, $i, 1);
if( exists( $ascii{$znak} ) ){
print "+";
}else{
print "-";
}
}
print "\n";

---
The problem is with the chomp:

In case I chomp the $line, the output is as
expected: 5: -+-+-

If I comment out the chomp, the result is
8: --------
so the Perl does not consider the $line to be
utf8 encoded.

Is this a side effect of chomp or do I have it
wrong? I need not to chomp and get the utf8.

perl -v
This is perl, v5.10.0 built for x86_64-linux-thread-multi

Thanks
Josef

Eric Pozharski 02-23-2009 01:47 AM

Re: utf8 and chomp
 
On 2009-02-22, Josef Feit <jfeit@ics.muni.cz> wrote:
*SKIP*
> The problem is with the chomp:
>
> In case I chomp the $line, the output is as
> expected: 5: -+-+-
>
> If I comment out the chomp, the result is
> 8: --------
> so the Perl does not consider the $line to be
> utf8 encoded.
>
> Is this a side effect of chomp or do I have it
> wrong? I need not to chomp and get the utf8.


Just checked -- I can't recreate that. I have C<5: -+-+-> with B<chomp>
and C<6: -+-+--> without. Consider forcing I<$line> to be utf8
(C<perldoc Encode> has more).

p.s. And rewrite your C in Perl.


--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom

Josef Feit 02-23-2009 05:05 PM

Re: utf8 and chomp
 
Utf8 and chomp problem:

Thank you for replies.
I tried to rewrite the script, but the problem seems
to persist.
UTF8 displayed OK, so I am sending the improved script.

I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
locale and on the server (Debian I think, with
LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).

The results are the same: the strings produced
are different. I will try to force the utf8 etc,
but it seems strange anyway.

Josef


#!/usr/bin/perl
# ----------------------------
# echo "náláx" >text.txt
# thisscript text.txt
# ----------------------------
use warnings;
use strict;
use encoding 'utf-8';

my %ascii = (
'á' => 'a',
);

my $line = <>;
my $linech = $line;
chomp $linech;

for my $l ( $line, $linech ){
print length($l), ": ";
for my $char (split //, $l){
if( exists( $ascii{$char} ) ){
print "+";
}else{
print "-";
}
}
print "\n";
}

Output (orig/chomped):
8: --------
5: -+-+-


Andrzej Adam Filip 02-23-2009 05:23 PM

Re: utf8 and chomp
 
Josef Feit <jfeit@ics.muni.cz> wrote:

> Utf8 and chomp problem:
>
> Thank you for replies.
> I tried to rewrite the script, but the problem seems
> to persist.
> UTF8 displayed OK, so I am sending the improved script.
>
> I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
> locale and on the server (Debian I think, with
> LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).
>
> The results are the same: the strings produced
> are different. I will try to force the utf8 etc,
> but it seems strange anyway.
>
> Josef
>
>
> #!/usr/bin/perl
> # ----------------------------
> # echo "náláx" >text.txt
> # thisscript text.txt
> # ----------------------------
> use warnings;
> use strict;
> use encoding 'utf-8';
>
> my %ascii = (
> 'á' => 'a',
> );
>
> my $line = <>;
> my $linech = $line;
> chomp $linech;
>
> for my $l ( $line, $linech ){
> print length($l), ": ";
> for my $char (split //, $l){
> if( exists( $ascii{$char} ) ){
> print "+";
> }else{
> print "-";
> }
> }
> print "\n";
> }
>
> Output (orig/chomped):
> 8: --------
> 5: -+-+-


Have you tried to use STDIN marked as utf8 stream?

thisscript < text.txt

binmode( STDIN, ':utf8') or die;
my $line = <STDIN>;

--
[pl>en Andrew] Andrzej Adam Filip : anfi@onet.eu : anfi@xl.wp.pl
We have met the enemy, and he is us.
-- Walt Kelly

Josef Feit 02-23-2009 08:31 PM

Re: utf8 and chomp
 
Andrzej Adam Filip napsal(a):
> Josef Feit <jfeit@ics.muni.cz> wrote:
>
>> Utf8 and chomp problem:
>>
>> Thank you for replies.
>> I tried to rewrite the script, but the problem seems
>> to persist.
>> UTF8 displayed OK, so I am sending the improved script.
>>
>> I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
>> locale and on the server (Debian I think, with
>> LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).
>>
>> The results are the same: the strings produced
>> are different. I will try to force the utf8 etc,
>> but it seems strange anyway.
>>
>> Josef
>>
>>
>> #!/usr/bin/perl
>> # ----------------------------
>> # echo "náláx" >text.txt
>> # thisscript text.txt
>> # ----------------------------
>> use warnings;
>> use strict;
>> use encoding 'utf-8';
>>
>> my %ascii = (
>> 'á' => 'a',
>> );
>>
>> my $line = <>;
>> my $linech = $line;
>> chomp $linech;
>>
>> for my $l ( $line, $linech ){
>> print length($l), ": ";
>> for my $char (split //, $l){
>> if( exists( $ascii{$char} ) ){
>> print "+";
>> }else{
>> print "-";
>> }
>> }
>> print "\n";
>> }
>>
>> Output (orig/chomped):
>> 8: --------
>> 5: -+-+-

>
> Have you tried to use STDIN marked as utf8 stream?
>
> thisscript < text.txt
>
> binmode( STDIN, ':utf8') or die;
> my $line = <STDIN>;
>

I have tried it now - no change in the output.
However when the $line is set directly in the program,
the results are as expected (my $line = "náláx";)

And if I run it as
thisscript < text.txt

(with <) it works OK as well, even without the binmode setting:

thisscript < text.txt
6: -+-+--
5: -+-+-

thisscript text.txt
8: --------
5: -+-+-


Regards
Josef


Eric Pozharski 02-23-2009 10:52 PM

Re: utf8 and chomp
 
On 2009-02-23, Josef Feit <jfeit@ics.muni.cz> wrote:
> Utf8 and chomp problem:
>
> Thank you for replies.
> I tried to rewrite the script, but the problem seems
> to persist.
> UTF8 displayed OK, so I am sending the improved script.
>
> I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
> locale and on the server (Debian I think, with
> LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).
>
> The results are the same: the strings produced
> are different. I will try to force the utf8 etc,
> but it seems strange anyway.
>
> Josef
>
>
> #!/usr/bin/perl
> # ----------------------------
> # echo "náláx" >text.txt
> # thisscript text.txt
> # ----------------------------


Snap! That's the problem -- everyone here are just a way lazy to dump
string into file, and run your script through something like this
instead:

echo someutf8 | thisscript

I've just gone through your original script with debugger, and found out
that after C<$line = <>;> I<$line> is pure byte string. And then after
C<chomp $line;> it automagically decodes into utf8 character(!) string.
Should I keep on explaining? (No, no spoiler this time.)

*CUT*

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom

Peter J. Holzer 02-24-2009 12:03 AM

Re: utf8 and chomp
 
On 2009-02-23 17:05, Josef Feit <jfeit@ics.muni.cz> wrote:
> The results are the same: the strings produced
> are different. I will try to force the utf8 etc,
> but it seems strange anyway.
>
> Josef
>
>
> #!/usr/bin/perl
> # ----------------------------
> # echo "nlx" >text.txt
> # thisscript text.txt
> # ----------------------------
> use warnings;
> use strict;
> use encoding 'utf-8';


I already wanted to advice against using "use encoding", because it
behaves rather unintuitively. But I couldn't see what's wrong until you
mentioned that reading from stdin works for you.

Then it became clear.

From perldoc encoding:

The encoding pragma also modifies the filehandle layers of STDIN
and STDOUT to the specified encoding.

If you call your script like

> # thisscript text.txt


it does *not* read from STDIN, so the file will *not* automatically be
decoded from UTF-8. You should either explicitely open the file with the
correct encoding layer, or use "use open".

hp

Marc Lucksch 02-24-2009 05:17 AM

Re: utf8 and chomp
 
Eric Pozharski schrieb:
> I've just gone through your original script with debugger, and found out
> that after C<$line = <>;> I<$line> is pure byte string. And then after
> C<chomp $line;> it automagically decodes into utf8 character(!) string.
> Should I keep on explaining? (No, no spoiler this time.)


Ok now I am confused, do please explain.

Marc "Maluku" Lucksch

Josef Feit 02-24-2009 08:08 PM

Re: utf8 and chomp
 
Marc Lucksch napsal(a):
> Eric Pozharski schrieb:
>> I've just gone through your original script with debugger, and found out
>> that after C<$line = <>;> I<$line> is pure byte string. And then after
>> C<chomp $line;> it automagically decodes into utf8 character(!) string.
>> Should I keep on explaining? (No, no spoiler this time.)

>
> Ok now I am confused, do please explain.
>
> Marc "Maluku" Lucksch


----

Please spoil us... :-)

Yes, in the docs (encoding) is:
Sets the script encoding to I<ENCNAME>. And unless ${^UNICODE}
exists and non-zero, PerlIO layers of STDIN and STDOUT are set to
":encoding(I<ENCNAME>)".

Note that STDERR WILL NOT be changed.

Also note that non-STD file handles remain unaffected. Use C<use
open> or C<binmode> to change layers of those.

---

I tried to use (from Encode):
print "UTFline: ", utf8::is_utf8($line), "\n";
print "UTFlinech: ", utf8::is_utf8($linech), "\n";

and really the $linech is utf8, the $line not.

Combination of

use encoding 'utf-8';
use open IO => ':encoding(utf8)';

solves the problem, thank you all.

---
But still:
1. why chomp changes the string to utf8 as side effect?
2. can I tell the <> is utf8 if it is not STDIN?
(I cannot figure out the syntax - OK, getting the file
name through @ARGV should be possible).


Thank you
Josef






Eric Pozharski 02-24-2009 08:49 PM

Re: utf8 and chomp
 
On 2009-02-24, Marc Lucksch <perl@marc-s.de> wrote:
> Eric Pozharski schrieb:
>> I've just gone through your original script with debugger, and found out
>> that after C<$line = <>;> I<$line> is pure byte string. And then after
>> C<chomp $line;> it automagically decodes into utf8 character(!) string.
>> Should I keep on explaining? (No, no spoiler this time.)

>
> Ok now I am confused, do please explain.


A long and boring way -- C<perldoc perlvar> then look for section
C<ARGV> (it's the first one among many), read 4 of them thoroughly.
Then return to C<perldoc encoding> and C<perldoc Encode> (it seems to be
used internally by B<encoding> pragma anyway). Then think a lot and
finally see the light.

p.s. A quick and dirty way --

perl -wle '
while(<>) {
system qq|ls -l /proc/$$/fd|;
exit;
};
' /etc/passwd
total 0
lrwx------ 1 whynot whynot 64 2009-02-24 22:47 0 -> /dev/pts/0
lrwx------ 1 whynot whynot 64 2009-02-24 22:47 1 -> /dev/pts/0
lrwx------ 1 whynot whynot 64 2009-02-24 22:47 2 -> /dev/pts/0
lr-x------ 1 whynot whynot 64 2009-02-24 22:47 3 -> /etc/passwd
lr-x------ 1 whynot whynot 64 2009-02-24 22:47 4 -> pipe:[7056143]
l-wx------ 1 whynot whynot 64 2009-02-24 22:47 5 -> pipe:[7056143]

Pay a bit of attention to I<fileno> #3

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom


All times are GMT. The time now is 02:59 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.