Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > utf8 and chomp

Reply
Thread Tools

utf8 and chomp

 
 
Josef Feit
Guest
Posts: n/a
 
      02-22-2009
Hi,

I have run accross a Perl behaviour, which I do not
understand:

I am trying to analyze some text with utf8 characters,
eg a file with "nXlXx", where the 'X' stands for
some utf8 encoded character. eg. "náláx"
(not sure whether it gets through).

Please change the 'X' in the %ascii for some
utf8 character (should be 'á').


#!/usr/bin/perl
# -----------------------------------------------------------
use warnings;
use strict;
use encoding 'utf-8';
use 5.010;

my %ascii = (
'X' => 'a',
);

my $line = <>;
chomp $line; # to chomp or not to chomp
print length($line), ": ";;
for( my $i = 0; $i < length($line); $i++ ){
my $znak = substr($line, $i, 1);
if( exists( $ascii{$znak} ) ){
print "+";
}else{
print "-";
}
}
print "\n";

---
The problem is with the chomp:

In case I chomp the $line, the output is as
expected: 5: -+-+-

If I comment out the chomp, the result is
8: --------
so the Perl does not consider the $line to be
utf8 encoded.

Is this a side effect of chomp or do I have it
wrong? I need not to chomp and get the utf8.

perl -v
This is perl, v5.10.0 built for x86_64-linux-thread-multi

Thanks
Josef
 
Reply With Quote
 
 
 
 
Eric Pozharski
Guest
Posts: n/a
 
      02-23-2009
On 2009-02-22, Josef Feit <(E-Mail Removed)> wrote:
*SKIP*
> The problem is with the chomp:
>
> In case I chomp the $line, the output is as
> expected: 5: -+-+-
>
> If I comment out the chomp, the result is
> 8: --------
> so the Perl does not consider the $line to be
> utf8 encoded.
>
> Is this a side effect of chomp or do I have it
> wrong? I need not to chomp and get the utf8.


Just checked -- I can't recreate that. I have C<5: -+-+-> with B<chomp>
and C<6: -+-+--> without. Consider forcing I<$line> to be utf8
(C<perldoc Encode> has more).

p.s. And rewrite your C in Perl.


--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom
 
Reply With Quote
 
 
 
 
Josef Feit
Guest
Posts: n/a
 
      02-23-2009
Utf8 and chomp problem:

Thank you for replies.
I tried to rewrite the script, but the problem seems
to persist.
UTF8 displayed OK, so I am sending the improved script.

I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
locale and on the server (Debian I think, with
LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).

The results are the same: the strings produced
are different. I will try to force the utf8 etc,
but it seems strange anyway.

Josef


#!/usr/bin/perl
# ----------------------------
# echo "náláx" >text.txt
# thisscript text.txt
# ----------------------------
use warnings;
use strict;
use encoding 'utf-8';

my %ascii = (
'á' => 'a',
);

my $line = <>;
my $linech = $line;
chomp $linech;

for my $l ( $line, $linech ){
print length($l), ": ";
for my $char (split //, $l){
if( exists( $ascii{$char} ) ){
print "+";
}else{
print "-";
}
}
print "\n";
}

Output (orig/chomped):
8: --------
5: -+-+-

 
Reply With Quote
 
Andrzej Adam Filip
Guest
Posts: n/a
 
      02-23-2009
Josef Feit <(E-Mail Removed)> wrote:

> Utf8 and chomp problem:
>
> Thank you for replies.
> I tried to rewrite the script, but the problem seems
> to persist.
> UTF8 displayed OK, so I am sending the improved script.
>
> I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
> locale and on the server (Debian I think, with
> LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).
>
> The results are the same: the strings produced
> are different. I will try to force the utf8 etc,
> but it seems strange anyway.
>
> Josef
>
>
> #!/usr/bin/perl
> # ----------------------------
> # echo "náláx" >text.txt
> # thisscript text.txt
> # ----------------------------
> use warnings;
> use strict;
> use encoding 'utf-8';
>
> my %ascii = (
> 'á' => 'a',
> );
>
> my $line = <>;
> my $linech = $line;
> chomp $linech;
>
> for my $l ( $line, $linech ){
> print length($l), ": ";
> for my $char (split //, $l){
> if( exists( $ascii{$char} ) ){
> print "+";
> }else{
> print "-";
> }
> }
> print "\n";
> }
>
> Output (orig/chomped):
> 8: --------
> 5: -+-+-


Have you tried to use STDIN marked as utf8 stream?

thisscript < text.txt

binmode( STDIN, ':utf8') or die;
my $line = <STDIN>;

--
[pl>en Andrew] Andrzej Adam Filip : http://www.velocityreviews.com/forums/(E-Mail Removed) : (E-Mail Removed)
We have met the enemy, and he is us.
-- Walt Kelly
 
Reply With Quote
 
Josef Feit
Guest
Posts: n/a
 
      02-23-2009
Andrzej Adam Filip napsal(a):
> Josef Feit <(E-Mail Removed)> wrote:
>
>> Utf8 and chomp problem:
>>
>> Thank you for replies.
>> I tried to rewrite the script, but the problem seems
>> to persist.
>> UTF8 displayed OK, so I am sending the improved script.
>>
>> I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
>> locale and on the server (Debian I think, with
>> LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).
>>
>> The results are the same: the strings produced
>> are different. I will try to force the utf8 etc,
>> but it seems strange anyway.
>>
>> Josef
>>
>>
>> #!/usr/bin/perl
>> # ----------------------------
>> # echo "náláx" >text.txt
>> # thisscript text.txt
>> # ----------------------------
>> use warnings;
>> use strict;
>> use encoding 'utf-8';
>>
>> my %ascii = (
>> 'á' => 'a',
>> );
>>
>> my $line = <>;
>> my $linech = $line;
>> chomp $linech;
>>
>> for my $l ( $line, $linech ){
>> print length($l), ": ";
>> for my $char (split //, $l){
>> if( exists( $ascii{$char} ) ){
>> print "+";
>> }else{
>> print "-";
>> }
>> }
>> print "\n";
>> }
>>
>> Output (orig/chomped):
>> 8: --------
>> 5: -+-+-

>
> Have you tried to use STDIN marked as utf8 stream?
>
> thisscript < text.txt
>
> binmode( STDIN, ':utf8') or die;
> my $line = <STDIN>;
>

I have tried it now - no change in the output.
However when the $line is set directly in the program,
the results are as expected (my $line = "náláx"

And if I run it as
thisscript < text.txt

(with <) it works OK as well, even without the binmode setting:

thisscript < text.txt
6: -+-+--
5: -+-+-

thisscript text.txt
8: --------
5: -+-+-


Regards
Josef

 
Reply With Quote
 
Eric Pozharski
Guest
Posts: n/a
 
      02-23-2009
On 2009-02-23, Josef Feit <(E-Mail Removed)> wrote:
> Utf8 and chomp problem:
>
> Thank you for replies.
> I tried to rewrite the script, but the problem seems
> to persist.
> UTF8 displayed OK, so I am sending the improved script.
>
> I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
> locale and on the server (Debian I think, with
> LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).
>
> The results are the same: the strings produced
> are different. I will try to force the utf8 etc,
> but it seems strange anyway.
>
> Josef
>
>
> #!/usr/bin/perl
> # ----------------------------
> # echo "náláx" >text.txt
> # thisscript text.txt
> # ----------------------------


Snap! That's the problem -- everyone here are just a way lazy to dump
string into file, and run your script through something like this
instead:

echo someutf8 | thisscript

I've just gone through your original script with debugger, and found out
that after C<$line = <>;> I<$line> is pure byte string. And then after
C<chomp $line;> it automagically decodes into utf8 character(!) string.
Should I keep on explaining? (No, no spoiler this time.)

*CUT*

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      02-24-2009
On 2009-02-23 17:05, Josef Feit <(E-Mail Removed)> wrote:
> The results are the same: the strings produced
> are different. I will try to force the utf8 etc,
> but it seems strange anyway.
>
> Josef
>
>
> #!/usr/bin/perl
> # ----------------------------
> # echo "nlx" >text.txt
> # thisscript text.txt
> # ----------------------------
> use warnings;
> use strict;
> use encoding 'utf-8';


I already wanted to advice against using "use encoding", because it
behaves rather unintuitively. But I couldn't see what's wrong until you
mentioned that reading from stdin works for you.

Then it became clear.

From perldoc encoding:

The encoding pragma also modifies the filehandle layers of STDIN
and STDOUT to the specified encoding.

If you call your script like

> # thisscript text.txt


it does *not* read from STDIN, so the file will *not* automatically be
decoded from UTF-8. You should either explicitely open the file with the
correct encoding layer, or use "use open".

hp
 
Reply With Quote
 
Marc Lucksch
Guest
Posts: n/a
 
      02-24-2009
Eric Pozharski schrieb:
> I've just gone through your original script with debugger, and found out
> that after C<$line = <>;> I<$line> is pure byte string. And then after
> C<chomp $line;> it automagically decodes into utf8 character(!) string.
> Should I keep on explaining? (No, no spoiler this time.)


Ok now I am confused, do please explain.

Marc "Maluku" Lucksch
 
Reply With Quote
 
Josef Feit
Guest
Posts: n/a
 
      02-24-2009
Marc Lucksch napsal(a):
> Eric Pozharski schrieb:
>> I've just gone through your original script with debugger, and found out
>> that after C<$line = <>;> I<$line> is pure byte string. And then after
>> C<chomp $line;> it automagically decodes into utf8 character(!) string.
>> Should I keep on explaining? (No, no spoiler this time.)

>
> Ok now I am confused, do please explain.
>
> Marc "Maluku" Lucksch


----

Please spoil us...

Yes, in the docs (encoding) is:
Sets the script encoding to I<ENCNAME>. And unless ${^UNICODE}
exists and non-zero, PerlIO layers of STDIN and STDOUT are set to
":encoding(I<ENCNAME>)".

Note that STDERR WILL NOT be changed.

Also note that non-STD file handles remain unaffected. Use C<use
open> or C<binmode> to change layers of those.

---

I tried to use (from Encode):
print "UTFline: ", utf8::is_utf8($line), "\n";
print "UTFlinech: ", utf8::is_utf8($linech), "\n";

and really the $linech is utf8, the $line not.

Combination of

use encoding 'utf-8';
use open IO => ':encoding(utf';

solves the problem, thank you all.

---
But still:
1. why chomp changes the string to utf8 as side effect?
2. can I tell the <> is utf8 if it is not STDIN?
(I cannot figure out the syntax - OK, getting the file
name through @ARGV should be possible).


Thank you
Josef





 
Reply With Quote
 
Eric Pozharski
Guest
Posts: n/a
 
      02-24-2009
On 2009-02-24, Marc Lucksch <(E-Mail Removed)> wrote:
> Eric Pozharski schrieb:
>> I've just gone through your original script with debugger, and found out
>> that after C<$line = <>;> I<$line> is pure byte string. And then after
>> C<chomp $line;> it automagically decodes into utf8 character(!) string.
>> Should I keep on explaining? (No, no spoiler this time.)

>
> Ok now I am confused, do please explain.


A long and boring way -- C<perldoc perlvar> then look for section
C<ARGV> (it's the first one among many), read 4 of them thoroughly.
Then return to C<perldoc encoding> and C<perldoc Encode> (it seems to be
used internally by B<encoding> pragma anyway). Then think a lot and
finally see the light.

p.s. A quick and dirty way --

perl -wle '
while(<>) {
system qq|ls -l /proc/$$/fd|;
exit;
};
' /etc/passwd
total 0
lrwx------ 1 whynot whynot 64 2009-02-24 22:47 0 -> /dev/pts/0
lrwx------ 1 whynot whynot 64 2009-02-24 22:47 1 -> /dev/pts/0
lrwx------ 1 whynot whynot 64 2009-02-24 22:47 2 -> /dev/pts/0
lr-x------ 1 whynot whynot 64 2009-02-24 22:47 3 -> /etc/passwd
lr-x------ 1 whynot whynot 64 2009-02-24 22:47 4 -> pipe:[7056143]
l-wx------ 1 whynot whynot 64 2009-02-24 22:47 5 -> pipe:[7056143]

Pay a bit of attention to I<fileno> #3

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
given char* utf8, how to read unicode line by line, and output utf8 gry C++ 2 03-13-2012 04:32 AM
HELP! New to Ruby and need to debug for work! Chomp! Error Doug Blackman Ruby 7 01-28-2011 12:23 PM
"chomp,chop" usage i.e. chop immediately after chomp martin Perl Misc 3 04-15-2006 08:09 PM
Gets and chomp method question paul.denlinger@gmail.com Ruby 6 03-28-2006 11:50 PM
Big problem with @array and Chomp ... I think :o Robert TV Perl Misc 5 11-05-2003 08:01 PM



Advertisments