Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Why "Wide character in print"?

Reply
Thread Tools

Why "Wide character in print"?

 
 
tcgo
Guest
Posts: n/a
 
      09-30-2012
Hi!
I just made a test code with Perl, using the Pi symbol with Unicode/UTF-8. That's the code:

#!/usr/bin/perl
use utf8;
my $cosa = "Here is my ☺ résúmé \x{2639}!";
print "$cosa\n";

And it gives me a "warning" message: "Wide character in print at ./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the warning disappears, but why was it showing before of adding the binmode?

Thanks!
~tcgo~
 
Reply With Quote
 
 
 
 
Rainer Weikusat
Guest
Posts: n/a
 
      09-30-2012
tcgo <(E-Mail Removed)> writes:
> I just made a test code with Perl, using the Pi symbol with
> Unicode/UTF-8. That's the code:
>
> #!/usr/bin/perl
> use utf8;
> my $cosa = "Here is my ☺ résúmé \x{2639}!";
> print "$cosa\n";
>
> And it gives me a "warning" message: "Wide character in print at
> ./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the
> warning disappears, but why was it showing before of adding the
> binmode?


Because the people who nowadays work on perl unicode support have
decided that it should behave as if the encoding used by it was some
super secret sauce shrouded in eternal mystery: All data flowing into
a Perl program is supposed to be converted to this super secret
internal mystery encoding before being used and all data flowing out
of a Perl program is supposed to be converted to something software
other than perl understands beforehand. De facto, the situation is
such that everything is fine when perl is used in an environment where
UTF-8 is the 'native' method for supporting wide characters because
this is also what perl uses itself, and anyone using something
else is essentially ****ed. De jure, perl is supposed to be nasty to
everyone, or at least try as hard as possible without breaking
backwards compatibility.
 
Reply With Quote
 
 
 
 
Alan Curry
Guest
Posts: n/a
 
      09-30-2012
In article <(E-Mail Removed)>,
tcgo <(E-Mail Removed)> wrote:
>Hi!
>I just made a test code with Perl, using the Pi symbol with
>Unicode/UTF-8. That's the code:
>
>#!/usr/bin/perl
>use utf8;
>my $cosa = "Here is my ☺ résúmé \x{2639}!";
>print "$cosa\n";
>
>And it gives me a "warning" message: "Wide character in print at
>./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the warning
>disappears, but why was it showing before of adding the binmode?


The binmode documents your assumption that nobody will ever run your program
on a non-UTF8-mode terminal.

--
Alan Curry
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      09-30-2012
On 2012-09-30 17:57, tcgo <(E-Mail Removed)> wrote:
> I just made a test code with Perl, using the Pi symbol with
> Unicode/UTF-8. That's the code:
>
> #!/usr/bin/perl
> use utf8;
> my $cosa = "Here is my ☺ résúmé \x{2639}!";
> print "$cosa\n";
>
> And it gives me a "warning" message: "Wide character in print at
> ./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the
> warning disappears, but why was it showing before of adding the
> binmode?


Because, unless you tell it with binmode, Perl doesn't know what
encoding it is supposed to use. It could get the encoding from the
locale settings, but that would only work for text written to a
terminal, not for arbitrary data written to a file, so perl doesn't
make assumptions and asks you to set the encoding explicitely.

(If you want to get the encoding from the locale, use I18N::Langinfo,
unfortunately this doesn't work on all platforms (at least it didn't
work on Windows last time I looked, but that was a few years ago)

hp


--
_ | Peter J. Holzer | Deprecating human carelessness and
|_|_) | Sysadmin WSR | ignorance has no successful track record.
| | | http://www.velocityreviews.com/forums/(E-Mail Removed) |
__/ | http://www.hjp.at/ | -- Bill Code on (E-Mail Removed)
 
Reply With Quote
 
johndelacour@gmail.com
Guest
Posts: n/a
 
      10-23-2012
On Sunday, September 30, 2012 6:57:38 PM UTC+1, tcgo wrote:

> #!/usr/bin/perl
> use utf8;
> my $cosa = "Here is my ☺ résúmé \x{2639}!";
> print "$cosa\n";
>
> And it gives me a "warning" message: "Wide character in print at
> ./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the
> warning disappears, but why was it showing before of adding the
> binmode?


“use utf8” means only that the script file itself is UTF-8-encoded;
It doesn’t say how to manage the output to STDOUT.

JD


 
Reply With Quote
 
C.DeRykus
Guest
Posts: n/a
 
      10-24-2012
On Sunday, September 30, 2012 10:57:38 AM UTC-7, tcgo wrote:
> Hi!
>
> I just made a test code with Perl, using the Pi symbol with Unicode/UTF-8.. That's the code:
>
>
>
> #!/usr/bin/perl
>
> use utf8;
>
> my $cosa = "Here is my ☺ résúmé \x{2639}!";
>
> print "$cosa\n";
> ...
>


Here's a follow-on with an observation/question for someone more knowledgeable about Perl unicode)

I don't know how 'use locale' affects this but I
only see the OP's expected display of characters
by using the "\N{U+...}" notation to force character
semantics:

#use utf8;
my $cosa = "Here is my \N{U+263A} résúmé \N{U+03C0}!";

Output: Here is my ☺ résúmé π!

--
Charles DeRykus
 
Reply With Quote
 
Eric Pozharski
Guest
Posts: n/a
 
      10-27-2012
with <(E-Mail Removed)> Ben Morrow wrote:

*SKIP*

> (In theory you can 'use encoding' to specify a different source
> character encoding, but in practice that pragma has always been buggy
> and is better avoided.)


Stop spreading FUD. They need

use encoding ENCNAME Filter => 1;

(what I<ENCNAME> could possibly be?) but

* "use utf8" is implicitly declared so you no longer have to "use
utf8" to "${"\x{4eba}"}++".

what pretty much defies the purpose of C<use encoding;>.

*SKIP*

> The lexer converts the "å" into a 1-character string which eventually
> gets passed to 'say', which appends a newline (that is, a character
> with ordinal 0a) and passes it to the STDOUT filehandle for writing.


That's not a whole story.

{2754:13} [0:0]% perl -Mutf8 -MDevel:eek -wle '$aa = "а" ; Dump $aa'
SV = PV(0x927a750) at 0x9295fac
REFCNT = 1
FLAGS = (POK,pPOK,UTF
PV = 0x9291a08 "\320\260"\0 [UTF8 "\x{430}"]
CUR = 2
LEN = 12
{2936:14} [0:0]% perl -Mutf8 -MDevel:eek -wle '$aa = "å" ; Dump $aa'
SV = PV(0x9af4750) at 0x9b0ffac
REFCNT = 1
FLAGS = (POK,pPOK,UTF
PV = 0x9b0ba08 "\303\245"\0 [UTF8 "\x{e5}"]
CUR = 2
LEN = 12

For a first glance, me wondered: what the heck is with yours
C<use warnings;>. Now I feel much better.

*CUT*

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom
 
Reply With Quote
 
Eric Pozharski
Guest
Posts: n/a
 
      10-28-2012
with <(E-Mail Removed)> Ben Morrow wrote:
> Quoth Eric Pozharski <(E-Mail Removed)>:
>> with <(E-Mail Removed)> Ben Morrow wrote:
>>
>>> (In theory you can 'use encoding' to specify a different source
>>> character encoding, but in practice that pragma has always been
>>> buggy and is better avoided.)

>>
>> Stop spreading FUD.

>
> That was certainly not my intention. My understanding is that 'use
> encoding' is liable to cause incorrect behaviour and segfaults; see
> for instance
>
> https://rt.perl.org/rt3/Public/Bug/D....html?id=31923


C<use threads;> and C<use encoding 'utf8';>. Unexpected(?) edge case?

> https://rt.perl.org/rt3/Public/Bug/D....html?id=36248


C<use utf8;>, C<use encoding 'utf8';>, and C<use Encode;>. Panic mode?

> https://rt.perl.org/rt3/Public/Bug/D....html?id=37526


Double encoding.

> http://www.xray.mpe.mpg.de/mailing-l.../msg00669.html


Monkey wrench.

> http://www.xray.mpe.mpg.de/mailing-l.../msg00255.html


Works just as expected, see below.

> which suggests that 'use utf8' is also broken; I didn't know that
> until just now, and I'm not sure I entirely believe it. If you have
> newer information than me, I'd be happy to change my opinion.


Probably that's not safe to state things like this below unprivately,
but:

not perl->isa( 'fool-proof' ) or die

(I'm trying to speak Perl here). IOW, Perl has an entry level. And
it's quite high. And one of steps to get behind is ability to read. I
don't mind ability to read code, I mean ability to RTFM. Three former
examples are clearly (for me) of that type. I have a couple of scripts
that have C<use encoding 'utf8';> (I<STDIN>, I<STDOUT>, and quote-like
operators) and C<use open ':locale';> (other filehandles, quite risky,
but those scripts are not for distribution thus I'm safe here). Those
scripts were started 4.5 years ago (according to logs, I can't believe
it was sarge (thus 5.8.8?)). Anyway, 5.10.0, 5.10.1, 5.14.2 -- because
I've made those right. Because I've read carefully, all the unicode
documentation that comes with perl (namely perluniitro.pod,
perlunicode.pod, utf8.pod, encoding.pm, Encdoe.pm (perlunifaq.pod,
perlunitut, and perluniprops.pod weren't distributed five years ago,
should read them too)). I've found that I don't need utf8.pm (those
scripts and modules should be us-ascii anyway).

I feel utf8-safe because, first of all, I can read. If I can, they can
too, can't they? Apparently, they don't, maybe because they can't.

>> They need
>>
>> use encoding ENCNAME Filter => 1;

>
> That installs a source filter; I'm not sure what the effects of that
> are, but I wouldn't be surprised if you get the union of any bugs in
> 'use encoding' and any bugs in 'use utf8'.
>
>> (what I<ENCNAME> could possibly be?) but
>>
>> * "use utf8" is implicitly declared so you no longer have to
>> "use utf8" to "${"\x{4eba}"}++".


BTW, I've checked. There's no C<use utf8>. It's B<require utf8> and no
import. A whole different story.

> I don't believe this is safe either. The pad code (which handles 'my'
> variables) isn't utf8-safe, so you can't create 'my' variables with
> Unicode names. (The above is a symref to a global; I don't know if the
> code handling the names of globals is utf8-safe, but even if it is
> that isn't terribly useful.)


Let me rephrase one famous proverb:

If an answer you've got is 'filter', you probably asking wrong
question.

*SKIP*
> In any case, the result is exactly what I said: the string contains
> one (logical) character. If you apply length() to that string it will
> return 1. (This character happens to be represented internally as two
> bytes; that is none of your business.) What do you think I omitted
> from the story?


Right. And that's closely related to your last example (the one about
utf8.pm being unsafe). I've tried to make a point that *characters*
from different *ranges* happen to be of different length in bytes.

{9829:45} [0:0]% perl -Mutf8 -MDevel:eek -wle '$aa = "a*а" ; Dump $aa'
SV = PV(0xa06f750) at 0xa08afac
REFCNT = 1
FLAGS = (POK,pPOK,UTF
PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
CUR = 5
LEN = 12

*Characters* of latin1 aren't wide (even if they are characters, they
are still one byte long)

{10406:65} [0:0]% perl -Mutf8 -wle 'print "[*]"'
[*]
{10415:66} [0:0]% perl -Mutf8 -wle 'print "[а]"'
Wide character in print at -e line 1.
[а]

I must have added those braces, because:

{10421:67} [0:0]% perl -wle 'print "*"' # no problmes, just a byte
*
{10477:68} [0:0]% perl -Mutf8 -wle 'print "*"' # oops

{10520:69} [0:0]% perl -Mutf8 -wle 'print "* "' # stupid
*
{10522:70} [0:0]% perl -Mutf8 -wle 'print "\x{E0}"' # oops

{10532:71} [0:0]% perl -Mutf8 -wle 'print "\x{E0} "' # stupid
*
{10602:79} [0:0]% perl -Mutf8 -wle 'print "\N{U+00E0}"' # oops

{10608:80} [0:0]% perl -Mutf8 -wle 'print "\N{U+00E0} "' # stupid
*

But watch this:

{10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "*"' # hooray!
*
{10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops

{10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hooray!
*

Except the middle one (what I should think about), I think encoding.pm
wins again.

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      10-28-2012
On 2012-10-27 23:37, Ben Morrow <(E-Mail Removed)> wrote:
> Quoth Eric Pozharski <(E-Mail Removed)>:
>> with <(E-Mail Removed)> Ben Morrow wrote:
>>
>> > (In theory you can 'use encoding' to specify a different source
>> > character encoding, but in practice that pragma has always been buggy
>> > and is better avoided.)

>>
>> Stop spreading FUD.

>
> That was certainly not my intention. My understanding is that 'use
> encoding' is liable to cause incorrect behaviour and segfaults; see for
> instance
>
> https://rt.perl.org/rt3/Public/Bug/D....html?id=31923
> https://rt.perl.org/rt3/Public/Bug/D....html?id=36248
> https://rt.perl.org/rt3/Public/Bug/D....html?id=37526
> http://www.xray.mpe.mpg.de/mailing-l.../msg00669.html
>
> Incidentally, while looking for those I also found
>
> http://www.xray.mpe.mpg.de/mailing-l.../msg00255.html
>
> which suggests that 'use utf8' is also broken; I didn't know that until
> just now, and I'm not sure I entirely believe it.


That doesn't look like a bug in "use utf8" to me, but like a bug in the
code which generates the warnings.

It doesn't help that Tom just dumped a load of gibberish into his mail
without specifying which encoding he was using. I had to guess that he
was using CP1252.

Anyway, with use utf8, the qw[] section of his program is parsed correcly as

("élite", "Ævar", "μῦθος", "m*o")

In the error message each character (even those in the printable ASCII
range U+0020 ... U+007E) is "helpfully" given in hex which I agree is
.... suboptimal.


> If you have newer information than me, I'd be happy to change my opinion.


Me too, although frankly I see no reason to use encoding even if it
works. It mixes up encoding of the source code and the I/O, which is not
a good idea, IMSHO, and my editor handles UTF-8 just fine, so I don't
see why I should write my perl scripts in a different encoding than
UTF-8. I/O can be handled explicitely by I/O layers or implicitely by
"use open".


>> (what I<ENCNAME> could possibly be?) but
>>
>> * "use utf8" is implicitly declared so you no longer have to "use
>> utf8" to "${"\x{4eba}"}++".

>
> I don't believe this is safe either. The pad code (which handles 'my'
> variables) isn't utf8-safe, so you can't create 'my' variables with
> Unicode names. (The above is a symref to a global; I don't know if the
> code handling the names of globals is utf8-safe, but even if it is that
> isn't terribly useful.)


I'm puzzled about this part of the documentation, too. Why would anybody
want to use a variable ${"\x{4eba}"} ? I am guessing that the variable
is really supposed to be $人, i.e., there is a Han character in the
source code, not a symref.

Is this unsafe? I have occasionally used non-ascii characters in
variable names (mostly Greek characters in physical formulas) together
with use utf8 since 5.8.x and I never noticed a problem. (The only
"problem" I noticed is that the euro sign isn't a word character, so you
can't have a variable $amount_in_€. But then you can't have a variable
$amount_in_$ either, so I guess this is fair )

hp


--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | (E-Mail Removed) | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      10-28-2012
On 2012-10-28 11:45, Eric Pozharski <(E-Mail Removed)> wrote:
> with <(E-Mail Removed)> Ben Morrow wrote:
>> In any case, the result is exactly what I said: the string contains
>> one (logical) character. If you apply length() to that string it will
>> return 1. (This character happens to be represented internally as two
>> bytes; that is none of your business.) What do you think I omitted
>> from the story?

>
> Right. And that's closely related to your last example (the one about
> utf8.pm being unsafe). I've tried to make a point that *characters*
> from different *ranges* happen to be of different length in bytes.


Then maybe you shouldn't have chosen two examples which both are same
length in bytes.

>
> {9829:45} [0:0]% perl -Mutf8 -MDevel:eek -wle '$aa = "a*а" ; Dump $aa'
> SV = PV(0xa06f750) at 0xa08afac
> REFCNT = 1
> FLAGS = (POK,pPOK,UTF
> PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
> CUR = 5
> LEN = 12
>
> *Characters* of latin1 aren't wide (even if they are characters, they
> are still one byte long)


In UTF-8, latin-1 characters >= 0x80 are 2 bytes, the same as cyrillic
characters. Your example shows this: "*" (LATIN SMALL LETTER A WITH
GRAVE) is "\303\240" and "а" (CYRILLIC SMALL LETTER A) is "\320\260".

But this isn't what "wide character" in the warning means. In the
warning, it means a string element with a code > 255. For string
elements <= 255, perl can assume that they are supposed to be bytes, not
characters, when you try to write them to a byte stream. It could be
argued that this assumption is a mistake, but for better or worse we are
stuck with that decision. But for string elements > 255, that just isn't
possible. It can't be a byte, it must be a character, and to convert a
character into bytes, the encoding needs to known.


> {10406:65} [0:0]% perl -Mutf8 -wle 'print "[*]"'
> [*]
> {10415:66} [0:0]% perl -Mutf8 -wle 'print "[а]"'
> Wide character in print at -e line 1.
> [а]


.... as these examples demonstrate.


> I must have added those braces, because:
>
> {10421:67} [0:0]% perl -wle 'print "*"' # no problmes, just a byte
> *


Assuming you use a UTF-8 terminal here: No, this isn't one byte. These are
two bytes, \303\240.

> {10477:68} [0:0]% perl -Mutf8 -wle 'print "*"' # oops
>


Now you have one character (because of -Mutf8, the two bytes \303\240
are decoded to the character U+00e0), but you are trying to write it to a byte
stream without specifying the encoding. Perl writes the single byte
0xE0, which your UTF-8 terminal cannot interpret. (Mine displays a
question mark in a dark circle)


> {10520:69} [0:0]% perl -Mutf8 -wle 'print "* "' # stupid
> *


Huh? What version of Perl on what platform is this? The string is
"\x{E0}\x{20}". All elements of the string are <= 255, so the string is
output as a byte string. This isn't valid UTF-8, and your terminal
shouldn't be able to interpret it as "*" anymore than it was able to
interpret "\x{E0}\x{0A}" above.

[more equivalent examples snipped]

If your program does character I/O, you *need* to specify the encoding
of the I/O channels. For one-liners, the -C option is sufficent:

hrunkner:~/tmp 20:40 195% perl -CS -Mutf8 -wle 'print "*"'
*

For scripts you would use binmode or 'use open'.

(Didn't you praise yourself on your ability to read? This is documented
and it has been repeated by several people in this newsgroup for years)


> But watch this:
>
> {10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "*"' # hooray!
> *
> {10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops
> �
> {10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hooray!
> *
>
> Except the middle one (what I should think about), I think encoding.pm
> wins again.


Excellent example, it shows exactly one of the pitfalls of using "use
encoding". One would expect "\x{E0}" to result in a string with a single
element with code 0xE0. At least you seem to have expected it, and for a
moment I was confused, too. But 'use encoding' doesn't work that way. It
was designed to convert string constants from the specified encoding to
Unicode, so it tries to interpret "\x{E0}" as UTF-8, but of course this
isn't valid UTF-8. So you get "\x{FFFD}" instead (U+FFFD is the
REPLACEMENT CHARACTER used to mark invalid characters).

If you use a correct UTF-8 encoded string, it works as expected (well,
expected by somebody who's read the documentation and remembers that
little pitfall):

hrunkner:~/tmp 20:47 197% perl -Mencoding=utf8 -wle 'print "\303\240"'
*


For one-liners like this, using the same encoding for the script and the
I/O is useful ("-CS -Mutf8" is even shorter than "-Mencoding=utf8", but
maybe you don't have a UTF-8 capable terminal). However, for real
programs, I think tying the encoding of the source code to the encoding
of I/O-streams the script is supposed to handle is foolish. My scripts
are always encoded in UTF-8, but they frequently have to handle files in
CP-1252.

hp


--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | (E-Mail Removed) | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
why why why why why Mr. SweatyFinger ASP .Net 4 12-21-2006 01:15 PM
findcontrol("PlaceHolderPrice") why why why why why why why why why why why Mr. SweatyFinger ASP .Net 2 12-02-2006 03:46 PM
character encoding +missing character sequence raavi Java 2 03-02-2006 05:01 AM
getting the character code of a character in a string Velvet ASP .Net 9 01-19-2006 09:27 PM
Character reference "&#c" is an invalid XML character cgbusch XML 6 09-02-2003 07:04 PM



Advertisments