Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > how is the string encoded

Reply
Thread Tools

how is the string encoded

 
 
dn.perl@gmail.com
Guest
Posts: n/a
 
      01-03-2012

I know the question must have been asked many times, there are many
web-pages which are supposed to help, but after going through many of
them, I still need help.
I am running a simple program on linux, perl 5.8.8 ;

use strict ;
use warnings ;
## use utf8 ;

my $str ;
$str = "" ;
print "str is $str\n" ;
---
Works well. But my question is: how do I know which encoding is being
used to read/write $str?

If I uncomment 'use utf8' like, I get a warning: Malformed UTF-8
character. And the string no longer prints correct. Why, and how to
remove this warning, and print the string correctly? I should have
guesses that 'use utf8' adds more power to the code and would not stop
running code which was otherwise running correct.

 
Reply With Quote
 
 
 
 
Rainer Weikusat
Guest
Posts: n/a
 
      01-03-2012
"(E-Mail Removed)" <(E-Mail Removed)> writes:
> I know the question must have been asked many times, there are many
> web-pages which are supposed to help, but after going through many of
> them, I still need help.
> I am running a simple program on linux, perl 5.8.8 ;
>
> use strict ;
> use warnings ;
> ## use utf8 ;
>
> my $str ;
> $str = "" ;
> print "str is $str\n" ;
> ---
> Works well. But my question is: how do I know which encoding is being
> used to read/write $str?


According to the people who dabble in this area, you are not supposed
to know that. You are supposed to convert any data flowing into perl
from the encoding known to you into 'the super-secret, proprietary
internal Perl encoding' (patent pending) and any data flowing out of
perl from said 'super-secret internal Perl encoding' into whatever
encoding you'd like to have. Should the encoding you want to use (for
whatever reason) not be among the ones Perl supports natively, you're
****ed and advised to take your petty problems elsewhere. That's the
theory. Practically, Perl uses utf8 (which presumably cause a lot of
people sour bumpers because Microsoft [reportedly] uses UCS-2).

Another practical piece of advice: Stick to ASCII. That's the only
thing no American comittee is going to uninvent tomorrow and thus, a
safe choice for all communication needs among educated people. Let all
those club-bearing natives draw their weird krikel-krakels to their
hearts content and ignore them.

 
Reply With Quote
 
 
 
 
Helmut Richter
Guest
Posts: n/a
 
      01-03-2012
On Tue, 3 Jan 2012, Rainer Weikusat wrote:

> "(E-Mail Removed)" <(E-Mail Removed)> writes:
> > I know the question must have been asked many times, there are many
> > web-pages which are supposed to help, but after going through many of
> > them, I still need help.
> > I am running a simple program on linux, perl 5.8.8 ;
> >
> > use strict ;
> > use warnings ;
> > ## use utf8 ;
> >
> > my $str ;
> > $str = "" ;
> > print "str is $str\n" ;
> > ---
> > Works well. But my question is: how do I know which encoding is being
> > used to read/write $str?

>
> According to the people who dabble in this area, you are not supposed
> to know that. You are supposed to convert any data flowing into perl
> from the encoding known to you into 'the super-secret, proprietary
> internal Perl encoding' (patent pending) and any data flowing out of
> perl from said 'super-secret internal Perl encoding' into whatever
> encoding you'd like to have.


You have to make a difference between the encoding used by *you* while you
are writing your perl program, and the encoding used by *perl* while it is
running your program.

You have to know what *you* are using. The answer has nothing to do with
perl. If you can look at your program in an environment where UTF-8 is
expected and you read it correctly there, then the program is in UTF-8.
Use the "use utf8" to tell perl about it. It has no effect on what the
program does with the strings in it.

For the encoding used by *perl* while it is running your program, Rainer
Weikusat's comment applies. You should not try to know. As long as all
characters in a string are in the ISO-8859-1 character set, it is probable
that ISO-8859-1 is internally used; there is an additional flag in the
internal representation to indicate how the string is internally stored.
Don't mess around with the internal encoding. Rather, *you* have to know
how you meant the string: either as sequence of bytes whose character
meaning only you know, or as a sequence of characters whose encoding as
bytes only perl knows. Do not try to share such knowledge between perl and
you. This is fairly well explained in perlunitut (e.g.
http://search.cpan.org/~flora/perl-5...perlunitut.pod).

--
Helmut Richter
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      01-03-2012
Helmut Richter <(E-Mail Removed)> writes:
> On Tue, 3 Jan 2012, Rainer Weikusat wrote:
>> "(E-Mail Removed)" <(E-Mail Removed)> writes:
>> > I know the question must have been asked many times, there are many
>> > web-pages which are supposed to help, but after going through many of
>> > them, I still need help.
>> > I am running a simple program on linux, perl 5.8.8 ;
>> >
>> > use strict ;
>> > use warnings ;
>> > ## use utf8 ;
>> >
>> > my $str ;
>> > $str = "" ;
>> > print "str is $str\n" ;
>> > ---
>> > Works well. But my question is: how do I know which encoding is being
>> > used to read/write $str?

>>
>> According to the people who dabble in this area, you are not supposed
>> to know that. You are supposed to convert any data flowing into perl
>> from the encoding known to you into 'the super-secret, proprietary
>> internal Perl encoding' (patent pending) and any data flowing out of
>> perl from said 'super-secret internal Perl encoding' into whatever
>> encoding you'd like to have.

>
> You have to make a difference between the encoding used by *you* while you
> are writing your perl program, and the encoding used by *perl* while it is
> running your program.


No. The people who *presently* work on Perl unicode support *want*
that users of the language have to pretend that 'the internal perl
encoding' is some magic secret beyond the realm of Perl code *despite*
this is obviously at odds with the original design of 'unicode support
for Perl' and this doesn't make much sense: At the very least, this
requires one additional copy of all data flowing into Perl and one
additional copy of all data going out of Perl. Given that one of the
main uses of Perl is as a so-called 'glue language' interconnection
other pieces of software into a complex whole, this is a major pain in
the ass and this solely for the hypothetical benefit of the people
working on the code. It is hypothetical because there is no way in
heaven or hell that all of the existing Perl code which wasn't written
based on the assumption that Perl strings are magic beasts with
intransigent properties is ever going to be changed just because this
would appeal someone's completely impractical idea of theoretical
purity and the worst possible cause is that - someday - a Perl 5 fork
is created which does break all this code and this will then simply
become Perl 6 rev 0.5 --- something which exists for the private joy
of its developers nobody uses for anything.

 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      01-03-2012
Ben Morrow <(E-Mail Removed)> writes:
> Quoth "(E-Mail Removed)" <(E-Mail Removed)>:
>>
>> I know the question must have been asked many times, there are many
>> web-pages which are supposed to help, but after going through many of
>> them, I still need help.
>> I am running a simple program on linux, perl 5.8.8 ;

>
> That perl is very nearly six years old. You should upgrade to at least
> 5.12.
>
>> use strict ;
>> use warnings ;
>> ## use utf8 ;
>>
>> my $str ;
>> $str = "" ;
>> print "str is $str\n" ;
>> ---
>> Works well. But my question is: how do I know which encoding is being
>> used to read/write $str?

>
> If you don't 'use utf8', perl assumes your source is is ISO8859-1. If
> you do, it assumes your source is in UTF-8. (In theory you can use other
> encodings with the 'use encoding' pragma, but AIUI this doesn't work
> reliably.)
>
> Output is completely unrelated. If you don't do anything special, perl
> will give you output in ISO8859-1.


This isn't quite correct: It will use 'the native 8 bit encoding' and
this may well be something other than ASCII/ ISO-8859-1, although
that's a case which rarely occurs in practice because most people
don't write code for IBM mainframes :->.

[...]

> If you attempt to print a character which can't be represented in
> ISO8859-1 you get a warning and the raw UTF-8 bytes representing
> that character: this is obviously something you need to avoid, since
> the output doesn't make any sense at that point.


An example I recently encountered where it did make sense was a web
interface with a Japanese localization: Since there were no characters
corresponding with codepoints from (128, 255), the generated output
was simply UTF-8 encoded Japanese which was exactly what it was
supposed to be.
 
Reply With Quote
 
dn.perl@gmail.com
Guest
Posts: n/a
 
      01-04-2012
On Jan 3, 10:25*am, Ben Morrow wrote:
>
> That perl (5.8. is very nearly six years old. You should
> upgrade to at least 5.12.
>


I wonder whether you realize how difficult (ranging to impossible) it
may be to achieve it. Say, I am on a 3-month contract. The employer
has been managing for years with 5.8.8 and is unlikely to upgrade in
such a case. Once I was stuck with a MySQL server which was many years
old, but my boss was more concerned with preserving his own job than
asking his BOSS to spend time and money on upgrading. Not that the
suggestion to upgrade is wrong or any thing.

>
> If you don't 'use utf8', perl assumes your source is is ISO8859-1. If
> you do, it assumes your source is in UTF-8. (In theory you can use other
> encodings with the 'use encoding' pragma, but AIUI this doesn't work
> reliably.)
> ...
> What did you expect to happen? perldoc utf8 quite clearly says
> * * Do not use this pragma for anything else than telling Perl that your
> * * script is written in UTF-8.
> so if you 'use utf8' and your source isn't, in fact, *in* UTF-8, you
> must expect warnings and misbehaviour.
>


It is very useful to know that perl assumes the source to be
ISO8859-1. That 'use utf8' arguably works counter-intuitively. Since
my code is ASCII and all ASCII is automatically utf8, I tend to wonder
why I would ever write non-ascii code. It may not be a logical thing
to do but I daresay it is an instinctive thing to do. Now if I want to
dabble in utf8 or databases, what do I do? I think of 'use utf8' or
'use DataBaseInterface DBI'.

What I needed was 'use Encode' which is what I am doing now.
Thanks for all the responses.

 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      01-04-2012
On 2012-01-04 07:38, http://www.velocityreviews.com/forums/(E-Mail Removed) <(E-Mail Removed)> wrote:
> On Jan 3, 10:25*am, Ben Morrow wrote:
>> That perl (5.8. is very nearly six years old. You should
>> upgrade to at least 5.12.

[...]
>> If you don't 'use utf8', perl assumes your source is is ISO8859-1. If
>> you do, it assumes your source is in UTF-8. (In theory you can use other
>> encodings with the 'use encoding' pragma, but AIUI this doesn't work
>> reliably.)
>> ...
>> What did you expect to happen? perldoc utf8 quite clearly says
>> * * Do not use this pragma for anything else than telling Perl that your
>> * * script is written in UTF-8.
>> so if you 'use utf8' and your source isn't, in fact, *in* UTF-8, you
>> must expect warnings and misbehaviour.
>>

>
> It is very useful to know that perl assumes the source to be
> ISO8859-1.


This is not quite correct. Without 'use utf8', perl assumes your source
is an unspecified superset of ASCII, not ISO-8859-1. The character codes
are the same, but the semantics are different. For example, if your
script was encoded in ISO-8859-1, "" would result in string consisting
of a single byte with the value 0xE4, but that byte is not equivalent to
the character "" - it doesn't match \w, [:lower:] or any of the other
classes "LATIN SMALL LETTER A WITH DIAERESIS" should match. It cannot be
uppercased. It is just a meaningless byte, not a character.


> That 'use utf8' arguably works counter-intuitively. Since
> my code is ASCII


No, your code isn't ASCII. It contained the line

| $str = "" ;

"" is not an ASCII character.

> and all ASCII is automatically utf8, I tend to wonder
> why I would ever write non-ascii code.


Well, why did you?


> What I needed was 'use Encode' which is what I am doing now.


Please don't unless you really understand what it does. Encode does a
couple of different things and it isn't entirely consistent. It seemed
like a good idea at the time and it may have been useful for converting
pre-5.8-code, but I really wouldn't use it for new code.

hp

--
_ | Peter J. Holzer | Deprecating human carelessness and
|_|_) | Sysadmin WSR | ignorance has no successful track record.
| | | (E-Mail Removed) |
__/ | http://www.hjp.at/ | -- Bill Code on (E-Mail Removed)
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      01-04-2012
"Peter J. Holzer" <(E-Mail Removed)> writes:
> On 2012-01-04 07:38, (E-Mail Removed) <(E-Mail Removed)> wrote:
>> On Jan 3, 10:25*am, Ben Morrow wrote:
>>> That perl (5.8. is very nearly six years old. You should
>>> upgrade to at least 5.12.

> [...]
>>> If you don't 'use utf8', perl assumes your source is is ISO8859-1. If
>>> you do, it assumes your source is in UTF-8. (In theory you can use other
>>> encodings with the 'use encoding' pragma, but AIUI this doesn't work
>>> reliably.)
>>> ...
>>> What did you expect to happen? perldoc utf8 quite clearly says
>>> * * Do not use this pragma for anything else than telling Perl that your
>>> * * script is written in UTF-8.
>>> so if you 'use utf8' and your source isn't, in fact, *in* UTF-8, you
>>> must expect warnings and misbehaviour.
>>>

>>
>> It is very useful to know that perl assumes the source to be
>> ISO8859-1.

>
> This is not quite correct. Without 'use utf8', perl assumes your source
> is an unspecified superset of ASCII, not ISO-8859-1.
> The character codes are the same, but the semantics are different.


This is also not quite correct: When 'use locale' is in effect, Perl
assumes that anything beyond ASCII is supposed to have a meaning in
the locale which happens to be in effect when the script is
executed. Otherwise, the default is equivalent to the default POSIX
locale (corresponding with LANG=C) which means bytes with value in the
range (0, 127) will be interpreted as ASCII characters belonging to
some of the different characters classes and bytes with values from
(128, 255) are just 'bytes with certain values' and no further
properties.

Eg, assuming the text included below

----------------
$a = chr(0xe4);

{
use locale;
print 'locale: ', $a =~ /\w/, "\n";
}

print 'no locale: ', $a =~ /\w/, "\n";
----------------

is saved to a file on a system where locale-information for ISO-8859-1
based German is available, the command (a.pl being the name of the
file)

LANG=de_DE perl a.pl

will print

locale: 1
no locale:

and

LANG=C perl a.pl

locale:
no locale:

 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      01-05-2012
Ben Morrow <(E-Mail Removed)> writes:
> Quoth "Peter J. Holzer" <(E-Mail Removed)>:
>> On 2012-01-04 07:38, (E-Mail Removed) <(E-Mail Removed)> wrote:


[...]


>> > What I needed was 'use Encode' which is what I am doing now.

>>
>> Please don't unless you really understand what it does. Encode does a
>> couple of different things and it isn't entirely consistent. It seemed
>> like a good idea at the time and it may have been useful for converting
>> pre-5.8-code, but I really wouldn't use it for new code.

>
> Are you (either of you, in fact) thinking of 'use encoding'? That pragma
> is, as I said originally, a Bad Idea.


This would then be another documented Perl which managed to run afoul
of someone's opinions. Is their actually any other reason than "it's a
convenient way to do what shalt not be done"?
 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      01-05-2012
Shmuel (Seymour J.) Metz <(E-Mail Removed)> writes:
> at 01:23 AM, Rainer Weikusat <(E-Mail Removed)> said:
>
>>This would then be another documented Perl which managed to run
>>afoul of someone's opinions.

>
> No.


It's documented:

[rw@sapphire]/tmp $whatis encoding
encoding (3perl) - allows you to write your script in non-ascii or non-utf8

But according to the opinion of someone, it shouldn't be used.

>>Is their actually any other reason than "it's a
>>convenient way to do what shalt not be done"?

>
> Yes.


And - as usual - no reasons beyond 'thou shalt do as I bid you and not
ask silly questions' are given.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
How to use String.split to split a mixed encoding string(partencoded in gbk, part encoded in utf-8) Stanley Xu Ruby 2 03-23-2011 02:06 PM
Will standard C++ allow me to replace a string in a unicode-encoded text file? Eric Lilja C++ 8 02-22-2005 02:27 PM
How to translate Japanese String into UTF-32 encoded using Java APIs ? Marat Java 5 11-10-2004 04:34 PM
how to obtain length of a UTF-encoded string before writing it? wnstnsmith@yahoo.com Java 2 04-25-2004 12:16 AM



Advertisments