Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > tr/// broken?

Reply
Thread Tools

tr/// broken?

 
 
Ilya Zakharevich
Guest
Posts: n/a
 
      04-11-2006

I'm trying to use tr/// operator (instead of RExen), and do not think
it works... The simplified example is

>perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"

UTF-16 surrogate 0xdfff at -e line 1.
Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
abcdefg

The original code contained something like

perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
Unicode character 0x1fffff is illegal at -e line 1.
________

That spurious warning can be worked about, but I think the behaviour
is not up to documentation; is it?

Thanks,
Ilya
 
Reply With Quote
 
 
 
 
Guest
Posts: n/a
 
      04-11-2006
Ilya Zakharevich a dit le Tue, 11 Apr 2006 02:53:58 +0000 (UTC):
>I'm trying to use tr/// operator (instead of RExen), and do not think
>it works... The simplified example is
>
> >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"

> UTF-16 surrogate 0xdfff at -e line 1.
> Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
> abcdefg
>

[...]
>That spurious warning can be worked about, but I think the behaviour
>is not up to documentation; is it?


Its in the perldiag manpage :

UTF-16 surrogate %s
(W utf You tried to generate half of an UTF-16 surrogate by requesting a
Unicode character between the code points 0xD800 and 0xDFFF (inclusive). That
range is reserved exclusively for the use of UTF-16 encoding (by having two 16-
bit UCS-2 characters); but Perl encodes its characters in UTF-8, so what you
got is a very illegal character. If you really know what you are doing you can
turn off this warning by "no warnings 'utf8';".

 
Reply With Quote
 
 
 
 
Ilya Zakharevich
Guest
Posts: n/a
 
      04-11-2006
[A complimentary Cc of this posting was sent to

<(E-Mail Removed)>], who wrote in article <443b8741$0$5170$(E-Mail Removed)>:
> > >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"

> > UTF-16 surrogate 0xdfff at -e line 1.
> > Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
> > abcdefg


> >That spurious warning can be worked about, but I think the behaviour
> >is not up to documentation; is it?


> Its in the perldiag manpage :
>
> UTF-16 surrogate %s
> (W utf You tried to generate half ...


First of all, I assume that "its" is this broken warning (actually,
one of two [duplicate] warnings). Since it does not apply to the
situation I discuss, I can hardly find your finding this message in
the list of warnings relevant.

Second, what I was discussing was not the warning, but the ACTION. Do
you think the RESULT ('abcdefg') is "correct"?

Thanks anyway,
Ilya

P.S. Actually, the text in perldiag is also wrong:

> of an UTF-16 surrogate by requesting a Unicode character between the
> code points 0xD800 and 0xDFFF (inclusive). That range is reserved
> exclusively for the use of UTF-16 encoding (by having two 16- bit
> UCS-2 characters); but Perl encodes its characters in UTF-8, so what
> you got is a very illegal character. If you really know what you
> are doing you can turn off this warning by "no warnings 'utf8';".


Perl (the language) does not encode its characters in UTF-8.
Characters are not encoded in any way, they just "are". And, if you
consider implementation, the internal encoding is not UTF-8 either (it
is called in perl world as "utf8", and is a proper superset). Sigh...
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      04-11-2006
Ilya Zakharevich schreef:

> I'm trying to use tr/// operator (instead of RExen), and do not think
> it works... The simplified example is
>
> >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"

> UTF-16 surrogate 0xdfff at -e line 1.
> Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
> abcdefg
>
> The original code contained something like
>
> perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
> tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
> Unicode character 0x1fffff is illegal at -e line 1.
> ________
>
> That spurious warning can be worked about,


Is it a "spurious warning"?

perl -MO=Deparse -e '$_ = qq(\x{d7ff}\x{d800})'

perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//'


> but I think the behaviour
> is not up to documentation; is it?


It isn't.

--
Affijn, Ruud

"Gewoon is een tijger."

 
Reply With Quote
 
thundergnat
Guest
Posts: n/a
 
      04-11-2006
Ilya Zakharevich wrote:
> I'm trying to use tr/// operator (instead of RExen), and do not think
> it works... The simplified example is
>
> >perl5.8.7 -wle "$_ = q(abcdefg); tr/\x{e000}-\x{e0ff}/ /c; print"

> UTF-16 surrogate 0xdfff at -e line 1.
> Malformed UTF-8 character (UTF-16 surrogate 0xdfff) at -e line 1.
> abcdefg
>
> The original code contained something like
>
> perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
> tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
> Unicode character 0x1fffff is illegal at -e line 1.
> ________
>
> That spurious warning can be worked about, but I think the behaviour
> is not up to documentation; is it?
>


It /does/ appear to be a bug in tr. Not in that it has a problem with
characters in the range D800–DFFF, that doesn't surprise me much. Those
/aren't/ legal utf-8 character codes. The thing that DOES surprise me is
that tr considers \x{e000} (and \x{d7ff}!) to be in the range
\x{d800}-\x{dfff}. Seems like tr is confused about the surrogates range.


no error:
perl -wle "$_ = q(abcdefg); tr/\x{e001}-\x{e0ff}/ /c; print"


error
perl -wle "$_ = q(abcdefg); tr/\x{e000}/ /c; print"


error
perl -wle "$_ = q(abcdefg); tr/\x{d7ff}/ /c; print"


no error
perl -wle "$_ = q(abcdefg); tr/\x{d7fe}/ /c; print"







 
Reply With Quote
 
Ilya Zakharevich
Guest
Posts: n/a
 
      04-11-2006
[A complimentary Cc of this posting was sent to
Dr.Ruud
<(E-Mail Removed)>], who wrote in article <(E-Mail Removed)>:
> > The original code contained something like
> >
> > perl5.8.7 -wle "$_ = qq(abcd\x{e155}efg);
> > tr/\x{e100}-\x{e1ff}\x00-\x{1FFFFF}/\x00-\xFF_/; print"
> > Unicode character 0x1fffff is illegal at -e line 1.
> > ________
> >
> > That spurious warning can be worked about,

>
> Is it a "spurious warning"?


Looks so. What makes you doubt it? I'm working with Perl characters,
not Unicode characters; and IIRC, even Unicode goes up to 0x1fffff...
Or is it 0x10ffff?

> perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//'


What is your point? I do not see which output makes you think this is
relevant... Did you try

perl -MO=Deparse -e 'tr/\x{7ff}\x{800}//'

Thanks,
Ilya
 
Reply With Quote
 
Ilya Zakharevich
Guest
Posts: n/a
 
      04-11-2006
[A complimentary Cc of this posting was sent to
Dr.Ruud
<(E-Mail Removed)>], who wrote in article <(E-Mail Removed)>:
> Is it a "spurious warning"?


> perl -MO=Deparse -e 'tr/\x{d7ff}\x{d800}//'


Oups, ignore my preceeding message; I was using wrong quotes... So I
see now where the Perl bug is:

>perl -MO=Deparse -e "tr/\x{0000}-\x{ffff}//"

Malformed UTF-8 character (character 0xffff) at -e line 1.
Malformed UTF-8 character (character 0xffff) at -e line 1.
use utf8 ();
tr/\000//;
-e syntax OK

>perl -MO=Deparse -e "tr/\x{0000}-\x{fff0}//"

use utf8 ();
tr/\000-\x{fff0}//;
-e syntax OK

So some Perl developer thought that Perl characters == Unicode
characters, and mangles the pattern without reporting errors...

A lot of thanks,
Ilya
 
Reply With Quote
 
Ilya Zakharevich
Guest
Posts: n/a
 
      04-11-2006
[A complimentary Cc of this posting was sent to
thundergnat
<(E-Mail Removed)>], who wrote in article <(E-Mail Removed)>:
> It /does/ appear to be a bug in tr. Not in that it has a problem with
> characters in the range D800–DFFF, that doesn't surprise me much. Those
> /aren't/ legal utf-8 character codes.


Let me disagree. First, I know of no such thing as utf-8. Second, if
you mean utf8, legal codes are 0..MAX_UV (since the size of UV is
specific to Perl build, this depends on the build of Perl executable).

Some codes would not appear in Unicode strings; but one should be able
to treat "binary" data freely (including 0..31 and 0x80..0x9F ranges,
and other characters which have no Unicode-consortium-assigned
cultural information).

Thanks,
Ilya
 
Reply With Quote
 
Guest
Posts: n/a
 
      04-12-2006
Ilya Zakharevich a dit le Tue, 11 Apr 2006 16:17:49 +0000 (UTC):
> Since it does not apply to the
>situation I discuss, I can hardly find your finding this message in
>the list of warnings relevant.
>
>Second, what I was discussing was not the warning, but the ACTION. Do
>you think the RESULT ('abcdefg') is "correct"?


The warning seems relevant, as avoiding the 0xD800-0xDFFF range seems to give a
good result :


$ perl -wle '$_ = q(abcdefg); tr/\x{d7ff}-\x{e0ff}/ /c; print'
 
Reply With Quote
 
Ben Bacarisse
Guest
Posts: n/a
 
      04-13-2006
On Tue, 11 Apr 2006 22:11:32 +0000, Ilya Zakharevich wrote:

> Let me disagree. First, I know of no such thing as utf-8. Second, if
> you mean utf8


The proper form is UTF-8 (i.e. with caps) so your correction (further from
the accepted form) seems rather harsh!

Refs:
http://www.unicode.org/versions/Unicode3.0.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

--
Ben.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off




Advertisments