Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > utf8

Reply
 
 
George Mpouras
Guest
Posts: n/a
 
      05-13-2013
Is there any easy way to decice if a string is valid UTF-8 ?
 
Reply With Quote
 
 
 
 
Manfred Lotz
Guest
Posts: n/a
 
      05-13-2013
On Mon, 13 May 2013 14:05:00 +0300
George Mpouras <(E-Mail Removed) m> wrote:

> Is there any easy way to decice if a string is valid UTF-8 ?


Minimal example:

#! /usr/bin/perl

use strict;
use warnings;

use utf8;
use Encode;

my $string = 'Hä';

Encode::is_utf8($string) or die "bad string";

my $bad_string = 0x123456;
Encode::is_utf8($bad_string) or die "bad string";


--
Manfred

 
Reply With Quote
 
 
 
 
George Mpouras
Guest
Posts: n/a
 
      05-13-2013
Στις 13/5/2013 15:51, ο/η Manfred Lotz *γραψε:
> On Mon, 13 May 2013 14:05:00 +0300
> George Mpouras <(E-Mail Removed) m> wrote:
>
>> Is there any easy way to decice if a string is valid UTF-8 ?

>
> Minimal example:
>
> #! /usr/bin/perl
>
> use strict;
> use warnings;
>
> use utf8;
> use Encode;
>
> my $string = 'Hä';
>
> Encode::is_utf8($string) or die "bad string";
>
> my $bad_string = 0x123456;
> Encode::is_utf8($bad_string) or die "bad string";
>
>




thanks, it is working.
I have tried the same thing, but my mistake was, I have not used the
line "use utf8;" !


 
Reply With Quote
 
Manfred Lotz
Guest
Posts: n/a
 
      05-13-2013
On Mon, 13 May 2013 16:22:36 +0300
George Mpouras <(E-Mail Removed) m> wrote:

> Στις 13/5/2013 15:51, ο/η Manfred Lotz *γραψε:
> > On Mon, 13 May 2013 14:05:00 +0300
> > George Mpouras <(E-Mail Removed) m>
> > wrote:
> >
> >> Is there any easy way to decice if a string is valid UTF-8 ?

> >
> > Minimal example:
> >
> > #! /usr/bin/perl
> >
> > use strict;
> > use warnings;
> >
> > use utf8;
> > use Encode;
> >
> > my $string = 'Hä';
> >
> > Encode::is_utf8($string) or die "bad string";
> >
> > my $bad_string = 0x123456;
> > Encode::is_utf8($bad_string) or die "bad string";
> >
> >

>
>
>
> thanks, it is working.
> I have tried the same thing, but my mistake was, I have not used the
> line "use utf8;" !
>
>


Yes, that is important.


--
Manfred





 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      05-13-2013
On 2013-05-13 12:51, Manfred Lotz <(E-Mail Removed)> wrote:
> On Mon, 13 May 2013 14:05:00 +0300
> George Mpouras <(E-Mail Removed) m> wrote:
>> Is there any easy way to decice if a string is valid UTF-8 ?

>
> Minimal example:
>
> #! /usr/bin/perl
>
> use strict;
> use warnings;
>
> use utf8;
> use Encode;
>
> my $string = 'H';


This string is not UTF-8 in any useful sense. It consists of two
characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would consist
of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former string has
length 2, the latter has length 3.


> Encode::is_utf8($string) or die "bad string";


This tests whether the internal representation of the string is
utf-8-like, which you almost never want to know in a Perl program. It
also tells you whether the string has character semantics (unless you
use a rather new version of perl with the unicode_strings feature),
which is sometimes useful.

If you want to know whether a string is a correctly encoded UTF-8
sequence, try to decode it:

$decoded = eval { decode('UTF-8', $string, FB_CROAK) };

(decode(..., FB_CROAK) will die if $string is not UTF-8, so you need to
catch that. All other check parameters are even less convenient).

hp


--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | http://www.velocityreviews.com/forums/(E-Mail Removed) | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpat. -- Ralph Babel
 
Reply With Quote
 
George Mpouras
Guest
Posts: n/a
 
      05-14-2013
>
> If you want to know whether a string is a correctly encoded UTF-8
> sequence, try to decode it:
>
> $decoded = eval { decode('UTF-8', $string, FB_CROAK) };
>
> (decode(..., FB_CROAK) will die if $string is not UTF-8, so you need to
> catch that. All other check parameters are even less convenient).
>


nice !

 
Reply With Quote
 
Manfred Lotz
Guest
Posts: n/a
 
      05-14-2013
On Tue, 14 May 2013 01:10:59 +0200
"Peter J. Holzer" <(E-Mail Removed)> wrote:

> On 2013-05-13 12:51, Manfred Lotz <(E-Mail Removed)> wrote:
> > On Mon, 13 May 2013 14:05:00 +0300
> > George Mpouras <(E-Mail Removed) m>
> > wrote:
> >> Is there any easy way to decice if a string is valid UTF-8 ?

> >
> > Minimal example:
> >
> > #! /usr/bin/perl
> >
> > use strict;
> > use warnings;
> >
> > use utf8;
> > use Encode;
> >
> > my $string = 'Hä';

>
> This string is not UTF-8 in any useful sense. It consists of two
> characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
> LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
> consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
> string has length 2, the latter has length 3.
>


This is only the email. In my test script it is this:

00000050 20 27 48 c3 a4 27 3b 0a 0a 45 6e 63 6f 64 65 3a
| 'H..';..Encode




> > Encode::is_utf8($string) or die "bad string";

>
> This tests whether the internal representation of the string is
> utf-8-like, which you almost never want to know in a Perl program. It
> also tells you whether the string has character semantics (unless you
> use a rather new version of perl with the unicode_strings feature),
> which is sometimes useful.
>
> If you want to know whether a string is a correctly encoded UTF-8
> sequence, try to decode it:
>
> $decoded = eval { decode('UTF-8', $string, FB_CROAK) };
>
> (decode(..., FB_CROAK) will die if $string is not UTF-8, so you need
> to catch that. All other check parameters are even less convenient).
>


Aaah, thanks. Didn't know that.

#! /usr/bin/perl
use strict;
use warnings;

use utf8;
use 5.010;

use Encode qw( decode FB_CROAK );

my $string = 'Hä'; # = 0x48c3a4


my $decoded = decode('utf8', $string, FB_CROAK);


Nevertheless, I'm confused. Above script where 'Hä' is definitely
0x48c3a4 (verified by hexdump) croaks. Why?

At any rate I have to read perlunitut, perluniintro etc. to understand
what's going on.


--
Manfred

 
Reply With Quote
 
Manfred Lotz
Guest
Posts: n/a
 
      05-15-2013
On Tue, 14 May 2013 21:27:49 +0100
Ben Morrow <(E-Mail Removed)> wrote:

>
> Quoth Manfred Lotz <(E-Mail Removed)>:
> > On Tue, 14 May 2013 01:10:59 +0200
> > "Peter J. Holzer" <(E-Mail Removed)> wrote:
> > > On 2013-05-13 12:51, Manfred Lotz <(E-Mail Removed)> wrote:
> > > >
> > > > use utf8;
> > > > use Encode;
> > > >
> > > > my $string = 'Hä';
> > >
> > > This string is not UTF-8 in any useful sense. It consists of two
> > > characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
> > > LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
> > > consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
> > > string has length 2, the latter has length 3.

> [...]
> >
> > use utf8;
> > use 5.010;
> >
> > use Encode qw( decode FB_CROAK );
> >
> > my $string = 'Hä'; # = 0x48c3a4
> >
> >
> > my $decoded = decode('utf8', $string, FB_CROAK);
> >
> >
> > Nevertheless, I'm confused. Above script where 'Hä' is definitely
> > 0x48c3a4 (verified by hexdump) croaks. Why?

>
> That is exactly what Peter was trying to explain. Because of the 'use
> utf8', perl has already decoded the UTF-8 in the source code file into
> Unicode characters, so $string does *not* contain "\x48\xc3\xa4":


My mistake was that I believed that perl's internal representation is
utf8 instead of unicode code point. I thought I had read this in some
perl man page.


--
Manfred

 
Reply With Quote
 
Manfred Lotz
Guest
Posts: n/a
 
      05-15-2013
On Tue, 14 May 2013 21:27:49 +0100
Ben Morrow <(E-Mail Removed)> wrote:

>
> Quoth Manfred Lotz <(E-Mail Removed)>:
> > On Tue, 14 May 2013 01:10:59 +0200
> > "Peter J. Holzer" <(E-Mail Removed)> wrote:
> > > On 2013-05-13 12:51, Manfred Lotz <(E-Mail Removed)> wrote:
> > > >
> > > > use utf8;
> > > > use Encode;
> > > >
> > > > my $string = 'Hä';
> > >
> > > This string is not UTF-8 in any useful sense. It consists of two
> > > characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
> > > LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
> > > consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
> > > string has length 2, the latter has length 3.

> [...]
> >
> > use utf8;
> > use 5.010;
> >
> > use Encode qw( decode FB_CROAK );
> >
> > my $string = 'Hä'; # = 0x48c3a4
> >
> >
> > my $decoded = decode('utf8', $string, FB_CROAK);
> >
> >
> > Nevertheless, I'm confused. Above script where 'Hä' is definitely
> > 0x48c3a4 (verified by hexdump) croaks. Why?

>
> That is exactly what Peter was trying to explain. Because of the 'use
> utf8', perl has already decoded the UTF-8 in the source code file into
> Unicode characters, so $string does *not* contain "\x48\xc3\xa4":
> instead it contains "\x48\xe4". The e4 is because 'ä', as a Unicode
> character, has ordinal 0x34. This string, which happens to contain
> only bytes though it could easily not have done, is not valid UTF-8,
> so decode croaks.
>


Ok, I agree that perl decodes 'ä' (which is utf8 x'c3a4' in the file) to
unicode \x{e4}.

Nevertheless the ä is a valid utf8 char.

This means that the test to check for valid utf8 which Peter proposed
is wrong as it croaks.

The following snippet:

#!/usr/bin/perl

use strict;
use warnings;

use utf8;

use Test::utf8;

binmode STDOUT, ":utf8";

my $ae = 'ä';

show_char($ae);

sub show_char {
my $ch = shift;

print '-' x 80;
print "\n";
print "Char: $ch\n";
is_valid_string($ch); # check the string is valid
is_sane_utf8($ch); # check not double encoded

# check the string has certain attributes
is_flagged_utf8($ch); # has utf8 flag set
is_within_ascii($ch); # only has ascii chars in it
is_within_latin_1($ch); # only has latin-1 chars in it

}

yields:
--------------------------------------------------------------------------------
Char: ä
ok 1 - valid string test
ok 2 - sane utf8
ok 3 - flagged as utf8
not ok 4 - within ascii
# Failed test 'within ascii'
# at ./unicode04.pl line 27.
# Char 1 not ASCII (it's 228 dec / e4 hex)
ok 5 - within latin-1
# Tests were run but no plan was declared and done_testing() was not
seen.

which is what I would have assumed.


--
Manfred

 
Reply With Quote
 
Rainer Weikusat
Guest
Posts: n/a
 
      05-15-2013
Manfred Lotz <(E-Mail Removed)> writes:
> On Tue, 14 May 2013 21:27:49 +0100


[...]

> My mistake was that I believed that perl's internal representation is
> utf8 instead of unicode code point.


perl's internal representation is utf8 which is supposed to be decoded
on demand as necessary. That's not an uncommon implementation choice
for software supposed to interact with 'the real world' (here supposed
to mean 'everything out there on the internet', have a look at the
Mozilla Rust FAQ for a cogent and succinct explanation why this makes
sense) but that's an implementation choice the people who presently
work on this code strongly disagree with: They would prefer a model
where, prior to each internal processing step, a pass over the
complete input data has to be made in order to transform it into "the
super-secret internal perl encoding" and after any internal processing
has been completed, a second pass over all of the data has to be made
in order to decode the 'super secrete internal perl encoding' into
something which is useful for anyhing except being 'super secret' and
'internal to Perl'.

This sort-of makes sense when assuming that perl is an island located
in strange waters and that it will usually keep mostly to itself
(figuratively spoken) and it makes absolutely no sense when 'some perl
code' performs one step of a multi-stage processing pipeline which may
possibly even include other perl code (since not even 'output of perl'
is supposed to be suitable to become 'input of perl').
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
given char* utf8, how to read unicode line by line, and output utf8 gry C++ 2 03-13-2012 04:32 AM
UTF8 to Unicode conversion Spamtrap Perl 6 07-31-2004 04:59 AM
open with encoding(utf8) takes forever Erik Sandblom Perl 0 05-28-2004 02:01 PM
LWP::Simple and utf8 problem Thomas =?ISO-8859-15?Q?G=F6tz?= Perl 0 04-19-2004 09:48 AM
Cmenu, Text Interfaces, and UTF8 shade Perl 1 08-11-2003 11:24 AM



Advertisments