Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Convert UFT-8 to unicode?

Reply
Thread Tools

Convert UFT-8 to unicode?

 
 
Andreas Schmidt
Guest
Posts: n/a
 
      08-06-2003
Hi,

my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the Euro
symbol.

However, the unicode for this symbol is 0x20AC.

How can I convert from UTF-8 to Unicode?

I'd like to do sth like:

if( $str =~ m/\x{20AC}/ ){
print "used euro";
}

but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of course...

Thanks for every hint!
Andi

 
Reply With Quote
 
 
 
 
Jürgen Exner
Guest
Posts: n/a
 
      08-06-2003
Andreas Schmidt wrote:
> my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the
> Euro symbol.
> However, the unicode for this symbol is 0x20AC.
> How can I convert from UTF-8 to Unicode?


Text::Iconv does a good job in converting between pretty much any encoding.

jue


 
Reply With Quote
 
 
 
 
Bart Lateur
Guest
Posts: n/a
 
      08-06-2003
Andreas Schmidt wrote:

>my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the Euro
>symbol.


I assume you mean the UTF-8 looks like "\xE2\x82\xAC"?

>However, the unicode for this symbol is 0x20AC.
>
>How can I convert from UTF-8 to Unicode?
>
>
>I'd like to do sth like:
>
>if( $str =~ m/\x{20AC}/ ){
> print "used euro";
>}
>
>but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of course...


If you're sure the string contains valid UTF-8, all you have to do is
enable the UTF-8 flag of the string. If you're using Perl 5.8.0 or
above, you have the Encode module at your displosal. See the last
section in its POD, "The UTF-8 flag" and "Messing with Perl's
Internals". You'll see the function _utf8_on($scalar) mentioned there.

<http://www.perldoc.com/perl5.8.0/lib/Encode.html>


If you're using a Perl 5.6.x, you can emulate that function using pack()
(most likely it will work for 5.8.x, too):

$utf8 = pack "U0a*", $bytes;

$utf8 will contain a string with exactly the same bytes as $bytes, but
having the UTF-8 flag on.

--
Bart.
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      08-06-2003
On Wed, Aug 6, Andreas Schmidt inscribed on the eternal scroll:

> my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the Euro
> symbol.


Then it would be best to use at least Perl 5.8.0 ...

> However, the unicode for this symbol is 0x20AC.


"Unicode" is an abstract concept - an identification of particular
characters with particular integer numbers ("code points" in the
Unicode character set). In order to actually _use_ those abtract
Unicode characters, it's necessary to have a way of representing them.
utf-8 is one particular way of representing them (and it just happens
to be Perl's own internal representation of Unicode, although you
don't need to know that in order to use it). You writing 0x20AC (or
as the Unicode folks would write it, U+20AC) are just other ways of
giving a concrete representation to the abstract characters. None of
them is "Unicode" per se: all of them are representations of Unicode.

> How can I convert from UTF-8 to Unicode?


utf-8 already _is_ (a representation of) Unicode.

> I'd like to do sth like:
>
> if( $str =~ m/\x{20AC}/ ){


Yup, that's another way of representing Unicode: it's Perl's way of
writing a "wide character" in source code.

Perhaps you could be a bit more precise about how this script
"receives" Unicode characters. Is it reading them directly from a
file (then it's easy in 5.8.0, you just open the file with :utf, or
is it that you've decoded some HTML form submission data, and got
yourself a string of bytes which contains some utf-8 representations
of characters?

If it's the latter, and you really have to handle this yourself by
hand (it appears that recent versions of CGI.pm handle it for you, but
I have to admit to not trying that myself yet), then I think you want
pack() with a template of U0, as others have said.

> but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of course...


Sort-of; but I'd still recommend taking a bit of time out to study
relevant parts of
http://www.perldoc.com/perl5.8.0/pod/perluniintro.html and then
http://www.perldoc.com/perl5.8.0/pod/perlunicode.html

to get a firmer understanding of what's going on, and how it's meant
to be used.
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      08-06-2003
On Wed, Aug 6, Alan J. Flavell inscribed on the eternal scroll:

> On Wed, Aug 6, Andreas Schmidt inscribed on the eternal scroll:
>
> > How can I convert from UTF-8 to Unicode?

>
> utf-8 already _is_ (a representation of) Unicode.


I'm glad to see now that you got much the same answer to this point
when you posted the same question to the German-language Perl group.

But it's not nice to post the same question in several places without
informing the respective participants that you are doing that. It
leads to pointless duplication of effort by people who were trying to
help you.
 
Reply With Quote
 
Ted Zlatanov
Guest
Posts: n/a
 
      08-06-2003
On Wed, 06 Aug 2003, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the
> Euro symbol.
>
> However, the unicode for this symbol is 0x20AC.
>
> How can I convert from UTF-8 to Unicode?
>
> I'd like to do sth like:
>
> if( $str =~ m/\x{20AC}/ ){
> print "used euro";
>}
>
> but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of
> course...


Start with "perldoc utf8" and "perldoc perlunicode." That will
probably do a large chunk of what you need.

The detailed answer depends a *lot* on your Perl version, your goals,
and your Unicode programming experience. You can also check CPAN for
UTF-8 modules that may be helpful:

http://search.cpan.org/search?query=UTF-8&mode=all

Ted
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      08-06-2003
On Wed, Aug 6, Nigel Horne inscribed on the eternal scroll:

> I have been sent a file in UTF format,


If we're to believe your subject header, it's utf-8 (as opposed to
utf-16LE or utf-16BE or whatever...)

> that is a file with UTF characters.


utf-8 is a representation of Unicode characters. I don't know what
the term "UTF characters" would mean.

> If I cat(1) the file I correctly see the Japanese characters.


It sounds as if you have a utf-8-capable terminal, then.

> How do I display the same characters in Perl?


Not to be too trite, but you'd read them in and then you'd print them
out. Just where are you experiencing a problem?

> An "od -x" of the file looks like
> this:
>
> 0000000 a4e6 e79c a2b4


I think that's OK; I'm not too good with doing utf-8 in my head.

I don't grasp your problem yet. Where did you get so far? Are you
using Perl 5.8 ? Have you read the relevant perldoc pages? Are you
opening output and input with ":utf8"?
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      08-07-2003
On Thu, Aug 7, Alan J. Flavell inscribed on the eternal scroll:

> On Wed, Aug 6, Nigel Horne inscribed on the eternal scroll:


> > An "od -x" of the file looks like
> > this:
> >
> > 0000000 a4e6 e79c a2b4

>
> I think that's OK; I'm not too good with doing utf-8 in my head.


The only octets in there which could be the first octet of a utf-8
character are the "e6" and "e7", and, since they are both of the form
"1110xxxx", each would be followed by two non-first octets (see the
utf-8 spec if you don't get this). Non-first octets have to be of the
form "10xxxxxx" i.e one of 8x, 9x, ax or bx. The bytes appear to be in
the wrong order for that.

I think this is because od printed little-endian 16-bit units instead
of printing bytes in sequence. Could it be that the actual byte
sequence in question is:

e6 a4 9c , e7 b4 a2

If so, then that could indeed be a legal utf-8 sequence, representing
two CJK-unified characters, namely U+691c and u+7d22.

http://www.unicode.org/cgi-bin/GetUn...codepoint=691C
http://www.unicode.org/cgi-bin/GetUn...codepoint=7d22

I don't read CJK myself, sorry, so this is sheer guesswork, I have no
idea whether it makes sense in the original.

But I come back to the original question. It looks as if in Perl 5.8
you can simply read this in and print it out (having opened the files
with :utf8 if you hope for the data to make any kind of sense in the
program). So, at which point are you experiencing a problem?
 
Reply With Quote
 
Philip Newton
Guest
Posts: n/a
 
      08-31-2003
On Thu, 7 Aug 2003 14:09:18 +0200, "Alan J. Flavell"
<(E-Mail Removed)> wrote:

> If so, then that could indeed be a legal utf-8 sequence, representing
> two CJK-unified characters, namely U+691c and u+7d22.
>
> http://www.unicode.org/cgi-bin/GetUn...codepoint=691C
> http://www.unicode.org/cgi-bin/GetUn...codepoint=7d22
>
> I don't read CJK myself, sorry, so this is sheer guesswork, I have no
> idea whether it makes sense in the original.


In Japanese, that would make the word "kensaku", which EDICT translates
as "retrieval (vs), looking up (a word in a dictionary), searching for,
referring to", which looks very sensical to me.

Cheers,
Philip
--
Philip Newton <(E-Mail Removed)>
That really is my address; no need to remove anything to reply.
If you're not part of the solution, you're part of the precipitate.
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      09-01-2003
On Sun, Aug 31, Philip Newton inscribed on the eternal scroll:

> On Thu, 7 Aug 2003 14:09:18 +0200, "Alan J. Flavell"

^^^

> > I don't read CJK myself, sorry, so this is sheer guesswork, I have no
> > idea whether it makes sense in the original.

>
> In Japanese, that would make the word "kensaku", which EDICT translates
> as "retrieval (vs), looking up (a word in a dictionary), searching for,
> referring to", which looks very sensical to me.


Well, that was a real slow-burner of a thread, but thanks! )

The O.P never did come back with any further details, as far
as I can see. I hope he got a workable solution.

cheers

--
The following corrective action will be
taken in 0 milliseconds: No action
- seen in Win2K event viewer
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Qestion about convert Object to byte[] and convert it back davidxiongcn@gmail.com Java 5 11-04-2006 04:11 PM
IsNumeric: Convert.ToInt32 vs. Convert.ToInt64 sck10 ASP .Net 4 09-03-2006 09:40 PM
To convert to J2SE 6 or not to convert, that is the question... Jaap Java 4 07-10-2006 09:03 AM
convert list of strings to set of regexes; convert list of strings to trie Klaus Neuner Python 7 07-26-2004 07:25 AM
Do I need to Convert with Convert.ToInt32(session("myNumber")) ? Andreas Klemt ASP .Net 1 07-23-2003 02:59 PM



Advertisments