Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > How to decode this unicode-hex string

Reply
Thread Tools

How to decode this unicode-hex string

 
 
* Tong *
Guest
Posts: n/a
 
      02-25-2005
Hi,

When I select from non-English web sites and paste into my emacs,
sometimes I get a unicode-hex string like this: \u82f1\u6587, which was
"English" in Big5 encoding.

I'm wondering how I can decode such strings and return the 8-bit character.

So far I've been looking into the following Perl modules man pages an
tried each one of them: Unicode::UTF8simple, Unicode::String,
Unicode::Lite. None of them seems to be able to do that. They handle
unicode-hex strings like this: "U+00d6 U+00d0 U+00b9 U+00fa". The
difference between the above representation is that, the \u82f1 represent
one 8-bit character, while in Perl it is represented in two U+00xx values.

I had also played with tcl decodings, but wasn't successful. Please help.

Thanks a lot!

tong

--
Tong (remove underscore(s) to reply)
*niX Power Tools Project: http://xpt.sourceforge.net/
- All free contribution & collection
 
Reply With Quote
 
 
 
 
phaylon
Guest
Posts: n/a
 
      02-25-2005
* Tong * wrote:

> I'm wondering how I can decode such strings and return the 8-bit
> character.


Sometimes I think all some people read from this group before posting is
the name. Look at the thread right before yours.

--
http://www.dunkelheit.at/

The eternal mistake of mankind is to set up an attainable ideal.
-- Aleister Crowley

 
Reply With Quote
 
 
 
 
* Tong *
Guest
Posts: n/a
 
      02-25-2005
On Fri, 25 Feb 2005 17:42:09 +0100, phaylon wrote:

>> I'm wondering how I can decode such strings and return the 8-bit
>> character.

>
> Sometimes I think all some people read from this group before posting is
> the name. Look at the thread right before yours.


Can you at least specify the thread subject if you want to help? Did you
mean the thread "How to convert latin1 to utf8"? Did you see that I've tried the
Unicode::String (and much more) before the posting? After all, have you
read the two threads carefully and seen the giant difference between them?


--
Tong (remove underscore(s) to reply)
*niX Power Tools Project: http://xpt.sourceforge.net/
- All free contribution & collection
 
Reply With Quote
 
phaylon
Guest
Posts: n/a
 
      02-25-2005
* Tong * wrote:

> Can you at least specify the thread subject if you want to help?


No, that's your job. My job is to code. But sometimes I make breaks. And,
I'm sorry if this is offensive to you, but I'm not willing to spend my
breaks doing someone other's work.

> Did you mean the thread "How to convert latin1 to utf8"?


Bingo.

> Did you see that I've tried the Unicode::String (and much more) before
> the posting?


Yeah. And I said there I would try out Encode, have you done that?

> After all, have you read the two threads carefully and seen the giant
> difference between them?


Nope, clear me up.

--
http://www.dunkelheit.at/
That is not dead, which can eternal lie,
and with strange aeons even death may die.
-- H.P. Lovecraft

 
Reply With Quote
 
RedGrittyBrick
Guest
Posts: n/a
 
      02-25-2005
* Tong * wrote:
> Hi,
>
> When I select from non-English web sites and paste into my emacs,
> sometimes I get a unicode-hex string like this: \u82f1\u6587, which was
> "English" in Big5 encoding.


I'm confused. Unicode and Big5 are completely different aren't they? For
one thing Unicode is a character set, there are several encodings such
as UTF-8.

u8251 and u6581 are Chinese characters in Unicode. They are within the
CJK Unified Ideographs 4E00-9FAF.
http://www.unicode.org/charts/PDF/U4E00.pdf
Together they form the Chonese word whose English translation is the
word "English".

> I'm wondering how I can decode such strings and return the 8-bit character.


An 8-bit character set would surely not be large enough to contain a
usable subset of the Chinese ideographs. Big 5 has 13,000 ideographs. An
8-bit character set has room for 256 at most.

When you say "the 8 bit character" are you thinking of something like
the ISO 8859-1 Latin-1 character set?

Without a Chinese-English dictionary, there's no way to "decode" the two
Chinese ideograms u8251 u6581 into the seven English letters u0045 u006e
u0067 u006C u0069 u0073 u0068

> So far I've been looking into the following Perl modules man pages an
> tried each one of them: Unicode::UTF8simple, Unicode::String,
> Unicode::Lite. None of them seems to be able to do that. They handle
> unicode-hex strings like this: "U+00d6 U+00d0 U+00b9 U+00fa". The
> difference between the above representation is that,




> the \u82f1 represent one 8-bit character,


No it doesn't!

while in Perl it is represented in two U+00xx values.

Two U+00xx values represent *TWO* Latin-1 characters.
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      02-25-2005
On Fri, 25 Feb 2005, * Tong * wrote:

> the \u82f1 represent one Chinese character,


Yes

> which is in two 8-bit characters


No way. As written, it's six *characters*. Encoded, it might be
two *bytes* (depends on the encoding).

> Any way, I figured out a way to do it, without any the
> aforementioned unicode packages.


But you're not going to tell us what it is?
 
Reply With Quote
 
* Tong *
Guest
Posts: n/a
 
      02-27-2005
On Fri, 25 Feb 2005 21:42:38 +0000, Alan J. Flavell wrote:

>> Any way, I figured out a way to do it, without any the
>> aforementioned unicode packages.

>
> But you're not going to tell us what it is?


Well, it actually has nothing to do with unicode. Here is what I did to
decode such string:

perl -pe 's / \\u([0-9a-f]+) / chr(hex($1)) /giex;' 2>/dev/null;


--
Tong (remove underscore(s) to reply)
*niX Power Tools Project: http://xpt.sourceforge.net/
- All free contribution & collection
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      02-28-2005
On Sun, 27 Feb 2005, * Tong * wrote:

> > But you're not going to tell us what it is?

>
> Well, it actually has nothing to do with unicode.


Actually, it has a great deal to do with Unicode...

> Here is what I did to decode such string:
>
> perl -pe 's / \\u([0-9a-f]+) / chr(hex($1)) /giex;' 2>/dev/null;


Fine. chr(hex($1)) is the Unicode character in question - in Perl's
native representation.

Thanks. It just goes to show how seamless Perl's Unicode
implementation is, when one can use it without even believing in it


Perhaps our questioner on another thread, who's determined to prevent
Perl's unicode from working for him, could take a lesson from this.

all the best
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
How encode & decode string to integer and back to string? sumit Java 0 03-10-2012 10:17 AM
decode unicode string using 'unicode_escape' codecs aurora Python 2 01-14-2006 01:57 AM
decode base64 string Guoqi Zheng ASP .Net 1 09-27-2004 06:37 AM
function to url decode a string Ramprasad A Padmanabhan C Programming 11 07-27-2003 04:23 AM



Advertisments