Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Converting UTF-* characters to &#xxx;

Reply
Thread Tools

Converting UTF-* characters to &#xxx;

 
 
Hemant Shah
Guest
Posts: n/a
 
      02-25-2004


Folks,

I need to convert UTF-8 characters into is ordinal number (&#xxx,
Is there a module to do it or do I have to write something?

How do I get started on it? I am new to Unicode encoding and I am still
trying to understand how UTF-8 characters are encoded.

Thanks.


--
Hemant Shah /"\ ASCII ribbon campaign
E-mail: http://www.velocityreviews.com/forums/(E-Mail Removed) \ / ---------------------
X against HTML mail
TO REPLY, REMOVE NoJunkMail / \ and postings
FROM MY E-MAIL ADDRESS.
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind, Above opinions are mine only.
it's backed up on tape somewhere. Others can have their own.
 
Reply With Quote
 
 
 
 
Ben Morrow
Guest
Posts: n/a
 
      02-25-2004

(E-Mail Removed) wrote:
> I need to convert UTF-8 characters into is ordinal number (&#xxx,
> Is there a module to do it or do I have to write something?
>
> How do I get started on it? I am new to Unicode encoding and I am still
> trying to understand how UTF-8 characters are encoded.


Firstly, use Perl 5.8.

Next, read perldoc perluniintro. Basically, you don't need to worry
about how perl encodes its characters: you just make sure you mark each
data source correctly with its encoding, and perl'll handle the rest.

For finding ordinal numbers, perldoc -f ord.
For converting them to hex, perldoc -f sprintf.
For an easier way to do what you (probably) want to do, perldoc
PerlIO::encoding and perldoc Encode (the section on fallbacks).

Ben

--
Joy and Woe are woven fine,
A Clothing for the Soul divine William Blake
Under every grief and pine 'Auguries of Innocence'
Runs a joy with silken twine. (E-Mail Removed)
 
Reply With Quote
 
 
 
 
Hemant Shah
Guest
Posts: n/a
 
      02-26-2004
While stranded on information super highway Ben Morrow wrote:
>
> (E-Mail Removed) wrote:
>> I need to convert UTF-8 characters into is ordinal number (&#xxx,
>> Is there a module to do it or do I have to write something?
>>
>> How do I get started on it? I am new to Unicode encoding and I am still
>> trying to understand how UTF-8 characters are encoded.

>
> Firstly, use Perl 5.8.


I am using perl 5.8
>
> Next, read perldoc perluniintro. Basically, you don't need to worry
> about how perl encodes its characters: you just make sure you mark each
> data source correctly with its encoding, and perl'll handle the rest.


I am not worried about how perl stores the characters. This is to store
the characters in an ASCII format in the file.

Here is what we are trying to do. We will be translating our help/error
messages in to Spanish, French, Japanese, etc.

I have written a perl script that will read english sentence from the
database, connect to our translation software and get the sentence
translated (translated text is in UTF-8 format). I want to store this
into a database or flat file in XML. This file
could contain english, spanish, french and japanese language and I
want it to be in 8-bit character set (ISO-8859-1). If I can convert
the japanese characters into the ordinal numbers I can store the text
in "&#xxx;" format. I would write the perl script to convert the text
between UTF-8 and ordinal and back. Spanish and franch characters can
be stored in ISO-8859-1 characterset with out any problem using
Encode module.



>
> For finding ordinal numbers, perldoc -f ord.
> For converting them to hex, perldoc -f sprintf.


I will take a look at the above docs.

Thanks.

> For an easier way to do what you (probably) want to do, perldoc
> PerlIO::encoding and perldoc Encode (the section on fallbacks).
>
> Ben
>
> --
> Joy and Woe are woven fine,
> A Clothing for the Soul divine William Blake
> Under every grief and pine 'Auguries of Innocence'
> Runs a joy with silken twine. (E-Mail Removed)


--
Hemant Shah /"\ ASCII ribbon campaign
E-mail: (E-Mail Removed) \ / ---------------------
X against HTML mail
TO REPLY, REMOVE NoJunkMail / \ and postings
FROM MY E-MAIL ADDRESS.
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind, Above opinions are mine only.
it's backed up on tape somewhere. Others can have their own.
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      02-26-2004
On Thu, 26 Feb 2004, Hemant Shah wrote:

> could contain english, spanish, french and japanese language and I
> want it to be in 8-bit character set (ISO-8859-1). If I can convert
> the japanese characters into the ordinal numbers I can store the text
> in "&#xxx;" format. I would write the perl script to convert the text
> between UTF-8 and ordinal and back.


See the discussion here a few days ago. Subject was (unbelievable as
it might seem) "replace unicode characters by &#number; representation".

> Spanish and franch characters can
> be stored in ISO-8859-1 characterset with out any problem using
> Encode module.


They can, indeed, but you said in the earlier part of your posting
that you want to use ASCII. Best be sure what it is that you want.

good luck

(And don't quote sigs, and other material not germane to your
followup. thanks.)
 
Reply With Quote
 
Hemant Shah
Guest
Posts: n/a
 
      02-26-2004
While stranded on information super highway Alan J. Flavell wrote:
> On Thu, 26 Feb 2004, Hemant Shah wrote:
>
>> could contain english, spanish, french and japanese language and I
>> want it to be in 8-bit character set (ISO-8859-1). If I can convert
>> the japanese characters into the ordinal numbers I can store the text
>> in "&#xxx;" format. I would write the perl script to convert the text
>> between UTF-8 and ordinal and back.


I looked at the thread, but I do not think it can deal with double byte
characters.

>
> See the discussion here a few days ago. Subject was (unbelievable as
> it might seem) "replace unicode characters by &#number; representation".
>
>> Spanish and franch characters can
>> be stored in ISO-8859-1 characterset with out any problem using
>> Encode module.


Yes, that is what I am doing.

>
> They can, indeed, but you said in the earlier part of your posting
> that you want to use ASCII. Best be sure what it is that you want.
>
> good luck
>
> (And don't quote sigs, and other material not germane to your
> followup. thanks.)


I am new to this and still reading various docs, so please bear with me if
I miss obvious things. Maybe if I try to explain what I am trying to do,
then someone may have better solution then what I am thinking of.

We are trying to translate all of our help/error messages to other
languages, currently ES, FR and JA.

The translation come back to us in an XML file with UTF-8 encoding (Open
Office doc). I use XML:arser to parse the file.

I need to take the tranlsations of each sentence and store them in same file
with #ifdef around them, and also store them into a DB2 database which is
using ISO-8859-1 character set.

The flat file is also in XML format. Based on the specified language our
pre-processor will extract XML code for english and specified language
from it.

The file is also controled by RCS. To keep things simple in flat file and
database I am trying to convert everything to extended ASCII characters
(ISO-8859-1). ES and FR do not pose any problems, I am trying to figure out
how to store japanese characters.

Example of the flat file:

#ifdef H5829
<?xml version='1.0' encoding='UTF-8'?>
<!-- **__**__**__**__**__**__**__**__**__**__**__**__** __**__**__** -->
<!-- Program: sent.1100 -->
<!-- Author: Name of the Author -->
<!-- Purpose: To describe content of sent 1100 -->
<!-- Project: H5829 -->
<!-- Version: XML 1.0 -->
<!-- Notes: -->
<!-- **_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_* _*_*_*_*__** -->
<!DOCTYPE sentsource SYSTEM "sent">
<sentsource>
<comment mod = 'H5829'
author = 'myself'
date = '20020624'
type = 'doconly' >
Initial programming.
</comment>
<filekey>1100</filekey>
<xinfo type = 'EN1'>
<sentence>
A master record is not associated with this entry so the suspense
number entered will not be verified.
</sentence>
</xinfo>
#ifdef H3436
<xinfo type = 'ES1'>
<sentence>
Un registro maestro no se asocia a esta entrada así que el número del suspenso
incorporado no será
</sentence>
</xinfo>
#endif H3436
#ifdef H3906
<xinfo type = 'FR1'>
<sentence>
French translation goes here.
</sentence>
</xinfo>
#endif H3436
#ifdef H4906
<xinfo type = 'JA1'>
<sentence>
Japanese translation goes here. I am thinking of putting "&#xxx;" here.
</sentence>
</xinfo>
#endif H3436
</sentsource>
#endif H5829




Thanks for your help.
--
Hemant Shah /"\ ASCII ribbon campaign
E-mail: (E-Mail Removed) \ / ---------------------
X against HTML mail
TO REPLY, REMOVE NoJunkMail / \ and postings
FROM MY E-MAIL ADDRESS.
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind, Above opinions are mine only.
it's backed up on tape somewhere. Others can have their own.
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      02-26-2004
On Thu, 26 Feb 2004, Hemant Shah wrote:

> I looked at the thread, but I do not think it can deal with double byte
> characters.


Perl (5.8 upwards) doesn't have "double byte characters", it has
"characters". How they are stored internally shouldn't concern you.

In other words, it's simpler than you imagine. But it can be helpful
to take a look at the complexity of what happens "under the covers" if
it helps to appreciate the simplicity of what you get on the surface.

> I need to take the tranlsations of each sentence and store them in same file
> with #ifdef around them, and also store them into a DB2 database which is
> using ISO-8859-1 character set.


Uh-uh, so it really comes down to - not a Perl problem as such - but
dealing with a database that doesn't understand utf-8.

But yes, if you see any benefit in it, you _could_ retain iso-8859-1
characters as themselves, while turning non-iso-8859-1 characters into
their &#number; representations.

The catch here is that if you do something which implies to Perl that
you are going beyond iso-8859-1, then it will "upgrade" your data from
8-bit bytes to utf-8 characters, and so your iso-8859-1 characters
will then, internally, be two bytes wide.

Perhaps this will become clearer as you gain familiarity with the
contents of the perluniintro and perlunicode documentation - much of
which probably goes way beyond what you need, but parts of which are
critical to your purpose.

But maybe there's a module that packages this away and does the work
for you. I'm looking at this just at the character-representation
level at the moment, and responding on that basis. Maybe others (or
on a group dedicated to XML such as comp.lang.xml) can offer
more-practical insights into available solutions.

> The file is also controled by RCS. To keep things simple in flat file and
> database I am trying to convert everything to extended ASCII characters
> (ISO-8859-1). ES and FR do not pose any problems, I am trying to figure out
> how to store japanese characters.


Your plan to represent them as &#number; representations sounds OK to
me. Of course if you need to sort data, or process it in similar
ways, then you'll need to think carefully what you're doing.

hope this helps a bit.
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      02-26-2004
On Thu, 26 Feb 2004, Alan J. Flavell wrote:

> on a group dedicated to XML such as comp.lang.xml)


Make that comp.text.xml - excuse me.

 
Reply With Quote
 
Ben Morrow
Guest
Posts: n/a
 
      02-26-2004

"Alan J. Flavell" <(E-Mail Removed)> wrote:
> Uh-uh, so it really comes down to - not a Perl problem as such - but
> dealing with a database that doesn't understand utf-8.
>
> But yes, if you see any benefit in it, you _could_ retain iso-8859-1
> characters as themselves, while turning non-iso-8859-1 characters into
> their &#number; representations.
>
> The catch here is that if you do something which implies to Perl that
> you are going beyond iso-8859-1, then it will "upgrade" your data from
> 8-bit bytes to utf-8 characters, and so your iso-8859-1 characters
> will then, internally, be two bytes wide.


The answer here is still to use Encode with FB_HTMLCREF: simply wrap all
calls to the database with subs that encode the data. You will have to
map & to &amp; or & yourself.

I would say a good rule-of-thumb when dealing with 5.8 and Unicode is
'*never* read or write data to or from some external source without
running it through the Encode module'. Then you'll always know where you
stand.

Ben

--
It will be seen that the Erwhonians are a meek and long-suffering people,
easily led by the nose, and quick to offer up common sense at the shrine of
logic, when a philosopher convinces them that their institutions are not based
on the strictest morality. [Samuel Butler, paraphrased] (E-Mail Removed)
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      02-26-2004
On Thu, 26 Feb 2004, Ben Morrow wrote:

[quoting ajf:]
> > The catch here is that if you do something which implies to Perl that
> > you are going beyond iso-8859-1, then it will "upgrade" your data from
> > 8-bit bytes to utf-8 characters, and so your iso-8859-1 characters
> > will then, internally, be two bytes wide.

>
> The answer here is still to use Encode with FB_HTMLCREF: simply wrap all
> calls to the database with subs that encode the data.


Looks to be excellent advice to me. Which was why I referred back to
the previous thread for details...

> You will have to map & to &amp; or & yourself.


Good point.

> I would say a good rule-of-thumb when dealing with 5.8 and Unicode is
> '*never* read or write data to or from some external source without
> running it through the Encode module'.


Where "external" also includes the database that the hon Usenaut is
using, right?

> Then you'll always know where you stand.


Once the questioner is up to speed on dealing with the data internally
to Perl, sure.

all the best
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting special characters in Java and JavaScript Crazy Monkey Java 1 01-21-2005 11:09 PM
Re: int. characters - converting to ASCII7 Andreas Prilop XML 0 09-14-2004 04:11 PM
Converting accent characters to html codes Alexandre Soares ASP .Net 1 09-01-2004 04:55 PM
Converting Non-Unicode Characters Arthur Java 0 08-24-2004 01:01 PM
converting characters to octal Hostos Java 7 10-15-2003 06:07 AM



Advertisments