Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > how to convert all invalid UTF-8 sequences to numeric equivalent?

Reply
Thread Tools

how to convert all invalid UTF-8 sequences to numeric equivalent?

 
 
Shambo
Guest
Posts: n/a
 
      06-25-2003
Hey folks,

I've been grappling with this for days, and can see no option but to
use brute force.

We have a ton of text files from all over the world, often times
including invalid UTF-8 characters such as or (that was an o with
a line thru it, a la Scandanavian letters, and a British pound
sterling symbol). When I convert these text files to XML, the
resulting XML is not valid becuase it contains these characters. I can
map individual charatcers to their numerical equivalent (ø and
£ in this case), but I'm wary about performing such a conversion
for each and every non UTF-8 valid sequence I may find.

So my question is, has someone found a way to automate converion of
these charcters to their numerical equivalent without having to list
every sinlge character? I searched for scripts and modules that might
do this, but didn't see any that jumped out at me.

Secondly, I had been doing brute-force checking for every non-UTF-8
valid sequence, and I might be doing it incorrectly. For example, if I
searched for the hex string \xA3, I was expecting to match on the
symbol. Not so. I have to explicitly search for the symbol, not the
hex equivalent, because that's how it is in the text file.

To re-iterate:

$line =~ s/\xA3/\&#163\;/g;
does not work when the literal symbol is in the text. I thought
forcing Perl to find the hex version of any character would work. I
guess I'm missing something.

Any insight would be mst appreciated.

thanks very much,
Shambo
 
Reply With Quote
 
 
 
 
Alan J. Flavell
Guest
Posts: n/a
 
      06-25-2003
On Wed, Jun 25, Shambo inscribed on the eternal scroll:

[Oh dear, this _is_ getting to be more like some
hypothetical comp.encoding group...]

> We have a ton of text files from all over the world, often times
> including invalid UTF-8 characters such as or


Well, your posting was encoded in iso-8859-1, so if that's to be
taken seriously, then you haven't got utf-8. So what's the point of
trying to read it as utf-8? It doesn't even remotely resemble it
(aside from the characters that are us-ascii anyway...).

> (that was an o with
> a line thru it, a la Scandanavian letters, and a British pound
> sterling symbol).


In iso-8859-1 (or Windows-1252, not that I'd encourage that), they
would indeed be.

> When I convert these text files to XML, the
> resulting XML is not valid becuase it contains these characters.


This is because you're not telling XML what your character coding is.

> I can
> map individual charatcers to their numerical equivalent (ø and
> £ in this case),


It's a valid choice. But why the hell? If you want to represent them
in utf-8, then do so.

In Perl 5.8 you just tell the input file handle that its encoding is
iso-8859-1, and the output file handle that its encoding is utf-8, and
the job is done.

In earlier Perls you'd use the Encode module explicitly...

> but I'm wary about performing such a conversion
> for each and every non UTF-8 valid sequence I may find.


Your mental model is way adrift, I'm afraid. This talk of "non utf-8
valid sequences" strikes me as a bit like counting what you've been
told is a stack of pound notes and then being surprised that the stack
doesn't contain US dollars.

> So my question is, has someone found a way to automate converion of
> these charcters to their numerical equivalent without having to list
> every sinlge character?


Well yes, it's called an XML normaliser, and it's got nothing to do
with Perl. You'd tell it that it was getting iso-8859-1 input, and
that you wanted us-ascii output, and that's what it would do.

But why would you want to do that, when XML likes to get utf-8 anyway?

You have the choice of either delivering utf-8 as XML likes it as
default, or telling XML that it's getting iso-8859-1. Nothing to do
with Perl there, though.
 
Reply With Quote
 
 
 
 
Shambo
Guest
Posts: n/a
 
      06-26-2003
> Your mental model is way adrift, I'm afraid. This talk of "non utf-8
> valid sequences" strikes me as a bit like counting what you've been
> told is a stack of pound notes and then being surprised that the stack
> doesn't contain US dollars.


You're sort of correct. I am believing what I'm being told. After
checking the converted XML against the Xerces parser, it reports
errors as "invalid utf-8 sequence". When I look at the character it's
referring to, it's something along the lines of .

> You have the choice of either delivering utf-8 as XML likes it as
> default, or telling XML that it's getting iso-8859-1. Nothing to do
> with Perl there, though.


It has everything to do with Perl since I'm using Perl to convert the
text files to XML. I'd like to take care of all my needs in this one
script instead of having to run all the files thru several steps.

I will take your advice and figure out how to tell Perl to write the
proper encoiding on output.

thanks,
S
 
Reply With Quote
 
Shambo
Guest
Posts: n/a
 
      06-26-2003
File disciplines, encode_utf8 and Encode::String functions don't seem
to work. They will simply remove any character they don't like, or
replace it with a question mark.

The reason I asked about numeric equivalents (£) is 'cause the
character gets properly represented when viewed in a web browser, and
the XML validates.

After MUCH education about character sets, encoding and modules, I see
why my preivous post could be a confusing.

Still, the problem remains. I need to preserve these characters
somehow.

many thanks for your help.
-S
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      06-26-2003
On Thu, Jun 26, Shambo inscribed on the eternal scroll:

> File disciplines, encode_utf8 and Encode::String functions don't seem
> to work.


That doesn't get us anywhere. Sure they work.

> They will simply remove any character they don't like, or
> replace it with a question mark.


Where's your simple test script to demonstrate that assertion?

> The reason I asked about numeric equivalents (£) is 'cause the
> character gets properly represented when viewed in a web browser, and
> the XML validates.


Sure, but the reason I didn't encourage you to follow that approach
and only that approach, was that you've given no clear idea of what
material you're going to be dealing with, and that could be a very
inefficient representation, even though, as you imply (and as my
character coding checklist points out), it's the safest way for people
who don't really understand what they're doing.

> Still, the problem remains. I need to preserve these characters
> somehow.


Isn't that what we've been working at all this time?

You don't need me to tell you that you can concatenate a & with a #
with ord($_) with a ; - that's elementary stuff. But if you didn't
tell Perl what you were reading-in in the first place (maybe it's
sometimes iso-8859-2, or koi8-r, we just don't know because you're
keeping us guessing) then you'll get the wrong answer. And if you
_do_ tell Perl correctly what you got, there should be no problem with
outputting utf-8 if that's what you wanted.

So do you want to make any progress with this or not?

> many thanks for your help.


You don't seem to have used much of it yet, but I'm hopeful that it
might be of some use to the occasional lurkers anyway.

 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      06-27-2003
On Thu, Jun 26, Alan J. Flavell inscribed on the eternal scroll:

> > You're sort of correct. I am believing what I'm being told. After
> > checking the converted XML against the Xerces parser, it reports
> > errors as "invalid utf-8 sequence".

>
> You must have either told it, or at least implied, that it was to
> expect utf-8 on input.


If you're still reading this thread:

http://xml.apache.org/xerces-c/faq-parse.html#faq-20

| I keep getting an error: "invalid UTF-8 character". What's wrong?

Sounds rather applicable, doesn't it?

> > When I look at the character it's
> > referring to, it's something along the lines of .

>
> As I said, you're not correctly describing the input that you're
> giving it.


The FAQ says:

Most commonly, the XML encoding = declaration is either incorrect or
missing. Without a declaration, XML defaults to the use utf-8
character encoding, which is not compatible with the default text
file encoding on most systems.

The XML declaration should look something like this:

<?xml version="1.0" encoding="iso-8859-1"?>

Make sure to specify the encoding that is actually used by file. The
encoding for "plain" text files depends both on the operating system
and the locale (country and language) in use.

Clear?

> > It has everything to do with Perl since I'm using Perl to convert the
> > text files to XML.


Didn't I say that it wasn't Perl-related? _Now_ would you believe me?

FAQs are good for you: take some frequently, and especially when
the symptoms occur. (SCNR).

have fun.
 
Reply With Quote
 
Shambo
Guest
Posts: n/a
 
      06-27-2003
I guess I should start over.

When we try to validate our XML, it tells us it doesn't like
characters like , calling them "invlaid UTF-8 sequences." I thought
if I could get Perl to translate characters like that to numeric
equivalent, the XML parser would not complain. These files will
eventually be displayed as HTML, so those characters would need to be
represented as numeric equivalent anyway.

So I was trying to identify the character set for all characters like
these, and I assumed that stuff like was out of the UTF-8 character
set range. I admit I was getting confused on the encoding issue.

And to answer one of your questions, I was telling Perl to output
utf8.

open(FILE, ">utf8", "$myfile");

Using this method would simply remove any character like , leading me
to believe something like is a non-UTF-8 character.

I have no idea what the input format is, and after lots of
experimentation with :latin1, :text and the like, I let it go to the
default.

I now think I'll simply have to build my own mapping table to convert
these characters to their numeirc equivalent so they will validate.

>>thanks for all your help.

> You don't seem to have used much of it yet, but I'm hopeful that it
> might be of some use to the occasional lurkers anyway.


I'm not sure why you say that, I've been reading your replies over and
over to make sure I get what you're saying. This experience has been
very informative, and I do sincerely appreciate it.

best,
S
 
Reply With Quote
 
Shambo
Guest
Posts: n/a
 
      07-09-2003
After MUCH self-educating on encoding, XML and good old Perl, I've
gained a lot of ground. Since these XML files will ultimitely be
displayed in a web browser, I realized that ASCII was the best
encoding, and all non-ASCII characters would have to be mapped to
their numeric equivalent.

I did find a module which would do exactly what I was looking for
(more on that below), but could not get it to work properly, so I've
resorted to searching for all non-ASCII characters, and mapping them
myself. Not that hard. Still will try to get those modules working.

"Alan J. Flavell" <(E-Mail Removed)> wrote in message
> - convert the data to utf-8 coding before feeding it to the parser,
> since that's evidently what the parser expects by default.


This is where I was getting hung up first, not knowing really what
encoding meant, and completely missing the fact that symbols such as
can be represented in UTF-8.

> | Unknown open() mode '>utf8' [...]
>
> Something wrong, see?


Ouch, duh, yes I do see it. Should be "utf8" instead of "outf8".

> Did you ever confirm that you really _are_ using Perl 5.8 ?


Perl 5.8 is in use. All modules are up to date as well.

> I'm confident that Perl already has the mapping table waiting for you
> to use it, if only you'd try to focus in on the issues.


I've found this to be true with the XML::UM module. It will take an
input stream and convert what it can to ASCII. Whatever doesn't
convert to ASCII, it converts to the numeric equivalent, based on the
XML::Encoding maps.

From the XML::UM synopsis:
# Create the encoding routine
my $encode = XML::UM::get_encode (
Encoding => 'US-ASCII',
EncodeUnmapped => \&XML::UM::encode_unmapped_dec);

# Convert a string from UTF-8 to the specified Encoding
my $encoded_str = $encode->($utf8_str);

However, the module seemed to have difficulty finding the paths to the
XML::Encoding maps, even tho I declared it in the script just as the
module instructed. I will continue to troubleshoot that particular
problem.

> You're just not giving us enough concrete detail here to be able to
> advise you with actual code. Can't you put a sample of your input on
> a web page or something, so that we at least know what we're talking
> about?


So the code I've resorted to using looks like:

$string =~ s/\xA3/\&#163\;/g;

which would convert a to its numeric equivalent. This gets past the
parser, and also allows the character to be displayed in a web
browser.

I found a vastly helpful tutorial on encoding within Perl at
http://www.xml.com/pub/a/2000/04/26/...ngs/index.html. Along with
exaplaining lots and lots about encoding, and how to encode within
Perl, it highlights modules such as XML:OM, XML::UM and XML::Code,
all of which seem to be able to do what I (think I) want to do.

From the XML::Code synopsis:
This module is an experimental module, encoding various XML strings
from UTF-8
to ASCII + Unicode entities. Everything that is not pure ASCII (US) is
encoded
as &#<nnn>;

Still trying to get these modules to work, but I at least have a
solution to work with. I do intend to get these modules working.
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      07-09-2003
On Wed, Jul 9, Eric Schwartz inscribed on the eternal scroll:

> http://www.velocityreviews.com/forums/(E-Mail Removed) (Shambo) writes:
> > After MUCH self-educating on encoding, XML and good old Perl, I've
> > gained a lot of ground. Since these XML files will ultimitely be
> > displayed in a web browser, I realized that ASCII was the best
> > encoding, and all non-ASCII characters would have to be mapped to
> > their numeric equivalent.


This is a total non-sequitur. Web browsers support a whole range of
document codings; while it's certainly a _legal_ option to represent
all characters by means of &-notation (e.g &#number using nothing
more interesting than us-ascii, there is surely no _need_ to do so.
Indeed, XML is perfectly happy with utf-8, and so is any halfways
decent current web browser.

> One of the big advantages of XML is that it's completely independant
> of display format. Optimising for one presentation format might well
> make it more difficult to implement another later on.


I've no argument with that, but I don't see what relevance it has to
the above. The hon Usenaut is talking about how individual unicode
characters might be represented in source code, not about any detail
of their visual presentation.

Come to that, neither of the issues are closely on-topic for
comp.lang.perl.misc, so I won't pursue that avenue.

cheers
 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      07-09-2003
On Wed, Jul 9, Shambo inscribed on the eternal scroll:

> > You're just not giving us enough concrete detail here to be able to
> > advise you with actual code. Can't you put a sample of your input on
> > a web page or something, so that we at least know what we're talking
> > about?

>
> So the code I've resorted to using looks like:
>
> $string =~ s/\xA3/\&#163\;/g;


You haven't addressed the question, though. Here you're showing what
you reckon to be part of a solution, but you still haven't shown us
what your input data is like.

Is it encoded in utf-8 ? iso-8859-1 ? (Windows-1252> shudder),
utf-16LE or what?? If you won't show us, and you're not sure
yourself, it's hard to advise.

> I found a vastly helpful tutorial on encoding within Perl at
> http://www.xml.com/pub/a/2000/04/26/...ngs/index.html. Along with
> exaplaining lots and lots about encoding, and how to encode within
> Perl,


But that's targetted at Perl 5.6 , where you still had to invoke
the encoding modules explicitly. You're only making things (a bit)
more complicated for yourself by doing that, when with Perl 5.8
you can do it with the i/o encoding layers.

As the article says: both XML and Perl are quite happy to work
with unicode characters. The possible motivation for resorting to
&-notations would be when you have to tangle with non-XML applications
which might not be unicode-capable. If you have such a constraint, I
must admit I don't recall you saying so. And XML-based tools can map
between unicode characters and &-notation for you without fuss, if the
need arises.

> However, the module seemed to have difficulty finding the paths to
> the XML::Encoding maps, even tho I declared it in the script just as
> the module instructed.


I'm not personally famliar with that module, but in the 3-year-old
article that you cited, there's some notes on that very problem, did
you see?

> it highlights modules such as XML:OM, XML::UM and XML::Code,
> all of which seem to be able to do what I (think I) want to do.
>
> From the XML::Code synopsis:
> This module is an experimental module, encoding various XML strings
> from UTF-8
> to ASCII + Unicode entities. Everything that is not pure ASCII (US) is
> encoded
> as &#<nnn>;


Well, if you're more comfortable with that, and can get it to work,
it's not technically wrong. I just don't think it's the way I'd want
to do it myself, and particularly with the features that 5.8 contains.

But maybe there's still features of your situation that you haven't
shown yet, that makes it a preferable approach for you.

good luck
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Convert unicode escape sequences to unicode in a file Jeremy Python 1 01-11-2011 10:36 PM
int to numeric numeric(18,2) ? jobs ASP .Net 2 07-22-2007 12:32 AM
Arithmetic overflow error converting numeric to data type numeric. darrel ASP .Net 4 07-19-2007 09:57 PM
check if string contains numeric, and check string length of numeric value ief@specialfruit.be C++ 5 06-30-2005 01:08 PM



Advertisments