Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Errors parsing Japanese chars

Reply
Thread Tools

Errors parsing Japanese chars

 
 
Sriv Chakravarthy
Guest
Posts: n/a
 
      07-08-2003
I am trying to use xerces-c SAX parser to parse japanese characters. I
have a <?xml... utf-8> line in the xml file. When the parser
encounters the jap characters it throws a UTFDataFormatException.
I am quite new to xml and I am not sure how to deal with this
situation.
Is there a way to parse the jap characters ? or should the japanese
characters be escaped in the xml file (i.e. &#1234) for this to work.
 
Reply With Quote
 
 
 
 
Alan J. Flavell
Guest
Posts: n/a
 
      07-08-2003
On Tue, Jul 8, Sriv Chakravarthy inscribed on the eternal scroll:

> I am trying to use xerces-c SAX parser to parse japanese characters. I
> have a <?xml... utf-8> line in the xml file. When the parser
> encounters the jap characters it throws a UTFDataFormatException.


Seems to be indicating that the Japanese characters are not in fact
encided in utf-8, then.

> I am quite new to xml and I am not sure how to deal with this
> situation.


Irrespective of xml or not xml, any text file needs to be accompanied
with information on its encoding if it's to be reliably read. (Modulo
some heuristics which claim to auto-recognise a limited number of
encodings[1]).

> Is there a way to parse the jap characters ?


If I've understood what you're reporting, it's not a matter of
_parsing_ them, it's a matter of understanding them in the first
place.

> or should the japanese
> characters be escaped in the xml file (i.e. &#1234) for this to work.


Not necessarily. And indeed it's a most inefficent way to represent
them if a large quantity of CJK text is involved. But yes, it's
certainly a legal possibility.

Can you view your data (e.g as plain text) in a web browser? (Or if
you haven't got a web browser, try MSIE...) Which character coding
does the browser need to be set to in order to make sense of the
Japanese? (You might try its auto recognition options and if it's
successful, then check to see which encoding it has chosen).

Then, if the encoding is one that's supported by the parser software,
just nominate it on the <?xml... thingy.

hope this helps.

[1] or of course the BOM, if you know for a fact that it's
a unicode encoding that you're dealing with.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to truncate char string fromt beginning and replace chars instring by other chars in C or C++? Hongyu C++ 9 08-08-2008 12:18 PM
Floats to chars and chars to floats Kosio C Programming 44 09-23-2005 09:49 AM
Japanese Double byte chars not allowed in textbox TextMode="Passwo Amit Save ASP .Net Web Controls 0 09-06-2005 01:16 PM
receiving ??? chars instead of "special" chars M.Posseth ASP .Net Web Services 3 11-16-2004 07:00 PM
Errors, errors, errors Mark Goldin ASP .Net 2 01-17-2004 08:05 PM



Advertisments