Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Help needed parsing a UTF-8 XML file

Reply
Thread Tools

Help needed parsing a UTF-8 XML file

 
 
Huzefa
Guest
Posts: n/a
 
      09-04-2004
I have a XML file encoded in UTF-8. The parser works fine when
there are only English characters in the file.

However, when I PUT SOME Chinese characters in the file, I get the
following error:

org.xml.sax.SAXParseException: Content is not allowed in prolog.
org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
com.xyz.pqr.ParseXmlFile.<init>(ParseXmlFile.java: 34)
org.apache.jsp.index3_jsp._jspService(index3_jsp.j ava:59)
org.apache.jasper.runtime.HttpJspBase.service(Http JspBase.java:94)
javax.servlet.http.HttpServlet.service(HttpServlet .java:802)
org.apache.jasper.servlet.JspServletWrapper.servic e(JspServletWrapper.java:324)
org.apache.jasper.servlet.JspServlet.serviceJspFil e(JspServlet.java:292)
org.apache.jasper.servlet.JspServlet.service(JspSe rvlet.java:236)
javax.servlet.http.HttpServlet.service(HttpServlet .java:802)

I am setting the character encoding of the InputSource.
My code for doing so lokks like this:

InputSource input = new InputSource(file); //File is the FileReader
input.setEncoding("UTF-8");

DOMParser parser = new DOMParser();
parser.parse(input);

How can I get it to read Chinese/Japanese characters?

Any help would be appreciated.

Thanx

Huzefa Khalil
 
Reply With Quote
 
 
 
 
Keith M. Corbett
Guest
Posts: n/a
 
      09-04-2004
"Huzefa" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) om...
> I have a XML file encoded in UTF-8. The parser works fine when
> there are only English characters in the file.
>
> However, when I PUT SOME Chinese characters in the file, I get the
> following error:
>
> org.xml.sax.SAXParseException: Content is not allowed in prolog.
> org.apache.xerces.parsers.DOMParser.parse(Unknown Source)


The error suggests the XML may not be well-formed.

It would be easier to diagnose this by looking at a set of sample XML files.
Can you upload some samples to a server somewhere with public access? Or
send me a zip file, email to kmc(at)world.std.com.

/kmc


 
Reply With Quote
 
 
 
 
Malcolm Dew-Jones
Guest
Posts: n/a
 
      09-04-2004
Huzefa ((E-Mail Removed)) wrote:
: I have a XML file encoded in UTF-8. The parser works fine when
: there are only English characters in the file.

: However, when I PUT SOME Chinese characters in the file, I get the
: following error:

: org.xml.sax.SAXParseException: Content is not allowed in prolog.

Perhaps you put some white space at the top of the file. The <? must be
the very first thing, and perhaps no white space before the first tag's <
either.

 
Reply With Quote
 
Keith M. Corbett
Guest
Posts: n/a
 
      09-05-2004
"Malcolm Dew-Jones" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> Huzefa ((E-Mail Removed)) wrote:
> : I have a XML file encoded in UTF-8. The parser works fine when
> : there are only English characters in the file.
>
> : However, when I PUT SOME Chinese characters in the file, I get the
> : following error:
>
> : org.xml.sax.SAXParseException: Content is not allowed in prolog.
>
> Perhaps you put some white space at the top of the file. The <? must be
> the very first thing, [snip]


I believe a Unicode Byte Order Mark (BOM) may precede the XML declaration.
Per the XML 1.1 TR:

"Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin
with the Byte Order Mark described in ISO/IEC 10646" etc.

> and perhaps no white space before the first tag's <
> either.


I believe white space may appear in the prolog, after the XML declaration
and before or after the document type declaration.

[22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?

[27] Misc ::= Comment | PI | S

/kmc


 
Reply With Quote
 
Chris Uppal
Guest
Posts: n/a
 
      09-05-2004
Huzefa wrote:

> InputSource input = new InputSource(file); //File is the FileReader
> input.setEncoding("UTF-8");


From the JavaDoc for org.xml.sax.InputSource.setEncoding():

This method has no effect when the application provides a character stream.

which may be your problem, since you are providing a character stream in your
constructor. There's more information in the intro to the class in the same
JavaDoc.

BTW, on the subject of the BOM (which someone mentioned elsewhere in this
thread) the JavaDoc for that constructor states:

The character stream shall not include a byte order mark.

HTH.

-- chris



 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
What libraries should I use for MIME parsing, XML parsing, and MySQL ? John Levine Ruby 0 02-02-2012 11:15 PM
Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser) Erik Wasser Perl Misc 5 03-05-2006 10:09 PM
Print XML parsing to JspWriter (out) Class org.xml.sax.helpers.NewInstance can not access a member of class javax.xml.parsers.SAXParser with modifiers "protected" Per Magnus L?vold Java 0 11-15-2004 02:27 PM
XML Parsing problem using DOM Parser: HELP NEEDED burn_hall Python 0 05-31-2004 08:27 PM
Example needed: simple XML file and parsing VB Programmer ASP .Net 3 01-06-2004 05:14 PM



Advertisments