Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Detect XML document encodings with SAX

Reply
Thread Tools

Detect XML document encodings with SAX

 
 
Arne Vajhøj
Guest
Posts: n/a
 
      11-24-2012
Sebastian wrote:
> I discovered this post:
> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>
> and implemented both approaches (SAX and Xerces XNI).
>
> Unfortunately, for the attached XML file, both methods
> output an encoding of UTF-8, while looking at the file


I tried.

And I can not get it to work either.

SAX detects UTF-8 no matter what it really is.

StAX seems never to detect and W3C DOM seems to
always detect correct.

I can not offer an explanation. Obviously the parsers
need to internally detect correct. Otherwise they
could not parse correct.

Code below.

Arne

====

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;

import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.ext.Locator2;
import org.xml.sax.helpers.XMLReaderFactory;
import org.xml.sax.helpers.DefaultHandler;

public class XmlEncodingDectect {
private static final String FNM1 = "/work/foobar1.xml";
private static final String FNM2 = "/work/foobar2.xml";
private static final String FNM3 = "/work/foobar3.xml";
private static void gen1() throws IOException {
PrintWriter pw = new PrintWriter(new FileWriter(FNM1));
pw.println("<?xml version='1.0' encoding='UTF-8'?>");
pw.println("<root/>");
pw.close();
}
private static void gen2() throws IOException {
PrintWriter pw = new PrintWriter(new FileWriter(FNM2));
pw.println("<?xml version='1.0' encoding='ISO-8859-1'?>");
pw.println("<root/>");
pw.close();
}
private static void gen3() throws IOException {
PrintWriter pw = new PrintWriter(new FileWriter(FNM3));
pw.println("<?xml version='1.0'?>");
pw.println("<root/>");
pw.close();
}
private static String encoding;
private static String detectSAX(String fnm) throws SAXException,
IOException {
XMLReader parser = XMLReaderFactory.createXMLReader();
parser.setContentHandler(new DefaultHandler() {
private Locator2 locator;
@Override
public void setDocumentLocator(Locator locator) {
if (locator instanceof Locator2) {
this.locator = (Locator2) locator;
} else {
encoding = "Unknown";
}
}
@Override
public void startDocument() throws SAXException {
if (locator != null) {
encoding = locator.getEncoding();
}
}
});
parser.parse(new InputSource(new FileInputStream(fnm)));
return encoding;
}
private static String detectW3CDOM(String fnm) throws
ParserConfigurationException, FileNotFoundException, SAXException,
IOException {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new InputSource(new FileInputStream(fnm)));
String encoding = doc.getXmlEncoding();
return encoding != null ? encoding : "Unknown";
}
private static String detectStAX(String fnm) throws
FileNotFoundException, XMLStreamException {
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(new
FileInputStream(fnm));
String encoding = null;
while(xsr.hasNext()) {
xsr.next();
switch(xsr.getEventType()) {
case XMLStreamReader.START_DOCUMENT:
encoding = xsr.getEncoding();
break;
default:
break;
}
}
return encoding != null ? encoding : "Unknown";
}
public static void main(String[] args) throws IOException,
SAXException, ParserConfigurationException, XMLStreamException {
gen1();
System.out.println(detectSAX(FNM1));
System.out.println(detectW3CDOM(FNM1));
System.out.println(detectStAX(FNM1));
gen2();
System.out.println(detectSAX(FNM2));
System.out.println(detectW3CDOM(FNM2));
System.out.println(detectStAX(FNM2));
gen3();
System.out.println(detectSAX(FNM3));
System.out.println(detectW3CDOM(FNM3));
System.out.println(detectStAX(FNM3));
}
}

 
Reply With Quote
 
 
 
 
Arne Vajhøj
Guest
Posts: n/a
 
      11-24-2012
On 11/21/2012 2:31 PM, Lew wrote:
> Sebastian wrote:
>> I discovered this post:
>> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>>
>> and implemented both approaches (SAX and Xerces XNI).
>>
>> Unfortunately, for the attached XML file, both methods

>
> Don't do attachments on Usenet.
>
>> output an encoding of UTF-8, while looking at the file

>
> as they should.


No.

If the XML prolog specifies another encoding than UTF-8,
then it should not return UTF-8.

> XML should be encoded in UTF-8 nearly always.


XML allows for other encodings.

And Java XML parsers support it.

So it should always work.

> But SAX is a parser, so it doesn't output, it inputs. What are you telling us?


Output usually mean System.out.println - that works fine with a parser.

> If your problem is with reading the file, then the encoding in the XML declaration
> should suffice to guide the parser. But then why do you talk about methods that
> "output an encoding"?


Because he wants to know what it is.

> However, according to
> http://xmlwriter.net/xml_guide/xml_d...shtml#Encoding
> supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2,
> ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS,

and EUC-JP,
> as you would have learned had you researched your question.
>
> So it looks like you must not accept XML documents with such a

non-standard
> encoding.


Those that has researched would know that the XML spec do not
limit the encodings at all. The XML processor must support UTF-8
and UTF-16, but are free to support others.

Arne



Arne


 
Reply With Quote
 
 
 
 
Arne Vajhøj
Guest
Posts: n/a
 
      11-24-2012
On 11/22/2012 2:18 AM, markspace wrote:
> On 11/21/2012 10:41 PM, Sebastian wrote:
>> The answer cannot be that windows-1250 is non-standard. In fact, the
>> declared encoding of the XML file does not seem to matter. The code will
>> always output "UTF-8".
>>

>
> Maybe this quote from the article will help you out:
>
> "This approach works 90 percent of the time, maybe a little more. But
> SAX parsers aren't required to support the Locator interface, much less
> Locator2, and a few don't. A second option, if you know you're using
> Xerces, is to work with XNI"
>
> Since the output of the program is "unknown", I'd guess that this
> particular SAX parser doesn't support Locator2, like it says.


Except that it does not return Unknown - it returns UTF-8.

Arne


 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      11-24-2012
On 11/23/2012 6:13 PM, Peter J. Holzer wrote:
> On 2012-11-22 11:24, Roedy Green <> wrote:
>> On Wed, 21 Nov 2012 15:32:19 +0100, Sebastian
>> <> wrote, quoted or indirectly quoted
>> someone who said :
>>> Does anyone have an idea why that is so? And how I could
>>> go about making some XML parser determine the correct encoding?

>>
>> See http://mindprod.com/products2.html#ENCODINGRECOGNISER
>>
>> This is a manual assist tool to help you guess the encoding.

>
> No need to guess.
>
>> Encodings are not embedded in any way in files. You just have to know.

>
> Not true for XML. The file Sebastian posted starts with
>
> <?xml version="1.0" encoding="windows-1250"?>


New around here?

Don't expect Roedy's posts to relate that much to what he is
replying to.

Arne


 
Reply With Quote
 
Lew
Guest
Posts: n/a
 
      11-24-2012
Arne Vajhøj wrote:
> Lew wrote:
>> Sebastian wrote:

[snip]
>>> output an encoding of UTF-8, while looking at the file

>> as they should.

>
> No.
>
> If the XML prolog specifies another encoding than UTF-8,
> then it should not return UTF-8.


True, but I'm saying they should specify UTF-8 in the prolog.

>> XML should be encoded in UTF-8 nearly always.


See?

> XML allows for other encodings.


So? You should use UTF-8 nearly always, i.e., unless there's a compelling
reason not to.

> And Java XML parsers support it.


For those rare times when you deviate from the usual UTF-8.

> So it should always work.


>> But SAX is a parser, so it doesn't output, it inputs. What are you telling us?

>
> Output usually mean System.out.println - that works fine with a parser.


His phrasing wasn't clear to me. That's why I asked for clarification.

I could have guessed, too.

>> If your problem is with reading the file, then the encoding in the XML declaration


See? You're preaching to the choir.

>> should suffice to guide the parser. But then why do you talk about methods that


>> "output an encoding"?

>
> Because he wants to know what it is.
>
>> However, according to
>> http://xmlwriter.net/xml_guide/xml_d...shtml#Encoding
>> supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2,
>> ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS,
>> and EUC-JP,
>> So it looks like you must not accept XML documents with such a
>> non-standard encoding.

>
> Those that has researched would know that the XML spec do not
> limit the encodings at all. The XML processor must support UTF-8
> and UTF-16, but are free to support others.


Perhaps the OP's parser doesn't exercise that freedom, judging by the
symptoms.

'sall I'm sayin'.

Obviously I don't know the answer, but he's asking for suggestions
to investigate, AIUI. He's having encoding problems. His XML is apparently
encoded in Windows-1252, a notoriously funky encoding especially for
the variety of characters with which one might wish to deal. So why not
investigate obtaining material that isn't in such a notoriously funky
encoding, like, oh, say, the old reliable standard UTF-8?

Perhaps that isn't feasible, for reasons as yet unstated, but that's
the nature of brainstorming.

--
Lew
 
Reply With Quote
 
Sebastian
Guest
Posts: n/a
 
      11-24-2012
Sebastian wrote:
> I discovered this post:
> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>
> and implemented both approaches (SAX and Xerces XNI).
>
> Unfortunately, for the attached XML file, both methods
> output an encoding of UTF-8, while looking at the file


Am 24.11.2012 11:14, schrieb Lew:
[snip]
>
> Obviously I don't know the answer, but he's asking for suggestions
> to investigate, AIUI. He's having encoding problems. His XML is apparently
> encoded in Windows-1252, a notoriously funky encoding especially for
> the variety of characters with which one might wish to deal. So why not
> investigate obtaining material that isn't in such a notoriously funky
> encoding, like, oh, say, the old reliable standard UTF-8?
>
> Perhaps that isn't feasible, for reasons as yet unstated, but that's
> the nature of brainstorming.


Here's the background to my question:
I am dealing with other people's code that processes XML files.
Unfortunately, that code, which I have no control over, seems to use
some home-grown parsing algorithm, which DOES NOT always detect
encodings correctly, but expects to be told them.

The XML files come from several sources in different encodings, and I
cannot dictate anything there either.

So I thought, well, why don't I add a little preprocessor to discover
the encoding to give to that terrible file processor I'm stuck with.
Shouldn't be that hard, because, as Arne said:

> Am 24.11.2012 03:11, schrieb Arne Vajhøj:
> Obviously the parsers
> need to internally detect correct. Otherwise they
> could not parse correct.


The only approach that seems to work (at least for Arne), namely
W3C DOM, is out of the question for me, because the files are
potentially huge and I cannot keep a complete document model in memory.
I need something along the lines of SAX. I'll have to look around some more.

-- Sebastian

PS: The author of that article from which I took the code isn't just
anyone. Elliotte Rusty Harold hosts the XML web site
http://www.cafeconleche.org/ and is affiliated with the University of
North Carolina. Perhaps I could try to get in touch with him.


 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      11-24-2012
On 11/24/2012 4:18 PM, Sebastian wrote:
> Am 24.11.2012 11:14, schrieb Lew:
> [snip]
>>
>> Obviously I don't know the answer, but he's asking for suggestions
>> to investigate, AIUI. He's having encoding problems. His XML is
>> apparently
>> encoded in Windows-1252, a notoriously funky encoding especially for
>> the variety of characters with which one might wish to deal. So why not
>> investigate obtaining material that isn't in such a notoriously funky
>> encoding, like, oh, say, the old reliable standard UTF-8?
>>
>> Perhaps that isn't feasible, for reasons as yet unstated, but that's
>> the nature of brainstorming.

>
> Here's the background to my question:
> I am dealing with other people's code that processes XML files.
> Unfortunately, that code, which I have no control over, seems to use
> some home-grown parsing algorithm, which DOES NOT always detect
> encodings correctly, but expects to be told them.
>
> The XML files come from several sources in different encodings, and I
> cannot dictate anything there either.


I would consider it tempting to rewrite that app to use a standard
XML parser.

It would solve this problem and possibly also some future problems.

> So I thought, well, why don't I add a little preprocessor to discover
> the encoding to give to that terrible file processor I'm stuck with.
> Shouldn't be that hard, because, as Arne said:
>
> > Am 24.11.2012 03:11, schrieb Arne Vajhøj:
> > Obviously the parsers
> > need to internally detect correct. Otherwise they
> > could not parse correct.

>
> The only approach that seems to work (at least for Arne), namely
> W3C DOM, is out of the question for me, because the files are
> potentially huge and I cannot keep a complete document model in memory.
> I need something along the lines of SAX. I'll have to look around some
> more.


What about just reading the first few lines until you have the
XML declaration.

Parsing the encoding out of that should be simple.

private static final Pattern encpat =
Pattern.compile("encoding\\s*=\\s*['\"]([^'\"]+)['\"]");
private static String detectSimple(String fnm) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(fnm));
String firstpart = "";
while(!firstpart.contains(">")) firstpart += br.readLine();
br.close();
Matcher m = encpat.matcher(firstpart);
if(m.find()) {
return m.group(1);
} else {
return "Unknown";
}
}

I do not like the solution, but given the restrictions in the
context, then maybe it is what you need.

> PS: The author of that article from which I took the code isn't just
> anyone. Elliotte Rusty Harold hosts the XML web site
> http://www.cafeconleche.org/ and is affiliated with the University of
> North Carolina. Perhaps I could try to get in touch with him.


Teaching at a university is no guarantee of good practical
programming skills.

Arne


 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      11-24-2012
On 11/24/2012 5:14 AM, Lew wrote:
> Arne Vajhøj wrote:
>> Lew wrote:
>>> But SAX is a parser, so it doesn't output, it inputs. What are you telling us?

>>
>> Output usually mean System.out.println - that works fine with a parser.

>
> His phrasing wasn't clear to me. That's why I asked for clarification.


Then maybe we need "How to ask for clarifications the smart way".

>>> However, according to
>>> http://xmlwriter.net/xml_guide/xml_d...shtml#Encoding
>>> supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2,
>>> ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS,
>>> and EUC-JP,
>>> So it looks like you must not accept XML documents with such a
>>> non-standard encoding.

>>
>> Those that has researched would know that the XML spec do not
>> limit the encodings at all. The XML processor must support UTF-8
>> and UTF-16, but are free to support others.

>
> Perhaps the OP's parser doesn't exercise that freedom, judging by the
> symptoms.


There are nothing in OP's symptoms that indicate lack of support
for encodings.

OP's symptoms is that it parse fine with encoding XYZ but when asked
by caller it claims wrongfully to be using UTF-8.

Arne

 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      11-24-2012
On 11/24/2012 5:14 AM, Lew wrote:
> Obviously I don't know the answer, but he's asking for suggestions
> to investigate, AIUI. He's having encoding problems. His XML is apparently
> encoded in Windows-1252, a notoriously funky encoding especially for
> the variety of characters with which one might wish to deal.


CP-1252 is just another encoding. It is not more or less funky than
any other encoding.

In fact it is identical with ISO-8859-1 for all characters except
128-159, which are control characters/unmapped in ISO-8859-1 but has
various extra values in CP-1252.

> So why not
> investigate obtaining material that isn't in such a notoriously funky
> encoding, like, oh, say, the old reliable standard UTF-8?


If one can chose the data files and the software, then life is easy.

Arne


 
Reply With Quote
 
markspace
Guest
Posts: n/a
 
      11-25-2012
On 11/24/2012 1:18 PM, Sebastian wrote:
> I am dealing with other people's code that processes XML files.
> Unfortunately, that code, which I have no control over, seems to use
> some home-grown parsing algorithm, which DOES NOT always detect
> encodings correctly, but expects to be told them.



That's not a big deal. Several of the Java components work this way.
Open the file with an assumed encoding, and test the encoding. If you
are wrong, throw an exception, which causes the stream to be re-opened
with the correct encoding (now that the correct encoding has been detected).

Be careful you're not subverting an established, working process here.

I personally am still looking for an SSCCE, as your last one didn't
reproduce the error for me.


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser) Erik Wasser Perl Misc 5 03-05-2006 10:09 PM
Print XML parsing to JspWriter (out) Class org.xml.sax.helpers.NewInstance can not access a member of class javax.xml.parsers.SAXParser with modifiers "protected" Per Magnus L?vold Java 0 11-15-2004 02:27 PM
Help on including one XML document within another XML document using XML Schemas Tony Prichard XML 0 12-12-2003 03:18 PM
Re: OutOfMemoryError when using SAX to process an XML document Robert Olofsson Java 3 07-03-2003 10:36 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57