Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Detect XML document encodings with SAX

Reply
Thread Tools

Detect XML document encodings with SAX

 
 
Sebastian
Guest
Posts: n/a
 
      11-21-2012
Hello there,

I discovered this post:
http://www.ibm.com/developerworks/library/x-tipsaxxni/

and implemented both approaches (SAX and Xerces XNI).

Unfortunately, for the attached XML file, both methods
output an encoding of UTF-8, while looking at the file
makes it clear that it is not UTF-8 encoded (all characters,
including the umlaut and the Euro-sign, take one byte, and the
declared encoding also is not UTF-.

Does anyone have an idea why that is so? And how I could
go about making some XML parser determine the correct encoding?

-- Sebastian

 
Reply With Quote
 
 
 
 
Lew
Guest
Posts: n/a
 
      11-21-2012
Sebastian wrote:
> I discovered this post:
> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>
> and implemented both approaches (SAX and Xerces XNI).
>
> Unfortunately, for the attached XML file, both methods


Don't do attachments on Usenet.

> output an encoding of UTF-8, while looking at the file


as they should. XML should be encoded in UTF-8 nearly always.

But SAX is a parser, so it doesn't output, it inputs. What are you telling us?

> makes it clear that it is not UTF-8 encoded (all characters,
> including the umlaut and the Euro-sign, take one byte, and the
> declared encoding also is not UTF-.


http://sscce.org/

> Does anyone have an idea why that is so? And how I could


You used the default encoding in your Writer.

> go about making some XML parser determine the correct encoding?


Your problem is writing the file, no? That has nothing to do with parsing.

If your problem is with reading the file, then the encoding in the XML declaration
should suffice to guide the parser. But then why do you talk about methods that
"output an encoding"?

However, according to
http://xmlwriter.net/xml_guide/xml_d...shtml#Encoding
supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2,
ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, and EUC-JP,
as you would have learned had you researched your question.

So it looks like you must not accept XML documents with such a non-standard
encoding.

Show us the code, or at least an SSCCE of it.

--
Lew
 
Reply With Quote
 
 
 
 
Sebastian
Guest
Posts: n/a
 
      11-21-2012
Am 21.11.2012 20:31, schrieb Lew:
> Sebastian wrote:
>> I discovered this post:
>> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>>
>> and implemented both approaches (SAX and Xerces XNI).

[snip]

>
> Your problem is writing the file, no? That has nothing to do with parsing.

No, it is with parsing the file. Parsing with the purpose of detecting
the encoding.

> If your problem is with reading the file, then the encoding in the XML declaration
> should suffice to guide the parser.

My question is exactly why in this case this does not suffice.

>But then why do you talk about methods that
> "output an encoding"?

I meant the System.out.println() statements in the code.

[snip]

> Show us the code, or at least an SSCCE of it.
>

I was referring to the code in the IBM developerworks article that I
linked to. Perhaps I should simply have copied out that code into my
original post. So here goes:

import org.xml.sax.*;
import org.xml.sax.ext.*;
import org.xml.sax.helpers.*;

import java.io.IOException;

public class SAXEncodingDetector extends DefaultHandler {

/**
* print the encodings of all URLs given on the command line.
*/
public static void main(String[] args) throws SAXException,
IOException {
XMLReader parser = XMLReaderFactory.createXMLReader();
SAXEncodingDetector handler = new SAXEncodingDetector();
parser.setContentHandler(handler);
for (int i = 0; i < args.length; i++) {
try {
parser.parse(args[i]);
}
catch (SAXException ex) {
System.out.println(handler.encoding);
}
}
}

private String encoding;
private Locator2 locator;

@Override
public void setDocumentLocator(Locator locator) {
if (locator instanceof Locator2) {
this.locator = (Locator2) locator;
}
else {
this.encoding = "unknown";
}
}

@Override
public void startDocument() throws SAXException {
if (locator != null) {
this.encoding = locator.getEncoding();
}
throw new SAXException("Early termination");
}

}

 
Reply With Quote
 
Lew
Guest
Posts: n/a
 
      11-22-2012
Sebastian wrote:
> schrieb Lew:
>> Sebastian wrote:
>>> I discovered this post:
>>> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>>>
>>> and implemented both approaches (SAX and Xerces XNI).

>
> [snip]
>
>> Your problem is writing the file, no? That has nothing to do with parsing.

>
> No, it is with parsing the file. Parsing with the purpose of detecting
> the encoding.


Not clear from your phrasing.

>> If your problem is with reading the file, then the encoding in the XML declaration
>> should suffice to guide the parser.

>
> My question is exactly why in this case this does not suffice.


Did my answer to that question not suffice?

I notice you didn't address my answer in your response; in fact you snipped it.

--
Lew
 
Reply With Quote
 
Sebastian
Guest
Posts: n/a
 
      11-22-2012
Am 22.11.2012 01:37, schrieb Lew:
> Sebastian wrote:
>> schrieb Lew:
>>> Sebastian wrote:
>>>> I discovered this post:
>>>> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>>>>
>>>> and implemented both approaches (SAX and Xerces XNI).

>>
>> [snip]
>>
>>> Your problem is writing the file, no? That has nothing to do with parsing.

>>
>> No, it is with parsing the file. Parsing with the purpose of detecting
>> the encoding.

>
> Not clear from your phrasing.
>
>>> If your problem is with reading the file, then the encoding in the XML declaration
>>> should suffice to guide the parser.

>>
>> My question is exactly why in this case this does not suffice.

>
> Did my answer to that question not suffice?
>
> I notice you didn't address my answer in your response; in fact you snipped it.


The answer cannot be that windows-1250 is non-standard. In fact, the
declared encoding of the XML file does not seem to matter. The code will
always output "UTF-8".

I am using Java 7 on Windows XP.

-- Sebastian

 
Reply With Quote
 
markspace
Guest
Posts: n/a
 
      11-22-2012
On 11/21/2012 10:41 PM, Sebastian wrote:

>
> The answer cannot be that windows-1250 is non-standard. In fact, the
> declared encoding of the XML file does not seem to matter. The code will
> always output "UTF-8".
>


Maybe this quote from the article will help you out:

"This approach works 90 percent of the time, maybe a little more. But
SAX parsers aren't required to support the Locator interface, much less
Locator2, and a few don't. A second option, if you know you're using
Xerces, is to work with XNI"


Since the output of the program is "unknown", I'd guess that this
particular SAX parser doesn't support Locator2, like it says.


 
Reply With Quote
 
Steven Simpson
Guest
Posts: n/a
 
      11-22-2012
On 22/11/12 07:18, markspace wrote:
> On 11/21/2012 10:41 PM, Sebastian wrote:
>>
>> The answer cannot be that windows-1250 is non-standard. In fact, the
>> declared encoding of the XML file does not seem to matter. The code will
>> always output "UTF-8".
>>

>
> Maybe this quote from the article will help you out:
>
> "This approach works 90 percent of the time, maybe a little more. But
> SAX parsers aren't required to support the Locator interface, much
> less Locator2, and a few don't. A second option, if you know you're
> using Xerces, is to work with XNI"
>
>
> Since the output of the program is "unknown", I'd guess that this
> particular SAX parser doesn't support Locator2, like it says.


Like the OP, I'm getting "UTF-8", and tracing in the code shows that it
is getting a Locator2.


--
ss at comp dot lancs dot ac dot uk

 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      11-22-2012
On Wed, 21 Nov 2012 15:32:19 +0100, Sebastian
<> wrote, quoted or indirectly quoted
someone who said :

>Does anyone have an idea why that is so? And how I could
>go about making some XML parser determine the correct encoding?


See http://mindprod.com/products2.html#ENCODINGRECOGNISER

This is a manual assist tool to help you guess the encoding.

Encodings are not embedded in any way in files. You just have to know.

ARGHHH!

See http://mindprod.com/jgloss/encoding.html
for how to use native2ascii to interconvert encodings.

The XML world likes UTF-8. Using anything else is just asking for
trouble.
--
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish
as couch potatoes who hire others to go to the gym for them.
 
Reply With Quote
 
markspace
Guest
Posts: n/a
 
      11-22-2012
On 11/21/2012 11:53 PM, Steven Simpson wrote:

> Like the OP, I'm getting "UTF-8", and tracing in the code shows that it
> is getting a Locator2.



Oh, well mine doesn't. I guess we have two different implementations.
Sorry can't guess what is up with yours.


 
Reply With Quote
 
Peter J. Holzer
Guest
Posts: n/a
 
      11-23-2012
On 2012-11-22 11:24, Roedy Green <> wrote:
> On Wed, 21 Nov 2012 15:32:19 +0100, Sebastian
><> wrote, quoted or indirectly quoted
> someone who said :
>>Does anyone have an idea why that is so? And how I could
>>go about making some XML parser determine the correct encoding?

>
> See http://mindprod.com/products2.html#ENCODINGRECOGNISER
>
> This is a manual assist tool to help you guess the encoding.


No need to guess.

> Encodings are not embedded in any way in files. You just have to know.


Not true for XML. The file Sebastian posted starts with

<?xml version="1.0" encoding="windows-1250"?>

hp


--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser) Erik Wasser Perl Misc 5 03-05-2006 10:09 PM
Print XML parsing to JspWriter (out) Class org.xml.sax.helpers.NewInstance can not access a member of class javax.xml.parsers.SAXParser with modifiers "protected" Per Magnus L?vold Java 0 11-15-2004 02:27 PM
Help on including one XML document within another XML document using XML Schemas Tony Prichard XML 0 12-12-2003 03:18 PM
Re: OutOfMemoryError when using SAX to process an XML document Robert Olofsson Java 3 07-03-2003 10:36 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57