Wed, 21 Nov 2012 15:32:19 +0100, /Sebastian/:
> I discovered this post:
> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>
> and implemented both approaches (SAX and Xerces XNI).
>
> Unfortunately, for the attached XML file, both methods
> output an encoding of UTF-8, while looking at the file
> makes it clear that it is not UTF-8 encoded (all characters,
> including the umlaut and the Euro-sign, take one byte, and the
> declared encoding also is not UTF-
.
>
> Does anyone have an idea why that is so? And how I could
> go about making some XML parser determine the correct encoding?
Sorry if this has been answered already elsewhere in the thread.
The XML specification has a guideline for detecting the source encoding:
http://www.w3.org/TR/xml/#sec-guessing
and this is basically what parsers do. One-byte encodings are
basically indistinguishable from each other and they could be only
reliably detected in presence of an explicit encoding
information/declaration.
--
Stanimir