![]() |
How to know the encoding of XML file?
Hi All,
I'm newbie to this XML world. My problem is to identify the encoding type of XML at runtime. What currently I'm doing is checking whether BOM is available in the XML; based on the BOM I'm identifying the encoding type. Here is the problem, some type of UTF-8 encoded file does'nt have BOM in the starting. So I'm identying the file as iso-8859-1 encoded which is actually encoded in UTF-8. I dont have much idea about the encoding technolgy also. Is there any way to identify the encoding type of XML file programtically; I can use Xerces C++ library or any other free library to identify the correct encoding. Any other work around is also welcome. Thanks & Regards |
Re: How to know the encoding of XML file?
In <1126607691.002066.44350@g47g2000cwa.googlegroups. com>, on
09/13/2005 at 04:01 AM, davisjoseph@postmark.net said: >Here is the problem, some type of UTF-8 encoded file >does'nt have BOM in the starting. Why would any UTF-8 file have a BOM? That's for encodings with 16-bit bytes, such as UTF-16. UTF-8 uses 8-bit bytes. -- Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel> Unsolicited bulk E-mail subject to legal action. I reserve the right to publicly post or ridicule any abusive E-mail. Reply to domain Patriot dot net user shmuel+news to contact me. Do not reply to spamtrap@library.lspace.org |
Re: How to know the encoding of XML file?
davisjoseph@postmark.net wrote: > I'm newbie to this XML world. My problem is to identify the encoding > type of XML at runtime. What currently I'm doing is checking whether > BOM is available in the XML; based on the BOM I'm identifying the > encoding type. Here is the problem, some type of UTF-8 encoded file > does'nt have BOM in the starting. So I'm identying the file as > iso-8859-1 encoded which is actually encoded in UTF-8. Well for XML there are clear rules, if there is no XML declaration specifying the encoding then it can only be UTF-8 or UTF-16 encoded and that is something you can decide with the BOM respectively the existance of the BOM (e.g. UTF-16 always needs one, UTF-8 BOM is optional). So look at the BOM and the XML declaration (that <?xml version="version.number" encoding="encoding-is-here"?>) to find the encoding for XML: <http://www.w3.org/TR/REC-xml/#charencoding> Of course what you really do with the above is detect the encoding the XML document is supposed to be in and an XML parser then has to check the whole document to comply with that encoding, e.g. if you read the XML declaration saying encoding="ISO-8859-1" that means the XML is supposed to be in that encoding and a parser then checks whether any byte sequences are encountered which can't be decoded properly using that encoding. In general there needs to be a declaration of the encoding associated with a document (e.g. in XML in the XML declaration, in HTML in a <meta> element, or for resources accessed via HTTP in the response header) as there is no general algorithm to detect any encoding that exists. For instance you can not detect whether a document is meant to be ISO-8859-1 encoded or ISO-8859-15 encoded, the document author has to declare the encoding, the same bytes are just interpreted as different characters. -- Martin Honnen http://JavaScript.FAQTs.com/ |
Re: How to know the encoding of XML file?
Shmuel (Seymour J.) Metz escribió:
> In <1126607691.002066.44350@g47g2000cwa.googlegroups. com>, on > 09/13/2005 > at 04:01 AM, davisjoseph@postmark.net said: > >>Here is the problem, some type of UTF-8 encoded file >>does'nt have BOM in the starting. > > Why would any UTF-8 file have a BOM? That's for encodings with 16-bit > bytes, such as UTF-16. UTF-8 uses 8-bit bytes. In mixed Unicode/non-unicode environments the BOM helps to discriminate between Unicode/UTF-8 files and simpler ASCII/ISO-8859-x/... text files. -- To reply by e-mail, please remove the extra dot in the given address: m.collado -> mcollado |
Re: How to know the encoding of XML file?
On Tue, 13 Sep 2005, Shmuel (Seymour J.) Metz wrote:
> Why would any UTF-8 file have a BOM? FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29 > That's for encodings with 16-bit bytes, such as UTF-16. Except that the encoding schemes utf-16BE and utf-16LE use 16-bit code units (I'd avoid using the term "bytes"), but don't need a BOM, because their endian-ness is specified by the name of the encoding scheme. |
Re: How to know the encoding of XML file?
In <43270386$1@news.victoria.tc.ca>, on 09/13/2005
at 09:51 AM, yf110@vtn1.victoria.tc.ca (Malcolm Dew-Jones) said: >: > Why would any UTF-8 file have a BOM? >: FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29 Note that the file doesn't contain a BOM, but rather the UTF-8 encoding of a BOM. An actual BOM would not be valid UTF-8. >(I'm still waiting for hardware that increases character sizes. For most hardware, character size is irrelevant. Some devices deal with large blocks of data. Some deal with graphical data rather than text. Some deal with individual bits. Keyboards deal with scan codes rather than conventional character representations. The only common PC peripherals that I can think of that actually deal with characters as characters are a display adapter or printer in text mode, and those are essentially obsolete. -- Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel> Unsolicited bulk E-mail subject to legal action. I reserve the right to publicly post or ridicule any abusive E-mail. Reply to domain Patriot dot net user shmuel+news to contact me. Do not reply to spamtrap@library.lspace.org |
Re: How to know the encoding of XML file?
Alan J. Flavell (flavell@ph.gla.ac.uk) wrote:
: On Tue, 13 Sep 2005, Shmuel (Seymour J.) Metz wrote: : > Why would any UTF-8 file have a BOM? : FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29 : > That's for encodings with 16-bit bytes, such as UTF-16. : Except that the encoding schemes utf-16BE and utf-16LE use 16-bit code : units (I'd avoid using the term "bytes"), but don't need a BOM, : because their endian-ness is specified by the name of the encoding : scheme. utf-16BE and utf-16LE must be using 8 bit bytes, because if they were using true 16-bit code units then there would be no endian-ness to consider. (I'm still waiting for hardware that increases character sizes. They've done it for all other elementary units on the computer, integers, memory pointers, etc, but for some reason not this one.) -- This programmer available for rent. |
Re: How to know the encoding of XML file?
On Tue, 13 Sep 2005, Malcolm Dew-Jones wrote:
> utf-16BE and utf-16LE must be using 8 bit bytes, That's the distinction (as set out in recent Unicode terminologies) between the Character Encoding Form (which in all these three cases is designated utf-16, consisting of 16-bit code units), and its Character Encoding Schemes (of which there are the three: utf-16 with BOM, utf-16LE, and utf-16BE) for representing the 16-bit code units as an octet stream. See chapter 2, sections 2.5 and 2.6 , e.g http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf as well as the previously-cited FAQs > because if they were using true 16-bit code units then there would > be no endian-ness to consider. It's unfortunate that when one reads "utf-16", without context, it is unclear whether it's meant to refer to the C.E.F (and thus to comprise all three C.E.Ses), or only to the one C.E.S. Perhaps it's a pity they didn't devise different designations for the CEF and for the CES (maybe "utf-16BOM" for the CES). (This isn't a problem for utf-8, since there is only one CES for that particular CEF, with the BOM being optional.) > (I'm still waiting for hardware that increases character sizes. Historically, there has been at least one machine with 36-bit words that could be used as four 9-bit units; but that's past rather than future! > They've done it for all other elementary units on the computer, > integers, memory pointers, etc, but for some reason not this one.) I suspect you're more interested in raising it to 16 bits (or 32) than to some non-multiple of 8, though. best |
Re: How to know the encoding of XML file?
On Tue, 13 Sep 2005, Shmuel (Seymour J.) Metz wrote:
> >: FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29 > > Note that the file doesn't contain a BOM, but rather the UTF-8 > encoding of a BOM. *No* data stream ever literally "contains" a BOM, any more than it "contains" a copyright sign, or the letter "A" (the BOM, just like any Unicode character, is an abstract concept): what a data stream contains is the BOM encoded according to the appropriate "Character Encoding Scheme". That's the whole point of the BOM, so that the character encoding scheme can be recognised by inspecting the encoding. So there were no surprises there. > An actual BOM would not be valid UTF-8. An "actual BOM" is an abstract concept! The idea of dumping the hexadecimal number x'FEFF' into a utf-8 data stream - if that was what you had in mind - would make no sense, any more than dumping x'00A9' into it would make any sense to represent the copyright sign. Isn't that obvious? Let's cut them some slack: when they say that it "contains a BOM", they are taking it for granted that it means "appropriately encoded". You can't put an abstract concept into a data stream *without* an appropriate encoding, after all. |
| All times are GMT. The time now is 10:33 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.