Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > How do xml parsers handle encoding?

Reply
Thread Tools

How do xml parsers handle encoding?

 
 
billsahiker@yahoo.com
Guest
Posts: n/a
 
      04-30-2008

if an xml file specifies an encoding, e.g., utf16, do xml browsers and
xml editors read and verify each character in the file to make sure it
is utf16? and throw an error if it is not, or. do they do an automatic
filtering/converting to utf16, or do they do something else?

Do they default to utf8 if the xml file does not specify an encoding?

Bill
 
Reply With Quote
 
 
 
 
Martin Honnen
Guest
Posts: n/a
 
      04-30-2008
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> if an xml file specifies an encoding, e.g., utf16, do xml browsers and
> xml editors read and verify each character in the file to make sure it
> is utf16? and throw an error if it is not, or. do they do an automatic
> filtering/converting to utf16, or do they do something else?
>
> Do they default to utf8 if the xml file does not specify an encoding?


An XML parser checks for a BOM (byte order mark) to find out whether it
is UTF-8 or UTF-16 if there is no XML declaration declaring an encoding.

And XML parsers are required to check that documents are properly
encoded. However browser like Firefox or Opera I think might not report
any such violation. For instance I saved an XML document as UTF-8 but
with an XML declaration saying encoding="UTF-16" and then loaded with
Firefox 2.0 and Opera 9 and they both did not report an error, instead
treated the document as UTF-8. IE 6 reported an error.



--

Martin Honnen
http://JavaScript.FAQTs.com/
 
Reply With Quote
 
 
 
 
Martin Honnen
Guest
Posts: n/a
 
      04-30-2008
Martin Honnen wrote:

> And XML parsers are required to check that documents are properly
> encoded. However browser like Firefox or Opera I think might not report
> any such violation. For instance I saved an XML document as UTF-8 but
> with an XML declaration saying encoding="UTF-16" and then loaded with
> Firefox 2.0 and Opera 9 and they both did not report an error, instead
> treated the document as UTF-8. IE 6 reported an error.


For Mozilla, the FAQ
http://developer.mozilla.org/en/docs...l_documents.3F
says:
"Most well-formedness constraints are enforced. (Currently Mozilla
does not catch character encoding errors, because the document is
re-encoded using a lenient encoding converter before the document
reaches the XML parser. This is a bug.)"



--

Martin Honnen
http://JavaScript.FAQTs.com/
 
Reply With Quote
 
Joseph J. Kesselman
Guest
Posts: n/a
 
      04-30-2008
The rules for how they're *supposed* to handle it are spelled out in the
XML Recommendation. Not all parsers are in strict compliance with all
parts of the recommendation, alas. Bug Happens.

If you're asking whether you can get away with cheating: the brief
answer is that it's extremely bad practice to try. If you're asking
whether you can be certain a particular parser will or won't let
something through, you can ask its development/user community... but be
aware that the next release may fix this, and it's a very bad idea to
write code that depends on bugs in specific versions.
 
Reply With Quote
 
billsahiker@yahoo.com
Guest
Posts: n/a
 
      04-30-2008
On Apr 30, 8:20*am, Martin Honnen <(E-Mail Removed)> wrote:
> Martin Honnen wrote:
> > And XML parsers are required to check that documents are properly
> > encoded.


So how do they do that? do they check every character? or do they just
convert? if the encoding attribute is utf8 and the file has a
character not utf8, does the browser error, convert it or what? Like
if a Korean character is in a file that says it is utf8.

Bill
 
Reply With Quote
 
Richard Tobin
Guest
Posts: n/a
 
      04-30-2008
In article <(E-Mail Removed)>,
<(E-Mail Removed)> wrote:

>> > And XML parsers are required to check that documents are properly
>> > encoded.


>So how do they do that? do they check every character?


Yes.

>Like if a Korean character is in a file that says it is utf8.


utf-8 covers all of Unicode, so it includes Korean characters.

A parser has to check two things: that the data is legal for the
encoding (for example, some sequences of bytes are not legal in
UTF-, and that the character it encodes is allowed in XML.

-- Richard
--
:wq
 
Reply With Quote
 
billsahiker@yahoo.com
Guest
Posts: n/a
 
      04-30-2008
On Apr 30, 9:49*am, (E-Mail Removed) (Richard Tobin) wrote:
> In article <(E-Mail Removed)>,
>
> *<(E-Mail Removed)> wrote:
> >> > And XML parsers are required to check that documents are properly
> >> > encoded.

> >So how do they do that? do they check every character?

>
> Yes.
>
> >Like if a Korean character is in a file that says it is utf8.

>
> utf-8 covers all of Unicode, so it includes Korean characters.
>
> A parser has to check two things: that the data is legal for the
> encoding (for example, some sequences of bytes are not legal in
> UTF-, and that the character it encodes is allowed in XML.
>
> -- Richard
> --
> :wq


OK. I dont know if you are a .net programmer or not(Martin is so maybe
he can respond to this too), but if I use streamreader to read an xml
file with encoding specified as utf8 and I set the
streamreader.encoding property to utf8, will streamreader fire an
exception if a character is not utf8,
or do I have to parse every character and check its value to see if it
is in the utf8 range?

Bill
 
Reply With Quote
 
Martin Honnen
Guest
Posts: n/a
 
      04-30-2008
(E-Mail Removed) wrote:

> OK. I dont know if you are a .net programmer or not(Martin is so maybe
> he can respond to this too), but if I use streamreader to read an xml
> file with encoding specified as utf8 and I set the
> streamreader.encoding property to utf8, will streamreader fire an
> exception if a character is not utf8,
> or do I have to parse every character and check its value to see if it
> is in the utf8 range?


As far as I know StreamReader does not throw an exception.


--

Martin Honnen
http://JavaScript.FAQTs.com/
 
Reply With Quote
 
Joseph J. Kesselman
Guest
Posts: n/a
 
      04-30-2008
(E-Mail Removed) wrote:
> So how do they do that? do they check every character? or do they just
> convert?


Most hand it off to an appropriate encoding-aware stream reader library
and let that code do the work. Why build a wheel when you can buy one?
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Strange xml.parsers.xml import problem dwelch91@gmail.com Python 2 10-03-2006 10:05 PM
blocking I/O with javax.xml.parsers.DocumentBuilder.parse() and javax.xml.transform.Transformer.transform() jazzdman@gmail.com Java 1 03-27-2005 06:56 AM
Print XML parsing to JspWriter (out) Class org.xml.sax.helpers.NewInstance can not access a member of class javax.xml.parsers.SAXParser with modifiers "protected" Per Magnus L?vold Java 0 11-15-2004 02:27 PM
xml.parsers.expat vs. xml.sax Thomas Guettler Python 2 04-27-2004 06:34 PM



Advertisments