Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Serializing XML with JAXP - help needed

Reply
Thread Tools

Serializing XML with JAXP - help needed

 
 
Michael
Guest
Posts: n/a
 
      02-22-2004
Hi all,

I'm trying to serialize an xml document with JAXP. The xml may or may not
contain international characters, and so I want any text elements to be
UTF-8 encoded. Consider the following (a brief summary is included below the
code):

---- code begin ----

org.w3c.dom.Document doc =
javax.xml.parsers.DocumentBuilderFactory.newInstan ce().newDocumentBuilder().
newDocument();

org.w3c.dom.Element el = doc.createElement("element");
el.setAttribute("attr1","attr1value");
el.appendChild(doc.createTextNode("Danish < זרו > characters!"));
doc.appendChild(el);

javax.xml.transform.TransformerFactory transformerFactory =
javax.xml.transform.TransformerFactory.newInstance ();
javax.xml.transform.Transformer transformer =
transformerFactory.newTransformer();

transformer.setOutputProperty(javax.xml.transform. OutputKeys.INDENT,"yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount","4
");

java.io.StringWriter xmlout = new java.io.StringWriter();
javax.xml.transform.stream.StreamResult result = new
javax.xml.transform.stream.StreamResult(xmlout);
transformer.transform(new javax.xml.transform.dom.DOMSource(doc),result);

System.out.println(xmlout.getBuffer());

---- code end ----

So, I'm creating a document (DOM), setting an attribute and appending a text
node with international characters (and a couple of brackets just for fun).
Then I create a transformer instance, I ask it to indent the output nicely
and finally to actually serialize my DOM into xml.

When I run this code (in a jsp file on a tomcat 4.1.x server with the latest
xerces2-j version installed) I get this output:

<?xml version="1.0" encoding="UTF-8"?>
<element attr1="attr1value">Danish &lt; זרו &gt; characters!</element>

Okay. So I got the < and > converted as I expected. However, the
international characters do not appear to have been encoded to UTF-8 or
anything else for that matter. In fact, the above isn't even a valid xml
document, and several parsers I tried (including Microsoft XML) rejects it
because of the illegal character data. Clearly there is a mismatch between
the what xml header encoding specifies and what's actually appearing in the
text nodes of the document. It's very curious that JAXP will transform a DOM
into a result that isn't valid.

Interestingly, when I run the same code interactively inside my WebSphere
Studio Application Developer 5 (using what is known as a scrapbook page), I
get this:

<?xml version="1.0" encoding="UTF-8"?>
<element attr1="attr1value">Danish &lt; æøå &gt;
characters!</element>

Well. I'm not sure that #230 is a correct UTF-8 encoding of "ז" (in fact I'm
sure it isn't), but at least the document is now valid and even Microsoft
XML will parse it without complaints.

I am hoping that someone out there can shed some light on this problem and
tell me what I am doing wrong. Exactly how do I instruct JAXP to encode the
text nodes in my DOM so that it doesn't break my XML parser?

Regards,
Michael Berg
www.hyperpal.com


 
Reply With Quote
 
 
 
 
Michael Berg
Guest
Posts: n/a
 
      02-22-2004
Hi all,

The problem is related to the use of a StringWriter to collect the XML
output. Apparently StringWriters have their own idea about character
encoding, so use an OutputStreamWriter in stead - like this, for example:

java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
javax.xml.transform.stream.StreamResult result = new
javax.xml.transform.stream.StreamResult(
new java.io.OutputStreamWriter(
baos,
"UTF-8"
)
);

/Michael
www.hyperpal.com

"Michael" <(E-Mail Removed) (figure it out)> wrote in
message news:40380891$0$95001$(E-Mail Removed) k...
> Hi all,
>
> I'm trying to serialize an xml document with JAXP. The xml may or may not
> contain international characters, and so I want any text elements to be
> UTF-8 encoded. Consider the following (a brief summary is included below

the
> code):
>
> ---- code begin ----
>
> org.w3c.dom.Document doc =
>

javax.xml.parsers.DocumentBuilderFactory.newInstan ce().newDocumentBuilder().
> newDocument();
>
> org.w3c.dom.Element el = doc.createElement("element");
> el.setAttribute("attr1","attr1value");
> el.appendChild(doc.createTextNode("Danish < זרו > characters!"));
> doc.appendChild(el);
>
> javax.xml.transform.TransformerFactory transformerFactory =
> javax.xml.transform.TransformerFactory.newInstance ();
> javax.xml.transform.Transformer transformer =
> transformerFactory.newTransformer();
>
>

transformer.setOutputProperty(javax.xml.transform. OutputKeys.INDENT,"yes");
>

transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount","4
> ");
>
> java.io.StringWriter xmlout = new java.io.StringWriter();
> javax.xml.transform.stream.StreamResult result = new
> javax.xml.transform.stream.StreamResult(xmlout);
> transformer.transform(new javax.xml.transform.dom.DOMSource(doc),result);
>
> System.out.println(xmlout.getBuffer());
>
> ---- code end ----
>
> So, I'm creating a document (DOM), setting an attribute and appending a

text
> node with international characters (and a couple of brackets just for

fun).
> Then I create a transformer instance, I ask it to indent the output nicely
> and finally to actually serialize my DOM into xml.
>
> When I run this code (in a jsp file on a tomcat 4.1.x server with the

latest
> xerces2-j version installed) I get this output:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <element attr1="attr1value">Danish &lt; זרו &gt; characters!</element>
>
> Okay. So I got the < and > converted as I expected. However, the
> international characters do not appear to have been encoded to UTF-8 or
> anything else for that matter. In fact, the above isn't even a valid xml
> document, and several parsers I tried (including Microsoft XML) rejects it
> because of the illegal character data. Clearly there is a mismatch between
> the what xml header encoding specifies and what's actually appearing in

the
> text nodes of the document. It's very curious that JAXP will transform a

DOM
> into a result that isn't valid.
>
> Interestingly, when I run the same code interactively inside my WebSphere
> Studio Application Developer 5 (using what is known as a scrapbook page),

I
> get this:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <element attr1="attr1value">Danish &lt; æøå &gt;
> characters!</element>
>
> Well. I'm not sure that #230 is a correct UTF-8 encoding of "ז" (in fact

I'm
> sure it isn't), but at least the document is now valid and even Microsoft
> XML will parse it without complaints.
>
> I am hoping that someone out there can shed some light on this problem and
> tell me what I am doing wrong. Exactly how do I instruct JAXP to encode

the
> text nodes in my DOM so that it doesn't break my XML parser?
>
> Regards,
> Michael Berg
> www.hyperpal.com
>
>



 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
error XML validation JAXP:org.xml.sax.SAXParseException KaR Java 1 10-09-2008 12:32 PM
JAXP: serializing XML with identity transform, but no indent? lard Java 3 03-12-2006 11:27 PM
JAXP - Fusing XSLT transformation results into a single XML file Blue Gecko Java 1 10-03-2005 09:39 AM
JAXP Document to String needed iksrazal Java 2 05-21-2004 06:04 PM
How to build a ListModel for JList out of XML-Data (JAXP/DOM)?? Tobi Krausl Java 0 11-20-2003 04:24 PM



Advertisments