Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   XML (http://www.velocityreviews.com/forums/f32-xml.html)
-   -   Strangeness with Japanese, XML, Java (http://www.velocityreviews.com/forums/t169159-strangeness-with-japanese-xml-java.html)

Robert M. Gary 04-14-2005 11:58 PM

Strangeness with Japanese, XML, Java
 
I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default
character set is EUC-JP
I'm seeing two strange things when using Japanese character sets...

1) If I write a program that does
System.out.println("$^%$%^^" ); //assume those are Japanese characters that
are multibyte under EUC-JP
The resulting output looks NOTHING like the characters I typed in.
Apparently the character set being used to read the literal is different
from the default.

2) If I create an XML document using the built in DOM which contains
elements with values in Japanese, I get strangeness when I transform that
into an XML document. If I do not set the character set in the transformer
the document will say its in UTF-8 (the XML header will). However, the
actual document is NOT UTF-8. I downloaded IBM's ICU character set utilities
(it knows nothing of XML, just character sets) and when I try to read the
document when telling uconv it is UTF-8 it claims it is invalid UTF-8.
However, if I try to read it telling it the document is EUC-JP it says its
good.
Also, when I change the transformer to use EUC-JP it creates the same
document bit-for-bit (other than changing the XML header to say EUC-8).
Other character sets (UTC, etc) result in a different document.
So, my conclusion is that by default the XML DOM says its UTF-8 in the
header, but ALWAYS uses the platform default unless you specify something
else (UTC for example).

Has anyone else seen this??
Here is my transformer...

Document new_document = documentBuilder.parse("japan2.xml");
System.out.println("I just read japan2.xml");
DOMSource new_source = new DOMSource(new_document);
StringWriter new_writer = new StringWriter();
StreamResult new_result = new StreamResult(new_writer);

Properties p = transformer.getOutputProperties();
//try explicit EUC
//p.setProperty(OutputKeys.ENCODING, "EUC-JP");

//try default (EUC)
//p.setProperty(OutputKeys.ENCODING,
// new OutputStreamWriter(new
ByteArrayOutputStream()).getEncoding());

//try UTF explicityly
//p.setProperty(OutputKeys.ENCODING, "UTF-8" );

transformer.setOutputProperties(p);
Properties p2 = transformer.getOutputProperties();
p2.list(System.out);

transformer.transform(new_source, new_result);

String new_text_doc = new_writer.toString();
System.out.println("XML doc is "+new_text_doc );


Resulting document...
XML doc is <?xml version="1.0" encoding="UTF-8"?><GenAlertsReq
confirmed="true"
invokeId="2"><AlertList><Alert><Name>ja_alert-とちつなのに</Name><AffectedObjects
type="Obj"><Obj><Name>ja_mo-あえいおう</Name></Obj></AffectedObjects><Properties><Property><Name>Severi ty</Name><Value>major</Value></Property><Property><Name>Manager</Name><Value>NetExpert</Value></Property></Properties></Alert></AlertList><AttrList><Attr
name="TOD"><Int32>1112980583</Int32></Attr><Attr
name="DMPAlarmObject"><Str>ja_mo-あえいおう</Str></Attr><Attr
name="CLASS"><Str>NetExpert</Str></Attr><Attr
name="MANAGER"><Str>NetExpert</Str></Attr><Attr
name="DMPAlarmName"><Str>ja_alert-とちつなのに</Str></Attr><Attr
name="ARCHIVE_LENGTH"><Int32>0</Int32></Attr><Attr
name="DMPAlarmSeverity"><Str>major</Str></Attr><Attr
name="MsgType"><Str>Alarm</Str></Attr><Attr
name="MGR_PORT_KEY"><Int32>93</Int32></Attr><Attr
name="ARCHIVE_OFFSET"><Int32>0</Int32></Attr></AttrList></GenAlertsReq>

When I try to read it using IBM's ICU character set tool uconv I get the
following...
=> uconv -f UTF-8 ~/test/xml/japan.xml
Conversion to Unicode from codepage failed at input byte position 116.
Bytes: a4 Error: Illegal character found
<?xml version="1.0" encoding="UTF-8"?>
<GenAlertsReq confirmed="true"
invokeId="1"><AlertList><Alert><Name>ja_alert-

However, when I tell it the document is EUC-JP it works...
=> uconv -f EUC-JP ~/test/xml/japan.xml
<?xml version="1.0" encoding="UTF-8"?>
<GenAlertsReq confirmed="true" invokeId=......

So, the document appears to be EUC-JP even though the Java DOM says its
UTF-8
-Robert



Soren Kuula 04-15-2005 08:02 PM

Re: Strangeness with Japanese, XML, Java
 
Hi
Robert M. Gary wrote:
> I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default
> character set is EUC-JP
> I'm seeing two strange things when using Japanese character sets...


> 1) If I write a program that does
> System.out.println("$^%$%^^" ); //assume those are Japanese characters that
> are multibyte under EUC-JP
> The resulting output looks NOTHING like the characters I typed in.
> Apparently the character set being used to read the literal is different
> from the default.


1) Find out under which encoding your java source editor saves your java
source files. Check your result.

2) javac -encoding <whatever you found above> ...java

> 2) If I create an XML document using the built in DOM which contains
> elements with values in Japanese, I get strangeness when I transform that
> into an XML document. If I do not set the character set in the transformer
> the document will say its in UTF-8 (the XML header will). However, the
> actual document is NOT UTF-8. I downloaded IBM's ICU character set utilities
> (it knows nothing of XML, just character sets) and when I try to read the
> document when telling uconv it is UTF-8 it claims it is invalid UTF-8.
> However, if I try to read it telling it the document is EUC-JP it says its
> good.


How do you serialize your DOMs? I guess you will have
UTF-8-decode(EUC-JP-encode(UTF-8decode(EUC-JP-encode(literals))))
if you edit in EUC-JP, compile as UTF-8 and run your data throgh a
Writer that takes the platform default encoding ... that's a mess :)

Check that you override the platform default encoding and really go
UTF-8 when you serialize.

> Also, when I change the transformer to use EUC-JP it creates the same
> document bit-for-bit (other than changing the XML header to say EUC-8).


Problem is where you serialize the document, not where you construct,
modify or transform it. And possibly in the decoding (by javac) of your
program text literals.

> Other character sets (UTC, etc) result in a different document.


Probably the document is read in correctly .. anything else than unicode
and EUC will not be able to contain all the Japanese, and will bust.

> So, my conclusion is that by default the XML DOM says its UTF-8 in the
> header, but ALWAYS uses the platform default unless you specify something
> else (UTC for example).


I'm pretty sure the error is where you output the data (you haven't
shown it..)

> Has anyone else seen this??


All the time...

> Document new_document = documentBuilder.parse("japan2.xml");


Verify until you are bloody sure what the encoding is of your input
document, and that it really matches with what the header says.
I think a mismatch will not result in an exception or anything, only bad
contents...
> System.out.println("I just read japan2.xml");
> DOMSource new_source = new DOMSource(new_document);
> StringWriter new_writer = new StringWriter();
> StreamResult new_result = new StreamResult(new_writer);
> Properties p = transformer.getOutputProperties();
> //try explicit EUC
> //p.setProperty(OutputKeys.ENCODING, "EUC-JP");
>
> //try default (EUC)
> //p.setProperty(OutputKeys.ENCODING,
> // new OutputStreamWriter(new
> ByteArrayOutputStream()).getEncoding());
>
> //try UTF explicityly
> //p.setProperty(OutputKeys.ENCODING, "UTF-8" );
>
> transformer.setOutputProperties(p);
> Properties p2 = transformer.getOutputProperties();
> p2.list(System.out);
>
> transformer.transform(new_source, new_result);
>
> String new_text_doc = new_writer.toString();
> System.out.println("XML doc is "+new_text_doc );


PSE show us how it got into that file.
> Resulting document...
> XML doc is <?xml version="1.0" encoding="UTF-8"?><GenAlertsReq
> confirmed="true"

....

Soren


Soren Kuula 04-15-2005 09:19 PM

Re: Strangeness with Japanese, XML, Java
 
Hi, Robert and myself,
Soren Kuula wrote:

>> Also, when I change the transformer to use EUC-JP it creates the same
>> document bit-for-bit (other than changing the XML header to say EUC-8).

>
>
> Problem is where you serialize the document, not where you construct,
> modify or transform it. And possibly in the decoding (by javac) of your
> program text literals.
>
>> Other character sets (UTC, etc) result in a different document.

>
> Probably the document is read in correctly .. anything else than unicode
> and EUC will not be able to contain all the Japanese, and will bust.


Sorry, I misunderstood you there .. you mean, the OUTput is identical
except for the header?

I would take that as an indication that whatever you use for serializing
the DOM a byte sequence (file) does not look at what you set the
transformer to use. You will have to control that elsewhere.

Are you by any chance instantiating your own Writers when serializing?
Tried to give them different sencoding settings?

Soren



All times are GMT. The time now is 04:27 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.