Andy Fish wrote:
> Hi,
>
> I have a servlet (running under tomcat 4.1, java 1.4.2) that sends XML in
> the HTTP body from a servlet. The I want the XML to be encoded in UTF-8.
>
> when I run Tomcat on windows 2000, the XML appears fine on the client end,
> but running Tomcat on debian woody linux, accented characters don't appear
> correctly. In the XML output stream, each accented character comes out as
> two characters, so obviously the fact that it's supposed to be UTF-8 is
> being lost.
No, that's not obvious at all. Not from the information you have given.
Unicode provides for logical characters to be composed of two or more
characters; for instance, a lowercase u with an umlaut could be
represented as the latin lowercase 'u' followed by the umlaut "combining
character". Many of the more common combinations also have
single-character representations, including the u-umlaut example, and
pretty much all the "diacriticalized" characters used in Western
European languages. The alternative representations are equivalent as
far as Unicode is concerned, and Unicode processors are permitted to
freely substitute one for another. They should be displayed or printed
the same by a conformant processor.
Moreover, the fact that you are making judgements about the "UTF-8ness"
of the stream based on the character count leads me to wonder whether
perhaps you are confusing characters with bytes / octets, or whether you
misunderstand the nature of character encodings. The character count
has little to do with whether the characters are encoded in UTF-8;
rather it has everything to do with which character or characters have
been encoded. The byte count has more relation to the encoding, but is
still closely tied to the characters that have been encoded.
> here's how I'm streaming the XML:
>
> response.setContentType("text/xml");
Better would probably be "text/xml; charset=UTF-8".
> OutputStream os = response.getOutputStream();
> OutputStreamWriter osw = new OutputStreamWriter(os , "UTF-8");
> PrintWriter pw = new PrintWriter(osw);
> pw.print("..all the xml..")
>
> If, instead of writing to the response object, I write to a
> FileOutputStream, the accented characters appear OK in the file.
As judged how?
> I'm a bit stuck here because when I wrote this code, I read up all about
> character encoding and did what I thought was right, and it all worked on my
> Win2000 test system. I can't figure out what could be going wrong on the
> linux box.
The output part looks okay to me. I suspect you have a different
problem than you think you have.
John Bollinger