Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Java (http://www.velocityreviews.com/forums/f30-java.html)
-   -   "Mangled" Servlet Unicode Output Characters (http://www.velocityreviews.com/forums/t134216-mangled-servlet-unicode-output-characters.html)

Wolfgang 06-09-2004 02:25 AM

"Mangled" Servlet Unicode Output Characters
 
I have a very simple servlet, see code at

http://www.alexandria.ucsb.edu/~rnott/tmp/Test.java

The servlet reads lines from a file tmp1.txt and just writes them back
to a Web page.

The lines in tmp1.txt contain UTF-8 encoded text, including some
special characters, such as the Norwegian '°' as in Mager°ya, or the
German '÷' as in S÷mmerda.

The servlet generates the Web page ok, listing the lines from file
tmp1.txt.

However, the special characters like '°' and '÷' don't show up on the
Web page, instead they are mangled, like '├Â' instead of '÷' (other,
regular characters are fine).

Why are the special characters mangled, and what do I do to have them
show up properly on the Web page?

Thanks for your help and advice.

Wolfgang,
Santa Barbara, CA

John C. Bollinger 06-09-2004 07:42 PM

Re: "Mangled" Servlet Unicode Output Characters
 
Wolfgang wrote:

> I have a very simple servlet, see code at
>
> http://www.alexandria.ucsb.edu/~rnott/tmp/Test.java
>
> The servlet reads lines from a file tmp1.txt and just writes them back
> to a Web page.
>
> The lines in tmp1.txt contain UTF-8 encoded text, including some
> special characters, such as the Norwegian '°' as in Mager°ya, or the
> German '÷' as in S÷mmerda.
>
> The servlet generates the Web page ok, listing the lines from file
> tmp1.txt.
>
> However, the special characters like '°' and '÷' don't show up on the
> Web page, instead they are mangled, like '├Â' instead of '÷' (other,
> regular characters are fine).
>
> Why are the special characters mangled, and what do I do to have them
> show up properly on the Web page?


These are typical symptoms of a character encoding mismatch. I see in
your source code that you use the system's default encoding to read the
text file. If the system default is not UTF-8 then that will be a
problem. You almost have it right in that regard: just pass the string
"UTF-8" an additional parameter to your InputStreamReader's constructor
(you will also have to add a handler for an additional checked exception).

Your output HTML is also a bit funky, as you are declaring an XML
document with the HTML 4 / transitional DTD. Transitional HTML 4 is not
necessarily well-formed XML. You should probably either drop the XML
declaration or (if you can) go all the way to XHTML. This is probably
not the cause of your current problem, but either way, I recommend that
you specify the charset in the response's content-type, rather than
relying on the XML declaration. To do so, use
response.setContentType("text/html; charset=UTF-8");


John Bollinger
jobollin@indiana.edu

Wolfgang 06-09-2004 08:48 PM

Re: "Mangled" Servlet Unicode Output Characters
 
Thanks, John

for your corrections to my code. This makes things work.

For those interested, I also found essentially the same advice (with
more detail) at
http://www.jorendorff.com/articles/unicode/java.html

Wolfgang


"John C. Bollinger" <jobollin@indiana.edu> wrote:
>
>These are typical symptoms of a character encoding mismatch. I see in
>your source code that you use the system's default encoding to read the
>text file. If the system default is not UTF-8 then that will be a
>problem. You almost have it right in that regard: just pass the string
>"UTF-8" an additional parameter to your InputStreamReader's constructor
>(you will also have to add a handler for an additional checked exception).
>
>Your output HTML is also a bit funky, as you are declaring an XML
>document with the HTML 4 / transitional DTD. Transitional HTML 4 is not
>necessarily well-formed XML. You should probably either drop the XML
>declaration or (if you can) go all the way to XHTML. This is probably
>not the cause of your current problem, but either way, I recommend that
>you specify the charset in the response's content-type, rather than
>relying on the XML declaration. To do so, use
> response.setContentType("text/html; charset=UTF-8");
>
>John Bollinger
>jobollin@indiana.edu
>
>Wolfgang wrote:
>
>> I have a very simple servlet, see code at
>>
>> http://www.alexandria.ucsb.edu/~rnott/tmp/Test.java
>>
>> The servlet reads lines from a file tmp1.txt and just writes them back
>> to a Web page.
>>
>> The lines in tmp1.txt contain UTF-8 encoded text, including some
>> special characters, such as the Norwegian '°' as in Mager°ya, or the
>> German '÷' as in S÷mmerda.
>>
>> The servlet generates the Web page ok, listing the lines from file
>> tmp1.txt.
>>
>> However, the special characters like '°' and '÷' don't show up on the
>> Web page, instead they are mangled, like '├Â' instead of '÷' (other,
>> regular characters are fine).
>>
>> Why are the special characters mangled, and what do I do to have them
>> show up properly on the Web page?

>




All times are GMT. The time now is 04:11 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.