Go Back   Velocity Reviews > Newsgroups > Java
User Name
Password
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

Reply

Java - UTF-8 problems with windows

 
Thread Tools Search this Thread
Old 08-10-2009, 10:29 PM   #1
Default UTF-8 problems with windows


I have the following code fragment in a tiny webserver:

...
os = sock.socket().getOutputStream();
osr = new PrintWriter(new PrintStream(os, true, "UTF-8"));
osr.println("HTTP/1.1 200 OK");
osr.println("Content-Type: text/html; charset=utf-8");
osr.println();
osr.println(test());
...

private String test() {
String ret = null;
try {
StringBuffer tmpl = new StringBuffer
("<html><head></head><body>H\u00e2n</body></html>");
ret = tmpl.toString();
}
catch (Exception e) {
e.printStackTrace();
}
System.out.println(ret);
return ret;
}

With Linux, firefox and opera there is no problem and
the a with circumflex is printed nicely.

On Windows xp I get neither firefox nor IE to work correctly.

Firefox shows some FFFD square, but when I change from the (detected)
UTF-8 encoding to ISO-8859-1, it displays things correctly. But that
would be the rwong encoding!?

IE shows some empty rectangle in the main browser window, but when
looking at the page source, everything is shown correctly!?

I have seen the correct output, but don't remember how I got it; so
it's not missing glyphs.

This is probably not a Java question, as I suspect some windows magic
to happen here. Maybe it has something to do with the infamous BOM?
(I tried setting "file.encoding" to "UTF-8" for what it's worth. And
the cmd prompt from the out.println then o with circumflex, but that's
due to the windows legacy encoding, I think.)

Michael


Michael Jung
  Reply With Quote
Old 08-11-2009, 04:11 AM   #2
Knute Johnson
 
Posts: n/a
Default Re: UTF-8 problems with windows
Michael Jung wrote:
> I have the following code fragment in a tiny webserver:
>
> ...
> os = sock.socket().getOutputStream();
> osr = new PrintWriter(new PrintStream(os, true, "UTF-8"));
> osr.println("HTTP/1.1 200 OK");
> osr.println("Content-Type: text/html; charset=utf-8");
> osr.println();
> osr.println(test());
> ...
>
> private String test() {
> String ret = null;
> try {
> StringBuffer tmpl = new StringBuffer
> ("<html><head></head><body>H\u00e2n</body></html>");
> ret = tmpl.toString();
> }
> catch (Exception e) {
> e.printStackTrace();
> }
> System.out.println(ret);
> return ret;
> }
>
> With Linux, firefox and opera there is no problem and
> the a with circumflex is printed nicely.
>
> On Windows xp I get neither firefox nor IE to work correctly.
>
> Firefox shows some FFFD square, but when I change from the (detected)
> UTF-8 encoding to ISO-8859-1, it displays things correctly. But that
> would be the rwong encoding!?
>
> IE shows some empty rectangle in the main browser window, but when
> looking at the page source, everything is shown correctly!?
>
> I have seen the correct output, but don't remember how I got it; so
> it's not missing glyphs.
>
> This is probably not a Java question, as I suspect some windows magic
> to happen here. Maybe it has something to do with the infamous BOM?
> (I tried setting "file.encoding" to "UTF-8" for what it's worth. And
> the cmd prompt from the out.println then o with circumflex, but that's
> due to the windows legacy encoding, I think.)
>
> Michael


Michael:

I've been playing around with this and I can't get it to work correctly
on Windows or Linux. I tried just putting a file with the 0xE2
character on my web server (which is set to default to UTF- and I get
a black square rotated 45 degrees with a white ? in it. If I reset the
character encoding to IS0-8859-1 on the browser the character appears
correctly. There is something I don't understand here and hopefully you
will get a better answer.

--

Knute Johnson
email s/nospam/knute2009/

--
Posted via NewsDemon.com - Premium Uncensored Newsgroup Service
------->>>>>>http://www.NewsDemon.com<<<<<<------
Unlimited Access, Anonymous Accounts, Uncensored Broadband Access


Knute Johnson
  Reply With Quote
Old 08-11-2009, 07:07 AM   #3
Roedy Green
 
Posts: n/a
Default Re: UTF-8 problems with windows
On Mon, 10 Aug 2009 23:29:04 +0200, Michael Jung
<> wrote, quoted or indirectly quoted someone
who said :

>With Linux, firefox and opera there is no problem and
>the a with circumflex is printed nicely.



0x00e2 is supposed to be &acirc; in UTF-8, Unicode and ISO-8859-1

However, in a proprietary windows encoding, it could be anything. What
encoding is your System.out.println using?

To find out, dump a set of chars 0 .. 255 to System.out and redirect
them to a file. Then look at the file with the EncodingRecogniser
utility.
See http://mindprod.com/jgloss/encoding.html

You might find windows-1252, Cp437, Cp850...

Also try dumping out the character as hex. You will see it is likely
just fine. It is just System.out screwing it up.

--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.


Roedy Green
  Reply With Quote
Old 08-11-2009, 07:12 AM   #4
Roedy Green
 
Posts: n/a
Default Re: UTF-8 problems with windows
On Mon, 10 Aug 2009 23:29:04 +0200, Michael Jung
<> wrote, quoted or indirectly quoted someone
who said :

>
>On Windows xp I get neither firefox nor IE to work correctly.


some other things to try:

1. use Wireshark to snoop on the messages your server is sending. See
if problem is in the server or the client browser. Make sure your
headers and body are encoded as you intended.

see http://mindprod.com/jgloss/wireshark.html

2. Check the font. If your font does not support &acirc; it won't
support an embedded 0x00e2. Try embedding &acirc; (the entity, not
the hex) in your text body.
use http://mindprod.com/jgloss/fontshower.html to make sure the font
supports &acirc;
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.


Roedy Green
  Reply With Quote
Old 08-11-2009, 11:25 AM   #5
Michael Jung
 
Posts: n/a
Default Re: UTF-8 problems with windows
Steven Simpson <> writes:
> Michael Jung wrote:
>> I have the following code fragment in a tiny webserver:
>> ...
>> os = sock.socket().getOutputStream();
>> osr = new PrintWriter(new PrintStream(os, true, "UTF-8"));


> This looks rather strange. I'd prefer to go for something like this:
>
> new PrintWriter(new OutputStreamWriter(os, "UTF-8"))


I used to have new PrintWriter(os), but wanted to enforce the encoding
and PrintWriter doesn't take one. *That* would be a convenience
constructor needed.

> Here's what I suspect:
>
> * PrintStream is an OutputStream, so most of its methods just takes
> bytes, and it happens to have a few more which take chars and
> Strings. These extra methods will do the char->UTF-8 conversion
> (an internal OutputStreamWriter is created), but the byte-based
> methods can't - they're already bytes.
> * PrintWriter can take an OutputStream. If it does so, it will also
> insert its own OutputStreamWriter (using the local system's charset).
> * Chars passed to the PrintWriter are converted using its
> OutputStreamWriter, and never get passed on to the
> char/String-based methods of the PrintStream, so its charset
> encoder does not get used.
>
> Result: you're writing using the native encoding of your server,
> regardless of what you tell the PrintStream.


Now that you mention it, this is what I found in the PrintStream Javadoc:

"All characters printed by a PrintStream are converted into bytes
using the platform's default character encoding. The PrintWriter class
should be used in situations that require writing characters rather
than bytes."

It even says so in the Javadoc of the constructor I used. *blush*

Thank you very much.

Bonus question: what is the encoding parameter good for in the
constructor of the PrintStream? It actually lead me on the false
track.

Michael


Michael Jung
  Reply With Quote
Old 08-11-2009, 08:24 PM   #6
jolz
 
Posts: n/a
Default Re: UTF-8 problems with windows
> osr.println("HTTP/1.1 200 OK");
> osr.println("Content-Type: text/html; charset=utf-8");
> osr.println();
> osr.println(test());


> With Linux, firefox and opera there is no problem and
> the a with circumflex is printed nicely.


I don't think it is required to work even with plain ASCII, especially
on linux.:

1.
public void println()

Terminate the current line by writing the line separator string.
The line separator string is defined by the system property
line.separator, and is not necessarily a single newline character ('\n').

2.
Response = Status-Line ; Section 6.1
*(( general-header ; Section 4.5
| response-header ; Section 6.2
| entity-header ) CRLF) ; Section 7.1
CRLF
[ message-body ] ; Section 7.2

CRLF = CR LF

CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>


jolz
  Reply With Quote
Old 08-11-2009, 08:26 PM   #7
Michael Jung
 
Posts: n/a
Default Re: UTF-8 problems with windows
Thomas Pornin <> writes:
> In the Javadoc of JDK-1.1.8, PrintStream was documented as
> being deprecated. Both public constructors include the comment:
> "Note: PrintStream() is deprecated." and go on to state that
> PrintWriter should be used.
>
> In JDK-1.3.1, the comments about deprecation are gone (I do not have
> the Javadoc for JDK-1.2, so I cannot check there). PrintStream got
> "reprecated". At some point between 1.1.8 and 1.3.1, Sun realized
> that explicit deprecation is not enough to get rid of a troublesome
> class, and that too much code was using PrintStream to allow for
> a simple removal (it would break too much existing code).


It would not be enough, but it would help. Or does the danger of
refactoring wrongly (by people trying to get rid of every warning in
sight) outweigh the benfits of a cleaner IF with deprecated parts?

Michael


Michael Jung
  Reply With Quote
Old 08-12-2009, 01:07 AM   #8
Lew
 
Posts: n/a
Default Re: UTF-8 problems with windows
Thomas Pornin wrote:
> Backward compatibility goes to
> a great extent to explain why Java is as it is nowadays. Examples
> of quirks include the following:

....
> -- There are both java.net.URI and java.net.URL, with oh-so-slightly
> different handlings of nominally invalid URLs (especially when there
> are spaces in the string).


That one doesn't belong on your list. The classes exist to handle the
functional differences between URIs generally and URLs specifically. As the
URI Javadocs state:
> The conceptual distinction between URIs and URLs is reflected in the
> differences between this class and the URL class.


--
Lew


Lew
  Reply With Quote
Old 08-12-2009, 09:50 AM   #9
Mike Schilling
 
Posts: n/a
Default Re: UTF-8 problems with windows
Lew wrote:
> Thomas Pornin wrote:
>> Backward compatibility goes to
>> a great extent to explain why Java is as it is nowadays. Examples
>> of quirks include the following:
>> Strings consist in sequences of 'char', not 'int'.

I'd put this one as "chars are fixed at 16 bits rather than simply
'big enough to hold all Unicode characters'". 24 bits would be
sufficient to get rid of surrogates.

And I'd add:
NullPointerExceptions in a language that insists it doesn't have
pointers.

In DOM, the null namespace is represents by a null String. In SAX,
by an empty string.

>> -- There are both java.net.URI and java.net.URL, with
>> oh-so-slightly
>> different handlings of nominally invalid URLs (especially when
>> there
>> are spaces in the string).

>
> That one doesn't belong on your list. The classes exist to handle
> the
> functional differences between URIs generally and URLs specifically.
> As the URI Javadocs state:
>> The conceptual distinction between URIs and URLs is reflected in
>> the
>> differences between this class and the URL class.


It belongs on a different list, one where Java accurately models a
historical quirk in a different domain.




Mike Schilling
  Reply With Quote
Old 08-12-2009, 12:39 PM   #10
Michael Jung
 
Posts: n/a
Default Re: UTF-8 problems with windows
jolz <> writes:
>> osr.println("HTTP/1.1 200 OK");
>> osr.println("Content-Type: text/html; charset=utf-8");
>> osr.println();
>> osr.println(test());

[...]

> I don't think it is required to work even with plain ASCII, especially
> on linux.:


> public void println()
> Terminate the current line by writing the line separator
> string. The line separator string is defined by the system property
> line.separator, and is not necessarily a single newline character
> ('\n').


> Response = Status-Line ; Section 6.1
> *(( general-header ; Section 4.5
> | response-header ; Section 6.2
> | entity-header ) CRLF) ; Section 7.1
> CRLF
> [ message-body ] ; Section 7.2


1. What I described would have been a strange phenomenom of this error
indeed.
2. You are right.
3. I have yet to meet a client to complain.

Michael


Michael Jung
  Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
How to Reset / Recover Forgotten Windows NT / 2000 / XP / 2003 Administrator Password wskaihd Software 2 11-17-2009 02:01 AM
How to activate Remote Assistance with XP using Windows Live Messenger Oziisr General Help Related Topics 0 02-01-2008 04:45 PM
Computer Security aldrich.chappel.com.use@gmail.com A+ Certification 0 11-27-2007 02:11 AM
MCITP: Enterprise Support Technician MileHighWelch MCITP 1 06-19-2007 10:25 PM
Re: Question about MS critical updates John Coode A+ Certification 0 06-30-2004 06:08 PM




SEO by vBSEO 3.3.2 ©2009, Crawlability, Inc.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46