On 12/10/2010 11:12 AM, cs_professional wrote:
> I understand that Java Strings are Unicode (charset), but how are Java
> String's stored in memory? As UTF-16 encoding or using the platform's
> default charset?
Strings internally are stored as chars, which a unsigned 16 bit integers
representing UTF-16 codepoints.
> There seems to be conflicting information this, the official String
> javadoc says platform's default charset:
> http://download.oracle.com/javase/6/...ml#String(byte[])
> "Constructs a new String by decoding the specified array of bytes
> using the platform's default charset."
For serialization as a byte stream, Strings by default use the platform
default charset.
> On my windows machine the above calls return Windows-1252 or CP-1252
> (they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
> So does this mean all Java Strings are encoded and stored in memory in
> this Windows-1252 or CP-1252 format?
It can't be, since you can store, say, π in a Java string, which is not
a character in CP-1252. On the other hand, if your default charset is
CP-1252, you can't serialize that character (you'll get ? instead).
> Btw, I'm trying to understand this so I know what to expect in a more
> complex i18n Browser-Servlet scenario.
What you have to be concerned about is the translation between byte
arrays (or any input/output that reads/writes bytes, possibly
autoconverting (!) characters) and character arrays (or Strings or other
containers implementing CharSequence).
--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth