Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > String default encoding: UTF-16 or Platform's default charset?

Reply
Thread Tools

String default encoding: UTF-16 or Platform's default charset?

 
 
cs_professional
Guest
Posts: n/a
 
      12-10-2010
I understand that Java Strings are Unicode (charset), but how are Java
String's stored in memory? As UTF-16 encoding or using the platform's
default charset?

There seems to be conflicting information this, the official String
javadoc says platform's default charset:
http://download.oracle.com/javase/6/...ml#String(byte[])
"Constructs a new String by decoding the specified array of bytes
using the platform's default charset."

I assume the platform's default charset is what you can get by
calling:
System.getProperty("file.encoding") OR
http://java.sun.com/javase/6/docs/ap...efaultCharset()

On my windows machine the above calls return Windows-1252 or CP-1252
(they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
So does this mean all Java Strings are encoded and stored in memory in
this Windows-1252 or CP-1252 format?

However, the "Java Internationalization FAQ" says UTF-16:
http://java.sun.com/javase/technolog...mended-charset
"... internal representation in Java, which is UTF-16".

So, what is it correct answer? Are Java Strings stored in memory as
UTF-16 or the platform's default charset?

Btw, I'm trying to understand this so I know what to expect in a more
complex i18n Browser-Servlet scenario.
 
Reply With Quote
 
 
 
 
Arne Vajhøj
Guest
Posts: n/a
 
      12-10-2010
On 10-12-2010 11:12, cs_professional wrote:
> I understand that Java Strings are Unicode (charset), but how are Java
> String's stored in memory? As UTF-16 encoding or using the platform's
> default charset?
>
> There seems to be conflicting information this, the official String
> javadoc says platform's default charset:
> http://download.oracle.com/javase/6/...ml#String(byte[])
> "Constructs a new String by decoding the specified array of bytes
> using the platform's default charset."
>
> I assume the platform's default charset is what you can get by
> calling:
> System.getProperty("file.encoding") OR
> http://java.sun.com/javase/6/docs/ap...efaultCharset()
>
> On my windows machine the above calls return Windows-1252 or CP-1252
> (they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
> So does this mean all Java Strings are encoded and stored in memory in
> this Windows-1252 or CP-1252 format?
>
> However, the "Java Internationalization FAQ" says UTF-16:
> http://java.sun.com/javase/technolog...mended-charset
> "... internal representation in Java, which is UTF-16".
>
> So, what is it correct answer? Are Java Strings stored in memory as
> UTF-16 or the platform's default charset?
>
> Btw, I'm trying to understand this so I know what to expect in a more
> complex i18n Browser-Servlet scenario.


Strings are stored as UTF-16.

The default char set applies to external representations.

Arne

 
Reply With Quote
 
 
 
 
Joshua Cranmer
Guest
Posts: n/a
 
      12-10-2010
On 12/10/2010 11:12 AM, cs_professional wrote:
> I understand that Java Strings are Unicode (charset), but how are Java
> String's stored in memory? As UTF-16 encoding or using the platform's
> default charset?


Strings internally are stored as chars, which a unsigned 16 bit integers
representing UTF-16 codepoints.

> There seems to be conflicting information this, the official String
> javadoc says platform's default charset:
> http://download.oracle.com/javase/6/...ml#String(byte[])
> "Constructs a new String by decoding the specified array of bytes
> using the platform's default charset."


For serialization as a byte stream, Strings by default use the platform
default charset.

> On my windows machine the above calls return Windows-1252 or CP-1252
> (they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
> So does this mean all Java Strings are encoded and stored in memory in
> this Windows-1252 or CP-1252 format?


It can't be, since you can store, say, π in a Java string, which is not
a character in CP-1252. On the other hand, if your default charset is
CP-1252, you can't serialize that character (you'll get ? instead).

> Btw, I'm trying to understand this so I know what to expect in a more
> complex i18n Browser-Servlet scenario.


What you have to be concerned about is the translation between byte
arrays (or any input/output that reads/writes bytes, possibly
autoconverting (!) characters) and character arrays (or Strings or other
containers implementing CharSequence).

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      12-10-2010
On Fri, 10 Dec 2010 08:12:13 -0800 (PST), cs_professional
<> wrote, quoted or indirectly quoted someone who
said :

>I understand that Java Strings are Unicode (charset), but how are Java
>String's stored in memory? As UTF-16 encoding or using the platform's
>default charset?


The spec allows the implementor to do anything he pleases internally,
including 8-bit encodings. However, they behave as if they were
encoded as 16-bit Unicode chars.

They are converted to the default local encoding when you use a
PrintWriter for example without specifying an explicit encoding.

You can experiment writing files, then feeding them to the encoding
recognizer to figure out what encoding was actually used. Local
encodings are often 8-bit.
http://mindprod.com/applet/encodingrecogniser.html
--
Roedy Green Canadian Mind Products
http://mindprod.com

Doubling the size of a team will probably make it produce even more slowly.
The problem is the more team members, the more secrets, the less each team
member understands about how it all fits together and how his changes may
adversely affect others.
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      12-10-2010
On Fri, 10 Dec 2010 12:52:32 -0500, Joshua Cranmer
<> wrote, quoted or indirectly quoted someone
who said :

>For serialization as a byte stream, Strings by default use the platform
>default charset


I don't think so. They use UTF-8 with lead count field, like
DataOutputStream. Otherwise such files would not be portable. I use
serialised streams all the time as resources. They would not work if
they read back differently by different clients.

--
Roedy Green Canadian Mind Products
http://mindprod.com

Doubling the size of a team will probably make it produce even more slowly.
The problem is the more team members, the more secrets, the less each team
member understands about how it all fits together and how his changes may
adversely affect others.
 
Reply With Quote
 
Mike Schilling
Guest
Posts: n/a
 
      12-10-2010


"Roedy Green" <> wrote in message
news:...
> On Fri, 10 Dec 2010 12:52:32 -0500, Joshua Cranmer
> <> wrote, quoted or indirectly quoted someone
> who said :
>
>>For serialization as a byte stream, Strings by default use the platform
>>default charset

>
> I don't think so. They use UTF-8 with lead count field, like
> DataOutputStream. Otherwise such files would not be portable. I use
> serialised streams all the time as resources. They would not work if
> they read back differently by different clients.


It's a complicated area, so we need to speak precisely.

DataOutputStream's writeChar() and writeChars() methods write characters as
UTF-16 code points. Its WriteUTF() method writes a string in (Java's
version of) UTF-8. None of these are affected by the platform's default
encoding.

Java object serialization uses these methods. Again, its output is
unaffected by the platform's default encoding.

The platform's default charset does affect other places where chars are
converted to bytes and no encoding is specified. These include
String.getBytes() and the various Writer methods that output strings (e.g
write(String)) if no encoding was specified when the Writer was created.


 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      12-10-2010
On 12/10/2010 06:52 PM, Joshua Cranmer wrote:
> On 12/10/2010 11:12 AM, cs_professional wrote:


>> There seems to be conflicting information this, the official String
>> javadoc says platform's default charset:
>> http://download.oracle.com/javase/6/...ml#String(byte[])
>>
>> "Constructs a new String by decoding the specified array of bytes
>> using the platform's default charset."

>
> For serialization as a byte stream, Strings by default use the platform
> default charset.


Please don't call String's getBytes() "serialization". Serialization is
a completely different mechanism (see [1]) and we don't really have to
bother how that format looks like because this is a Java only story and
instances are guaranteed to come back as they were written.

Kind regards

robert


[1] http://download.oracle.com/javase/6/...ializable.html
 
Reply With Quote
 
David
Guest
Posts: n/a
 
      12-10-2010
On 10 dic, 12:52, Joshua Cranmer <Pidgeo...@verizon.invalid> wrote:
> On 12/10/2010 11:12 AM, cs_professional wrote:
>
> > I understand that Java Strings are Unicode (charset), but how are Java
> > String's stored in memory? As UTF-16 encoding or using the platform's
> > default charset?

>
> Strings internally are stored as chars, which a unsigned 16 bit integers
> representing UTF-16 codepoints.


Strictly speaking, strings could be stored in some other format, like
UTF-32, or arrays of double where the integer part represents a
Unicode codepoint, or Perl's SvPV type (that carries a flag and can be
either ISO-8859-1 or UTF-8 internally). However, the Sun reference
implementation uses UTF-16 on all platforms, and some of the methods
in String are easier to implement efficiently when that's the case.

--
DLL
 
Reply With Quote
 
Mike Schilling
Guest
Posts: n/a
 
      12-11-2010


"David" <> wrote in message
news:4e5cd164-ad39-4fe9-b95e-...
> On 10 dic, 12:52, Joshua Cranmer <Pidgeo...@verizon.invalid> wrote:
>> On 12/10/2010 11:12 AM, cs_professional wrote:
>>
>> > I understand that Java Strings are Unicode (charset), but how are Java
>> > String's stored in memory? As UTF-16 encoding or using the platform's
>> > default charset?

>>
>> Strings internally are stored as chars, which a unsigned 16 bit integers
>> representing UTF-16 codepoints.

>
> Strictly speaking, strings could be stored in some other format, like
> UTF-32, or arrays of double where the integer part represents a
> Unicode codepoint, or Perl's SvPV type (that carries a flag and can be
> either ISO-8859-1 or UTF-8 internally). However, the Sun reference
> implementation uses UTF-16 on all platforms, and some of the methods
> in String are easier to implement efficiently when that's the case.


I'm wondering whether there's any guarantee that String.charAt() is O(0),
which would be next to impossible if the String were an array of UTF-32.

 
Reply With Quote
 
Tom Anderson
Guest
Posts: n/a
 
      12-11-2010
On Fri, 10 Dec 2010, Mike Schilling wrote:

> "David" <> wrote in message
> news:4e5cd164-ad39-4fe9-b95e-...
>
>> Strictly speaking, strings could be stored in some other format, like
>> UTF-32, or arrays of double where the integer part represents a Unicode
>> codepoint, or Perl's SvPV type (that carries a flag and can be either
>> ISO-8859-1 or UTF-8 internally).

>
> I'm wondering whether there's any guarantee that String.charAt() is O(0),
> which would be next to impossible if the String were an array of UTF-32.


O(0)?

tom

--
william gibson said that the future has already happened, it just isn't
evenly distributed. he was talking specifically about finsbury park. --
andy
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
'System.String[]' from its string representation 'String[] Array' =?Utf-8?B?UmFqZXNoIHNvbmk=?= ASP .Net 0 05-04-2006 04:29 PM
Is "String s = "abc";" equal to "String s = new String("abc");"? Bruce Sam Java 15 11-19-2004 06:03 PM
String[] files = {"a.doc, b.doc"}; VERSUS String[] files = new String[] {"a.doc, b.doc"}; Matt Java 3 09-17-2004 10:28 PM
String.replaceAll(String regex, String replacement) question Mladen Adamovic Java 3 12-05-2003 04:20 PM
Re: String.replaceAll(String regex, String replacement) question Mladen Adamovic Java 0 12-04-2003 04:40 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57