Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Unicode chinese

Reply
Thread Tools

Unicode chinese

 
 
Crouchez
Guest
Posts: n/a
 
      08-29-2007
String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);

Why does this return 2?


 
Reply With Quote
 
 
 
 
Knute Johnson
Guest
Posts: n/a
 
      08-29-2007
Crouchez wrote:
> String chinese = "\u4e2d\u5c0f";
> System.out.println(chinese.getBytes().length);
>
> Why does this return 2?
>
>


The font on the console may not be able to draw it. Try it with an
appropriate font in a JComponent of some variety.

--

Knute Johnson
email s/nospam/knute/
 
Reply With Quote
 
 
 
 
sadiruddin@gmail.com
Guest
Posts: n/a
 
      08-29-2007
It runs 6 for me.

 
Reply With Quote
 
bugbear
Guest
Posts: n/a
 
      08-29-2007
Crouchez wrote:
> String chinese = "\u4e2d\u5c0f";
> System.out.println(chinese.getBytes().length);
>
> Why does this return 2?
>
>


http://java.sun.com/j2se/1.4.2/docs/...html#getBytes()

"The behavior of this method when this string cannot be encoded in the default charset is unspecified."

BugBear
 
Reply With Quote
 
Thomas Fritsch
Guest
Posts: n/a
 
      08-29-2007
Crouchez wrote:
> String chinese = "\u4e2d\u5c0f";
> System.out.println(chinese.getBytes().length);
>
> Why does this return 2?
>
>

String.getBytes() uses the platform's default charset. See
<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#getBytes()>

If the platform's default charset is "Cp1252" (like on my system and may
be on Crouchez's), then chinese.getBytes() returns 2 bytes. By the way:
the 2 bytes are {63,63} which is just {'?','?'} because the encoding
can't decode characters beyond '\u00ff'.

If the platform's default charset is "UTF-8" (like probably on
sadiruddin's system), then chinese.getBytes() returns 6 bytes.


--
Thomas
 
Reply With Quote
 
Andreas Leitgeb
Guest
Posts: n/a
 
      08-29-2007
bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
> Crouchez wrote:
>> String chinese = "\u4e2d\u5c0f";
>> System.out.println(chinese.getBytes().length);
>> Why does this return 2?

> http://java.sun.com/j2se/1.4.2/docs/...html#getBytes()
> "The behavior of this method when this string cannot be encoded in the default charset is unspecified."


While it's not specified, and could theoretically change over time,
the current implementation seems to encode your string as two
questionmarks, which account for length==2.

The other one, who answered that it gave "6" for him, likely
has an utf-8 based system-encoding (or utf-8 itself).

On Unix-systems, the system-encoding generally depends on the
environment variable LANG (and possibly overridden by certain
LC_... variables whose names I never remember).
For Windows, I don't know.
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      08-29-2007
On Wed, 29 Aug 2007 03:47:16 GMT, "Crouchez"
<(E-Mail Removed)> wrote, quoted or indirectly
quoted someone who said :

>String chinese = "\u4e2d\u5c0f";
>System.out.println(chinese.getBytes().length);
>
>Why does this return 2?


I modified your code a little, so it will make the problem clear:

public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args )
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
byte[] b = chinese.getBytes();
for ( int i=0; i<b.length; i++ )
{
System.out.println( b[i]);
}
// prints
// Cp1252
// 63
// 63
// in other words ??. Those tho chars are not available in your
default encoding.
}
}


I further modified you code to choose the encoding explicitly:

import java.io.UnsupportedEncodingException;
public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args ) throws
UnsupportedEncodingException
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
// explicit choice of encoding, designed to support Chinese.
byte[] b = chinese.getBytes( "Big5-HKSCS" );
for ( int i=0; i<b.length; i++ )
{
System.out.println( 0xff & b[i]);
}
// prints
// Cp1252
// 164
// 164
// 164
// 112 more like you would expect.
}
}




--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
 
Reply With Quote
 
Crouchez
Guest
Posts: n/a
 
      08-29-2007
cheers.

If I do

byte[] b = chinese.getBytes( "UTF-8" );

b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?


 
Reply With Quote
 
Crouchez
Guest
Posts: n/a
 
      08-29-2007

"Roedy Green" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> On Wed, 29 Aug 2007 03:47:16 GMT, "Crouchez"
> <(E-Mail Removed)> wrote, quoted or indirectly
> quoted someone who said :
>
>>String chinese = "\u4e2d\u5c0f";
>>System.out.println(chinese.getBytes().length);
>>
>>Why does this return 2?

>
> I modified your code a little, so it will make the problem clear:
>
> public class Chinese
> {
> /**
> * test harness
> *
> * @param args not used
> */
> public static void main ( String[] args )
> {
> System.out.println( System.getProperty( "file.encoding" ));
> String chinese = "\u4e2d\u5c0f";
> byte[] b = chinese.getBytes();
> for ( int i=0; i<b.length; i++ )
> {
> System.out.println( b[i]);
> }
> // prints
> // Cp1252
> // 63
> // 63
> // in other words ??. Those tho chars are not available in your
> default encoding.
> }
> }
>
>
> I further modified you code to choose the encoding explicitly:
>
> import java.io.UnsupportedEncodingException;
> public class Chinese
> {
> /**
> * test harness
> *
> * @param args not used
> */
> public static void main ( String[] args ) throws
> UnsupportedEncodingException
> {
> System.out.println( System.getProperty( "file.encoding" ));
> String chinese = "\u4e2d\u5c0f";
> // explicit choice of encoding, designed to support Chinese.
> byte[] b = chinese.getBytes( "Big5-HKSCS" );
> for ( int i=0; i<b.length; i++ )
> {
> System.out.println( 0xff & b[i]);
> }
> // prints
> // Cp1252
> // 164
> // 164
> // 164
> // 112 more like you would expect.
> }
> }
>
>
>
>
> --
> Roedy Green Canadian Mind Products
> The Java Glossary
> http://mindprod.com


Why have you done an AND on this?
System.out.println( 0xff & b[i]);


 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      08-30-2007
On Wed, 29 Aug 2007 16:50:41 GMT, "Crouchez"
<(E-Mail Removed)> wrote, quoted or indirectly
quoted someone who said :

>Why have you done an AND on this?
>System.out.println( 0xff & b[i]);


see http://mindprod.com/jgloss/unsigned.html
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to put traditional Chinese text into Unicode Oracle 9i database via Internet Explorer bjwang@acs.com.tw Java 1 12-24-2004 10:13 PM
Re: How do I translate a Chinese web site and add Chinese character set to IE..? =A0 - =A0 Taking these two questions one at a time... =A Patrick Dunford NZ Computing 3 04-28-2004 07:15 PM
python unicode display of chinese characters Posadas, Dennis Python 1 12-10-2003 06:01 AM
Unicode Support In chinese Win98 Gordon ASP .Net 0 10-22-2003 07:16 AM
wxPython and Chinese characters in Unicode? carroll@tjc.com Python 2 08-07-2003 04:36 AM



Advertisments