Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Read utf-8 file return utf-16 coding hex string ?

Reply
Thread Tools

Read utf-8 file return utf-16 coding hex string ?

 
 
moonhkt
Guest
Posts: n/a
 
      01-29-2010
Hi All
Why using utf-8, the hex value return 51cc and 6668 ?

od -cx utf8_file01.text

22e5 878c e699 a822 with " befor and after

http://www.fileformat.info/info/unic...51cc/index.htm
http://www.fileformat.info/info/unic...6668/index.htm

Output
......
101 ? 20940 HEX=51cc BIN=101000111001100
102 ? 26216 HEX=6668 BIN=110011001101000

Java program

import java.nio.charset.Charset ;
import java.io.*;
import java.lang.String.*;
import java.lang.Integer.*;
public class read_utf_line {
public static void main(String[] args) {
File aFile = new File("utf8_file01.text");
try {
System.out.println(aFile);
String str = "";
String hexstr = "";
String bystr = "";
int stlen= 0;
Integer val=0;
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(aFile), "UTF8"));

while (( str = in.readLine()) != null )
{ stlen = str.length();
System.out.println(str.length());
for (int i = 0;i < stlen;++i) {
val = str.codePointAt(i);
hexstr = Integer.toHexString(val);
bystr = Integer.toBinaryString(val);

System.out.println(i + " " + str.substring(i,i+1)
+ " " + str.codePointAt(i)
+ " HEX=" + hexstr
+ " BIN=" + bystr
);
}
}
} catch (UnsupportedEncodingException e) {
} catch (IOException e) {
}

}
}
 
Reply With Quote
 
 
 
 
moonhkt
Guest
Posts: n/a
 
      01-29-2010
On Jan 29, 3:59*pm, Peter Duniho <(E-Mail Removed)> wrote:
> moonhkt wrote:
> > Hi All
> > Why using utf-8, the hex value return 51cc and 6668 ?

>
> > od -cx utf8_file01.text

>
> > 22e5 * *878c * *e699 * *a822 * *with " befor and after

>
> I don't understand the above. *Are you trying to suggest that the text
> 'with " befor and after' is part of the output of the "od" program? *If
> so, why does it not appear to match up with the binary values written
> out? *And if the characters you're concerned with are at index 101 and
> 102, why only eight bytes in the file? *And if the file is UTF-8, why
> are you dumping its contents as shorts? *Why not just bytes?
>
> Frankly, the whole question doesn't make much sense to me. *That said,
> the basic answer to your question is, I believe: UTF-8 and UTF-16 are
> different, so of course the bytes used to represent a character in a
> UTF-8 file are going to look different from the bytes used to represent
> the same character in a UTF-16 data structure.
>
> Pete


System : AIX 5.3

Text file just have two utf-8 chinease character.
cat out_utf.text
凌晨

od -cx out_utf.text
0000000 207 214 231 \n
e587 8ce6 99a8 0a00
0000007

java to build utf-8 data, input using utf-16 value. I does not know
how to input utf-8 hex value.
My Question is input utf-16 hex value, when write to file with UTF8
codepage, the data will encode to UTF-8 ?
Do you know hwo to input hex value of utf-8 ? I tried \0xe5 not works.


import java.io.*;
public class build_utf01 {
public static void main(String[] args)
throws UnsupportedEncodingException {

// I want console output in UTF-8
PrintStream sysout = new PrintStream(System.out, true, "UTF-8");
try {
File oFile = new File("out_utf.text");
BufferedWriter out = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream(oFile),"UTF8"));

/* http://www.fileformat.info/info/unic...51cc/index.htm
UTF-8 (hex) 0xe5 0x87 0x8c (e5878c)
UTF-16 (hex) 0x51CC (51cc)
http://www.fileformat.info/info/unic...6668/index.htm
UTF-16 (hex) U+6668
UTF-8 (hex) 0xe6 0x99 0xa8 (e699a
*/
String a = "\u51cc\u6668" ;

int n = a.length();
sysout.println("GIVEN STRING IS=" + a);
sysout.printf("Length of string is %d%n", n);
sysout.printf("CodePoints in string is %d%n", a.codePointCount
(0,n));
for (int i = 0; i < n; i++) {
sysout.printf("Character[%d] is %s%n", i, a.charAt(i));
out.write(a.charAt(i));
}
out.newLine();
out.close() ;
} catch (IOException e) {
}
}

}


Output utf-8 enabled terminal
java build_utf01
GIVEN STRING IS=凌晨
Length of string is 2
CodePoints in string is 2
Character[0] is 凌
Character[1] is 晨

 
Reply With Quote
 
 
 
 
John B. Matthews
Guest
Posts: n/a
 
      01-29-2010
In article
<(E-Mail Removed)>,
moonhkt <(E-Mail Removed)> wrote:

[...]
> My Question is input utf-16 hex value, when write to file with UTF8
> codepage, the data will encode to UTF-8 ?


When I run your program, I get this file content:

$ hd out_utf.text
000000: e5 87 8c e6 99 a8 0a ?..?.?.

> Do you know hwo to input hex value of utf-8?


Do you mean like this?

String a = "\u51cc\u6668";
String b = new String(new byte[] {
(byte) 0xe5, (byte) 0x87, (byte) 0x8c,
(byte) 0xe6, (byte) 0x99, (byte) 0xa8
});
System.out.println("a.equals(b) is " + a.equals(b));

This prints "a.equals(b) is true".

For reference: $ cat ~/bin/hd
#!/usr/bin/hexdump -f
"%06.6_ax: " 16/1 "%02x " " "
16/1 "%_p" "\n"

--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>
 
Reply With Quote
 
RedGrittyBrick
Guest
Posts: n/a
 
      01-29-2010

moonhkt wrote:
> Hi All
> Why using utf-8, the hex value return 51cc and 6668 ?


Because those are the Unicode codepoints of the characters in the file.

>
> od -cx utf8_file01.text
>


These are the byte values of the UTF8 encoding of the characters.

> 22e5 878c e699 a822 with " befor and after


^^ ^^^^
e5 87 8c = U+51CC

^^^^ ^^
e6 99 a8 = U+6668


As shown here:

> http://www.fileformat.info/info/unic...51cc/index.htm
> http://www.fileformat.info/info/unic...6668/index.htm




>
> Output
> .....
> 101 ? 20940 HEX=51cc BIN=101000111001100
> 102 ? 26216 HEX=6668 BIN=110011001101000


^^^^ Unicode *CodePoint*

> System.out.println(i + " " + str.substring(i,i+1)
> + " " + str.codePointAt(i)


^^^^^^^^^^^ you retrieve a *CodePoint*
> + " HEX=" + hexstr
> + " BIN=" + bystr
> );



--
RGB
 
Reply With Quote
 
moonhkt
Guest
Posts: n/a
 
      01-29-2010
On Jan 29, 8:09*pm, RedGrittyBrick <(E-Mail Removed)>
wrote:
> moonhkt wrote:
> > Hi All
> > Why using utf-8, the hex value return 51cc and 6668 ?

>
> Because those are the Unicode codepoints of the characters in the file.
>
>
>
> > od -cx utf8_file01.text

>
> These are the byte values of the UTF8 encoding of the characters.
>
> > 22e5 * *878c * *e699 * *a822 * *with " befor and after

>
> * * ^^ * * ^^^^
> * * e5 * *87 8c * = U+51CC
>
> * * * * * * * * * *^^^^ * *^^
> * * * * * * * * * *e6 99 * a8 *= U+6668
>
> As shown here:
>
> >http://www.fileformat.info/info/unic...51cc/index.htm
> >http://www.fileformat.info/info/unic...6668/index.htm

>
> > Output
> > .....
> > 101 ? 20940 HEX=51cc BIN=101000111001100
> > 102 ? 26216 HEX=6668 BIN=110011001101000

>
> * * * * * * * * * *^^^^ Unicode *CodePoint*
>
> > * * * * * * *System.out.println(i + " " + str.substring(i,i+1)
> > * * * * * * * + " " + str.codePointAt(i)

>
> * * * * * * * * * * * * * * * ^^^^^^^^^^^ you retrieve a *CodePoint*
>
> > * * * * * * * + " HEX=" + hexstr
> > * * * * * * * + " BIN=" + bystr
> > * * * * * * * );

>
> --
> RGB


But, I want Print out UTF-8 hex value How to Print ? e.g U+51CC to e5
87 8c.
What coding can handle this ?

 
Reply With Quote
 
markspace
Guest
Posts: n/a
 
      01-29-2010
moonhkt wrote:

> But, I want Print out UTF-8 hex value How to Print ? e.g U+51CC to e5
> 87 8c.
> What coding can handle this ?



Oh, I see.

Try this:


package test;
import java.io.UnsupportedEncodingException;

public class UtfOut {
public static void main( String[] args )
throws UnsupportedEncodingException
{
String a = "\u51cc\u6668";

byte [] buf = a.getBytes( "UTF-8" );

for( byte b : buf ) {
System.out.printf( "%02X ", b );
}
System.out.println( );

}
}


You could also use a ByteArrayOutputStream.
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      01-30-2010
On Thu, 28 Jan 2010 23:40:07 -0800 (PST), moonhkt <(E-Mail Removed)>
wrote, quoted or indirectly quoted someone who said :

>Hi All
>Why using utf-8, the hex value return 51cc and 6668 ?


UTF-8 is a mixture of 8 bit chars, and magic 8-bit sequences that turn
into 16 bit and 32 bit code sequences.

To see how the algorithm works see
http://mindprod.com/jgloss/utf.html
http://mindprod.com/jgloss/codepoint.html
--
Roedy Green Canadian Mind Products
http://mindprod.com
Computers are useless. They can only give you answers.
~ Pablo Picasso (born: 1881-10-25 died: 1973-04-08 at age: 91)
 
Reply With Quote
 
moonhkt
Guest
Posts: n/a
 
      01-30-2010
On Jan 30, 5:51*pm, Roedy Green <(E-Mail Removed)>
wrote:
> On Thu, 28 Jan 2010 23:40:07 -0800 (PST), moonhkt <(E-Mail Removed)>
> wrote, quoted or indirectly quoted someone who said :
>
> >Hi All
> >Why using utf-8, the hex value return 51cc and 6668 ?

>
> UTF-8 is a mixture of 8 bit chars, and magic 8-bit sequences that turn
> into 16 bit and 32 bit code sequences.
>
> To see how the algorithm works seehttp://mindprod.com/jgloss/utf.htmlhttp://mindprod.com/jgloss/codepoint.html
> --
> Roedy Green Canadian Mind Productshttp://mindprod.com
> Computers are useless. They can only give you answers.
> ~ Pablo Picasso (born: 1881-10-25 died: 1973-04-08 at age: 91)


Hi All
Thank for documents for UTF-8. Actually, My company want using
ISO8859-1 database to store UTF-8 data. Currently, our EDI just handle
ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI
with UTF-8 Data can be import and processed loading to our database.
Then export the data to default codepage, IBM850, we found e5 87 8c
e6 99 a8 in the file. The Export file are mix ISO8859-1 chars and
UTF-8 character.

The next test is loading all possible UTF-8 character to our database
then export the loaded data into a file, for compare two file. If two
different, we may be proof that loading UTF-8 into ISO8859-1 database
without any of bad effect.

Our Database is Progress Database for Character mode run on AIX 5.3
Machine.

Next Task, try to build all possible UTF-8 Bit into file,for Loading
test.
Any suggestion ?


 
Reply With Quote
 
RedGrittyBrick
Guest
Posts: n/a
 
      01-30-2010

moonhkt wrote:

> Actually, My company want using
> ISO8859-1 database to store UTF-8 data.


Your company should use a Unicode database to store Unicode data. The
Progress DBMS supports Unicode.

> Currently, our EDI just handle
> ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI
> with UTF-8 Data can be import and processed loading to our database.
> Then export the data to default codepage, IBM850, we found e5 87 8c
> e6 99 a8 in the file.


This seems crazy to me. The DBMS functions for working with CHAR
datatypes will do bad things if your have misled the DBMS into treating
UTF-8 encoded data as if it were ISO 8859-1. You will no longer be able
to fit 10 chars in a CHAR(10) field for example.

> The Export file are mix ISO8859-1 chars and UTF-8 character.


Sorry to be so negative, but this seems a recipe for disaster.


> The next test is loading all possible UTF-8 character to our database
> then export the loaded data into a file, for compare two file. If two
> different, we may be proof that loading UTF-8 into ISO8859-1 database
> without any of bad effect.


I think you'll have a false sense of optimism and discover bad effects
later.


> Our Database is Progress Database for Character mode run on AIX 5.3
> Machine.


A 1998 vintage document suggests the Progress DBMS can support Unicode.
http://unicode.org/iuc/iuc13/c12/slides.ppt. Though there's a few items
in that presentation that I find troubling.


> Next Task, try to build all possible UTF-8 Bit into file,for Loading
> test.


Unicode contains combining characters, not all sequences of Unicode
characters are valid.


> Any suggestion ?


Reconsider

--
RGB
 
Reply With Quote
 
Lew
Guest
Posts: n/a
 
      01-30-2010
-moonhkt wrote:.

> Thank for documents for UTF-8. Actually, My company want using
> ISO8859-1 database to store UTF-8 data. Currently, our EDI just handle


That statement doesn't make sense. What makes sense would be, "My company
wants to store characters with an ISO8859-1 encoding". There is not any such
thing, really, as "UTF-8 data". What there is is character data. Others
upthread have explained this; you might wish to review what people told you
about how data in a Java 'String' is always UTF-16. You read it into the
'String' using an encoding argument to the 'Reader' to understand the encoding
of the source, and you write it to the destination using whatever encoding in
the 'Writer' that you need.

> ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI


The term "UTF-8 data" has no meaning.

> with UTF-8 Data can be import and processed loading to our database.
> Then export the data to default codepage, IBM850, we found e5 87 8c
> e6 99 a8 in the file. The Export file are mix ISO8859-1 chars and
> UTF-8 character.


You simply map the 'String' data to the database column using JDBC. The
connection and JDBC driver handle the encoding, AIUI.
<http://java.sun.com/javase/6/docs/api/java/sql/PreparedStatement.html#setString(int,%20java.lang. String)>

> The next test is loading all possible UTF-8 character to our database
> then export the loaded data into a file, for compare two file. If two
> different, we may be proof that loading UTF-8 into ISO8859-1 database
> without any of bad effect.


There are an *awful* lot of UTF-encoded characters, over 107,000. Most are
not encodable with ISO-8859-1, which only handles 256 characters.

> Our Database is Progress Database for Character mode run on AIX 5.3
> Machine.
>
> Next Task, try to build all possible UTF-8 Bit into file,for Loading
> test.
> Any suggestion ?


That'll be a rather large file.

Why don't you Google for character encoding and what different encodings can
handle?

Also:
<http://en.wikipedia.org/wiki/Unicode>
<http://en.wikipedia.org/wiki/ISO-8859-1>

--
Lew
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
hex string to hex value tim Python 8 11-23-2005 06:27 PM
[Fwd: Re: hex string to hex value] tim Python 2 11-23-2005 07:18 AM
Hex Color Codes - Hex 6 <=> Hex 3 lucanos@gmail.com HTML 10 08-18-2005 11:21 PM
hex value in string back to real hex value jack Python 4 09-08-2004 07:11 AM
hex(-5) => Futurewarning: ugh, can't we have a better hex than '-'[:n<0]+hex(abs(n)) ?? Bengt Richter Python 6 08-19-2003 07:33 AM



Advertisments