Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Java (http://www.velocityreviews.com/forums/f30-java.html)
-   -   change ISO8859-1 to GB2312 (http://www.velocityreviews.com/forums/t723529-change-iso8859-1-to-gb2312.html)

moonhkt 05-19-2010 06:40 AM

change ISO8859-1 to GB2312
 
Hi All

Our database codepage is iso8859-1. Some data input with GB2312 data.
When export data to iso8859-1 format with GB2312 data, Is it possible
to change iso8859-1 to GB2312 format ?

Machine AIX.


I try below coding not work.

import java.nio.charset.Charset ;
import java.io.*;
import java.lang.String;
public class read_iso {
public static void main(String[] args) {
File aFile = new File("abc.txt");
try {
String str = "";
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(aFile),
"iso8859-1"));

while (( str = in.readLine()) != null )
{
System.out.println(str);
System.out.println(new String (str.getBytes("iso8859-1")));
System.out.println(new String
(str.getBytes("iso-8859-1"),"GB2312")); /* not */
}
} catch (UnsupportedEncodingException e) {
} catch (IOException e) {
}

}
}

Lew 05-19-2010 04:50 PM

Re: change ISO8859-1 to GB2312
 
On 05/19/2010 02:40 AM, moonhkt wrote:
> Our database codepage is iso8859-1. Some data input with GB2312 data.
> When export data to iso8859-1 format with GB2312 data, Is it possible
> to change iso8859-1 to GB2312 format ?
>
> Machine AIX.
>
>
> I try below coding not work.
>
> import java.nio.charset.Charset ;
> import java.io.*;
> import java.lang.String;
> public class read_iso {


You should follow the Java naming conventions.

> public static void main(String[] args) {
> File aFile = new File("abc.txt");
> try {


.... and indentation conventions.

> String str = "";


And not initialize to values that are never used, only discarded.

> BufferedReader in = new BufferedReader(
> new InputStreamReader(new FileInputStream(aFile),
> "iso8859-1"));
>
> while (( str = in.readLine()) != null )
> {
> System.out.println(str);
> System.out.println(new String (str.getBytes("iso8859-1")));


Didn't you say the data was input in GB2312 encoding?

Whatever, this constructs a string using the platform native encoding from
bytes encoded using ISO-8859-1. If that isn't the native encoding, you got
worries.

> System.out.println(new String
> (str.getBytes("iso-8859-1"),"GB2312")); /* not */


Now you're decoding bytes using GB2312 from bytes encoded using ISO-8859-1.
That can't work.

System.out always uses the platform default string encoding.

> }
> } catch (UnsupportedEncodingException e) {
> } catch (IOException e) {
> }


Don't silently eat exceptions.

> }
> }


My approach to the encoding would be a lot more straightforward. None of this
wacky "new String()" stuff.

<sscce source="eegee/FooCoder.java">
package eegee;

import java.io.*;
import org.apache.log4j.Logger;
import static org.apache.log4j.Logger.getLogger;

public class FooCoder
{
private transient final Logger logger = getLogger( FooCoder.class );

public static void main( String[] args )
{
new FooCoder().recode();
}

public void recode()
{
final BufferedReader rin;
final BufferedWriter owt;
try
{
rin = new BufferedReader( new InputStreamReader(
getClass().getResourceAsStream( "temp.txt" ),
"ISO-8859-1" ));
owt = new BufferedWriter( new OutputStreamWriter(
System.out, "GB2312" ));
}
catch ( IOException exc )
{
logger.error( exc );
return;
}
try
{
for ( String str; (str = rin.readLine()) != null; )
{
owt.write( str );
owt.newLine();
}
owt.flush();
}
catch ( IOException exc )
{
logger.error( exc );
}
finally
{
try
{
rin.close();
owt.close();
}
catch ( IOException exc )
{
logger.error( exc );
}
}
}
}
</sscce>

--
Lew

moonhkt 05-20-2010 02:12 AM

Re: change ISO8859-1 to GB2312
 
On 5月20日, 上午12时50分, Lew <no....@lewscanon.com> wrote:
> On 05/19/2010 02:40 AM, moonhkt wrote:
>
> > Our database codepage is iso8859-1. Some data input with GB2312 data.
> > When export data to iso8859-1 format with GB2312 data, Is it possible
> > to change iso8859-1 to GB2312 format ?

>
> > Machine AIX.

>
> > I try below coding not work.

>
> > import java.nio.charset.Charset ;
> > import java.io.*;
> > import java.lang.String;
> > public class *read_iso {

>
> You should follow the Java naming conventions.
>
> > public static void main(String[] args) {
> > File aFile = new File("abc.txt");
> > try {

>
> ... and indentation conventions.
>
> > * * *String str = "";

>
> And not initialize to values that are never used, only discarded.
>
> > * * *BufferedReader in = new BufferedReader(
> > * * * * *new InputStreamReader(new FileInputStream(aFile),
> > "iso8859-1"));

>
> > * * while (( str = in.readLine()) != null )
> > * * {
> > * * * * System.out.println(str);
> > * * * * System.out.println(new String (str.getBytes("iso8859-1")));

>
> Didn't you say the data was input in GB2312 encoding?
>
> Whatever, this constructs a string using the platform native encoding from
> bytes encoded using ISO-8859-1. *If that isn't the native encoding, you got
> worries.
>
> > * * * * System.out.println(new String
> > (str.getBytes("iso-8859-1"),"GB2312")); */* not */

>
> Now you're decoding bytes using GB2312 from bytes encoded using ISO-8859-1.
> That can't work.
>
> System.out always uses the platform default string encoding.
>
> > * * }
> > } catch (UnsupportedEncodingException e) {
> > } catch (IOException e) {
> > }

>
> Don't silently eat exceptions.
>
> > }
> > }

>
> My approach to the encoding would be a lot more straightforward. *None of this
> wacky "new String()" stuff.
>
> <sscce source="eegee/FooCoder.java">
> * package eegee;
>
> * import java.io.*;
> * import org.apache.log4j.Logger;
> * import static org.apache.log4j.Logger.getLogger;
>
> * public class FooCoder
> * {
> * * private transient final Logger logger = getLogger( FooCoder.class );
>
> * * public static void main( String[] args )
> * * {
> * * *new FooCoder().recode();
> * * }
>
> * * public void recode()
> * * {
> * * *final BufferedReader rin;
> * * *final BufferedWriter owt;
> * * *try
> * * *{
> * * * *rin = new BufferedReader( new InputStreamReader(
> * * * * *getClass().getResourceAsStream( "temp.txt" ),
> * * * * *"ISO-8859-1" ));
> * * * *owt = new BufferedWriter( new OutputStreamWriter(
> * * * * *System.out, "GB2312" ));
> * * *}
> * * *catch ( IOException exc )
> * * *{
> * * * *logger.error( exc );
> * * * *return;
> * * *}
> * * *try
> * * *{
> * * * *for ( String str; (str = rin.readLine()) != null; )
> * * * *{
> * * * * *owt.write( str );
> * * * * *owt.newLine();
> * * * *}
> * * * *owt.flush();
> * * *}
> * * *catch ( IOException exc )
> * * *{
> * * * *logger.error( exc );
> * * *}
> * * *finally
> * * *{
> * * * *try
> * * * *{
> * * * * *rin.close();
> * * * * *owt.close();
> * * * *}
> * * * *catch ( IOException exc )
> * * * *{
> * * * * *logger.error( exc );
> * * * *}
> * * *}
> * }}
>
> </sscce>
>
> --
> Lew


Hi Lew
Thank a lot.
How to check platform native encoding ?

Change your code as below. My test file can conv to UTF-8, view in
Reflection UTF-8 Emulation, the font is ok.
View in IE the font is ok.

temp.txt file
| 10 TEST1 |测试1
| |
| 11 TEST2 |测试2
| |
| 12 TEST3 |测试3
| |
| 13 TEST4 |测试4
| |
| 14 TEST5 |测试5
| |


import java.io.*;
public class conv_ig
{
public static void main( String[] args )
{
new conv_ig().recode();
}
public void recode()
{
final BufferedReader rin;
final BufferedWriter owt;
try
{
rin = new BufferedReader( new InputStreamReader(
/* getClass().getResourceAsStream( "temp.txt" ),
"ISO-8859-1" ));
owt = new BufferedWriter( new OutputStreamWriter(System.out,
"GB2312" ));
*/
getClass().getResourceAsStream( "temp.txt" ),"GB2312" ));
owt = new BufferedWriter( new OutputStreamWriter(
System.out, "UTF-8" ));
}
catch ( IOException exc )
{
/* logger.error( exc ); */
return;
}
try
{
for ( String str; (str = rin.readLine()) != null; )
{
owt.write( str );
owt.newLine();
}
owt.flush();
}
catch ( IOException exc )
{
/* logger.error( exc ); */
}
finally
{
try
{
rin.close();
owt.close();
}
catch ( IOException exc )
{
/* logger.error( exc ); */
}
}
}
}

Lew 05-20-2010 03:58 AM

Re: change ISO8859-1 to GB2312
 
moonhkt wrote:
> Change your code as below. My test file can conv to UTF-8, view in
> Reflection UTF-8 Emulation, the font is ok.


What is "Reflection UTF-8"?

Not a bad job there, but I have to wonder why you ruined the indentation and
still are flouting the naming conventions. Code should be readable.

Also, it is exceedingly bad that you eliminated logging. You should keep the
logging. Switch to java.util.logging if you don't like log4j or don't care to
add the JAR, but for Pete's sake keep the logging. Yikes.

Here's a pop quiz for you - given that few code examples I've seen use the
idiom I did of a separate try block for opening the Reader and Writer from the
one for using them, why do you think I bothered?

Is it better or worse than the common idiom, or simply a matter of style and
more power to you for whichever?

> View in IE the font is ok.
>
> temp.txt file
> | 10 TEST1 |测试1
> | |
> | 11 TEST2 |测试2
> | |
> | 12 TEST3 |测试3
> | |
> | 13 TEST4 |测试4
> | |
> | 14 TEST5 |测试5
> | |
>
>
> import java.io.*;
> public class conv_ig
> {
> public static void main( String[] args )
> {
> new conv_ig().recode();
> }
> public void recode()
> {
> final BufferedReader rin;
> final BufferedWriter owt;
> try
> {
> rin = new BufferedReader( new InputStreamReader(
> /* getClass().getResourceAsStream( "temp.txt" ),
> "ISO-8859-1" ));
> owt = new BufferedWriter( new OutputStreamWriter(System.out,
> "GB2312" ));
> */
> getClass().getResourceAsStream( "temp.txt" ),"GB2312" ));
> owt = new BufferedWriter( new OutputStreamWriter(
> System.out, "UTF-8" ));
> }
> catch ( IOException exc )
> {
> /* logger.error( exc ); */
> return;
> }
> try
> {
> for ( String str; (str = rin.readLine()) != null; )
> {
> owt.write( str );
> owt.newLine();
> }
> owt.flush();
> }
> catch ( IOException exc )
> {
> /* logger.error( exc ); */
> }
> finally
> {
> try
> {
> rin.close();
> owt.close();
> }
> catch ( IOException exc )
> {
> /* logger.error( exc ); */
> }
> }
> }
> }



--
Lew

moonhkt 05-21-2010 02:18 AM

Re: change ISO8859-1 to GB2312
 
On 5月20日, 上午11時58分, Lew <no....@lewscanon.com> wrote:
> moonhkt wrote:
> > Change your code as below. My test file can conv to UTF-8, view in
> > Reflection UTF-8 Emulation, the font is ok.

>
> What is "Reflection UTF-8"?
>
> Not a bad job there, but I have to wonder why you ruined the indentation and
> still are flouting the naming conventions. *Code should be readable.
>
> Also, it is exceedingly bad that you eliminated logging. *You should keep the
> logging. *Switch to java.util.logging if you don't like log4j or don't care to
> add the JAR, but for Pete's sake keep the logging. *Yikes.
>
> Here's a pop quiz for you - given that few code examples I've seen use the
> idiom I did of a separate try block for opening the Reader and Writer from the
> one for using them, why do you think I bothered?
>
> Is it better or worse than the common idiom, or simply a matter of style and
> more power to you for whichever?
>
>
>
> > View in IE the font is ok.

>
> > temp.txt file
> > | 10 TEST1 * *|测试1
> > | * * * * * * * * * * * *|
> > | 11 TEST2 * *|测试2
> > | * * * * * * * * * * * *|
> > | 12 TEST3 * *|测试3
> > | * * * * * * * * * * * *|
> > | 13 TEST4 * *|测试4
> > | * * * * * * * * * * * *|
> > | 14 TEST5 * *|测试5
> > | * * * * * * * * * * * *|

>
> > import java.io.*;
> > public class conv_ig
> > {
> > * * *public static void main( String[] args )
> > * * *{
> > * * * new conv_ig().recode();
> > * * *}
> > * * * public void recode()
> > {
> > * * final BufferedReader rin;
> > * * * final BufferedWriter owt;
> > * * * try
> > * * * {
> > * * * * rin = new BufferedReader( new InputStreamReader(
> > * * * * */* getClass().getResourceAsStream( "temp.txt" ),
> > * * * * * "ISO-8859-1" ));
> > * * * * * owt = new BufferedWriter( new OutputStreamWriter(System.out,
> > "GB2312" ));
> > * * * * **/
> > * * * * getClass().getResourceAsStream( "temp.txt" ),"GB2312" ));
> > * * * * owt = new BufferedWriter( new OutputStreamWriter(
> > * * * * * System.out, "UTF-8" ));
> > * * * }
> > * * * catch ( IOException exc )
> > * * * {
> > * * * * /* logger.error( exc ); **/
> > * * * * return;
> > * * * }
> > * * * try
> > * * * {
> > * * * * for ( String str; (str = rin.readLine()) != null; )
> > * * * * {
> > * * * * * owt.write( str );
> > * * * * * owt.newLine();
> > * * * * }
> > * * * * owt.flush();
> > * * * }
> > * * * catch ( IOException exc )
> > * * * {
> > * * * * /* logger.error( exc ); **/
> > * * * }
> > * * * finally
> > * * * {
> > * * * * try
> > * * * * {
> > * * * * * rin.close();
> > * * * * * owt.close();
> > * * * * }
> > * * * * catch ( IOException exc )
> > * * * * {
> > * * * * */* logger.error( exc ); **/
> > * * * * }
> > * * * }
> > }
> > }

>
> --
> Lew


Sorry about this. This is dirty method to test the code. Reflection
is Telnet software using UTF-8 Emulation to check the the string
encoding.
I will check How to using java.util.logging .

Can you give some example where "ruined the indentation " ? and what
about the the naming conventions ?

Lew 05-21-2010 05:03 AM

Re: change ISO8859-1 to GB2312
 
moonhkt wrote:
>>> public class conv_ig
>>> {
>>> public static void main( String[] args )
>>> {
>>> new conv_ig().recode();
>>> }
>>> public void recode()
>>> {

....
>> --
>> Lew


Please do not quote sigs.

> Sorry about this. This is dirty method to test the code. Reflection
> is Telnet software using UTF-8 Emulation to check the the string
> encoding.


Oh, THAT Reflection.

> I will check How to using java.util.logging .
>
> Can you give some example where "ruined the indentation " ? and what
> about the the naming conventions ?


I apologize about the indentation comment - apparently I was seeing an
artifact of word wrap imposed by the posting software and not something that
you did.

As for the naming conventions:
<http://java.sun.com/docs/codeconv/index.html>

You named the class:
>>> public class conv_ig


The convention is to name a class with an initial upper-case letter and camel
case (mixed case, first letter of each word within the compound capitalized
and the rest lower-case), as explained in the Java Code Conventions document.

Methods and non-constant variables (or, more conventionally, non-final
variables) begin with a lower-case letter and are otherwise in camel case.

Underscores should only be used in names that comprise all upper-case letters,
namely those of constant (or more conventionally, final) member variables.

--
Lew

RedGrittyBrick 05-21-2010 08:35 AM

Re: change ISO8859-1 to GB2312
 
On 21/05/2010 03:18, moonhkt wrote:
> On 5月20日, 上午11時58分, Lew<no...@lewscanon.com> wrote:
>> moonhkt wrote:
>>> Change your code as below. My test file can conv to UTF-8, view in
>>> Reflection UTF-8 Emulation, the font is ok.

>>
>> What is "Reflection UTF-8"?

>
> Sorry about this. This is dirty method to test the code. Reflection
> is Telnet software using UTF-8 Emulation to check the the string
> encoding.


There's much wrong in the above.

Reflection is a *terminal-emulator* marketed by Attachmate (who
presumably absorbed WRQ, it's original developers).

Reflection does not *emulate* UTF-8, Reflection handles several
character encodings amongst which is UTF-8. Reflection doesn't *check*
the encoding (AFAIK), it just *uses* the configured encoding to
determine which glyph to display for a received byte sequence.

What Reflection *does* emulate is a variety of serial character-mode
terminals such as VT220, Wyse-50 and varieties of ANSI "standard" terminals.

Telnet is only one of several application layers supported by Reflection
for host communication, though I suppose it is the principal one. FTP
and SSH are others.

--
RGB

moonhkt 05-21-2010 04:38 PM

Re: change ISO8859-1 to GB2312
 
On 5月21日, 下午4時35分, RedGrittyBrick <RedGrittyBr...@spamweary.invalid>
wrote:
> On 21/05/2010 03:18, moonhkt wrote:
>
> > On 5月20日, 上午11時58分, Lew<no...@lewscanon.com> *wrote:
> >> moonhkt wrote:
> >>> Change your code as below. My test file can conv to UTF-8, view in
> >>> Reflection UTF-8 Emulation, the font is ok.

>
> >> What is "Reflection UTF-8"?

>
> > Sorry about this. This is dirty method to test the code. *Reflection
> > is Telnet software using UTF-8 Emulation to check the the string
> > encoding.

>
> There's much wrong in the above.
>
> Reflection is a *terminal-emulator* marketed by Attachmate (who
> presumably absorbed WRQ, it's original developers).
>
> Reflection does not *emulate* UTF-8, Reflection handles several
> character encodings amongst which is UTF-8. Reflection doesn't *check*
> the encoding (AFAIK), it just *uses* the configured encoding to
> determine which glyph to display for a received byte sequence.
>
> What Reflection *does* emulate is a variety of serial character-mode
> terminals such as VT220, Wyse-50 and varieties of ANSI "standard" terminals.
>
> Telnet is only one of several application layers supported by Reflection
> for host communication, though I suppose it is the principal one. FTP
> and SSH are others.
>
> --
> RGB


Hi All
Thank for explain how reflection works.

Our database is ISO8859-1 format with some GB2312 and other non
ISO8859-1 data. Now, we want print GB2312 code in work order routing.
We planing to purchase a Chinese line printer for printing GB2312. The
line printer can print the file under UNIX. Why the output file no
need to convert GB2312 format before printing ?
Any Suggestion ? And Java Conversion program can convert my output to
UTF-8.

moonhkt

moonhkt 05-24-2010 02:30 AM

Re: change ISO8859-1 to GB2312
 
On 5月22日, 上午6时23分, RedGrittyBrick <RedGrittyBr...@SpamWeary.invalid>
wrote:
> On 21/05/2010 17:38, moonhkt wrote:
>
>
>
> > Our database is ISO8859-1 format with some GB2312 and other non
> > ISO8859-1 data. Now, we want print GB2312 code in work order routing.
> > We planing to purchase a Chinese line printer for printing GB2312. The
> > line printer can print the file under UNIX. Why the output file no
> > need to convert GB2312 format before printing ?

>
> You don't provide any details so I can only guess. My guess is that the
> Database thinks it has (for example) six European letters when in fact
> it has three Chinese characters. The database is happy to store and
> retrieve the bytes sequences that would, under 8859-1 encoding represent
> six European letters. When the retrieved byte sequences are sent to the
> printer, because the printer is configured to use the GB2312 encoding,
> it interprets those same byte sequences, not as six European letters but
> as three Chinese characters.
>
> On the other hand, so far as I know, Unix/Linux printing systems like
> CUPS allow you to specify a character encoding as an option to commands
> like lp. they also pick them up from the locale (see environment
> variables) This allows CUPS to do whatever is needed to print those
> characters correctly.
>
> > Any Suggestion ? And Java Conversion program can convert my output to
> > UTF-8.

>
> I'm sure it can. If a Java program knows what encodings are to be used
> for data input and data output then the standard classes allow you to
> handle data correctly*. How that would help in your situation I don't
> know. if your database thinks it is handing 8859-1 encoded European
> characters to your Java program when in fact some of that needs to be
> interpreted as GB3212 then I expect you will have to do something ugly
> in Java. UTF-8 is, in general, a good thing. Configuring your database,
> your programs, your locale and your printer for UTF-8 might well be a
> good thing to do.
>
> --
> RGB


Hi All
Today, Our printer vendor suggest us provide Hanzi EBCDIC code for
testing Chinease printing.
Due to IBM Hosts All support Hanzi EBCDIC code.
How to Convert GB2312/UTF-8 to EBCDID

I try cp1047 on cp1838, All ASCII code like before. By compare using
diff to check the different.



RedGrittyBrick 05-24-2010 09:57 AM

Re: change ISO8859-1 to GB2312
 
On 24/05/2010 03:30, moonhkt wrote:
> On 5月22日, 上午6时23分, RedGrittyBrick<RedGrittyBr...@SpamWeary.invalid>
> wrote:
>> On 21/05/2010 17:38, moonhkt wrote:
>>
>>> Our database is ISO8859-1 format with some GB2312 and other non
>>> ISO8859-1 data. Now, we want print GB2312 code in work order routing.
>>> We planing to purchase a Chinese line printer for printing GB2312. The
>>> line printer can print the file under UNIX. Why the output file no
>>> need to convert GB2312 format before printing ?

>>
>> You don't provide any details so I can only guess. My guess is that the
>> Database thinks it has (for example) six European letters when in fact
>> it has three Chinese characters. The database is happy to store and
>> retrieve the bytes sequences that would, under 8859-1 encoding represent
>> six European letters. When the retrieved byte sequences are sent to the
>> printer, because the printer is configured to use the GB2312 encoding,
>> it interprets those same byte sequences, not as six European letters but
>> as three Chinese characters.
>>
>> On the other hand, so far as I know, Unix/Linux printing systems like
>> CUPS allow you to specify a character encoding as an option to commands
>> like lp. they also pick them up from the locale (see environment
>> variables) This allows CUPS to do whatever is needed to print those
>> characters correctly.
>>
>>> Any Suggestion ? And Java Conversion program can convert my output to
>>> UTF-8.

>>
>> I'm sure it can. If a Java program knows what encodings are to be used
>> for data input and data output then the standard classes allow you to
>> handle data correctly*. How that would help in your situation I don't
>> know. if your database thinks it is handing 8859-1 encoded European
>> characters to your Java program when in fact some of that needs to be
>> interpreted as GB3212 then I expect you will have to do something ugly
>> in Java. UTF-8 is, in general, a good thing. Configuring your database,
>> your programs, your locale and your printer for UTF-8 might well be a
>> good thing to do.
>>

>
> Hi All
> Today, Our printer vendor suggest us provide Hanzi EBCDIC code for
> testing Chinease printing.
> Due to IBM Hosts All support Hanzi EBCDIC code.


You have an IBM System z?

Throwing EBCDIC code-pages into the mix with 8859-1, GB2312 and UTF-8
seems to me to be making your life more complex when you need to make it
simpler. Still, presumably your printor vendor's saleman has your best
interests at heart.


> How to Convert GB2312/UTF-8 to EBCDID


<http://stackoverflow.com/questions/771054/utf-8-to-ebcdic-in-java>


> I try cp1047 on cp1838, All ASCII code like before. By compare using
> diff to check the different.


What JCL did you use to run diff?


--
RGB


All times are GMT. The time now is 09:50 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.