Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > converting from one charset encoding to another ...

Reply
Thread Tools

converting from one charset encoding to another ...

 
 
Albretch Mueller
Guest
Posts: n/a
 
      11-23-2009


Sometime ago I coded some methods to charset re-encoding. Say you get
files in kirillic, “KOI8-R” and you want them as UTF-8

What I did was basically opening an InputStreamReader(FileInputStream
FIS, String aEncoding1) and an OutputStreamWriter(FOS, “UTF-8”) and
went InputStreamReader.read(char[] chrBffr) and
OutputStreamWriter.write(chrBffr, 0, iRdByts) in a while loop till it
hit an EOF

That works just fine, yet I wonder if there are better/faster ways to
do that using channels/memory mapped files

Also where can you get actual files with different types fo encodings
to test these methods.

Thanks
lbrtchx
{comp.lang.java.programmer}
 
Reply With Quote
 
 
 
 
Mike Schilling
Guest
Posts: n/a
 
      11-23-2009
Albretch Mueller wrote:
> Sometime ago I coded some methods to charset re-encoding. Say you
> get
> files in kirillic, “KOI8-R” and you want them as UTF-8
>
> What I did was basically opening an
> InputStreamReader(FileInputStream
> FIS, String aEncoding1) and an OutputStreamWriter(FOS, “UTF-8”) and
> went InputStreamReader.read(char[] chrBffr) and
> OutputStreamWriter.write(chrBffr, 0, iRdByts) in a while loop till
> it
> hit an EOF
>
> That works just fine, yet I wonder if there are better/faster ways
> to
> do that using channels/memory mapped files
>
> Also where can you get actual files with different types fo
> encodings
> to test these methods.


You can create them easily enough with a FileWriter that writes to an
OutputStreamWriter of the desired encoding.


 
Reply With Quote
 
 
 
 
Albretch Mueller
Guest
Posts: n/a
 
      11-23-2009
On Nov 23, 5:54*am, "Mike Schilling" <(E-Mail Removed)>
wrote:
> Albretch Mueller wrote:
> > *Sometime ago I coded some methods to charset re-encoding. Say you
> > get
> > files in kirillic, “KOI8-R” and you want them as UTF-8

>
> > *What I did was basically opening an
> > InputStreamReader(FileInputStream
> > FIS, String aEncoding1) and an OutputStreamWriter(FOS, “UTF-8”) and
> > went InputStreamReader.read(char[] chrBffr) and
> > OutputStreamWriter.write(chrBffr, 0, iRdByts) in a while loop till
> > it
> > hit an EOF

>
> > *That works just fine, yet I wonder if there are better/faster ways
> > to
> > do that using channels/memory mapped files

>
> > *Also where can you get actual files with different types fo
> > encodings
> > to test these methods.

>
> You can create them easily enough with a FileWriter that writes to an
> OutputStreamWriter of the desired encoding.

~
After checking the API I don't see what the difference would be
between a plain reader and a FileOutputStream. What is it?

Thank you
lbrtchx
 
Reply With Quote
 
Lew
Guest
Posts: n/a
 
      11-23-2009
Albretch Mueller wrote:
> After checking the API I don't see what the difference would be
> between a plain reader and a FileOutputStream. What is it?


I'll assume you either meant a "plain writer" or a 'FileInputStream', but the
question remains what you mean by a "plain reader/writer".

'Reader's and 'Writer's deal with encoded 'char's. Streams deal with raw bytes.

--
Lew
 
Reply With Quote
 
Mike Schilling
Guest
Posts: n/a
 
      11-23-2009
Albretch Mueller wrote:
>>
>> You can create them easily enough with a FileWriter that writes to
>> an
>> OutputStreamWriter of the desired encoding.

> ~
> After checking the API I don't see what the difference would be
> between a plain reader and a FileOutputStream. What is it?


A Writer converts from characters (Unicode) to whatever encoding it
was created with. an OutputStream just outputs bytes with no
conversion being done..




 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      11-23-2009
On Sun, 22 Nov 2009 19:02:36 -0800 (PST), Albretch Mueller
<(E-Mail Removed)> wrote, quoted or indirectly quoted someone who
said :

>
> That works just fine, yet I wonder if there are better/faster ways to
>do that using channels/memory mapped files


The thing I don't understand, is nio uses ordinary file i/o
underneath. So how is it faster if you don't do something stupid with
ordinary file i/o in a case where caching would not help?
--
Roedy Green Canadian Mind Products
http://mindprod.com
Finding a bug is a sign you were asleep a the switch when coding. Stop debugging, and go back over your code line by line.
 
Reply With Quote
 
Albretch Mueller
Guest
Posts: n/a
 
      11-23-2009
> I'll assume you either meant a "plain writer" or a 'FileInputStream'
~

~
> 'Reader's and 'Writer's deal with encoded 'char's. Streams deal with raw bytes.

~
but once you write to a file as I am doing it all becomes a stream of
bytes anyway, till you eventually reopen the file using a Reader and
specifying the charset to interpret chuncks of bytes as they are being
read into an array of chars, and as specified by the API:
~
http://java.sun.com/javase/6/docs/ap...Character.html
~
"The Java 2 platform uses the UTF-16 representation in char arrays
and in the String and StringBuffer classes."
~
So I think there is no real fancifulness in converting streams from
and to char sets as long as your OS/Java supports both encodings, it
is by nature a serial process.
~
Thank you
lbrtchx
 
Reply With Quote
 
Lew
Guest
Posts: n/a
 
      11-24-2009
Lew wrote:
>> 'Reader's and 'Writer's deal with encoded 'char's. Streams deal with raw bytes.


Albretch Mueller wrote:
> but once you write to a file as I am doing it all becomes a stream of
> bytes anyway, till you eventually reopen the file using a Reader and
> specifying the charset to interpret chuncks of bytes as they are being
> read into an array of chars, and as specified by the API:


The exact bytes written through a Writer depend on the encoding used. If you
use a Reader with a different encoding, you'll get garbage.

--
Lew
 
Reply With Quote
 
Albretch Mueller
Guest
Posts: n/a
 
      11-25-2009
On Nov 24, 1:45*am, Lew <(E-Mail Removed)> wrote:
> Lew wrote:
> >> 'Reader's and 'Writer's deal with encoded 'char's. *Streams deal with raw bytes.

> Albretch Mueller wrote:
> > *but once you write to a file as I am doing it all becomes a stream of
> > bytes anyway, till you eventually reopen the file using a Reader and
> > specifying the charset to interpret chuncks of bytes as they are being
> > read into an array of chars, and as specified by the API:

>
> The exact bytes written through a Writer depend on the encoding used. *If you
> use a Reader with a different encoding, you'll get garbage.
>
> --
> Lew


OK, you have made me wonder about what to do when you don't know the
encoding of a file you got. As long as I know this is not taken care
by Readers even though some heuristics may be used

So, what do you do in those situations?

Thank you
lbrtchx



 
Reply With Quote
 
Lew
Guest
Posts: n/a
 
      11-25-2009
Albretch Mueller wrote:
>> Lew wrote:
>> The exact bytes written through a Writer depend on the encoding used. If you
>> use a Reader with a different encoding, you'll get garbage.
>>
>> --
>> Lew


Don't quote sigs.

> OK, you have made me wonder about what to do when you don't know the
> encoding of a file you got. As long as I know this is not taken care
> by Readers even though some heuristics may be used
>
> So, what do you do in those situations?


The editor in Rational Software Architect, an IDE built on Eclipse, simply
reports that the file is not in the specified encoding. I haven't looked at
its source, but I guess it notices illegal code points. Other editors just
display the wrong thing.

--
Lew
Don't quote sigs.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Reading Text File Encoding and converting to Perls internal UTF-8 encoding sln@netherlands.com Perl Misc 2 04-17-2009 11:22 PM
javascript charset <> page charset optimistx Javascript 2 08-15-2008 12:42 PM
how to convert String from one charset to another mehafi@gmail.com Java 5 08-07-2007 02:29 PM
Is there any way to discover what charset encoding a file is using? James Java 2 07-01-2004 08:06 AM
Problem with default Charset Encoding Servlet (Windows vs. RedHat) J.P.Jarolim Java 0 02-27-2004 04:11 PM



Advertisments