Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > How to identify File encoding in Java?

Reply
Thread Tools

How to identify File encoding in Java?

 
 
Perma
Guest
Posts: n/a
 
      04-17-2007
Hi,
I have a Java program which polls a directory for incoming files
(zipped and text).
When a new file comes, I read it and post it's outcome.

Here I have some encoding problems. The text files are usually UTF-8,
so I hard-code the encoding to UTF-8:

Code extract:
....
// trying to read the file "myFile"
FileInputStream fi = new FileInputStream(myFile);
InputStreamReader ir = new InputStreamReader(fi, "UTF8"); // hardcoded
UTF-8, how can I do this dynamically?
....

I was expecting the zipped files to be UTF-8 as well, but it turned
out not to be, so I get an:
MalformedInputException at
sun.io.ByteToCharUTF8.convert([BII[CII)I(ByteToCharUTF8

So I have to handle the two separately and it troubles my code.

I guess there's a smart way of doing this.
Hope someone can give me some hint on this!

Regards, Per Magnus

 
Reply With Quote
 
 
 
 
Gordon Beaton
Guest
Posts: n/a
 
      04-17-2007
On 17 Apr 2007 09:25:47 -0700, Perma wrote:
> I was expecting the zipped files to be UTF-8 as well, but it turned
> out not to be, so I get an:


> MalformedInputException at
> sun.io.ByteToCharUTF8.convert([BII[CII)I(ByteToCharUTF8
>
> So I have to handle the two separately and it troubles my code.


Character encoding is an attribute that describes how characters are
represented as bytes in *text* files. Classes that are "readers" read
from a stream and apply the specified encoding to obtain the text.
This operation is only meaninful when applied to text files.

Zipped files are binary, not text, so they don't contain characters
and there is no character encoding. To read a zipped file, use an
InputStream (to read raw bytes), or e.g. a ZipInputStream to get the
unzipped contents.

/gordon

--
 
Reply With Quote
 
 
 
 
Kai Schwebke
Guest
Posts: n/a
 
      04-17-2007
Perma schrieb:
> I was expecting the zipped files to be UTF-8 as well, but it turned
> out not to be, so I get an:
> MalformedInputException at
> sun.io.ByteToCharUTF8.convert([BII[CII)I(ByteToCharUTF8
>
> So I have to handle the two separately and it troubles my code.
>
> I guess there's a smart way of doing this.
> Hope someone can give me some hint on this!


There is no 100% solution if you have to guess the encoding.
But you can exploit the fact, that input which may be interpreted
as UTF-8-encoded without error, in almost all cases is actually
UTF-8 encoded.


So just try do apply UTF-8, catch the exception or search for
the invalid marker character (defaults to \uFFFD) and apply
an alternate charset on error.



Kai
 
Reply With Quote
 
Mike Schilling
Guest
Posts: n/a
 
      04-17-2007

"Gordon Beaton" <> wrote in message
news:4624f80e$0$24610$...
>
> Zipped files are binary, not text, so they don't contain characters
> and there is no character encoding.


That is, zip files are binary. Zipped files (the files that were processed
to create the zip file) may be binary or text.


 
Reply With Quote
 
Oliver Wong
Guest
Posts: n/a
 
      04-17-2007

"Perma" <> wrote in message
news: oups.com...
> Hi,
> I have a Java program which polls a directory for incoming files
> (zipped and text).
> When a new file comes, I read it and post it's outcome.
>
> Here I have some encoding problems. The text files are usually UTF-8,
> so I hard-code the encoding to UTF-8:
>
> Code extract:
> ...
> // trying to read the file "myFile"
> FileInputStream fi = new FileInputStream(myFile);
> InputStreamReader ir = new InputStreamReader(fi, "UTF8"); // hardcoded
> UTF-8, how can I do this dynamically?
> ...
>
> I was expecting the zipped files to be UTF-8 as well, but it turned
> out not to be, so I get an:
> MalformedInputException at
> sun.io.ByteToCharUTF8.convert([BII[CII)I(ByteToCharUTF8
>
> So I have to handle the two separately and it troubles my code.
>
> I guess there's a smart way of doing this.
> Hope someone can give me some hint on this!


Maybe I'm misunderstanding something, but zip files are NOT text files
encoded via the UTF-8 encoding. In fact, they're not text files at all,
but binary files. Thus the question of "which encoding?" never has a
chance to come up at all.

- Oliver


 
Reply With Quote
 
Perma
Guest
Posts: n/a
 
      04-24-2007
I am aware of that a zip file is not a text file, but it contains a
text file.
When unzipping the text file, I need to figure out what encoding it
is.

>From the postings above, it sounds to me as if there is no method

which can tell me whether it is "UTF-8" or not, so perhaps the best
solution is to try to read the file as UTF-8, and handling non-UTF-8
files by re-reading them in the Exception handling method.
Example:

InputStreamReader ir
try {
// trying to read as utf-8
ir = new InputStreamReader(fi, "UTF8");
} catch (Exception e {
// couldn't read file as UTF-8, therefore reading it as nont-UTF-8
encoding
ir = new InputStreamReader(fi);
}

Thank you for responses!
And please update if you have some alternative solutions to this.

-Per Magnus

On 18 Apr, 00:09, "Oliver Wong" <o...@castortech.com> wrote:
> "Perma" <kebabkon...@hotmail.com> wrote in message
>
> news: oups.com...
>
>
>
> > Hi,
> > I have a Java program which polls a directory for incoming files
> > (zipped and text).
> > When a new file comes, I read it and post it's outcome.

>
> > Here I have some encoding problems. The text files are usually UTF-8,
> > so I hard-code the encoding to UTF-8:

>
> > Code extract:
> > ...
> > // trying to read the file "myFile"
> > FileInputStream fi = new FileInputStream(myFile);
> > InputStreamReader ir = new InputStreamReader(fi, "UTF8"); // hardcoded
> > UTF-8, how can I do this dynamically?
> > ...

>
> > I was expecting the zipped files to be UTF-8 as well, but it turned
> > out not to be, so I get an:
> > MalformedInputException at
> > sun.io.ByteToCharUTF8.convert([BII[CII)I(ByteToCharUTF8

>
> > So I have to handle the two separately and it troubles my code.

>
> > I guess there's a smart way of doing this.
> > Hope someone can give me some hint on this!

>
> Maybe I'm misunderstanding something, but zip files are NOT text files
> encoded via the UTF-8 encoding. In fact, they're not text files at all,
> but binary files. Thus the question of "which encoding?" never has a
> chance to come up at all.
>
> - Oliver



 
Reply With Quote
 
Martin Gregorie
Guest
Posts: n/a
 
      04-24-2007
Perma wrote:
> I am aware of that a zip file is not a text file, but it contains a
> text file. When unzipping the text file, I need to figure out what encoding it
> is.
>

Fire up your ZIP utility and take a good look at the file headers it
lists from the zip archive. What you see there is about all it knows
about compressed files. I don't think it knows or cares what's in the
file apart from what's implied by the filename extension.

I think you'll be better off extracting the file as a bytestream
because I'm 99% certain that's what the zip archiver thought it had
compressed.

Don't forget that ASCII text, Unicode wordprocessor files,
JPEG images and binary executables are all the same to the ZIP archiver.


--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |
 
Reply With Quote
 
jitmer@hotmail.com jitmer@hotmail.com is offline
Junior Member
Join Date: Aug 2010
Posts: 1
 
      08-04-2010
Use the below code to read from a text file which are delimited with pipe and | and output the tokens or string


try{

FileReader infile=new FileReader(file);


bufRdr = new BufferedReader(new FileReader(file));
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
String line = null;

int row = 0;int col = 0; //read each line of text file
try {
while((line = bufRdr.readLine()) != null && rowcount<1 ){
StringTokenizer st = new StringTokenizer(line,"|");

while (st.hasMoreTokens()) {

// System.out.println( "NextToken " +st.nextToken());

// valArray[col]=st.nextToken();

System.out.println( "NextToken " +st.nextToken());


//System.out.println( "Columns " +rowcount);
}
}
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[java programming] How to detect the file encoding? Simon Java 10 06-09-2009 02:12 PM
Reading Text File Encoding and converting to Perls internal UTF-8 encoding sln@netherlands.com Perl Misc 2 04-17-2009 11:22 PM
Identify a file type and open the file Shilpa ASP .Net 1 03-22-2006 10:07 PM
Java applet file dialog encoding maxwelton@my-deja.com Java 1 03-02-2006 10:28 PM
encoding: write html file in java with ploish characters =?ISO-8859-1?Q?B=FClent_=D6ktem?= Java 1 10-28-2003 06:00 PM



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57