Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > TextStreamReader with transparent unicode BOM Support

Reply
Thread Tools

TextStreamReader with transparent unicode BOM Support

 
 
X_AWemner_X
Guest
Posts: n/a
 
      07-02-2003
Ok, here is a teaser for all java io coders. Make us all happy and create a
filterreader with proper unicode bom support.

As you know, _we_ have tell InputStreamReader what unicode charset to use
for read operations. (UTF-8, UTF-16, ....). Reader does support BOM mark for
UTF-16 keyword and skip first bytes, but still we must tell it to use
UTF-16. but fails with UTF-8 files.

Win2k Notepad stores BOM mark at the start of UTF-8 files, and currently ISR
cannot read it properly.

http://www.unicode.org/unicode/faq/utf_bom.html#22

Now, do you have a streamreader which support BOMs fully transparently,
something like?

String defaultEnc = "UTF-8"; // java default is ISO-8859-1
Reader in = new BestUnicodeTextStreamReader(new
FileInputStream("myfile.txt"), defaultEnc);
-> this class would recognize all BOM marks automatically and used it. If no
BOM were found, then use given defaultEnc value.

I am sure we n00b coders would love to use such reader implementation.

 
Reply With Quote
 
 
 
 
Thomas Weidenfeller
Guest
Posts: n/a
 
      07-02-2003
"X_AWemner_X" <(E-Mail Removed)> writes:
> Ok, here is a teaser for all java io coders. Make us all happy and create a
> filterreader with proper unicode bom support.

[...]
>
> I am sure we n00b coders would love to use such reader implementation.
>


Doesn't your company (that's ZenPark, isn't it?) have an own software
development department that can do such remittance work? Please tell
me where I should send the bill for the following rough and inefficient
sketch to?

I have left out all exception handling and minor details:

class UnicodeReader implements Reader {
PushbackInputStream internalIn;
InputStreamReader internalOut = null;
String defaultEnc;

private static final int BOM_SIZE = 3; // enought for UTF8 and UTF16

UnicodeReader(InputStream in, String defaultEnc) {
internalIn = new PushbackInputStream(in, BOM_SIZE);
this.defaultEnc = defaultEnc;
}

protected void init() {
if(internalOut != null) {
return;
}

byte bom[] = new byte[BOM_SIZE];
int n;
int pos = 0;
while(pos < BOM_SIZE &&
(n = internalIn.read(bom, pos, BOM_SIZE - pos)) != -1)
{
pos += n;
}
internalIn.unread(bom, 0, pos);
String encoding = ... // evaluate the content of bom[] here
// revert to defaultEnc if nothing found
internalOut = new InputStreamReader(internalIn, encoding);
}

//
// For all methods in interface Reader, implement each method as:
//
// method(...) {
// init();
// internalOut.method(...);
// }
//
}


/Thomas
 
Reply With Quote
 
 
 
 
Roedy Green
Guest
Posts: n/a
 
      07-02-2003
On Wed, 2 Jul 2003 12:27:56 +0300, "X_AWemner_X" <(E-Mail Removed)> wrote
or quoted :

>Win2k Notepad stores BOM mark at the start of UTF-8 files, and currently ISR
>cannot read it properly.


see http://mindprod.com/jgloss/encoding.html

What happens if you use UTF-8 or UTF-16 encoding on the code
suggested by the File IO amanuensis at
http://mindprod.com/fileio.html?

Java is not smart enough to flip between 8-16 automatically, but is it
smart enough to deal with endian markers, both BE and LE.

Ideally this should be implemented as yet another encoding:
Unicode-8-16. Does anyone know how you insert your own encoding into
the official list? You can't pass any parameters to the encoding such
as your preferred default big/little endian, so you must create
variant names for all the combinations.


--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
 
Reply With Quote
 
NoName NoName
Guest
Posts: n/a
 
      07-03-2003
Thx for the good tip, I was not aware of PushPackInputStream class. It
made everything really simple to do. Here is the implementation what you
suggested.


/**
Original pseudocode : Thomas Weidenfeller
Implementation tweaked: Aki Nieminen

http://www.unicode.org/unicode/faq/utf_bom.html
BOMs:
00 00 FE FF = UTF-32, big-endian
FF FE 00 00 = UTF-32, little-endian
FE FF = UTF-16, big-endian
FF FE = UTF-16, little-endian
EF BB BF = UTF-8

Win2k Notepad:
Unicode format = UTF-16LE
***/

import java.io.*;

/**
* Generic unicode textreader, which will use BOM mark
* to identify the encoding to be used.
*/
public class UnicodeReader extends Reader {
PushbackInputStream internalIn;
InputStreamReader internalIn2 = null;
String defaultEnc;

private static final int BOM_SIZE = 4;

UnicodeReader(InputStream in, String defaultEnc) {
internalIn = new PushbackInputStream(in, BOM_SIZE);
this.defaultEnc = defaultEnc;
}

public String getDefaultEncoding() {
return defaultEnc;
}

public String getEncoding() {
if (internalIn2 == null) return null;
return internalIn2.getEncoding();
}

/**
* Read-ahead four bytes and check for BOM marks. Extra bytes are
* unread back to the stream, only BOM bytes are skipped.
*/
protected void init() throws IOException {
if (internalIn2 != null) return;

String encoding;
byte bom[] = new byte[BOM_SIZE];
int n, unread;
n = internalIn.read(bom, 0, bom.length);

if ( (bom[0] == (byte)0xEF) && (bom[1] == (byte)0xBB) &&
(bom[2] == (byte)0xBF) ) {
encoding = "UTF-8";
unread = n - 3;
} else if ( (bom[0] == (byte)0xFE) && (bom[1] == (byte)0xFF) ) {
encoding = "UTF-16BE";
unread = n - 2;
} else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) ) {
encoding = "UTF-16LE";
unread = n - 2;
} else if ( (bom[0] == (byte)0x00) && (bom[1] == (byte)0x00) &&
(bom[2] == (byte)0xFE) && (bom[3] == (byte)0xFF)) {
encoding = "UTF-32BE";
unread = n - 4;
} else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) &&
(bom[2] == (byte)0x00) && (bom[3] == (byte)0x00)) {
encoding = "UTF-32LE";
unread = n - 4;
} else {
// Unicode BOM mark not found, unread all bytes
encoding = defaultEnc;
unread = n;
}
// System.out.println("read=" + n + ", unread=" + unread);

if (unread > 0) internalIn.unread(bom, (n - unread), unread);
else if (unread < -1) internalIn.unread(bom, 0, 0);

// Use given encoding
if (encoding == null) {
internalIn2 = new InputStreamReader(internalIn);
} else {
internalIn2 = new InputStreamReader(internalIn, encoding);
}
}

public void close() throws IOException {
init();
internalIn2.close();
}

public int read(char[] cbuf, int off, int len) throws IOException {
init();
return internalIn2.read(cbuf, off, len);
}

}


> I have left out all exception handling and minor details:
>
> class UnicodeReader implements Reader {
> PushbackInputStream internalIn;
> InputStreamReader internalOut = null;
> String defaultEnc;


<...clip clip...>
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Write UTF-8 BOM marker char8s) at the start of file? X_AWieminer_X Java 2 08-18-2011 05:44 PM
Q: Cteni unicode retezcu ze souboru UTF-8 s BOM? Petr Prikryl Python 0 03-14-2007 01:38 PM
Unicode BOM marks Francis Girard Python 9 03-14-2005 12:19 AM
Javadoc fails on BOM netnews.comcast.net Java 2 07-12-2004 11:13 AM
How to write UTF-16 with BOM in little endian Von: Jean-Marc Autexier <jmau2002@web.de> Datum: Samstag, 30. August 2003 13:35:54 Gruppen: comp.lang.java.help Keine Referenzen Jean-Marc Autexier Java 2 08-30-2003 09:04 PM



Advertisments