unbending <> wrote:
> I'm having trouble using the example method (to extract text from an
> HTML document I found on Sun's site). It works fine for standard
> ANSI-based files, but when I convert them to Unicode or UTF-8, it
> doesn't work right (it includes a bunch of strange characters).
There is no such thing as a "standard ANSI-based file". ANSI
standardizes (or jointly standardizes) a lot of things, including a good
number of very different character encodings. If you mean ASCII, then
say ASCII. If you mean something else, then say what you mean.
There is also no such character encoding as "Unicode". I'll assume you
mean one of UCS-2BE, UCS-2LE, UTF-16LE or UTF-16BE. The difference
between UCS-2 and UTF-16 is probably not critical for you, unless you're
using characters outside of the Unicode basic plane. The difference
between big-endian and little-endian is very important, though, and
you'll need to know which one you are using.
You then wrote:
> Reader rd = new InputStreamReader(conn.getInputStream());
If you're having character encoding problems, this is almost certainly
the source. The constructor you've used for InputStreamReader uses the
platform default encoding. Because I don't know what platform you're
working on, I can't tell you what that is. Apparently, though, it is
(or is a superset of) the same encoding you used in the first document,
but is not compatible with UTF-8 or whatever other Unicode encoding you
tried.
There is another constructor for InputStreamReader which allows you to
specify an encoding for the file. You should use that instead.
--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.
Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation