Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Extract text from HTML (unicode)

Reply
Thread Tools

Extract text from HTML (unicode)

 
 
unbending
Guest
Posts: n/a
 
      01-29-2005
I'm having trouble using the example method (to extract text from an
HTML document I found on Sun's site). It works fine for standard
ANSI-based files, but when I convert them to Unicode or UTF-8, it
doesn't work right (it includes a bunch of strange characters).

I think the reason it's not working has to do with the 2-byte vs.
1-byte encoding, but I have no idea how to fix it. Any ideas?

Here's my code:
final StringBuffer buf = new StringBuffer(1000);
try {
// Create an HTML document that appends all text to buf
HTMLDocument doc = new HTMLDocument() {
public HTMLEditorKit.ParserCallback getReader(int pos) {
return new HTMLEditorKit.ParserCallback() {
// This method is called whenever text is encountered
// in the HTML file
public void handleText(char[] data, int pos) {
buf.append(data + "\n");
}
};
}
};

// Create a reader on the HTML content
// URL url = new URI(location).toURL();
URL url = location.toURL();
URLConnection conn = url.openConnection();
Reader rd = new InputStreamReader(conn.getInputStream());

// Parse the HTML
HTMLEditorKit kit = new HTMLEditorKit();
kit.read(rd, doc, 0);
}
catch(MalformedURLException mue)
{ System.out.println(mue.getLocalizedMessage()); }
catch(BadLocationException ble)
{ System.out.println(ble.getLocalizedMessage()); }
catch(IOException ioe)
{ System.out.println(ioe.getLocalizedMessage()); }
parsed = buf.toString();

 
Reply With Quote
 
 
 
 
Chris Smith
Guest
Posts: n/a
 
      01-29-2005
unbending <> wrote:
> I'm having trouble using the example method (to extract text from an
> HTML document I found on Sun's site). It works fine for standard
> ANSI-based files, but when I convert them to Unicode or UTF-8, it
> doesn't work right (it includes a bunch of strange characters).


There is no such thing as a "standard ANSI-based file". ANSI
standardizes (or jointly standardizes) a lot of things, including a good
number of very different character encodings. If you mean ASCII, then
say ASCII. If you mean something else, then say what you mean.

There is also no such character encoding as "Unicode". I'll assume you
mean one of UCS-2BE, UCS-2LE, UTF-16LE or UTF-16BE. The difference
between UCS-2 and UTF-16 is probably not critical for you, unless you're
using characters outside of the Unicode basic plane. The difference
between big-endian and little-endian is very important, though, and
you'll need to know which one you are using.

You then wrote:

> Reader rd = new InputStreamReader(conn.getInputStream());


If you're having character encoding problems, this is almost certainly
the source. The constructor you've used for InputStreamReader uses the
platform default encoding. Because I don't know what platform you're
working on, I can't tell you what that is. Apparently, though, it is
(or is a superset of) the same encoding you used in the first document,
but is not compatible with UTF-8 or whatever other Unicode encoding you
tried.

There is another constructor for InputStreamReader which allows you to
specify an encoding for the file. You should use that instead.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract Text to HTML, Form Filling & Better PDF to Image Conversion sherazam Java 0 09-27-2011 11:36 AM
Please help me how is easiest way to extract text between some variable text Mladen Perl Misc 5 02-22-2011 10:57 AM
How do i extract vidios when winrar wont extract them??? help plzzzzzzzz smuttdog@sc.rr.com Computer Support 2 12-23-2007 07:03 AM
Need to extract text between two HTML comments mmk16 Perl Misc 7 01-22-2004 11:45 AM
Extract text to html href (a bit long) text news Java 1 08-23-2003 09:54 AM



Advertisments