Damo wrote:
> Hi
> I'm trying to extract text from a html page useing DOM. I used JTidy
> first on it. The HTml itself is not very descriptive. Theres no
> standout tags around the text I need to extract . The way I was
> thinking of doing it was accessing the attributes, but I keep getting a
> NullPointerException. This is the HTML:
>
>
> <div class="mb16">
> <div id="r_t0" class="prel">
> <a id="r0_t" class="L4"href="http://java.sun.com/"">
> <b>Java</b> Technology</a></div>
> <div class="T1" id="r0_a">Sun's home for <b>Java</b>. Offers
> Windows, Solaris, and Linux <b>Java</b> Development Kits (JDKs),
> extensions, news, tutorials, and product information.</div>
> <div id="r_b0" class="prel T11"><a id="r0_b"
> href="http://java.sun.com/">
> <img src="http://sp.ask.com/sh/i/icon_bins.gif" border="0"class="bb"
> /></a>
> <span id="r0_u" class="T10">java.sun.com/</span>
> <strong>·</strong> <a class="L5 nw"
> href="http://www.askcache.com">
> Cached</a> 1f40 <strong>·</strong>
> <a class="L5 L5V" href="javascript:void(0)">Save</a>
> </div>
> </div>
>
>
> This is the part I want to skip to to extract text. Its buried in loads
> of other HTML. Cany anyone please help me do this.
The example HTML is a good start, perhaps you should consider giving us
the code that produces the NPE, and what you expect the output to be.
Also, if its a valid XML document, perhaps you should consider using
XPath, it helps select data based on the path to that data (including
selections based on element names, attributes, order, etc...).
|