Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Accessing attributes in HTML with DOM

Reply
Thread Tools

Accessing attributes in HTML with DOM

 
 
Damo
Guest
Posts: n/a
 
      01-16-2007
Hi
I'm trying to extract text from a html page useing DOM. I used JTidy
first on it. The HTml itself is not very descriptive. Theres no
standout tags around the text I need to extract . The way I was
thinking of doing it was accessing the attributes, but I keep getting a
NullPointerException. This is the HTML:


<div class="mb16">
<div id="r_t0" class="prel">
<a id="r0_t" class="L4"href="http://java.sun.com/"">
<b>Java</b> Technology</a></div>
<div class="T1" id="r0_a">Sun's home for <b>Java</b>. Offers
Windows, Solaris, and Linux <b>Java</b> Development Kits (JDKs),
extensions, news, tutorials, and product information.</div>
<div id="r_b0" class="prel T11"><a id="r0_b"
href="http://java.sun.com/">
<img src="http://sp.ask.com/sh/i/icon_bins.gif" border="0"class="bb"
/></a>
<span id="r0_u" class="T10">java.sun.com/</span>
<strong>&middot;</strong> <a class="L5 nw"
href="http://www.askcache.com">
Cached</a> 1f40 <strong>&middot;</strong>
<a class="L5 L5V" href="javascript:void(0)">Save</a>
</div>
</div>


This is the part I want to skip to to extract text. Its buried in loads
of other HTML. Cany anyone please help me do this.

 
Reply With Quote
 
 
 
 
Daniel Pitts
Guest
Posts: n/a
 
      01-16-2007

Damo wrote:
> Hi
> I'm trying to extract text from a html page useing DOM. I used JTidy
> first on it. The HTml itself is not very descriptive. Theres no
> standout tags around the text I need to extract . The way I was
> thinking of doing it was accessing the attributes, but I keep getting a
> NullPointerException. This is the HTML:
>
>
> <div class="mb16">
> <div id="r_t0" class="prel">
> <a id="r0_t" class="L4"href="http://java.sun.com/"">
> <b>Java</b> Technology</a></div>
> <div class="T1" id="r0_a">Sun's home for <b>Java</b>. Offers
> Windows, Solaris, and Linux <b>Java</b> Development Kits (JDKs),
> extensions, news, tutorials, and product information.</div>
> <div id="r_b0" class="prel T11"><a id="r0_b"
> href="http://java.sun.com/">
> <img src="http://sp.ask.com/sh/i/icon_bins.gif" border="0"class="bb"
> /></a>
> <span id="r0_u" class="T10">java.sun.com/</span>
> <strong>&middot;</strong> <a class="L5 nw"
> href="http://www.askcache.com">
> Cached</a> 1f40 <strong>&middot;</strong>
> <a class="L5 L5V" href="javascript:void(0)">Save</a>
> </div>
> </div>
>
>
> This is the part I want to skip to to extract text. Its buried in loads
> of other HTML. Cany anyone please help me do this.

The example HTML is a good start, perhaps you should consider giving us
the code that produces the NPE, and what you expect the output to be.
Also, if its a valid XML document, perhaps you should consider using
XPath, it helps select data based on the path to that data (including
selections based on element names, attributes, order, etc...).

 
Reply With Quote
 
 
 
 
Damo
Guest
Posts: n/a
 
      01-16-2007
If I can get at the first div I can get its child nodes. How would one
use XPath to get it.
The code below is what I have




NodeList sections = document.getElementsByTagName("div");
for(int i=0; i<sections.getLength();i++)
{
Element section =(Element)sections.item(i);

Attr attr = (Attr)section.getAttributeNode("class");
boolean wasSpecified = attr != null && attr.getSpecified();

String at = attr.getValue();
if(at=="mb16")
{
//I have a recursive method to get the text nodes for here
//if I can get at the child nodes of that particular div
}
}

 
Reply With Quote
 
Damo
Guest
Posts: n/a
 
      01-16-2007
Oh and the output I want is

Java Technology

Sun's home for Java Offers
Windows, Solaris, and Linux Java Development Kits (JDKs),
extensions, news, tutorials, and product information.

java.sun.com/

all stored as differnet Strings

 
Reply With Quote
 
Damo
Guest
Posts: n/a
 
      01-16-2007
Oh and the output I want is

Java Technology

Sun's home for Java Offers
Windows, Solaris, and Linux Java Development Kits (JDKs),
extensions, news, tutorials, and product information.

java.sun.com/

all stored as 3 differnet Strings

 
Reply With Quote
 
Damo
Guest
Posts: n/a
 
      01-17-2007
I'm now using this code. It finds the div nodes with the attribute
"pre1", but it wil not get its child nodes.

if(attr.getValue()=="prel"): Is there something wrong with this line?




NodeList sections =
document.getElementsByTagName("div");
System.out.println(sections.getLength());
for(int i=0; i<sections.getLength();i++)
{
Element section =(Element)sections.item(i);
Attr attr =
(Attr)section.getAttributeNode("class");
if(attr==null)
{
System.out.println("false");
}
else
{
System.out.println(attr.getValue());
if(attr.getValue()=="prel")
{
NodeList name =
section.getChildNodes();

System.out.println(name.getLength());
for(int j=0;
j<name.getLength();j++)
{
Element list =
(Element)name.item(j);
String title =
getText(list.getFirstChild());
System.out.println(title);
}
}
}

}

 
Reply With Quote
 
Andrew Thompson
Guest
Posts: n/a
 
      01-17-2007

Damo wrote:
> I'm now using this code. It finds the div nodes with the attribute
> "pre1", but it wil not get its child nodes.


// compares references to the two strings
> if(attr.getValue()=="prel"): Is there something wrong with this line?


// compares contents of strings
if(attr.getValue().equals("prel"))

Andrew T.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
DOM ? HTML DOM mk834tt@yahoo.com Javascript 1 12-20-2007 01:08 AM
HTML DOM browser; all attributes sdf HTML 4 12-08-2007 02:55 AM
HTML DOM object browser; all attributes sdf Javascript 1 12-07-2007 05:51 PM
Firefox differences in event handlers through HTML attributes and Javascript DOM Safalra Javascript 2 03-30-2007 03:48 AM
Convert a XML DOM Object to a HTML DOM Object manjunath.d@gmail.com XML 0 09-20-2005 08:16 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57