wrote:
> I'm slowly discovering the world of JavaScript, so I'm not sure I'm
> attacking this problem in the right manner, thus if I'm in the wrong
> newsgroup, my apologies.
>
> What I'm trying to do is extract some news items from a web site. To
> do this, I'm using Microsoft Word VBA and using the following bit of
> script:
>
> '// Open web site
> IeApp.Navigate
> "http://www.radioaustralia.net.au/francais/stories/s1776501.htm"
> Do: Loop Until IeApp.ReadyState = READYSTATE_COMPLETE
>
> '// Find text to extract
> txtTitle = IeApp.Document.GetElementByID("a2title").innerhtml
> txt = IeApp.Document.GetElementByID("a2copy").innerhtml
>
> When extracting the text (ie. "txt") I seem to get more than just the
> text of the body that I'm after, and the resulting junk is difficult to
> remove.
So you are not using JavaScript at all but you are automating Internet
Explorer with VBA. The IE object model for HTML documents is documented
here:
<http://msdn.microsoft.com/library/default.asp?url=/workshop/author/dhtml/reference/dhtml_reference_entry.asp>
You might be after the |innerText| property instead of the |innerHTML|
property of element objects. Or you might want to look at specific child
or descendant nodes of an element you have found with getElementById.
For instance
IeApp.Document.getElementById("a2copy")
gives you a div element object which then has other nodes (e.g. table
element) as child nodes. Once you have an element node you can access
its |firstChild|, |lastChild|, |childNodes| collection, you can call
|getElementsByTagName| on the element to find descendant elements of a
certain tag name.
--
Martin Honnen
http://JavaScript.FAQTs.com/