Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Parsing HTML?

Reply
Thread Tools

Parsing HTML?

 
 
Benjamin
Guest
Posts: n/a
 
      04-03-2008
I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
 
Reply With Quote
 
 
 
 
Daniel Fetchinson
Guest
Posts: n/a
 
      04-03-2008
> I'm trying to parse an HTML file. I want to retrieve all of the text
> inside a certain tag that I find with XPath. The DOM seems to make
> this available with the innerHTML element, but I haven't found a way
> to do it in Python.


Have you tried http://www.google.com/search?q=python+html+parser ?

HTH,
Daniel
 
Reply With Quote
 
 
 
 
benash@gmail.com
Guest
Posts: n/a
 
      04-03-2008
BeautifulSoup does what I need it to. Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM. Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLParser is SAX-based, so I don't want to use
that, and the minidom doesn't appear to implement this part of the
DOM.

On Wed, Apr 2, 2008 at 10:37 PM, Daniel Fetchinson
<(E-Mail Removed)> wrote:
> > I'm trying to parse an HTML file. I want to retrieve all of the text
> > inside a certain tag that I find with XPath. The DOM seems to make
> > this available with the innerHTML element, but I haven't found a way
> > to do it in Python.

>
> Have you tried http://www.google.com/search?q=python+html+parser ?
>
> HTH,
> Daniel
>

 
Reply With Quote
 
Paul Boddie
Guest
Posts: n/a
 
      04-03-2008
On 3 Apr, 06:59, Benjamin <(E-Mail Removed)> wrote:
> I'm trying to parse an HTML file. I want to retrieve all of the text
> inside a certain tag that I find with XPath. The DOM seems to make
> this available with the innerHTML element, but I haven't found a way
> to do it in Python.


With libxml2dom you'd do the following:

1. Parse the file using libxml2dom.parse with html set to a true
value.
2. Use the xpath method on the document to select the desired
element.
3. Use the toString method on the element to get the text of the
element (including start and end tags), or the textContent
property
to get the text between the tags.

See the Package Index page for more details:

http://www.python.org/pypi/libxml2dom

Paul
 
Reply With Quote
 
7stud
Guest
Posts: n/a
 
      04-04-2008
On Apr 3, 12:39*am, (E-Mail Removed) wrote:
> BeautifulSoup does what I need it to. *Though, I was hoping to find
> something that would let me work with the DOM the way JavaScript can
> work with web browsers' implementations of the DOM. *Specifically, I'd
> like to be able to access the innerHTML element of a DOM element.
> Python's built-in HTMLParser is SAX-based, so I don't want to use
> that, and the minidom doesn't appear to implement this part of the
> DOM.
>


innerHTML has never been part of the DOM. It is however a defacto
browser standard. That's probably why you aren't having any luck
using a python module that implements the DOM.
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      04-07-2008
Benjamin wrote:
> I'm trying to parse an HTML file. I want to retrieve all of the text
> inside a certain tag that I find with XPath. The DOM seems to make
> this available with the innerHTML element, but I haven't found a way
> to do it in Python.


import lxml.html as h
tree = h.parse("somefile.html")
text = tree.xpath("string( some/element[@condition] )")

http://codespeak.net/lxml

Stefan
 
Reply With Quote
 
Benjamin
Guest
Posts: n/a
 
      04-26-2008
On Apr 3, 9:10*pm, 7stud <(E-Mail Removed)> wrote:
> On Apr 3, 12:39*am, (E-Mail Removed) wrote:
>
> > BeautifulSoup does what I need it to. *Though, I was hoping to find
> > something that would let me work with the DOM the way JavaScript can
> > work with web browsers' implementations of the DOM. *Specifically, I'd
> > like to be able to access the innerHTML element of a DOM element.
> > Python's built-in HTMLParser is SAX-based, so I don't want to use
> > that, and the minidom doesn't appear to implement this part of the
> > DOM.

>
> innerHTML has never been part of the DOM. *It is however a defacto
> browser standard. *That's probably why you aren't having any luck
> using a python module that implements the DOM.


That makes sense.
 
Reply With Quote
 
Benjamin
Guest
Posts: n/a
 
      04-26-2008
On Apr 6, 11:03*pm, Stefan Behnel <(E-Mail Removed)> wrote:
> Benjamin wrote:
> > I'm trying to parse an HTML file. *I want to retrieve all of the text
> > inside a certain tag that I find with XPath. *The DOM seems to make
> > this available with the innerHTML element, but I haven't found a way
> > to do it in Python.

>
> * * import lxml.html as h
> * * tree = h.parse("somefile.html")
> * * text = tree.xpath("string( some/element[@condition] )")
>
> http://codespeak.net/lxml
>
> Stefan


I actually had trouble getting this to work. I guess only new version
of lxml have the html module, and I couldn't get it installed. lxml
does look pretty cool, though.
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      04-26-2008
Benjamin wrote:
> On Apr 6, 11:03 pm, Stefan Behnel <(E-Mail Removed)> wrote:
>> Benjamin wrote:
>>> I'm trying to parse an HTML file. I want to retrieve all of the text
>>> inside a certain tag that I find with XPath. The DOM seems to make
>>> this available with the innerHTML element, but I haven't found a way
>>> to do it in Python.

>> import lxml.html as h
>> tree = h.parse("somefile.html")
>> text = tree.xpath("string( some/element[@condition] )")
>>
>> http://codespeak.net/lxml
>>
>> Stefan

>
> I actually had trouble getting this to work. I guess only new version
> of lxml have the html module, and I couldn't get it installed. lxml
> does look pretty cool, though.


Yes, the above code requires lxml 2.x. However, older versions should allow
you to do this:

import lxml.etree as et
parser = etree.HTMLParser()
tree = h.parse("somefile.html", parser)
text = tree.xpath("string( some/element[@condition] )")

lxml.html is just a dedicated package that makes HTML handling beautiful. It's
not required for parsing HTML and doing general XML stuff with it.

Stefan
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
What libraries should I use for MIME parsing, XML parsing, and MySQL ? John Levine Ruby 0 02-02-2012 11:15 PM
[ANN] Parsing Tutorial and YARD 1.0: A C++ Parsing Framework Christopher Diggins C++ 0 07-09-2007 09:01 PM
[ANN] Parsing Tutorial and YARD 1.0: A C++ Parsing Framework Christopher Diggins C++ 0 07-09-2007 08:58 PM
SAX Parsing - Weird results when parsing content between tags. Naren XML 0 05-11-2004 07:25 PM
Perl expression for parsing CSV (ignoring parsing commas when in double quotes) GIMME Perl 2 02-11-2004 05:40 PM



Advertisments