Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Any equivalent to Ruby's 'hpricot' html/xpath/css selector package?

Reply
Thread Tools

Any equivalent to Ruby's 'hpricot' html/xpath/css selector package?

 
 
Kenneth McDonald
Guest
Posts: n/a
 
      12-28-2008
Ruby has a package called 'hpricot' which can perform limited xpath
queries, and CSS selector queries. However, what makes it really
useful is that it does a good job of handling the "broken" html that
is so commonly found on the web. Does Python have anything similar,
i.e. something that will not only do XPath queries, but will do so on
imperfect HTML? (A good HTML neatener would also be fine, of course,
as I could then pass the result to a Python XPath package.)

And, what are people's favorite Python XPath solutions?

Thanks,
Ken McDonald
 
Reply With Quote
 
 
 
 
Bruno Desthuilliers
Guest
Posts: n/a
 
      12-29-2008
Kenneth McDonald a écrit :
> Ruby has a package called 'hpricot' which can perform limited xpath
> queries,


ElementTree ? (it's in the stdlib now)

> and CSS selector queries.


PyQuery ?
http://pypi.python.org/pypi/pyquery

> However, what makes it really useful
> is that it does a good job of handling the "broken" html that is so
> commonly found on the web.


BeautifulSoup ?
http://pypi.python.org/pypi/BeautifulSoup/3.0.7a

possibly with ElementSoup ?
http://pypi.python.org/pypi/ElementSoup/rev452

 
Reply With Quote
 
 
 
 
Mark Thomas
Guest
Posts: n/a
 
      12-29-2008
On Dec 28, 6:22*pm, Kenneth McDonald
<(E-Mail Removed)> wrote:
> Ruby has a package called 'hpricot' which can perform limited xpath *
> queries, and CSS selector queries. However, what makes it really *
> useful is that it does a good job of handling the "broken" html that *
> is so commonly found on the web. Does Python have anything similar, *
> i.e. something that will not only do XPath queries, but will do so on *
> imperfect HTML?


Hpricot is a fine package but I prefer Nokogiri (see
http://www.rubyinside.com/nokogiri-r...rser-1288.html)
because it is based on libxml2 and therefore is faster, conforms to
the full XPath 1.0 spec, works on imperfect HTML, and exposes the
Hpricot API.

In python, the equivalent is lxml (http://codespeak.net/lxml/), which
is similarly based on libxml2, very fast, XPath-1.0 conformant, and
exposes the now-standard ElementTree API.

The main difference is that lxml doesn't have CSS selector syntax, but
IMHO that's a gimmick when you have a full XPath 1.0 engine at your
disposal.

-- Mark.
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      12-30-2008
Mark Thomas wrote:
> The main difference is that lxml doesn't have CSS selector syntax


Feel free to read the docs:

http://codespeak.net/lxml/cssselect.html

Stefan
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      12-30-2008
Bruno Desthuilliers wrote:
>> However, what makes it really useful is that it does a good job of
>> handling the "broken" html that is so commonly found on the web.

>
> BeautifulSoup ?
> http://pypi.python.org/pypi/BeautifulSoup/3.0.7a
>
> possibly with ElementSoup ?
> http://pypi.python.org/pypi/ElementSoup/rev452


It's actually debatable if BS is any better than lxml/libxml2 when parsing
broken HTML, as lxml tends to tidy things up pretty well. The only major
difference is in encoding detection, for which you can also use a separate
tool like chardet:

http://chardet.feedparser.org/

Stefan
 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      12-30-2008
Kenneth McDonald wrote:
> Ruby has a package called 'hpricot' which can perform limited xpath
> queries, and CSS selector queries. However, what makes it really useful
> is that it does a good job of handling the "broken" html that is so
> commonly found on the web. Does Python have anything similar, i.e.
> something that will not only do XPath queries, but will do so on
> imperfect HTML?


lxml.html is your friend.

http://codespeak.net/lxml/lxmlhtml.html

Stefan
 
Reply With Quote
 
Mark Thomas
Guest
Posts: n/a
 
      12-30-2008
On Dec 30, 8:20*am, Stefan Behnel <(E-Mail Removed)> wrote:
> Mark Thomas wrote:
> > The main difference is that lxml doesn't have CSS selector syntax

>
> Feel free to read the docs:
>
> http://codespeak.net/lxml/cssselect.html


Don't know how I missed that...

So lxml is pretty much an exact equivalent to what Ruby has to offer
(Hpricot or Nokogiri). Nice.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
501 PIX "deny any any" "allow any any" Any Anybody? Networking Student Cisco 4 11-16-2006 10:40 PM
warning message for case statements where the selector signal is of type std_logic_vector profpenguin@shaw.ca VHDL 6 02-11-2005 05:38 AM
Multi list box selector and post back question Brian Henry ASP .Net 3 08-18-2004 11:31 PM
Data grid with current page combo box selector Brian Henry ASP .Net 2 07-22-2004 11:13 AM
Color selector Chris ASP .Net 1 12-10-2003 02:42 PM



Advertisments