Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   beautifulsoup .vs tidy (http://www.velocityreviews.com/forums/t359236-beautifulsoup-vs-tidy.html)

bruce 07-01-2006 04:02 AM

beautifulsoup .vs tidy
 
hi...

never used perl, but i have an issue trying to resolve some html that
appears to be "dirty/malformed" regarding the overall structure. in
researching validators, i came across the beautifulsoup app and wanted to
know if anybody could give me pros/cons of the app as it relates to any of
the other validation apps...

the issue i'm facing involves parsing some websites, so i'm trying to
extract information based on the DOM/XPath functions.. i'm using perl to
handle the extraction....

thanks

-bruce
bedouglas@earthlink.net


Ravi Teja 07-01-2006 05:23 AM

Re: beautifulsoup .vs tidy
 
bruce wrote:
> hi...
>
> never used perl, but i have an issue trying to resolve some html that
> appears to be "dirty/malformed" regarding the overall structure. in
> researching validators, i came across the beautifulsoup app and wanted to
> know if anybody could give me pros/cons of the app as it relates to any of
> the other validation apps...
>
> the issue i'm facing involves parsing some websites, so i'm trying to
> extract information based on the DOM/XPath functions.. i'm using perl to
> handle the extraction....


1.) XPath is not a good idea at all with "malformed" HTML or perhaps
web pages in general.
2.) BeautifulSoup is not a validator but works well with bad HTML. Also
look at Mechanize and ClientForm.
3.) XMLStarlet is a good XML validator
(http://xmlstar.sourceforge.net/). It's not Python but you don't need
to care about the language it is written in.
4.) For a simple HTML validator, Just use http://validator.w3.org/


Paddy 07-01-2006 08:09 AM

Re: beautifulsoup .vs tidy
 

bruce wrote:
> hi...
>
> never used perl, but i have an issue trying to resolve some html that
> appears to be "dirty/malformed" regarding the overall structure. in
> researching validators, i came across the beautifulsoup app and wanted to
> know if anybody could give me pros/cons of the app as it relates to any of
> the other validation apps...
>

I'm not too sure of what you are after. You mention tidy in the subject
which made me think that maybe you were trying to generate well-formed
HTML from malformed webppages that nonetheless browsers can interpret.
If that is the case then try HTML tidy:
http://www.w3.org/People/Raggett/tidy/

- Pad.


Fredrik Lundh 07-01-2006 03:26 PM

Re: beautifulsoup .vs tidy
 
bruce wrote:

> that's exactly what i'm trying to accomplish... i've used tidy, but it seems
> to still generate warnings...
>
> initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)
>
> the xpath/linxml functions in the perl app complain regarding the file.


what exactly do they complain about ?

</F>


Paul Boddie 07-01-2006 04:43 PM

Re: beautifulsoup .vs tidy
 
Ravi Teja wrote:
>
> 1.) XPath is not a good idea at all with "malformed" HTML or perhaps
> web pages in general.


import libxml2dom
import urllib
f = urllib.urlopen("http://wiki.python.org/moin/")
s = f.read()
f.close()
# s contains HTML not XML text
d = libxml2dom.parseString(s, html=1)
# get the community-related links
for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"):
print label.nodeValue

Of course, lxml should be able to do this kind of thing as well. I'd be
interested to know why this "is not a good idea", though.

Paul


Matt Good 07-01-2006 10:22 PM

Re: beautifulsoup .vs tidy
 
bruce wrote:
> that's exactly what i'm trying to accomplish... i've used tidy, but it seems
> to still generate warnings...
>
> initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)
>
> the xpath/linxml functions in the perl app complain regarding the file. my
> thought is that tidy isn't cleaning enough, or that the perl xpath/libxml
> functions are too strict!


Clean HTML is not valid XML. If you want to process the output with an
XML library you'll need to tell Tidy to output XHTML. Then it should
be valid for XML processing.

Of course BeautifulSoup is also a very nice library if you need to
extract some information, but don't necessarilly require XML processing
to do it.

-- Matt Good


Ravi Teja 07-01-2006 10:53 PM

Re: beautifulsoup .vs tidy
 

Paul Boddie wrote:
> Ravi Teja wrote:
> >
> > 1.) XPath is not a good idea at all with "malformed" HTML or perhaps
> > web pages in general.

>
> import libxml2dom
> import urllib
> f = urllib.urlopen("http://wiki.python.org/moin/")
> s = f.read()
> f.close()
> # s contains HTML not XML text
> d = libxml2dom.parseString(s, html=1)
> # get the community-related links
> for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"):
> print label.nodeValue


I wasn't aware that your module does html as well.

> Of course, lxml should be able to do this kind of thing as well. I'd be
> interested to know why this "is not a good idea", though.


No reason that you don't know already.

http://www.boddie.org.uk/python/HTML.html

"If the document text is well-formed XML, we could omit the html
parameter or set it to have a false value."

XML parsers are not required to be forgiving to be regarded compliant.
And much HTML out there is not well formed.


Fredrik Lundh 07-02-2006 07:22 AM

Re: beautifulsoup .vs tidy
 
Ravi Teja wrote:

>> Of course, lxml should be able to do this kind of thing as well. I'd be
>> interested to know why this "is not a good idea", though.

>
> No reason that you don't know already.
>
> http://www.boddie.org.uk/python/HTML.html
>
> "If the document text is well-formed XML, we could omit the html
> parameter or set it to have a false value."
>
> XML parsers are not required to be forgiving to be regarded compliant.
> And much HTML out there is not well formed.


so? once you run it through an HTML-aware parser, the *resulting*
structure is well formed.

a site generator->converter->xpath approach is no less reliable than any
other HTML-scraping approach.

</F>


uche.ogbuji@gmail.com 07-03-2006 02:28 AM

Re: beautifulsoup .vs tidy
 
bruce wrote:
> hi paddy...
>
> that's exactly what i'm trying to accomplish... i've used tidy, but it seems
> to still generate warnings...
>
> initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)
>
> the xpath/linxml functions in the perl app complain regarding the file. my
> thought is that tidy isn't cleaning enough, or that the perl xpath/libxml
> functions are too strict!
>
> which is why i decided to see if anyone on the python side has
> experienced/solved this problem..


FWIW here's my usual approach:

http://copia.ogbuji.net/blog/2005-07-22/Beyond_HTM

Personally, I avoid Tidy. I've too often seen it crash or hang on
really bad HTML. TagSoup seems to be built like a tank. I've also
never seen BeautifulSoup choke, but I don't use it as much as TagSoup.

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://fourthought.com
http://copia.ogbuji.net http://4Suite.org
Articles: http://uche.ogbuji.net/tech/publications/



All times are GMT. The time now is 07:15 PM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57