Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > I'm looking for html cleaner. Example : convert <h1><span><font>my title</font></span></h1> => <h1>my title</h1>…

Reply
Thread Tools

I'm looking for html cleaner. Example : convert <h1><span><font>my title</font></span></h1> => <h1>my title</h1>…

 
 
Stéphane Klein
Guest
Posts: n/a
 
      03-29-2010
Hi,

I work on HTML cleaner.

I export OpenOffice.org documents to HTML.
Next, I would like clean this HTML export files :

* remove comment
* remove style
* remove dispensable tag
* ...

some difficulty :

* convert <p>my text <span>foo</span> bar</p> => <p>my text foo par</p>
* convert <h1><span><font>my title</font></span></h1> => <h1>my title</h1>

to do this process, I use lxml and pyquery.

Question :

* are there some xml helper tools in Python to do this process ? I've
looked for in pypi, I found nothing about it

If you confirm than this tools don't exists, I'll maybe publish a helper
package to do this "clean" processing.

Thanks for your help,
Stephane

 
Reply With Quote
 
 
 
 
Harishankar
Guest
Posts: n/a
 
      03-29-2010
On Mon, 29 Mar 2010 10:12:09 +0200, Stéphane Klein wrote:

> Hi,
>
> I work on HTML cleaner.
>
> I export OpenOffice.org documents to HTML. Next, I would like clean this
> HTML export files :
>
> * remove comment
> * remove style
> * remove dispensable tag
> * ...
>
> some difficulty :
>
> * convert <p>my text <span>foo</span> bar</p> => <p>my text foo par</p>
> * convert <h1><span><font>my title</font></span></h1> => <h1>my
> title</h1>
>
> to do this process, I use lxml and pyquery.
>
> Question :
>
> * are there some xml helper tools in Python to do this process ? I've
> looked for in pypi, I found nothing about it
>
> If you confirm than this tools don't exists, I'll maybe publish a helper
> package to do this "clean" processing.
>
> Thanks for your help,
> Stephane



Take a look at htmllib and HTMLParser (two different modules) in the
Python built-in library.

In Python 3.x there is one called html.parser

You can use this to parse out specific tags from HTML documents. If you
want something more advanced, consider using XML.





--
Harishankar (http://harishankar.org http://literaryforums.org)
 
Reply With Quote
 
 
 
 
John Nagle
Guest
Posts: n/a
 
      03-30-2010
Stéphane Klein wrote:
> Hi,
>
> I work on HTML cleaner.
>
> I export OpenOffice.org documents to HTML.
> Next, I would like clean this HTML export files :
>
> * remove comment
> * remove style
> * remove dispensable tag
> * ...


Try parsing with HTML5 Parser ("http://code.google.com/p/html5lib/") which
is the closest thing to a good parser available for Python. It's basically
a reference implementation of HTML5, including all the handling of bad HTML.

Once you have a tree, write something to go through the tree and remove
empty tags from a list of tags which do nothing when empty. Then
regenerate HTML from the tree.

Or just use HTML Tidy: "http://www.w3.org/People/Raggett/tidy/"

John Nagle
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: I'm looking for html cleaner. Example : convert <h1><span><font>my title</font></span></h1> => <h1>my title</h1>… Stefan Behnel Python 0 03-29-2010 08:14 PM
Looking for freeware that can convert large MS Word docs to .txt, html,or lit RZ Computer Information 2 01-12-2008 04:44 PM
'example.com' == 'example.com.' => false... is this intended? Sam Roberts Ruby 15 02-07-2005 04:36 PM
RE: Convert HTML to XML or Paser HTML Steven Cheng[MSFT] ASP .Net 3 02-12-2004 07:15 PM
Re: Convert HTML to XML or Paser HTML Joerg Jooss ASP .Net 0 01-11-2004 12:23 AM



Advertisments