Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > convert xhtml back to html

Reply
Thread Tools

convert xhtml back to html

 
 
Tim Arnold
Guest
Posts: n/a
 
      04-24-2008
hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. <br />) to plain html (e.g. <br>).

Seems simple enough, but I'm having some trouble with it. regexps trip up
because I also have to take into account 'img', 'meta', 'link' tags, not
just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not
enough of a regexp pro to figure out that lookahead stuff.

I'm not sure where to start now; I looked at BeautifulSoup and
BeautifulStoneSoup, but I can't see how to modify the actual tag.

thanks,
--Tim Arnold


 
Reply With Quote
 
 
 
 
Gary Herron
Guest
Posts: n/a
 
      04-24-2008
Tim Arnold wrote:
> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
> create CHM files. That application really hates xhtml, so I need to convert
> self-ending tags (e.g. <br />) to plain html (e.g. <br>).
>
> Seems simple enough, but I'm having some trouble with it. regexps trip up
> because I also have to take into account 'img', 'meta', 'link' tags, not
> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
> that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not
> enough of a regexp pro to figure out that lookahead stuff.
>
> I'm not sure where to start now; I looked at BeautifulSoup and
> BeautifulStoneSoup, but I can't see how to modify the actual tag.
>
> thanks,
> --Tim Arnold
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Whether or not you can find an application that does what you want, I
don't know, but at the very least I can say this much.

You should not be reading and parsing the text yourself! XHTML is valid
XML, and there a lots of ways to read and parse XML with Python.
(ElementTree is what I use, but other choices exist.) Once you use an
existing package to read your files into an internal tree structure
representation, it should be a relatively easy job to traverse the tree
to emit the tags and text you want.


Gary Herron

 
Reply With Quote
 
 
 
 
Arnaud Delobelle
Guest
Posts: n/a
 
      04-24-2008
"Tim Arnold" <(E-Mail Removed)> writes:

> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
> create CHM files. That application really hates xhtml, so I need to convert
> self-ending tags (e.g. <br />) to plain html (e.g. <br>).
>
> Seems simple enough, but I'm having some trouble with it. regexps trip up
> because I also have to take into account 'img', 'meta', 'link' tags, not
> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
> that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not
> enough of a regexp pro to figure out that lookahead stuff.


Hi, I'm not sure if this is very helpful but the following works on
the very simple example below.

>>> import re
>>> xhtml = '<p>hello <img src="/img.png"/> spam <br/> bye </p>'
>>> xtag = re.compile(r'<([^>]*?)/>')
>>> xtag.sub(r'<\1>', xhtml)

'<p>hello <img src="/img.png"> spam <br> bye </p>'


--
Arnaud
 
Reply With Quote
 
Walter Dörwald
Guest
Posts: n/a
 
      04-24-2008
Arnaud Delobelle wrote:
> "Tim Arnold" <(E-Mail Removed)> writes:
>
>> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
>> create CHM files. That application really hates xhtml, so I need to convert
>> self-ending tags (e.g. <br />) to plain html (e.g. <br>).
>>
>> Seems simple enough, but I'm having some trouble with it. regexps trip up
>> because I also have to take into account 'img', 'meta', 'link' tags, not
>> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
>> that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not
>> enough of a regexp pro to figure out that lookahead stuff.

>
> Hi, I'm not sure if this is very helpful but the following works on
> the very simple example below.
>
>>>> import re
>>>> xhtml = '<p>hello <img src="/img.png"/> spam <br/> bye </p>'
>>>> xtag = re.compile(r'<([^>]*?)/>')
>>>> xtag.sub(r'<\1>', xhtml)

> '<p>hello <img src="/img.png"> spam <br> bye </p>'


You might try XIST (http://www.livinglogic.de/Python/xist):

Code looks like this:

from ll.xist import parsers
from ll.xist.ns import html

xhtml = '<p>hello <img src="/img.png"/> spam <br/> bye </p>'

doc = parsers.parsestring(xhtml)
print doc.bytes(xhtml=0)

This outputs:

<p>hello <img src="/img.png"> spam <br> bye </p>

(and a warning that the alt attribute is missing in the img )

Servus,
Walter

 
Reply With Quote
 
Tim Arnold
Guest
Posts: n/a
 
      04-24-2008
"Gary Herron" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> Tim Arnold wrote:
>> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop
>> to create CHM files. That application really hates xhtml, so I need to
>> convert self-ending tags (e.g. <br />) to plain html (e.g. <br>).
>>
>> Seems simple enough, but I'm having some trouble with it. regexps trip up
>> because I also have to take into account 'img', 'meta', 'link' tags, not
>> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to
>> do that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work.
>> I'm not enough of a regexp pro to figure out that lookahead stuff.
>>
>> I'm not sure where to start now; I looked at BeautifulSoup and
>> BeautifulStoneSoup, but I can't see how to modify the actual tag.
>>
>> thanks,
>> --Tim Arnold
>>
>>
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>>

> Whether or not you can find an application that does what you want, I
> don't know, but at the very least I can say this much.
>
> You should not be reading and parsing the text yourself! XHTML is valid
> XML, and there a lots of ways to read and parse XML with Python.
> (ElementTree is what I use, but other choices exist.) Once you use an
> existing package to read your files into an internal tree structure
> representation, it should be a relatively easy job to traverse the tree to
> emit the tags and text you want.
>
>
> Gary Herron
>

I agree and I'd really rather not parse it myself. However, ET will clean up
the file which in my case includes some comments required as metadata, so
that won't work. Oh, I could get ET to read it and write a new parser--I see
what you mean. I think I need to subclass so I could get ET to honor those
comments too.
That's one way to go, I was just hoping for something easier.
thanks,
--Tim


 
Reply With Quote
 
Tim Arnold
Guest
Posts: n/a
 
      04-24-2008
"Arnaud Delobelle" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> "Tim Arnold" <(E-Mail Removed)> writes:
>
>> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop
>> to
>> create CHM files. That application really hates xhtml, so I need to
>> convert
>> self-ending tags (e.g. <br />) to plain html (e.g. <br>).
>>
>> Seems simple enough, but I'm having some trouble with it. regexps trip up
>> because I also have to take into account 'img', 'meta', 'link' tags, not
>> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to
>> do
>> that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm
>> not
>> enough of a regexp pro to figure out that lookahead stuff.

>
> Hi, I'm not sure if this is very helpful but the following works on
> the very simple example below.
>
>>>> import re
>>>> xhtml = '<p>hello <img src="/img.png"/> spam <br/> bye </p>'
>>>> xtag = re.compile(r'<([^>]*?)/>')
>>>> xtag.sub(r'<\1>', xhtml)

> '<p>hello <img src="/img.png"> spam <br> bye </p>'
>
>
> --
> Arnaud


Thanks for that. It is helpful--I guess I had a brain malfunction. Your
example will work for me I'm pretty sure, except in some cases where the IMG
alt text contains a gt sign. I'm not sure that's even possible, so maybe
this will do the job.
thanks,
--Tim


 
Reply With Quote
 
bryan rasmussen
Guest
Posts: n/a
 
      04-24-2008
I'll second the recommendation to use xsl-t, set the output to html.


The code for an XSL-T to do it would be basically:
<xsl:stylesheet xmlnssl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xslutput method="html" />
<xsl:template match="/"><xsl:copy-of select="/"/></xsl:template>
</xsl:stylesheet>

you would probably want to do other stuff than just copy it out but
that's another case.

Also, from my recollection the solution in CHM to make XHTML br
elements behave correctly was <br /> as opposed to <br/>, at any rate
I've done projects generating CHM and my output markup was well formed
XML at all occasions.

Cheers,
Bryan Rasmussen

On Thu, Apr 24, 2008 at 5:34 PM, Tim Arnold <(E-Mail Removed)> wrote:
> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
> create CHM files. That application really hates xhtml, so I need to convert
> self-ending tags (e.g. <br />) to plain html (e.g. <br>).
>
> Seems simple enough, but I'm having some trouble with it. regexps trip up
> because I also have to take into account 'img', 'meta', 'link' tags, not
> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
> that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not
> enough of a regexp pro to figure out that lookahead stuff.
>
> I'm not sure where to start now; I looked at BeautifulSoup and
> BeautifulStoneSoup, but I can't see how to modify the actual tag.
>
> thanks,
> --Tim Arnold
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      04-24-2008
Tim Arnold wrote:
> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
> create CHM files. That application really hates xhtml, so I need to convert
> self-ending tags (e.g. <br />) to plain html (e.g. <br>).


This should do the job in lxml 2.x:

from lxml import etree

tree = etree.parse("thefile.xhtml")
tree.write("thefile.html", method="html")

http://codespeak.net/lxml

Stefan
 
Reply With Quote
 
bryan rasmussen
Guest
Posts: n/a
 
      04-24-2008
wow, that's pretty nice there.

Just to know: what's the performance like on XML instances of 1 GB?

Cheers,
Bryan Rasmussen


On Thu, Apr 24, 2008 at 9:55 PM, Stefan Behnel <(E-Mail Removed)> wrote:
> Tim Arnold wrote:
> > hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
> > create CHM files. That application really hates xhtml, so I need to convert
> > self-ending tags (e.g. <br />) to plain html (e.g. <br>).

>
> This should do the job in lxml 2.x:
>
> from lxml import etree
>
> tree = etree.parse("thefile.xhtml")
> tree.write("thefile.html", method="html")
>
> http://codespeak.net/lxml
>
> Stefan
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

 
Reply With Quote
 
Stefan Behnel
Guest
Posts: n/a
 
      04-25-2008
bryan rasmussen top-posted:
> On Thu, Apr 24, 2008 at 9:55 PM, Stefan Behnel <(E-Mail Removed)> wrote:
>> from lxml import etree
>>
>> tree = etree.parse("thefile.xhtml")
>> tree.write("thefile.html", method="html")
>>
>> http://codespeak.net/lxml

>
> wow, that's pretty nice there.
>
> Just to know: what's the performance like on XML instances of 1 GB?


That's a pretty big file, although you didn't mention what kind of XML
language you want to handle and what you want to do with it.

lxml is pretty conservative in terms of memory:

http://blog.ianbicking.org/2008/03/3...r-performance/

But the exact numbers depend on your data. lxml holds the XML tree in memory,
which is a lot bigger than the serialised data. So, for example, if you have
2GB of RAM and want to parse a serialised 1GB XML file full of little
one-element integers into an in-memory tree, get prepared for lunch. With a
lot of long text string content instead, it might still fit.

However, lxml also has a couple of step-by-step and stream parsing APIs:

http://codespeak.net/lxml/parsing.ht...rser-interface
http://codespeak.net/lxml/parsing.ht...rser-interface
http://codespeak.net/lxml/parsing.ht...e-and-iterwalk

They might do what you want.

Stefan
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
PSD to XHTML Conversion, PSD to HTML, Joomla, Drupal, WordpressConversion, PSD to XHTML CSS xhtml champs XML 0 08-02-2011 05:40 AM
PSD to XHTML Conversion, PSD to HTML, Joomla, Drupal, WordpressConversion, PSD to XHTML CSS xhtml champs C Programming 0 08-01-2011 06:29 AM
convert xhtml to another xhtml using xslt Usha2009 XML 0 12-20-2009 01:13 PM
Should I Convert Site To XHTML or XHTML mobile? chronos3d HTML 9 12-05-2006 04:46 PM
HTML convert to XHTML mike Java 1 11-17-2004 08:45 AM



Advertisments