Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Problem with xml.dom parser and xmlns attribute

Reply
Thread Tools

Problem with xml.dom parser and xmlns attribute

 
 
Peter Maas
Guest
Posts: n/a
 
      04-22-2004
Hi,

I have a problem parsing html text with xmldom. The following code
runs well:

--------------------------------------------
from xml.dom.ext.reader import HtmlLib
from xml.dom.ext import PrettyPrint

r = HtmlLib.Reader()
doc = r.fromString(
'''
<html>
<head>
</head>
<body>
<p>hallo welt
</body>
</html>
''')
PrettyPrint(doc)
--------------------------------------------

but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml">
I get the error

Traceback (most recent call last):
File "xhtml.py", line 5, in ?
doc = r.fromString(
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 69, in fromString
return self.fromStream(stream, ownerDoc, charset)
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 27, in fromStream
self.parser.parse(stream)
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 57, in parse
self._parser.parse(stream.read())
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 160, in finish_starttag
unicode(value, self._charset))
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Element.py", line 177, in setAttributeNS
attr = self.ownerDocument.createAttributeNS(namespaceURI, qualifiedName)
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Document.py", line 139, in createAttributeNS
raise NamespaceErr()
xml.dom.NamespaceErr: Invalid or illegal namespace operation
>Exit code: 1


A lot of HTML documents on Internet have this xmlns=.... Are
they wrong or is this a PyXML bug?

Mit freundlichen Gruessen,

Peter Maas

--
-------------------------------------------------------------------
Peter Maas, M+R Infosysteme, D-52070 Aachen, Hubert-Wienen-Str. 24
Tel +49-241-93878-0 Fax +49-241-93878-20 eMail http://www.velocityreviews.com/forums/(E-Mail Removed)
-------------------------------------------------------------------

 
Reply With Quote
 
 
 
 
Richard Brodie
Guest
Posts: n/a
 
      04-22-2004

"Peter Maas" <(E-Mail Removed)> wrote in message news:c682uu$sco$(E-Mail Removed)...

> but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml">


> A lot of HTML documents on Internet have this xmlns=.... Are
> they wrong or is this a PyXML bug?


If they are genuine XHTML documents, they should be well-formed XML,
so you should be able to use an XML rather than an SGML parser.

from xml.dom.ext.reader import Sax2
r = Sax2.Reader()



 
Reply With Quote
 
 
 
 
Peter Maas
Guest
Posts: n/a
 
      04-22-2004
Richard Brodie wrote:
> "Peter Maas" <(E-Mail Removed)> wrote in message news:c682uu$sco$(E-Mail Removed)...

[...]
>>but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml">

[...]
>>A lot of HTML documents on Internet have this xmlns=.... Are
>>they wrong or is this a PyXML bug?

>
>
> If they are genuine XHTML documents, they should be well-formed XML,
> so you should be able to use an XML rather than an SGML parser.
>
> from xml.dom.ext.reader import Sax2
> r = Sax2.Reader()


Thanks, Richard. But in the Internet most of the time I don't know
what kind of document I'm dealing with when I start parsing. I guess
I should use HTMLParser (?).

Mit freundlichen Gruessen,

Peter Maas

--
-------------------------------------------------------------------
Peter Maas, M+R Infosysteme, D-52070 Aachen, Hubert-Wienen-Str. 24
Tel +49-241-93878-0 Fax +49-241-93878-20 eMail (E-Mail Removed)
-------------------------------------------------------------------
 
Reply With Quote
 
Richard Brodie
Guest
Posts: n/a
 
      04-23-2004

"Peter Maas" <(E-Mail Removed)> wrote in message news:c68jai$g85$(E-Mail Removed)...

> Thanks, Richard. But in the Internet most of the time I don't know
> what kind of document I'm dealing with when I start parsing. I guess
> I should use HTMLParser (?).


If you're dealing with a wide range of web pages, chances are they
will have all manner of rubbish in them. I would probably feed the
stuff through Tidy (or uTidyLib) first, to convert to cleanish XHTML,
then use an XML parser.


 
Reply With Quote
 
Uche Ogbuji
Guest
Posts: n/a
 
      05-10-2004
Peter Maas <(E-Mail Removed)> wrote in message news:<c682uu$sco$(E-Mail Removed)>...
> Hi,
>
> I have a problem parsing html text with xmldom. The following code
> runs well:
>
> --------------------------------------------
> from xml.dom.ext.reader import HtmlLib
> from xml.dom.ext import PrettyPrint
>
> r = HtmlLib.Reader()
> doc = r.fromString(
> '''
> <html>
> <head>
> </head>
> <body>
> <p>hallo welt
> </body>
> </html>
> ''')
> PrettyPrint(doc)
> --------------------------------------------
>
> but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml">
> I get the error
>
> Traceback (most recent call last):
> File "xhtml.py", line 5, in ?
> doc = r.fromString(
> File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 69, in fromString
> return self.fromStream(stream, ownerDoc, charset)
> File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 27, in fromStream
> self.parser.parse(stream)
> File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 57, in parse
> self._parser.parse(stream.read())
> File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 160, in finish_starttag
> unicode(value, self._charset))
> File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Element.py", line 177, in setAttributeNS
> attr = self.ownerDocument.createAttributeNS(namespaceURI, qualifiedName)
> File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Document.py", line 139, in createAttributeNS
> raise NamespaceErr()
> xml.dom.NamespaceErr: Invalid or illegal namespace operation
> >Exit code: 1

>
> A lot of HTML documents on Internet have this xmlns=.... Are
> they wrong or is this a PyXML bug?


This looks like a 4DOM bug. What are you hoping to do once you've
parsed these documents? If we know we can either suggest an
alternative tool to use or perhaps a workaround.

--Uche
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
What does xmlns:xsi and xmlns:xsd attributes mean? afshar XML 3 05-19-2010 02:16 AM
XSL problem matching node containing xmlns attribute clover2411 XML 4 01-23-2007 12:38 PM
Unrecognized attribute 'xmlns'. kai ASP .Net 7 11-21-2005 01:25 AM
xmlns:SOAP-ENC, xmlns:xsd required? chris.stromberger@gmail.com XML 0 10-26-2005 02:48 PM
Unrecognized attribute 'xmlns'. john1001 ASP .Net 6 10-15-2005 12:32 PM



Advertisments