Hi Joe,
OK, I guess I'll go back to programatically performing the conversion
with a utility. I haven't yet figured out for sure where the
namespaces are actually coming from. I'll have to look into it.
While I agree with you that stripping namespaces out would have
problematic consequences if I were parsing general-purpose xml (and if
I cared about the element type in which a certain bit of data was
found), in this particular case it really is safe to ignore them
because of the nature of what I'm doing. I'm parsing html to scrape
out textual data. Namespaces aren't normally used in html--in fact,
not even in xhtml--to distinguish one element type from another. You
could conceivably use namespaces in xhtml, but there would be no
practical purpose in doing so. If you did so in a way that assigned an
element to a namespace other than
http://www.w3c.org/TR/xhtml1 (or
something like that), no user agent would know what to do with it.
Even if namespaces were customarily used by web browsers to distinguish
between elements (such as might happen with inline SVG content), it
still might not make a difference to me because I don't actually care
what element type the data comes from. I'm really just using XPath and
XSLT as a more powerful alternative to fishing stuff out of the stream
using Perl scripting with regular expressions.
I'm generally pretty anal about this type of thing. Sloppiness and
ignorance in technical matters drives me crazy. It's one reason I hate
Microsoft. But in this case, it's more important to me that users of
my framework be able to write XPath expressions into the configuration
files without having to specify the same namespace prefix in all their
location steps. As long as I can write an XPath expression to identify
navigational elements and XSLT templates to scrape out the content, I'm
happy.
Thanks for your help.
--Erik
Joseph Kesselman wrote:
> wrote:
> > Actually, I have tried SAX2DOM from the Xalan project. It works, but
> > this utility seems to want to add namespaces to my DOM, and can't turn
> > this feature off. Correct though the namespaces may be, they add
> > needless complexity to the required XPath expressions and XSLT files
> > that are used to configure the framework to scrape a site. I'm trying
> > to make my framework as easy to use as possible.
>
> SAX2DOM shouldn't be adding namespaces unless the namespaces are present
> in the SAX input -- in which case leaving them out is Absolutely
> Incorrect; you'd be changing the meaning of the document (since the
> namespaces are part of the document's semantics) and this bad practice
> *WILL* eventually turn around and bite your kneecaps off.
>
> Everything should be as simple as possible... but not simpler!
>
> > But if there is no easy
> > way of setting a system property to tell the standard JAXP DOM
> > implementation what SAX parser to use
>
> The JAXP DOM path may not be using a SAX parser under the covers -- for
> example, Xerces drives both SAX and DOM output off a lower-level
> representation -- so there really isn't a plug-in point that maps to
> what you're asking for. Using a separate SAX-driven DOM builder really
> is likely to be the most portable solution. It's a pretty simple piece
> of code, and since it's based entirely on the SAX and DOM specs it's
> highly portable.
>
> --
> Joe Kesselman / Beware the fury of a patient man. -- John Dryden