Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > How to parse a XML doc with HTML tags within the texts

Reply
Thread Tools

How to parse a XML doc with HTML tags within the texts

 
 
Francesco Moi
Guest
Posts: n/a
 
      02-20-2005
Hi.

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi<br>My name is Jerry</message>
</item>
</doc>
--------

When I try to get the 'message' value by using:
getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;

I get only:
Hi

Any suggestion to get the whole text? I'm using Xerces+Perl.
Thank you very much.
 
Reply With Quote
 
 
 
 
Martin Honnen
Guest
Posts: n/a
 
      02-20-2005


Francesco Moi wrote:


> I must parse this XML document:
> --------------
> <doc>
> <item>
> <name>Jerry</name>
> <message>Hi<br>My name is Jerry</message>
> </item>
> </doc>


That is not XML as it is not well-formed, there needs to be a closing
</br> tag.

> When I try to get the 'message' value by using:
> getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;
>
> I get only:
> Hi


That is odd, if you really parse with an XML parser then you shouldn't
get to a DOM at all, parsing should throw an error.

--

Martin Honnen
http://JavaScript.FAQTs.com/
 
Reply With Quote
 
 
 
 
Andy Dingley
Guest
Posts: n/a
 
      02-20-2005
On 20 Feb 2005 06:32:22 -0800, http://www.velocityreviews.com/forums/(E-Mail Removed) (Francesco Moi)
wrote:

>I must parse this XML document:
>--------------
><doc>
><item>
><name>Jerry</name>
><message>Hi<br>My name is Jerry</message>
></item>
></doc>
>--------


That's not a well-formed XML document.

I assume that <message> is from your own schema, and that you want to
embed some HTML fragment within it. At this point I usually start
wondering if I can use RSS instead, and save myself a lot of effort.

Your failure here is that the HTML fragment isn't a well-fomed XML
fragment.. You have several choices:

- Use XHTML instead of HTML. This _might_ work, but you still need to
only include balanced and well-formed fragments. If it's generated
within your own system it might be workable, but it's not a general
solution to reading other people's content (which will always break
sometime).

- Write a parser that can handle tag soup. This is what you need to do
when reading other people's RSS feeds, because they're so often
mis-formed.

- Use HTML, but mangle into well-formed XML (i.e. <br> becomes
<br />) This is ugly, worse than using XHTML and has nothing to
commend it.

- Embed the HTML into the XML, either by encoding it, or by using a
CDATA section.


Read the infamous RSS versions note
http://diveintomark.org/archives/200...compatible-rss
It gives some useful background on these issues.

As well as tag / element formation issues, watch out for HTML entity
references that aren't in core XML (like &eacute and for embedded
CDATA sections too.

--
Smert' spamionam
 
Reply With Quote
 
Malte
Guest
Posts: n/a
 
      02-20-2005
Francesco Moi wrote:
> Hi.
>
> I must parse this XML document:
> --------------
> <doc>
> <item>
> <name>Jerry</name>
> <message>Hi<br>My name is Jerry</message>
> </item>
> </doc>
> --------
>
> When I try to get the 'message' value by using:
> getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;
>
> I get only:
> Hi
>
> Any suggestion to get the whole text? I'm using Xerces+Perl.
> Thank you very much.


We have an application that ouputs this kind of rubbish (rubbish being
!xhtml ).
I had to take out all the unbalanced tags before being able to parse the
results.
Much easier, if you can enforce xhtml, IMHO.
 
Reply With Quote
 
francescomoi@europe.com
Guest
Posts: n/a
 
      02-20-2005
Sorry, it's a <br/> instead of <br>.
-----------------------
<doc>
<item>
<name>Jerry</name>
<message>Hi<br/>My name is Jerry</message>
</item>
</doc>
----------------------

 
Reply With Quote
 
William Park
Guest
Posts: n/a
 
      02-21-2005
(E-Mail Removed) wrote:
> Sorry, it's a <br/> instead of <br>.
> -----------------------
> <doc>
> <item>
> <name>Jerry</name>
> <message>Hi<br/>My name is Jerry</message>
> </item>
> </doc>
> ----------------------


sed 's,<br/>,,g'

--
William Park <(E-Mail Removed)>, Toronto, Canada
Slackware Linux -- because I can type.

 
Reply With Quote
 
Andy Dingley
Guest
Posts: n/a
 
      02-21-2005
On 20 Feb 2005 14:00:03 -0800, (E-Mail Removed) wrote:

>Sorry, it's a <br/> instead of <br>.


It's not a parsing problem either, it's a DOM problem.

"Hi" is the first child of <message>, that's what you asked for,
that's what you got.

item(0) & getFirstChild are effectively duplicates here. So instead
of getting the content of the first <message>, you're getting the
first member (one text node) of this content.

To get "the whole text" is a common requirement, but not particularly
meaningful in a pure XML sense. So it's not part of the standard DOM.
You can usually use a .text property, or else you'll have to iterate /
collect all the text nodes yourself and concatenate them.

--
Die Gotterspammerung - Junkmail of the Gods
 
Reply With Quote
 
Johannes Koch
Guest
Posts: n/a
 
      02-21-2005
Andy Dingley wrote:

> To get "the whole text" is a common requirement, but not particularly
> meaningful in a pure XML sense. So it's not part of the standard DOM.


In DOM3 Core (W3C Recommendation since 07 April 2004) there is
textContent
<http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Node3-textContent>.
But I don't know about its implementation in XML parsers.

--
Johannes Koch
In te domine speravi; non confundar in aeternum.
(Te Deum, 4th cent.)
 
Reply With Quote
 
Martin Honnen
Guest
Posts: n/a
 
      02-21-2005


Johannes Koch wrote:


> In DOM3 Core (W3C Recommendation since 07 April 2004) there is
> textContent
> <http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Node3-textContent>.
> But I don't know about its implementation in XML parsers.


The XML parser in Java 1.5 (alias Java 5) has support for that, and I
think it is based on Xerces Java from Apache.
Mozilla has no DOM Level 3 Core support in general but has textContent
support.
Not sure whether the Xerces C++ that the OP uses with Perl is also up to
DOM Level 3 Core.


--

Martin Honnen
http://JavaScript.FAQTs.com/
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Preventing "collapse" of HTML tags during XML parse Rob Hunter Ruby 2 08-31-2007 01:48 PM
html tags within meta tags allowed? Donald Firesmith XML 5 01-08-2005 11:29 PM
Unable to have tags within tags?? Kwasi Java 13 12-01-2004 02:57 PM
String[] files = {"a.doc, b.doc"}; VERSUS String[] files = new String[] {"a.doc, b.doc"}; Matt Java 3 09-17-2004 10:28 PM
Custom Tags within Custom Tags. Ranganath Java 2 10-21-2003 06:14 AM



Advertisments