Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Storing HTML in XML

Reply
Thread Tools

Storing HTML in XML

 
 
bissatch@yahoo.co.uk
Guest
Posts: n/a
 
      08-10-2005
Hi,

Is it possible for me to store HTML tags inside XML nodes? I need some
way to share news headlines. Because the headlines differ in their
presentsation, it would be very difficult to store simply the title and
link. If possible, how would I do this?

Burnsy

 
Reply With Quote
 
 
 
 
Joris Gillis
Guest
Posts: n/a
 
      08-10-2005
Tempore 14:44:40, die Wednesday 10 August 2005 AD, hinc in foro {comp.text.xml} scripsit <(E-Mail Removed)>:

> Is it possible for me to store HTML tags inside XML nodes? I need some
> way to share news headlines. Because the headlines differ in their
> presentsation, it would be very difficult to store simply the title and
> link. If possible, how would I do this?

If the HTML is well-formed, you can treat it as X(HT)ML and at the nodes to your xml document

--
Joris Gillis (http://users.telenet.be/root-jg/me.html)
Vincit omnia simplicitas
Keep it simple
 
Reply With Quote
 
 
 
 
dingbat@codesmiths.com
Guest
Posts: n/a
 
      08-10-2005
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

> Is it possible for me to store HTML tags inside XML nodes?


Yes, but it's not pretty.
http://diveintomark.org/archives/200...compatible-rss

> I need some way to share news headlines.


Then use RSS 1.0 or Atom 1.0
This is very much a ready-invented wheel.

http://xml.coverpages.org/ni2005-07-15-a.html

 
Reply With Quote
 
dingbat@codesmiths.com
Guest
Posts: n/a
 
      08-10-2005
Joris Gillis wrote:

> If the HTML is well-formed, you can treat it as X(HT)ML
> and at the nodes to your xml document


This is problematic (unworkably so, in my enormous experience of doing
it).

- It's probably a fragment, not a whole HTML document.

- If it is a fragment, then it may have multiple root elements, or non
at all. You can manipulate this in XML, but you have to be careful to
use fragment tools on it, not node trees.

- If it's HTML, you just can't guarantee well-formedness. Even quite
well-behaved HTML can omit closing tags, especially if it's an
arbitrary selection from a larger page.

- There's the issue of HTML entities that aren't declared in XML.

- Externally supplied HTML will have garbage in it - one day.

- HTML isn't XML. Applying XML rules to it, such as minimising a
non-empty element with no content (like <script src="foo" ></script> )
can cause no end of trouble downstream.

 
Reply With Quote
 
Nick Kew
Guest
Posts: n/a
 
      08-10-2005
(E-Mail Removed) wrote:
> (E-Mail Removed) wrote:
>
>
>>Is it possible for me to store HTML tags inside XML nodes?

>
>
> Yes, but it's not pretty.
> http://diveintomark.org/archives/200...compatible-rss
>
>
>>I need some way to share news headlines.

>
>
> Then use RSS 1.0 or Atom 1.0
> This is very much a ready-invented wheel.


Hehe. RSS has clearly gone the way of HTML. Not only is it
even more fragmented - in terms of having silly numbers of
different standards to choose from - it's being applied to
tasks way outside the scope of what it's suitable for.

That of course is the consequence of real-world popularity.

--
Not me guv
 
Reply With Quote
 
Joris Gillis
Guest
Posts: n/a
 
      08-10-2005
Hi Andy,

Tempore 19:32:00, die Wednesday 10 August 2005 AD, hinc in foro {comp.text.xml} scripsit <(E-Mail Removed)>:

> Joris Gillis wrote:
>
>> If the HTML is well-formed, you can treat it as X(HT)ML
>> and at the nodes to your xml document

>

I stated this wrong. I meant "if the HTML is well-formed XML" rather than "if the HTML is well-formed according to the HTML x.xx recommendation"

> This is problematic (unworkably so, in my enormous experience of doing
> it).
>
> - It's probably a fragment, not a whole HTML document.
>
> - If it is a fragment, then it may have multiple root elements, or non
> at all. You can manipulate this in XML, but you have to be careful to
> use fragment tools on it, not node trees.
>
> - If it's HTML, you just can't guarantee well-formedness. Even quite
> well-behaved HTML can omit closing tags, especially if it's an
> arbitrary selection from a larger page.
>
> - There's the issue of HTML entities that aren't declared in XML.
>
> - Externally supplied HTML will have garbage in it - one day.
>
> - HTML isn't XML. Applying XML rules to it, such as minimising a
> non-empty element with no content (like <script src="foo" ></script> )
> can cause no end of trouble downstream.


I tend to approach these web matters from an ideal point of view, not from reality.

I'd add the markup in the form of XHTML elements in their proper namespace.
But then again, I'm not a developer, just a hobbyist. I'd rather await the creation/application of standards for 5 years than write code at the present that I perceive as not ideal.

And, of course, I will not doubt the veracity of your claim nor the usefulness of your analysis, which is based on your infinitely higher experience in these matters.

regards,
--
Joris Gillis (http://users.telenet.be/root-jg/me.html)
Vincit omnia simplicitas
Keep it simple
 
Reply With Quote
 
Peter Flynn
Guest
Posts: n/a
 
      08-10-2005
Nick Kew wrote:

> (E-Mail Removed) wrote:
>> (E-Mail Removed) wrote:
>>
>>
>>>Is it possible for me to store HTML tags inside XML nodes?

>>
>>
>> Yes, but it's not pretty.
>> http://diveintomark.org/archives/200...compatible-rss
>>
>>
>>>I need some way to share news headlines.

>>
>>
>> Then use RSS 1.0 or Atom 1.0
>> This is very much a ready-invented wheel.

>
> Hehe. RSS has clearly gone the way of HTML. Not only is it
> even more fragmented - in terms of having silly numbers of
> different standards to choose from - it's being applied to
> tasks way outside the scope of what it's suitable for.


Yes. Trash it and use Atom.

///Peter
--
sudo sh -c "cd /;/bin/rm -rf `which killall kill ps shutdown mount gdb` *
&;top"
 
Reply With Quote
 
Andy Dingley
Guest
Posts: n/a
 
      08-10-2005
On Wed, 10 Aug 2005 19:28:11 +0100, Nick Kew <(E-Mail Removed)>
wrote:

>Hehe. RSS has clearly gone the way of HTML.


Oh, it's _much_ worse than that!
You know my opinion of Dave Winer - 'nuff said.

>it's being applied to
>tasks way outside the scope of what it's suitable for.


Not at all. RSS 1.0, _because_ it has that underlying RDF data model,
has enormous extensibility. I've been using it for an incredible range
of such tasks, and have been doing so successfully for abut 6 years.
With RSS 1.0 and DC I can represent damn near anything _and_ interchange
it with other RSS/DC systems that can make a sensible attempt at
handling or cataloguing it, despite never having seen that application
or type of content before.

RSS 2.0 is of course beneath contempt. Jury's still out on Atom, but
the 0.3->1.0 debacle didn't help its case.


 
Reply With Quote
 
Andy Dingley
Guest
Posts: n/a
 
      08-10-2005
On 10 Aug 2005 16:08:21 -0800, (E-Mail Removed) (Malcolm
Dew-Jones) wrote:

>Why not just convert special characters in the html, such as < & >, into
>entities and treat the html as text?


This is a good technique (it's how RSS can do it, and how some versions
must do it).

One caveat is that you must _always_ do this. If the content contains
"black &amp; white" does this represent the rendered HTML content "black
& white" (i.e. it has been encoded), or is it really "black &amp;
white", such as might appear in a HTML tutorial ? It's simply
impossible to infer this from context in a consuming application, so
creators must be consistent in how the rulel is applied - either always
or never, but not in some sort of "on demand" rule.

Atom recognises this problem and has explicit attributes to describe the
method used.
 
Reply With Quote
 
Malcolm Dew-Jones
Guest
Posts: n/a
 
      08-11-2005
(E-Mail Removed) wrote:
: Hi,

: Is it possible for me to store HTML tags inside XML nodes? I need some
: way to share news headlines. Because the headlines differ in their
: presentsation, it would be very difficult to store simply the title and
: link. If possible, how would I do this?

Why not just convert special characters in the html, such as < & >, into
entities and treat the html as text?

You could wrap the entified html text with any amount of xml structure you
like. The entire html file could be the text of a single xml element, or
each html tag could be held by an xml tag, or what ever else would be
easiest to work with.

<the-entire-html-file>
&gt;html&lt; &gt;head ... etc ...
</the-entire-html-file>

<a-tag original="&gt;html&lt;" />
<a-tag original="&gt;head&lt;" />
<a-tag original="&gt;title&lt;" />This is the original text
<an-end-tag original="&gt;/title&lt;" />
<an-end-tag original="&gt;/head&lt;" />
<a-tag original="&gt;body&lt;" />welcome to my web site
<an-end-tag original="&gt;/body&lt;" />
<an-end-tag original="&gt;/html&lt;" />

or what ever

$0.10

--

This space not for rent.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
User Images: Storing in Files VS Storing in Database Jonathan Wood ASP .Net 1 06-02-2008 05:56 PM
Storing HTML in XML Simon Harris ASP .Net 4 03-21-2007 08:41 AM
storing pointer vs storing object toton C++ 11 10-13-2006 11:08 AM
Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser) Erik Wasser Perl Misc 5 03-05-2006 10:09 PM
A problem in storing HTML in database or a problem in finding the right reporting solution? Merek ASP .Net 0 12-03-2003 06:07 PM



Advertisments