Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Converting HTML elements into XML/RSS

Reply
Thread Tools

Converting HTML elements into XML/RSS

 
 
mickjames@gmail.com
Guest
Posts: n/a
 
      01-06-2005
Hi,

I'd like to include the whole web page content (as opposed to just the
headlines) into RSS/XML to enable people to read them via rss feed
readers.

Question: how to convert HTML elements such as href, img, b, p, etc
into XML?
I've seen someone use the following in their RSS feed but I don't like
it because <pre> doesn't produce a nice format:

<content:encoded><![CDATA[
<PRE>
blah blah blah..

Here is a sample HTML code. What would be the best way to put it into
XML, more specifically, convert those HTML elements.

----------------
<b>CAESAR</b> Et tu, Brute! Then fall,
<a
href=http://www.epilepsiemuseum.de/raum6/caesar.jpg>Caesar</a>.<br>
Dies
<p>
<b>CINNA</b> Liberty! Freedom! Tyranny is dead!
Run hence, proclaim, cry it about the streets.
<a href=http://www.shakespeare-online.com/>Read more</a>.
-----------------

Thanks for all the help!

Mick James

 
Reply With Quote
 
 
 
 
Andy Dingley
Guest
Posts: n/a
 
      01-06-2005
On 6 Jan 2005 13:43:19 -0800, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

>I'd like to include the whole web page content (as opposed to just the
>headlines) into RSS/XML to enable people to read them via rss feed
>readers.


Read this
http://diveintomark.org/archives/200...compatible-rss

Ask again if anything is unclear.
 
Reply With Quote
 
 
 
 
mickjames@gmail.com
Guest
Posts: n/a
 
      01-06-2005
Thanks.

So all the HTML needs to be enclosed in <description> and tags need to
be escaped with &amp;lt; and &amp;gt;?

 
Reply With Quote
 
Nick Kew
Guest
Posts: n/a
 
      01-07-2005
In article <(E-Mail Removed) .com>,
(E-Mail Removed) writes:

> I'd like to include the whole web page content (as opposed to just the
> headlines) into RSS/XML to enable people to read them via rss feed
> readers.


Uh, that's a lot of content for what users are expecting to be a summary.
Why use a feed if it doesn't save your users anything?

> Question: how to convert HTML elements such as href, img, b, p, etc
> into XML?


Bearing in mind the above, freely mix it, just using namespaces to
distinguish the elements. Since you're already breaking the purpose
of a feed, working normally with conventional client software presumably
isn't an issue.

> Here is a sample HTML code. What would be the best way to put it into


Looks more like tag-soup to me.

--
Nick Kew
 
Reply With Quote
 
mickjames@gmail.com
Guest
Posts: n/a
 
      01-07-2005
Thanks for your reply. Yes, I understand that RSS is meant for summary,
not the whole content, but a lot of readers ask for the whole thing.
One imagines, they prefer to read using an rss feed reader instead of
using a web browser.

One question I didn't get the answer to in all my searching is: how to
code HTML tags such as href, img, p, b, etc when converting an HTML
page to .rss page?

Putting everything in CDATA or is there a better way?
A short example would be helpful.

Thanks a lot!

 
Reply With Quote
 
Andy Dingley
Guest
Posts: n/a
 
      01-07-2005
On 6 Jan 2005 15:15:54 -0800, (E-Mail Removed) wrote:

>So all the HTML needs to be enclosed in <description> and tags need to
>be escaped with &amp;lt; and &amp;gt;?


Yes. Ampersands might also cause problems and should already have been
escaped, but it's common in HTML that they aren't.

You should also "fix" any entitity references that are in the HTML,
such as &eacute; or &nbsp; This needs to be done whether there are
tags involved or not - they're one of the most common intermittent
reasons for an RSS feed to become invalid. Such entities are defined
in HTML, but aren't already defined in XML or RSS.

"Fixing" them can be either replacing the initial ampersand with &amp;
or replacing the "named" form of the entity reference with the
corresponding numeric form. The numeric form is probably best to use,
because that will render correctly even if the consumer doesn't
properly expand the encoded entities.

--
Smert' spamionam
 
Reply With Quote
 
Andy Dingley
Guest
Posts: n/a
 
      01-07-2005
On Fri, 7 Jan 2005 01:25:36 +0000, (E-Mail Removed) (Nick Kew)
wrote:

>Why use a feed if it doesn't save your users anything?


Why do you assume the function of my RSS feed ? I've built many
feeds that are anything but "newsfeeds". I think my record was 20MB
content size in a <description> element, for a very
application-specific intranet task. However it's still perfectly
compliant RSS 1.0

>> Question: how to convert HTML elements such as href, img, b, p, etc
>> into XML?

>
>Bearing in mind the above, freely mix it, just using namespaces to
>distinguish the elements.


You can't use namespacing, because the content is HTML rather than
XHTML. Apart from the standards-based argument and the fact that
namespacing just doesn't make sense for HTML, it's also impractical to
expect the incoming HTML content to be well-formed as an XML fragment
(or even valid HTML!).

Remember that RSS is a _feed_, not a one-off document (I wish Winer
would recognise this). Like all layered protocols you have to be very
careful that your implementations are not only correct for one
demonstration example, they have to be demonstrably correct for all
possible inputs.


> Since you're already breaking the purpose of a feed,


Rubbish. RSS does _NOT_ define any notion of "purpose", or what's
"appropriate" to use it for. Besides which, the notion of content
encoding HTML fragments within the <description> element is very well
established.


--
Smert' spamionam
 
Reply With Quote
 
Nick Kew
Guest
Posts: n/a
 
      01-07-2005
In article <(E-Mail Removed). com>,
(E-Mail Removed) writes:
> One imagines, they prefer to read using an rss feed reader instead of
> using a web browser.


Hmmm. I think it should be the job of the Client to present it
sensibly. An RSS feed is to the Web as a newsgroup or mail folder
listing (from, subject, date) is to Usenet or Email. IMHO.

(you've presumably seen how Opera presents RSS feeds?)

> One question I didn't get the answer to in all my searching is: how to
> code HTML tags such as href, img, p, b, etc when converting an HTML
> page to .rss page?


The core Site Valet tools offer options to present reports as RDF.
Since these are markup analysis tools, the more verbose options
embed the original markup, so all system messages can be properly
referenced to it. This uses a namespace to describe it, and
looks a little like XSLT with things like:
<ml:element name="a">
<ml:attribute name="href">foo</ml:attribute>

> Putting everything in CDATA or is there a better way?
> A short example would be helpful.


I don't think the above reply is really relevant to your question:
I was solving a different problem! But you already have Andy's reply.

--
Nick Kew
 
Reply With Quote
 
Colin
Guest
Posts: n/a
 
      01-07-2005
Hey,

>I'd like to include the whole web page content (as opposed to just the
>headlines) into RSS/XML to enable people to read them via rss feed
>readers.
>
>Question: how to convert HTML elements such as href, img, b, p, etc
>into XML?


Why don't you just use software to create the feed that will convert it for you
so that you don't have to worry about it. There are a couple of options, I know
FeedForAll http://www.feedforall.com has a WYSWIG editor that will do this.

Best,
Colin

 
Reply With Quote
 
mickjames@gmail.com
Guest
Posts: n/a
 
      01-07-2005
WYSIWIG is not an option. I need to do it via script on Linux.

Would someone tell me how the following HTML snippet should be encoded
in an RSS file:

<b>This is a test.</a>
<a href=foo.html>Bar</a>.
<img src=baz.jpg>
<p>

I tried using &amp;lt; etc but RSS readers simply display the
equivalent HTML, rather then rendering it.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
help converting xml into html dude XML 2 07-29-2006 04:55 PM
converting a text file into an "insert into ..." file kublaikhan55@hotmail.com Ruby 5 07-23-2006 07:35 PM
Trouble printing QuarkXpress into postscript drive as prn and converting into pdf files using distiller OZ Computer Support 0 09-27-2004 05:03 PM
.net library/class/component for converting HTML into XHTML? darrel ASP .Net 1 09-09-2004 01:39 PM
Newbie question: Converting XML into PDF or HTML niels XML 3 10-01-2003 04:51 PM



Advertisments