Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Parsing XML RSS feed byte stream for <item> tag

Reply
Thread Tools

Parsing XML RSS feed byte stream for <item> tag

 
 
darrel.rendell@gmail.com
Guest
Posts: n/a
 
      02-07-2013


I'm attempting to parse an RSS feed for the first instance of an element ""..

def pageReader(url):
try:
readPage = urllib2.urlopen(url)
except urllib2.URLError, e:
# print 'We failed to reach a server.'
# print 'Reason: ', e.reason
return 404
except urllib2.HTTPError, e:
# print('The server couldn\'t fulfill the request.')
# print('Error code: ', e.code)
return 404
else:
outputPage = readPage.read()
return outputPage

Assume arguments being passed are correct. The function returns a str object whose value is simply an entire rss feed - I've confirmed the type with:

a = isinstance(value, str)
if not a:
return -1

So, an entire rss feed has been returned from the function call, it's this point I hit a brick wall - I've tried parsing the feed with BeautifulSoup, lxml and various other libs, but no success (I had some success with BeautifulSoup, but it wasn't able to pull certain child elements from the parent,for example, . I'm just about ready to resort to writing my own parser, but I'd like to know if anybody has any suggestions.

To recreate my error, simply call the above function with an argument similar to:

http://www.cert.org/nav/cert_announcements.rss

You'll see I'm trying to return the first child.

<item>
<title>New Blog Entry: Common Sense Guide to Mitigating Insider Threats - Best Practice 16 (of 19)</title>
<link>http://www.cert.org/blogs/insider_threat/2013/02/common_sense_guide_to_mitigating_insider_threats_-_best_practice_16_of_19.html</link>
<description>This sixteenth of 19 blog posts about the fourth edition of the Common Sense Guide to Mitigating Insider Threats describes Practice 16evelop a formalized insider threat program.</description>
<pubDate>Wed, 06 Feb 2013 06:38:07 -0500</pubDate>
</item>

As I've said, BeautifulSoup fails to find both pubDate and Link, which are crucial to my app.

Any advice would be greatly appreciated.
 
Reply With Quote
 
 
 
 
John Gordon
Guest
Posts: n/a
 
      02-07-2013
In <(E-Mail Removed)> http://www.velocityreviews.com/forums/(E-Mail Removed) writes:

> def pageReader(url):
> try:
> readPage =3D urllib2.urlopen(url)
> except urllib2.URLError, e:
> # print 'We failed to reach a server.'
> # print 'Reason: ', e.reason
> return 404 =20
> except urllib2.HTTPError, e:
> # print('The server couldn\'t fulfill the request.')
> # print('Error code: ', e.code) =20
> return 404 =20
> else:
> outputPage =3D readPage.read() =20
> return outputPage


> To recreate my error, simply call the above function with an argument
> similar to:


> http://www.cert.org/nav/cert_announcements.rss


> You'll see I'm trying to return the first child.


The above code produces no output at all. The pageReader() function is
defined but never called.

If we add a few lines at the bottom:

if __name__ == '__main__':
print pageReader('http://www.cert.org/nav/cert_announcements.rss')

Then we get some output:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">

<channel>
<title>CERT Announcements</title>
<link>http://www.cert.org/nav/whatsnew.html</link>
<language>en-us</language>
<description>Announcements: What's New on the CERT web site</description>

<item>
<title>New Blog Entry: Common Sense Guide to Mitigating Insider Threats - Best Practice 16 (of 19)</title>
<link>http://www.cert.org/blogs/insider_threat/2013/02/common_sense_guide_to_mitigating_insider_threats_-_best_practice_16_of_19.html</link>
<description>This sixteenth of 19 blog posts about the fourth edition of the Common Sense Guide to Mitigating Insider Threats describes Practice 16: Develop a formalized insider threat program.</description>
<pubDate>Wed, 06 Feb 2013 06:38:07 -0500</pubDate>
</item>

....

> As I've said, BeautifulSoup fails to find both pubDate and Link, which are =
> crucial to my app.


> Any advice would be greatly appreciated.


You haven't included the BeautifulSoup code which attempts to parse the XML,
so it's impossible to say exactly what the error is.

However, I have a guess: you said you're trying to return the first
child. Based on the above output, the first child is the <channel>
element, not an <item> element. Perhaps that's the issue?

--
John Gordon A is for Amy, who fell down the stairs
(E-Mail Removed) B is for Basil, assaulted by bears
-- Edward Gorey, "The Gashlycrumb Tinies"

 
Reply With Quote
 
 
 
 
xDog Walker
Guest
Posts: n/a
 
      02-08-2013
On Thursday 2013 February 07 12:36, (E-Mail Removed) wrote:
> As I've said, BeautifulSoup fails to find both pubDate and Link, which are
> crucial to my app
> Any advice would be greatly appreciated.


http://packages.python.org/feedparser

--
Yonder nor sorghum stenches shut ladle gulls stopper torque wet
strainers.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Two ways to generate RSS - rss/maker and rss/2.0 - which is better? Jonathan Groll Ruby 1 06-27-2009 03:53 AM
Post RSS feed w/o RSS-to-Javascript.com Scott Gordo HTML 5 08-29-2006 01:34 AM
how do u invoke Tag b's Tag Handler from within Tag a's tag Handler? shruds Java 1 01-27-2006 03:00 AM
RSS Feed - need an Idiot's Guide to RSS News on my website teach_me6@hotmail.com HTML 5 02-25-2005 11:01 AM
Searches in multiple RSS feeds -> new rss feed Motta XML 1 06-09-2004 10:55 PM



Advertisments