Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > newbie read.scan (?) question

Reply
Thread Tools

newbie read.scan (?) question

 
 
Bruce D'Arcus
Guest
Posts: n/a
 
      06-06-2005
Hi,

I'm trying to get my feet wet with Ruby by tackling a manageable, but
real, issue I'd like to solve.

I'm an academic, and subscribe to some RSS feeds of journals I read.
However, the feeds are really bad, and only contain lists of authors
and titles (with no markup), and links to the issue urls.

So, I want a script that takes those feeds, goes to the issue pages,
grabs the links for the articles, and then from there extracts author
and title information.

For some reason I don't understand, the below fragment all works,
except for the author attribute is always blank. The problem is not
with my regular expression pattern.

Could someone explain what I'm doing wrong?

Bruce

# journals is an array of rss feed urls and titles
journals.each do |journal|
open(journal[1]) do |http|
response = http.read
result = RSS:arser.parse(response, false)

# grab first issue url listed from each journal
issue_url = result.items[0].link

# regular expression patterns to use below
article_page = /<a href="(.*?)">Article Description<\/a>/
title_match = /<span class="article-title">(.*?)<\/span>/
author_match = /<strong>Author:<\/strong><\/td><td
class="rightcol">(.*?)</

articles = open(issue_url)
# find each article url by screen-scraping
articles.read.scan(article_page).each do |url|
article_url = "#{base_url}#{url}"
open(article_url) do |article|
# screen-scrap for article author and title
title = article.read.scan(title_match)
# for whatever reason, author never returns anything
author = article.read.scan(author_match)
# create new article object
list.append(Article.new(title, author, article_url))
end
end
end
end

 
Reply With Quote
 
 
 
 
Pit Capitain
Guest
Posts: n/a
 
      06-06-2005
Bruce D'Arcus schrieb:
> For some reason I don't understand, the below fragment all works,
> except for the author attribute is always blank. The problem is not
> with my regular expression pattern.
>
> Could someone explain what I'm doing wrong?


Hi Bruce,

I don't know which libraries you're using, but could it be that you can
only read once from article, like reading from a file?

Instead of

> open(article_url) do |article|
> # screen-scrap for article author and title
> title = article.read.scan(title_match)
> # for whatever reason, author never returns anything
> author = article.read.scan(author_match)


try something like

open(article_url) do |article|
# screen-scrap for article author and title
article_text = article.read
title = article_text.scan(title_match)
author = article_text.scan(author_match)

HTH

Regards,
Pit


 
Reply With Quote
 
 
 
 
Dominik Bathon
Guest
Posts: n/a
 
      06-06-2005
article is a stream and you try to read it twice, this doesn't work like
you think. I guess the 2nd article.read just returns "", so "".scan(...)
returns nothing.
Try the following:

> articles.read.scan(article_page).each do |url|
> article_url = "#{base_url}#{url}"
> open(article_url) do |article|

articletxt=article.read
> # screen-scrap for article author and title

title = articletxt.scan(title_match)
> # for whatever reason, author never returns anything

author = articletxt.scan(author_match)
> # create new article object
> list.append(Article.new(title, author, article_url))
> end
> end



Dominik


 
Reply With Quote
 
Bruce D'Arcus
Guest
Posts: n/a
 
      06-06-2005
Yes, that solved the problem. I had a feeling it was something pretty
simple.

Thanks!

Bruce

 
Reply With Quote
 
Bruce D'Arcus
Guest
Posts: n/a
 
      06-06-2005
One followup.

Why if I dump my list of article objects to YAML, do I end up with
this:

- !ruby/object:Article
author:
-
- "Hovorka, Alice J."
title:
-
- "The (Re) Production of Gendered Positionality in Botswana's
Commercial Urban
Agriculture Sector"
url:
http://journals.ohiolink.edu/cgi-bin...urnal=00045608

I'm referring to the fact that article and title content aren't
represented the same as url (which is what I was expecting).

I have these two classes:

class Article

include Journals

attr_reader :title, :author, :description, :url
def initialize(title, author, url)
@title = title
@author = author
@url = url
end

def to_s
"#@title, #@author"
end

def abstract
#
end

def refer
Journals::const_get(:BASE_URL) + "/" +
@url + "&form=refer&file=file.txt"
end

def pdf
Journals::const_get(:BASE_URL) + "/" +
@url + "&form=pdf&file=file.pdf"
end
end

class Articles
#
attr_reader :articles

def initialize
@articles = Array.new
end

def append(article)
@articles.push(article)
self
end

def [](index)
@articles[index]
end
end

.... and then:

list = Articles.new

... and at the end:

File.open("articles.yaml", "w") {|f| YAML.dump(list.articles, f)}

Or is everything fine?

Bruce

 
Reply With Quote
 
Ghislain Mary
Guest
Posts: n/a
 
      06-06-2005
Hi,

Bruce D'Arcus a écrit :
> Why if I dump my list of article objects to YAML, do I end up with
> this:
>
> - !ruby/object:Article
> author:
> -
> - "Hovorka, Alice J."
> title:
> -
> - "The (Re) Production of Gendered Positionality in Botswana's
> Commercial Urban
> Agriculture Sector"
> url:
> http://journals.ohiolink.edu/cgi-bin...urnal=00045608
>
> I'm referring to the fact that article and title content aren't
> represented the same as url (which is what I was expecting).


Because your author and title probably aren't strings as you expect them
to be but rather arrays. You should try to puts @title.inspect somewhere
to see what it is.

> I have these two classes:
>
> class Article
>
> include Journals
>
> attr_reader :title, :author, :description, :url
> def initialize(title, author, url)
> @title = title
> @author = author
> @url = url
> end
>
> def to_s
> "#@title, #@author"
> end
>
> def abstract
> #
> end
>
> def refer
> Journals::const_get(:BASE_URL) + "/" +
> @url + "&form=refer&file=file.txt"
> end
>
> def pdf
> Journals::const_get(:BASE_URL) + "/" +
> @url + "&form=pdf&file=file.pdf"
> end
> end
>
> class Articles
> #
> attr_reader :articles
>
> def initialize
> @articles = Array.new
> end
>
> def append(article)
> @articles.push(article)
> self
> end
>
> def [](index)
> @articles[index]
> end
> end


Why create an Article class and an Articles class? You could make all
the content of your Articles class also content of the Article class but
at the class level instead of the instance level. So you just have to
transform your @articles variable into @@articles and define your append
and [] methods as self.append and self.[].

An other thing: I don't think you need to use
Journals::const_get(:BASE_URL). You could simply use Journals::BASE_URL.

HTH

Ghislain


 
Reply With Quote
 
Bruce D'Arcus
Guest
Posts: n/a
 
      06-06-2005


Ghislain Mary wrote:

> Because your author and title probably aren't strings as you expect them
> to be but rather arrays.


Ah, right. Using scan returns an array. On this ...

> > I have these two classes:
> >
> > class Article
> >
> > include Journals
> >
> > attr_reader :title, :author, :description, :url
> > def initialize(title, author, url)
> > @title = title
> > @author = author
> > @url = url
> > end
> >
> > def to_s
> > "#@title, #@author"
> > end
> >
> > def abstract
> > #
> > end
> >
> > def refer
> > Journals::const_get(:BASE_URL) + "/" +
> > @url + "&form=refer&file=file.txt"
> > end
> >
> > def pdf
> > Journals::const_get(:BASE_URL) + "/" +
> > @url + "&form=pdf&file=file.pdf"
> > end
> > end
> >
> > class Articles
> > #
> > attr_reader :articles
> >
> > def initialize
> > @articles = Array.new
> > end
> >
> > def append(article)
> > @articles.push(article)
> > self
> > end
> >
> > def [](index)
> > @articles[index]
> > end
> > end

>
> Why create an Article class and an Articles class?


Because I'm *real* newbie! My only programming background is with
XSLT. So I'm trying to also understand basic OO design in this
example.

> You could make all
> the content of your Articles class also content of the Article class but
> at the class level instead of the instance level. So you just have to
> transform your @articles variable into @@articles and define your append
> and [] methods as self.append and self.[].


Can you give me an abbreviated example of how to do actually do this?
For example, how do I define @@articles under the Article class, and
how would I then define the append method there.

> An other thing: I don't think you need to use
> Journals::const_get(:BASE_URL). You could simply use Journals::BASE_URL.


Ah thanks. It took me awhile just to get that far!

Bruce

 
Reply With Quote
 
Ghislain Mary
Guest
Posts: n/a
 
      06-06-2005
Bruce D'Arcus a écrit :
>>Why create an Article class and an Articles class?

>
>
> Because I'm *real* newbie! My only programming background is with
> XSLT. So I'm trying to also understand basic OO design in this
> example.


So welcome into the Ruby community
I'm still considering myself as a newby too, and I don't often reply to
posts on this list because I often think I am not able to contribute in
a good way to the discussions. But I learn a lot by reading what is
happening here

> Can you give me an abbreviated example of how to do actually do this?
> For example, how do I define @@articles under the Article class, and
> how would I then define the append method there.


You could do something like:

class Article

include Journals

attr_reader :title, :author, :description, :url

# Create the Array containing the articles.
@@articles = Array.new

def initialize(title, author, url)
@title, @author, @url = title, author, url

# Add the new Article to the articles array.
@@articles << self
end

def to_s
"#@title, #@author"
end

def refer
Journals::BASE_URL + "/" + @url + "&form=refer&file=file.txt"
end

def pdf
Journals::BASE_URL + "/" + @url + "&form=pdf&file=file.pdf"
end

# Add a class method to get an Article by its index in the @@articles
Array.
def self.[](index)
@@articles[index]
end

# Add a method to get the number of articles.
# Call it how you want it to be called.
def self.count
@@articles.size
end

end

Good luck,

Ghislain


 
Reply With Quote
 
Ghislain Mary
Guest
Posts: n/a
 
      06-06-2005
Oh... I was forgetting.

You don't even need an append method anymore since when you create a new
Article it is automatically pushed into the @@articles Array.

Ghislain


 
Reply With Quote
 
Brian Schröder
Guest
Posts: n/a
 
      06-06-2005
On 06/06/05, Bruce D'Arcus <(E-Mail Removed)> wrote:
>=20
>=20
> Ghislain Mary wrote:
>=20
> > Because your author and title probably aren't strings as you expect the=

m
> > to be but rather arrays.

>=20
> Ah, right. Using scan returns an array. On this ...
>=20
> > > I have these two classes:
> > >
> > > class Article
> > >
> > > include Journals
> > >
> > > attr_reader :title, :author, :description, :url
> > > def initialize(title, author, url)
> > > @title =3D title
> > > @author =3D author
> > > @url =3D url
> > > end
> > >
> > > def to_s
> > > "#@title, #@author"
> > > end
> > >
> > > def abstract
> > > #
> > > end
> > >
> > > def refer
> > > Journals::const_get(:BASE_URL) + "/" +
> > > @url + "&form=3Drefer&file=3Dfile.txt"
> > > end
> > >
> > > def pdf
> > > Journals::const_get(:BASE_URL) + "/" +
> > > @url + "&form=3Dpdf&file=3Dfile.pdf"
> > > end
> > > end
> > >
> > > class Articles
> > > #
> > > attr_reader :articles
> > >
> > > def initialize
> > > @articles =3D Array.new
> > > end
> > >
> > > def append(article)
> > > @articles.push(article)
> > > self
> > > end
> > >
> > > def [](index)
> > > @articles[index]
> > > end
> > > end

> >
> > Why create an Article class and an Articles class?

>=20
> Because I'm *real* newbie! My only programming background is with
> XSLT. So I'm trying to also understand basic OO design in this
> example.
>=20
> > You could make all
> > the content of your Articles class also content of the Article class bu=

t
> > at the class level instead of the instance level. So you just have to
> > transform your @articles variable into @@articles and define your appen=

d
> > and [] methods as self.append and self.[].

>=20
> Can you give me an abbreviated example of how to do actually do this?
> For example, how do I define @@articles under the Article class, and
> how would I then define the append method there.
>=20


I have not followed this thread in depth, but I think it is a good
idea to distinguish between a set of articles and an article. I don't
see how you would benefit from mixing these two. If I understand the
proposal correctly, you would no longer be able to maintain two
independent sets of articles, because the ArticleSet would be part of
the article class.

Anyhow, here is how to define a class variable and class methods.

class Klass
@@foo =3D []

def self.add(bar)
@@foo << bar
end

def self.foo
@@foo
end
end

Klass.add(1)
Klass.add(2)
p Klass.foo

good luck with ruby,

Brian

--=20
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
VONAGE Newbie w/newbie question New_kid@nowhere.new VOIP 0 08-11-2007 01:40 PM
another newbie question from another newbie.... Lee UK VOIP 4 05-17-2005 04:10 PM
newbie: cisco vlan newbie question No Spam Cisco 3 06-07-2004 10:02 AM
dumb newbie question (or newbie dumb question) Jerry C. Perl Misc 8 11-23-2003 04:11 AM
Newbie! I'm a newbie! What's wrong with this program? Id0x Python 4 07-20-2003 11:40 PM



Advertisments