Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > str.scan

Reply
Thread Tools

str.scan

 
 
Colin Summers
Guest
Posts: n/a
 
      06-15-2007
I have a page of html, the usual thing. It has an ordered list. So it has
<ol>
<li>item</li>
<li>item</li>
<li>item</li>
<li>item</li>
</ol>

Well, I am going through this my usual way, which is just brute force
string manipulation. It's still my first day with Ruby. Then I see

str.scan
Both forms iterate through str, matching the pattern (which may be a
Regexp or a String). For each match, a result is generated and either
added to the result array or passed to the block. If the pattern
contains no groups, each individual result consists of the matched
string, $&. If the pattern contains groups, each individual result is
itself an array containing one entry per group.



And I think, oooh, I bet that would be cool to use here. But my regexp
is rusty and I'm not sure how I would set it up
items = page.scan('<li>*</li>')
something like that? Then items would be an array of the text in the items?

Looked cool, anyway. I love how terse it can be.

There's probably also an html/xml parsing library, but I don't have
THAT much of this stuff to do, so I think a little manual work is
probably simpler/easier to learn.

--Colin

 
Reply With Quote
 
 
 
 
Peter Szinek
Guest
Posts: n/a
 
      06-15-2007
Colin,

But my regexp
> is rusty and I'm not sure how I would set it up
> items = page.scan('<li>*</li>')
> something like that? Then items would be an array of the text in the items?


Yes, they will be.

However, first things first:

1) items = page.scan('<li>*</li>')

I believe you want instead is

items = page.scan('<li>.*</li>')

( or maybe items = page.scan('<li>.+</li>') if you are not interested in
empty <li>s)

2) What I really believe you want is

items = page.scan('<li>.*?</li>')

? adds greediness to your regexp - so instead of matching the first
<li>. then matching as much as possible of anything, then matching the
*last* </li>, 2) will match as less as possible.

Let's try:

stuff = <<HTML
<li>aaa</li>
<li>bbb</li>
HTML

>> stuff.scan(/<li>.*?<\/li>/)

=> ["<li>aaa</li>", "<li>bbb</li>"]

3) Maybe you want even this:

>> stuff.scan(/<li>(.*?)<\/li>/)

=> [["aaa"], ["bbb"]]

or, even more friendly:

>> stuff.scan(/<li>(.*?)<\/li>/).flatten

=> ["aaa", "bbb"]

HTH,
Peter
_
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.


 
Reply With Quote
 
 
 
 
Peter Szinek
Guest
Posts: n/a
 
      06-15-2007
> hpricot, mechanize, rexml, rubyful_soup

and if you decide you need something advanced, you could check out
scRUBYt! as well.

Cheers,
Peter
_
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.


 
Reply With Quote
 
Phrogz
Guest
Posts: n/a
 
      06-15-2007
On Jun 15, 12:13 am, Peter Szinek <(E-Mail Removed)> wrote:
> 2) What I really believe you want is
>
> items = page.scan('<li>.*?</li>')
>
> ? adds greediness to your regexp - so instead of matching the first
> <li>. then matching as much as possible of anything, then matching the
> *last* </li>, 2) will match as less as possible.


Minor pedantic correction: .* is greedy (it grabs as much as it can).
The question mark makes it non-greedy (stop as soon as you've found a
match).

 
Reply With Quote
 
Colin Summers
Guest
Posts: n/a
 
      06-15-2007
stuff.scan(/<li>(.*?)<\/li>/).flatten

is exactly what I was hoping for. I peeked at scRUBYt and I know that
I am duplicating work in there, but I am trying to a bunch of things
at once and one is learning Ruby. scRUBYt is doing so much work for me
that I wouldn't learn very much.

The tcl code that stuff.scan(/<li>(.*?)<\/li>/).flatten is so long.
That's great.

Day 2: Have my pickaxe. Bought Pine's book because it was fun to read
on the web and I like having books. Bought another copy of Lenz' Rails
book because a friend like it so much he took it. 115 lines and I am
ahead of where the professional consultant was with the .NET
application (after a month of programming).

Thanks,
--Colin

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off




Advertisments