Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Page crawling and URL grabbing

Reply
Thread Tools

Page crawling and URL grabbing

 
 
Patrick L.
Guest
Posts: n/a
 
      01-27-2009
Hey guys,
I'm trying to write an application that goes onto a website (istockphoto
specifically), opens up istockphoto.com/file_browse.php and grabs the
URLs of the photos that appear there.

It's my first time doing something like this. I'm reading some
documentation right now...but a hand would be greatly appreciated. I'm
not really sure how to do regex on an html file...or even find the right
stuff within that file. I'm guessing its..

open('http://www.istockphoto.com/file_browse.php/') do |f|
f.find # dot something something
end

but I really have no idea. Any help would be great - thanks in advance!
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
Jesús Gabriel y Galán
Guest
Posts: n/a
 
      01-27-2009
On Tue, Jan 27, 2009 at 1:55 AM, Patrick L. <(E-Mail Removed)> wrote:
> Hey guys,
> I'm trying to write an application that goes onto a website (istockphoto
> specifically), opens up istockphoto.com/file_browse.php and grabs the
> URLs of the photos that appear there.
>
> It's my first time doing something like this. I'm reading some
> documentation right now...but a hand would be greatly appreciated. I'm
> not really sure how to do regex on an html file...or even find the right
> stuff within that file. I'm guessing its..


Generally speaking, regular expressions are not the best tool to extract
information from HTML. Take a look at these other tools:

Mechanize
Hpricot
Scrubyt
Nokogiri

This is an example that might get you started, although I recommend taking
a look at the above tools:

require 'open-uri'
require 'hpricot'

h = Hpricot(open("http://www.istockphoto.com/file_browse.php"))
imgs = h.search("//[@class = searchImg]")
imgs.map {|img| img["src"]}

# => ["http://www2.istockphoto.com/file_thumbview_approve/8137463/1/istockphoto_8137463-budapest-by-night.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8139472/1/istockphoto_8139472-four-antique-wood-tennis-racquets.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/6731990/1/istockphoto_6731990-two-female-lovers.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8308377/1/istockphoto_8308377-beauty.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/6349299/1/istockphoto_6349299-lovers-interested-in-smth.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8322403/1/istockphoto_8322403-happy-piggy-bank.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8138976/1/istockphoto_8138976-tower-guard-of-cetara-little-town-in-amalfi-coast-italy.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8322394/1/istockphoto_8322394-yellow-red-paper.jpg",
"http://www1.istockphoto.com/file_thumbview_approve/4660654/1/istockphoto_4660654-the-art-of-eye-shadows.jpg",
"http://www1.istockphoto.com/file_thumbview_approve/8301075/1/istockphoto_8301075-3d-render-of-the-olive-tree.jpg",
"http://www1.istockphoto.com/file_thumbview_approve/6921717/1/istockphoto_6921717-manicure.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8322391/1/istockphoto_8322391-pomegranate.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8138975/1/istockphoto_8138975-junger-mann-seitlich.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8139815/1/istockphoto_8139815-winter.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8137153/1/istockphoto_8137153-beadworkafrican_pictureframe_p3406-jpg.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8139787/1/istockphoto_8139787-statue-of-liberty.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8322388/1/istockphoto_8322388-cold-winter-day.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8139602/1/istockphoto_8139602-statue-of-liberty.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8137801/1/istockphoto_8137801-litchi.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8139406/1/istockphoto_8139406-statue-of-liberty.jpg",
"http://www1.istockphoto.com/file_thumbview_approve/6850893/1/istockphoto_6850893-polka-dot-wedding-cake.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8139802/1/istockphoto_8139802-snow-woman.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8322364/1/istockphoto_8322364-white-cherry-blossom.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8139808/1/istockphoto_8139808-airport.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8322357/1/istockphoto_8322357-ciruit.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8139597/1/istockphoto_8139597-cheese-and-wine.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8138075/1/istockphoto_8138075-employee-of-office.jpg"]


You should customize the criteria to choose the images (in my little
example I selected all tags which had a class searchImg, which at a
quick glance seemed what you wanted, but double check).

I recall reading somewhere that nokogiri has better XPath support than
Hpricot, so check it out.

Jesus.

 
Reply With Quote
 
 
 
 
Miroslaw Niegowski
Guest
Posts: n/a
 
      01-27-2009
2009/1/27 Patrick L. <(E-Mail Removed)>:
> Hey guys,
> I'm trying to write an application that goes onto a website (istockphoto
> specifically), opens up istockphoto.com/file_browse.php and grabs the
> URLs of the photos that appear there.
>
> It's my first time doing something like this. I'm reading some
> documentation right now...but a hand would be greatly appreciated. I'm
> not really sure how to do regex on an html file...or even find the right
> stuff within that file. I'm guessing its..
>
> open('http://www.istockphoto.com/file_browse.php/') do |f|
> f.find # dot something something
> end



Try Mechanize.
It's easy :

agent = WWW::Mechanize.new
agent.user_agent_alias='Mac Safari'
page = agent.get('http://www.istockphoto.com/file_browse.php');
page.links.text(/jpg/)
...

 
Reply With Quote
 
Patrick L.
Guest
Posts: n/a
 
      01-27-2009
Miroslaw Niegowski wrote:
> 2009/1/27 Patrick L. <(E-Mail Removed)>:
>> open('http://www.istockphoto.com/file_browse.php/') do |f|
>> f.find # dot something something
>> end

>
>
> Try Mechanize.
> It's easy :
>
> agent = WWW::Mechanize.new
> agent.user_agent_alias='Mac Safari'
> page = agent.get('http://www.istockphoto.com/file_browse.php');
> page.links.text(/jpg/)
> ...


That's great, or it sounds great. Is there any documentation aside from
blog posts and this: http://mechanize.rubyforge.org/mechanize/ ? What
did you use to learn it?

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Tsunami Script
Guest
Posts: n/a
 
      01-27-2009
mechanize is very easy and intuitive ... you could basically learn to
use mechanize just by playing with it in irb . Combine that with reading
some/the docs , and you're good to go .
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
An seo and spider crawling question Tim W HTML 2 06-15-2012 08:09 AM
Web Crawling/Threading and Things That Go Bump in the Night Remarkable Python 1 08-04-2006 06:12 PM
Search engines crawling our .NET site Mark ASP .Net 3 03-07-2005 04:37 AM
Grabbing paramter in the URL Randy ASP .Net 3 02-08-2005 03:03 PM
Grabbing referrer page on a redirected error page darrel ASP .Net 4 09-29-2004 08:25 PM



Advertisments