Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > listing all the html links

Reply
Thread Tools

listing all the html links

 
 
Dado
Guest
Posts: n/a
 
      05-03-2006
how can I use ruby to list all the html links on a site, ?

Tahnks


 
Reply With Quote
 
 
 
 
Dado
Guest
Posts: n/a
 
      05-03-2006
after running this code I get


:~$ ruby list.rb
list.rb:5: Invalid char `\302' in expression
list.rb:5: Invalid char `\240' in expression
list.rb:5: Invalid char `\302' in expression
list.rb:5: Invalid char `\240' in expression
list.rb:5: Invalid char `\302' in expression
list.rb:5: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:12: Invalid char `\302' in expression
list.rb:12: Invalid char `\240' in expression
list.rb:12: Invalid char `\302' in expression
list.rb:12: Invalid char `\240' in expression
list.rb:12: Invalid char `\302' in expression
list.rb:12: Invalid char `\240' in expression

Jeffrey Schwab wrote:

> Dado wrote:
>> how can I use ruby to list all the html links on a site, ?

>
> require 'open-uri'
>
> def scrape(url)
> open(url) do |uri|
> href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
> m = href.match(uri.read)
> while m
> puts m[1]
> m = href.match(m.post_match)
> end
> end
> end
>
> scrape('http://www.ruby-lang.org/en/')


 
Reply With Quote
 
 
 
 
anne001
Guest
Posts: n/a
 
      05-05-2006
require 'open-uri'
def scrape(url)
open(url) do |uri|
href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
m = href.match(uri.read)
while m
puts m[1]
m = href.match(m.post_match)
end
end
end

scrape('http://www.ruby-lang.org/en/')
works for me

regular expression: href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
what is it saying? \s is space, () retrieves a group...[]identifies
character sets...

how does the loop work?
I found post_match, programming ruby page 538

I put some puts
first time around
m and m[1]
href="(E-Mail Removed)"
"(E-Mail Removed)"
why is the second line m[1]...? Is it because of the set of
parenthesis?

thanks for your help

 
Reply With Quote
 
Ross Bamford
Guest
Posts: n/a
 
      05-05-2006
On Wed, 03 May 2006 22:27:35 +0100, Dado <(E-Mail Removed)> wrote:

> how can I use ruby to list all the html links on a site, ?
>


An alternative to the regexp approach, if you don't mind using external
libraries:

require 'open-uri'
require 'rubyful_soup' # [1]
page = BeautifulSoup.new(URI('http://ruby-lang.org').read)
page.find_all('a').each { |l| puts l['href'] }

require 'mechanize' # [2]
m = WWW::Mechanize.new
page = m.get('http://ruby-lang.org')
page.links.each { |l| puts l.href }

--
[1] http://www.crummy.com/software/RubyfulSoup/
[2] http://mechanize.rubyforge.org/

Ross Bamford - http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
Vincent Foley
Guest
Posts: n/a
 
      05-05-2006
require 'open-uri'
URI.extract(open(<url>).read)

 
Reply With Quote
 
Ross Bamford
Guest
Posts: n/a
 
      05-05-2006
On Fri, 05 May 2006 19:16:05 +0100, Vincent Foley <(E-Mail Removed)> wrote:

> require 'open-uri'
> URI.extract(open(<url>).read)
>


Unfortunately, you pull a lot of false positives, and it doesn't
differentiate between links and other uris (e.g. link src elements, DTD
refs, etc).

pp URI.extract(URI('http://www.google.com').read)
["font-family:arial,sans-serif;",
"font-size:",
"color:#0000cc;",
"http://www.google.co.uk/ig%3Fhl%3Den",
"https://www.google.com/accounts/Login?continue=http://www.google.co.uk/&hl=en",
"http://groups.google.co.uk/grphp?hl=en&tab=wg&ie=UTF-8",
"http://news.google.co.uk/nwshp?hl=en&tab=wn&ie=UTF-8",
"http://froogle.google.co.uk/frghp?hl=en&tab=wf&ie=UTF-8",
"Search:",
"http://www.google.com/ncr"]


--
Ross Bamford - (E-Mail Removed)
 
Reply With Quote
 
anne001
Guest
Posts: n/a
 
      05-06-2006
thank you for your clear explanations

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Listing All HTML Elements with Specific attributes Kabindra Javascript 3 01-21-2010 12:28 PM
Listing All computer accounts Bad Beagle ASP .Net 0 12-23-2005 04:04 PM
listing of all classes in a package/directory -- J2EE ufmemo@yahoo.com Java 1 03-09-2005 09:14 PM
Opening all links of a html page and saving the html pages java_seek Java 4 12-10-2004 04:33 PM
Listing all the files in a directory... Piyush C++ 2 08-02-2004 07:26 AM



Advertisments