Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Ruby (http://www.velocityreviews.com/forums/f66-ruby.html)
-   -   listing all the html links (http://www.velocityreviews.com/forums/t830530-listing-all-the-html-links.html)

Dado 05-03-2006 09:27 PM

listing all the html links
 
how can I use ruby to list all the html links on a site, ?

Tahnks



Dado 05-03-2006 11:25 PM

Re: listing all the html links
 
after running this code I get


:~$ ruby list.rb
list.rb:5: Invalid char `\302' in expression
list.rb:5: Invalid char `\240' in expression
list.rb:5: Invalid char `\302' in expression
list.rb:5: Invalid char `\240' in expression
list.rb:5: Invalid char `\302' in expression
list.rb:5: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:12: Invalid char `\302' in expression
list.rb:12: Invalid char `\240' in expression
list.rb:12: Invalid char `\302' in expression
list.rb:12: Invalid char `\240' in expression
list.rb:12: Invalid char `\302' in expression
list.rb:12: Invalid char `\240' in expression

Jeffrey Schwab wrote:

> Dado wrote:
>> how can I use ruby to list all the html links on a site, ?

>
> require 'open-uri'
>
> def scrape(url)
> open(url) do |uri|
> href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
> m = href.match(uri.read)
> while m
> puts m[1]
> m = href.match(m.post_match)
> end
> end
> end
>
> scrape('http://www.ruby-lang.org/en/')



anne001 05-05-2006 01:14 PM

Re: listing all the html links
 
require 'open-uri'
def scrape(url)
open(url) do |uri|
href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
m = href.match(uri.read)
while m
puts m[1]
m = href.match(m.post_match)
end
end
end

scrape('http://www.ruby-lang.org/en/')
works for me

regular expression: href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
what is it saying? \s is space, () retrieves a group...[]identifies
character sets...

how does the loop work?
I found post_match, programming ruby page 538

I put some puts
first time around
m and m[1]
href="mailto:webmaster@ruby-lang.org"
"mailto:webmaster@ruby-lang.org"
why is the second line m[1]...? Is it because of the set of
parenthesis?

thanks for your help


Ross Bamford 05-05-2006 02:25 PM

Re: listing all the html links
 
On Wed, 03 May 2006 22:27:35 +0100, Dado <digi@lycos.com> wrote:

> how can I use ruby to list all the html links on a site, ?
>


An alternative to the regexp approach, if you don't mind using external
libraries:

require 'open-uri'
require 'rubyful_soup' # [1]
page = BeautifulSoup.new(URI('http://ruby-lang.org').read)
page.find_all('a').each { |l| puts l['href'] }

require 'mechanize' # [2]
m = WWW::Mechanize.new
page = m.get('http://ruby-lang.org')
page.links.each { |l| puts l.href }

--
[1] http://www.crummy.com/software/RubyfulSoup/
[2] http://mechanize.rubyforge.org/

Ross Bamford - rosco@roscopeco.remove.co.uk

Vincent Foley 05-05-2006 06:16 PM

Re: listing all the html links
 
require 'open-uri'
URI.extract(open(<url>).read)


Ross Bamford 05-05-2006 06:38 PM

Re: listing all the html links
 
On Fri, 05 May 2006 19:16:05 +0100, Vincent Foley <vfoley@gmail.com> wrote:

> require 'open-uri'
> URI.extract(open(<url>).read)
>


Unfortunately, you pull a lot of false positives, and it doesn't
differentiate between links and other uris (e.g. link src elements, DTD
refs, etc).

pp URI.extract(URI('http://www.google.com').read)
["font-family:arial,sans-serif;",
"font-size:",
"color:#0000cc;",
"http://www.google.co.uk/ig%3Fhl%3Den",
"https://www.google.com/accounts/Login?continue=http://www.google.co.uk/&hl=en",
"http://groups.google.co.uk/grphp?hl=en&tab=wg&ie=UTF-8",
"http://news.google.co.uk/nwshp?hl=en&tab=wn&ie=UTF-8",
"http://froogle.google.co.uk/frghp?hl=en&tab=wf&ie=UTF-8",
"Search:",
"http://www.google.com/ncr"]


--
Ross Bamford - rosco@roscopeco.remove.co.uk

anne001 05-06-2006 09:26 PM

Re: listing all the html links
 
thank you for your clear explanations



All times are GMT. The time now is 02:50 PM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57