![]() |
listing all the html links
how can I use ruby to list all the html links on a site, ?
Tahnks |
Re: listing all the html links
after running this code I get
:~$ ruby list.rb list.rb:5: Invalid char `\302' in expression list.rb:5: Invalid char `\240' in expression list.rb:5: Invalid char `\302' in expression list.rb:5: Invalid char `\240' in expression list.rb:5: Invalid char `\302' in expression list.rb:5: Invalid char `\240' in expression list.rb:6: Invalid char `\302' in expression list.rb:6: Invalid char `\240' in expression list.rb:6: Invalid char `\302' in expression list.rb:6: Invalid char `\240' in expression list.rb:6: Invalid char `\302' in expression list.rb:6: Invalid char `\240' in expression list.rb:6: Invalid char `\302' in expression list.rb:6: Invalid char `\240' in expression list.rb:6: Invalid char `\302' in expression list.rb:6: Invalid char `\240' in expression list.rb:7: Invalid char `\302' in expression list.rb:7: Invalid char `\240' in expression list.rb:7: Invalid char `\302' in expression list.rb:7: Invalid char `\240' in expression list.rb:7: Invalid char `\302' in expression list.rb:7: Invalid char `\240' in expression list.rb:7: Invalid char `\302' in expression list.rb:7: Invalid char `\240' in expression list.rb:7: Invalid char `\302' in expression list.rb:7: Invalid char `\240' in expression list.rb:8: Invalid char `\302' in expression list.rb:8: Invalid char `\240' in expression list.rb:8: Invalid char `\302' in expression list.rb:8: Invalid char `\240' in expression list.rb:8: Invalid char `\302' in expression list.rb:8: Invalid char `\240' in expression list.rb:8: Invalid char `\302' in expression list.rb:8: Invalid char `\240' in expression list.rb:8: Invalid char `\302' in expression list.rb:8: Invalid char `\240' in expression list.rb:9: Invalid char `\302' in expression list.rb:9: Invalid char `\240' in expression list.rb:9: Invalid char `\302' in expression list.rb:9: Invalid char `\240' in expression list.rb:9: Invalid char `\302' in expression list.rb:9: Invalid char `\240' in expression list.rb:9: Invalid char `\302' in expression list.rb:9: Invalid char `\240' in expression list.rb:9: Invalid char `\302' in expression list.rb:9: Invalid char `\240' in expression list.rb:9: Invalid char `\302' in expression list.rb:9: Invalid char `\240' in expression list.rb:9: Invalid char `\302' in expression list.rb:9: Invalid char `\240' in expression list.rb:10: Invalid char `\302' in expression list.rb:10: Invalid char `\240' in expression list.rb:10: Invalid char `\302' in expression list.rb:10: Invalid char `\240' in expression list.rb:10: Invalid char `\302' in expression list.rb:10: Invalid char `\240' in expression list.rb:10: Invalid char `\302' in expression list.rb:10: Invalid char `\240' in expression list.rb:10: Invalid char `\302' in expression list.rb:10: Invalid char `\240' in expression list.rb:10: Invalid char `\302' in expression list.rb:10: Invalid char `\240' in expression list.rb:10: Invalid char `\302' in expression list.rb:10: Invalid char `\240' in expression list.rb:11: Invalid char `\302' in expression list.rb:11: Invalid char `\240' in expression list.rb:11: Invalid char `\302' in expression list.rb:11: Invalid char `\240' in expression list.rb:11: Invalid char `\302' in expression list.rb:11: Invalid char `\240' in expression list.rb:11: Invalid char `\302' in expression list.rb:11: Invalid char `\240' in expression list.rb:11: Invalid char `\302' in expression list.rb:11: Invalid char `\240' in expression list.rb:12: Invalid char `\302' in expression list.rb:12: Invalid char `\240' in expression list.rb:12: Invalid char `\302' in expression list.rb:12: Invalid char `\240' in expression list.rb:12: Invalid char `\302' in expression list.rb:12: Invalid char `\240' in expression Jeffrey Schwab wrote: > Dado wrote: >> how can I use ruby to list all the html links on a site, ? > > require 'open-uri' > > def scrape(url) > open(url) do |uri| > href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/ > m = href.match(uri.read) > while m > puts m[1] > m = href.match(m.post_match) > end > end > end > > scrape('http://www.ruby-lang.org/en/') |
Re: listing all the html links
require 'open-uri'
def scrape(url) open(url) do |uri| href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/ m = href.match(uri.read) while m puts m[1] m = href.match(m.post_match) end end end scrape('http://www.ruby-lang.org/en/') works for me regular expression: href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/ what is it saying? \s is space, () retrieves a group...[]identifies character sets... how does the loop work? I found post_match, programming ruby page 538 I put some puts first time around m and m[1] href="mailto:webmaster@ruby-lang.org" "mailto:webmaster@ruby-lang.org" why is the second line m[1]...? Is it because of the set of parenthesis? thanks for your help |
Re: listing all the html links
On Wed, 03 May 2006 22:27:35 +0100, Dado <digi@lycos.com> wrote:
> how can I use ruby to list all the html links on a site, ? > An alternative to the regexp approach, if you don't mind using external libraries: require 'open-uri' require 'rubyful_soup' # [1] page = BeautifulSoup.new(URI('http://ruby-lang.org').read) page.find_all('a').each { |l| puts l['href'] } require 'mechanize' # [2] m = WWW::Mechanize.new page = m.get('http://ruby-lang.org') page.links.each { |l| puts l.href } -- [1] http://www.crummy.com/software/RubyfulSoup/ [2] http://mechanize.rubyforge.org/ Ross Bamford - rosco@roscopeco.remove.co.uk |
Re: listing all the html links
require 'open-uri'
URI.extract(open(<url>).read) |
Re: listing all the html links
On Fri, 05 May 2006 19:16:05 +0100, Vincent Foley <vfoley@gmail.com> wrote:
> require 'open-uri' > URI.extract(open(<url>).read) > Unfortunately, you pull a lot of false positives, and it doesn't differentiate between links and other uris (e.g. link src elements, DTD refs, etc). pp URI.extract(URI('http://www.google.com').read) ["font-family:arial,sans-serif;", "font-size:", "color:#0000cc;", "http://www.google.co.uk/ig%3Fhl%3Den", "https://www.google.com/accounts/Login?continue=http://www.google.co.uk/&hl=en", "http://groups.google.co.uk/grphp?hl=en&tab=wg&ie=UTF-8", "http://news.google.co.uk/nwshp?hl=en&tab=wn&ie=UTF-8", "http://froogle.google.co.uk/frghp?hl=en&tab=wf&ie=UTF-8", "Search:", "http://www.google.com/ncr"] -- Ross Bamford - rosco@roscopeco.remove.co.uk |
Re: listing all the html links
thank you for your clear explanations
|
| All times are GMT. The time now is 02:50 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.