Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > hrpicot - cant extract what i want from page

Reply
Thread Tools

hrpicot - cant extract what i want from page

 
 
Adam Akhtar
Guest
Posts: n/a
 
      03-28-2008
Hi im starting to use hrpicot and im having problems extracting
descriptions of various concert events from a page. Here is a sample of
the html


<p>
<a name="concerts"/>
<span class="heading">Concerts</span>
<br/>
<span class="subheading">POPULAR</span>
<br/>
<br/>
<span class="textbold">Middle Field! Vol.4</span >
<br/>
Featuring electric-pop band The Stealth, Mac and Masaru, and others. Mar
28, 7pm, ¥2,500 (adv)/ ¥3,000 (door). Shibuya O-Nest. Tel: 03-3498-9999.
<br/>
<br/>
<span class="textbold">Philip Woo featuring Brenda Vaughn</span>
<br/>
Japanese pianist and soul singer performing with Andy Wulf and Kaori
Kobayashi. Mar 28 & 29, 7 & 9:30pm, ¥3,150. Cotton Club, Marunouchi.
Tel: 03-3215-1555.
<br/>
...
...
...
etc

I can get the artist band names fine using
names = doc.search("//span[@class='textbold']")

but i cant get teh descriptions. In fact the descriptions aren't
indvidually wrapped up in any tags but rather just clumped together
under the paragraph tab with line breaks <br/>

So I thought id just try
descriptions =
doc.search("/html/body/div/table/tbody/tr[4]/td/table/tbody/tr/td[2]/table/tbody/tr/td/span/p")
but when i try to puts descriptions nothing is printed to the screen.

How would i go about getting this info??? any tips or ideas?

Thanks
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
Adam Akhtar
Guest
Posts: n/a
 
      03-28-2008
more info..

the original website can be found at
http://metropolis.co.jp/tokyo/recent/listings.asp

i used firebug to retrieve the xpath address of the desired paragraph
(excerpted above). When I put it in doc.search it doesnt retrieve
anything, nothing at all????


Does anyone know why i cant????? Im banging my head against the wall
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
Todd Benson
Guest
Posts: n/a
 
      03-28-2008
On Fri, Mar 28, 2008 at 2:11 AM, Adam Akhtar <(E-Mail Removed)> wrot=
e:
> Hi im starting to use hrpicot and im having problems extracting
> descriptions of various concert events from a page. Here is a sample of
> the html
>
>
> <p>
> <a name=3D"concerts"/>
> <span class=3D"heading">Concerts</span>
> <br/>
> <span class=3D"subheading">POPULAR</span>
> <br/>
> <br/>
> <span class=3D"textbold">Middle Field! Vol.4</span >
> <br/>
> Featuring electric-pop band The Stealth, Mac and Masaru, and others. Mar
> 28, 7pm, =A52,500 (adv)/ =A53,000 (door). Shibuya O-Nest. Tel: 03-3498-9=

999.
> <br/>
> <br/>
> <span class=3D"textbold">Philip Woo featuring Brenda Vaughn</span>
> <br/>
> Japanese pianist and soul singer performing with Andy Wulf and Kaori
> Kobayashi. Mar 28 & 29, 7 & 9:30pm, =A53,150. Cotton Club, Marunouchi.
> Tel: 03-3215-1555.
> <br/>
> ...
> ...
> ...
> etc
>
> I can get the artist band names fine using
> names =3D doc.search("//span[@class=3D'textbold']")
>
> but i cant get teh descriptions. In fact the descriptions aren't
> indvidually wrapped up in any tags but rather just clumped together
> under the paragraph tab with line breaks <br/>
>
> So I thought id just try
> descriptions =3D
> doc.search("/html/body/div/table/tbody/tr[4]/td/table/tbody/tr/td[2]/tab=

le/tbody/tr/td/span/p")
> but when i try to puts descriptions nothing is printed to the screen.
>
> How would i go about getting this info??? any tips or ideas?
>
> Thanks


Wow! It looks nice, but the html is really ugly. This would be
pretty hard to scrape on a regular basis. For artists, there are a
mix of <strong></strong> tags, <span class=3D"textbold"></span> tags,
and I noticed one artist with no surrounding tags at all (Ex-press
Ver.2).

It can be really hard to work with inconsistent html, but I suppose it
could be done to some degree of accuracy. Any hpricot masters out
there? I'm sure you'd have to attack with regexps as well. Maybe
turning into text and then parsing is a better idea after all.

Todd

 
Reply With Quote
 
Adam Akhtar
Guest
Posts: n/a
 
      03-28-2008
thanks tod for the reply. Yes even I thought that it was badly designed
and I dont have any web desing experience at all. In fact i learn the
basics of html, xml and xpath just for this.

Although those inconsitencies will prove to be a problem in the future
the one im having right now is getting any information at all. Surely
when i pass the xpath address for the paragraph element which contains
all the artists names and event descriptinos it should return something
rather than nothing. Is that right??? Every time a try to print to
screen the result of the search it just comes blank. Does anyone know
why???



--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Thomas Wieczorek
Guest
Posts: n/a
 
      03-28-2008
On Fri, Mar 28, 2008 at 11:42 AM, Dan Diebolt <(E-Mail Removed)> wrote:
> Firebug puts in tbody's into xpath's that reach into tables even if the <tbody> tag is not in the html source. Try removing the tbody path and debug using shorter xpaths to initially address content further up in the hierarchy.
>


Yes, Firefox does it to make it more (X)HTML-conform. It took me a
while to get the hang of it. You might download the page using
open-uri and open it with your favourite editor, search the text and
work your way up through the tags.
Most sites don't use <tbody>, so just try it without it.

 
Reply With Quote
 
Adam Akhtar
Guest
Posts: n/a
 
      03-28-2008
ok i have tried taking out the tbody tags completely and got some of the
text back. Ill experiment to see if i can get all of it.

Re: Tidy

I installed the gem and i got the example code

require 'tidy'
Tidy.path = '/usr/lib/libtidy.so'
html = '<html><title>title</title>Body</html>'
xml = Tidy.open(:show_warnings=>true) do |tidy|
tidy.options.output_xml = true
puts tidy.options.show_warnings
xml = tidy.clean(html)
puts tidy.errors
puts tidy.diagnostics
xml
end
puts xml

now i have to change the path to whereever the lib is...well i foudn
tidys folder in my lib directory and changed the above to this

Tidy.path = 'C:\ruby\lib\ruby\gems\1.8\gems\tidy-1.1.2\lib\tidy\tidylib'

and its complaining saying no such file... i tried

Tidy.path =
'C:\ruby\lib\ruby\gems\1.8\gems\tidy-1.1.2\lib\tidy\tidylib.rb'

as thats the proper extension of the tidylib file but again it wont
work.

I cant find any tidylib file with an extenision .so

banging my head even more now

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Adam Akhtar
Guest
Posts: n/a
 
      03-28-2008
just downloaded a dll which i needed. Why doesnt that come with the
******* gem.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
daniel hoey
Guest
Posts: n/a
 
      03-31-2008
On Mar 28, 6:11 pm, Adam Akhtar <(E-Mail Removed)> wrote:
> Hi im starting to use hrpicot and im having problems extracting
> descriptions of various concert events from a page. Here is a sample of
> the html
>
> <p>
> <a name="concerts"/>
> <span class="heading">Concerts</span>
> <br/>
> <span class="subheading">POPULAR</span>
> <br/>
> <br/>
> <span class="textbold">Middle Field! Vol.4</span >
> <br/>
> Featuring electric-pop band The Stealth, Mac and Masaru, and others. Mar
> 28, 7pm, 2,500 (adv)/ 3,000 (door). Shibuya O-Nest. Tel: 03-3498-9999.
> <br/>
> <br/>
> <span class="textbold">Philip Woo featuring Brenda Vaughn</span>
> <br/>
> Japanese pianist and soul singer performing with Andy Wulf and Kaori
> Kobayashi. Mar 28 & 29, 7 & 9:30pm, 3,150. Cotton Club, Marunouchi.
> Tel: 03-3215-1555.
> <br/>
> ..
> ..
> ..
> etc
>
> I can get the artist band names fine using
> names = doc.search("//span[@class='textbold']")
>
> but i cant get teh descriptions. In fact the descriptions aren't
> indvidually wrapped up in any tags but rather just clumped together
> under the paragraph tab with line breaks <br/>
>
> So I thought id just try
> descriptions =
> doc.search("/html/body/div/table/tbody/tr[4]/td/table/tbody/tr/td[2]/table/tbody/tr/td/span/p")
> but when i try to puts descriptions nothing is printed to the screen.
>
> How would i go about getting this info??? any tips or ideas?
>
> Thanks
> --
> Posted viahttp://www.ruby-forum.com/.


Once you have the 'name' node you can use next_node to get the next
elements in the document
This method should work for your example:

def print_names_and_descriptions(hpricot_doc)
names = hpricot_doc.search("//span[@class='textbold']")

names.each do |name|
node = name.next_node
node = node.next_node until node.text? and node.inner_text =~ /\w
+/

puts name.inner_text
puts node.to_s.strip
puts
end
end
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How do i extract vidios when winrar wont extract them??? help plzzzzzzzz smuttdog@sc.rr.com Computer Support 2 12-23-2007 07:03 AM
cant compile on linux system.cant compile on cant compile onlinux system. Nagaraj C++ 1 03-01-2007 11:18 AM
I want to create a link "e-mail this page to a friend" on clicking this link i want to send the URL of that current page to a friend pavi Javascript 0 01-13-2006 12:10 PM
collapse part of page - Did it by mistake once, now i want to do it i cant chris Javascript 3 10-05-2005 08:09 AM
man i cant belive i cant get help please unclejesse01 DVD Video 2 04-14-2005 03:15 PM



Advertisments