Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Using Xpath to parse a Yahoo Finance page (http://www.velocityreviews.com/forums/t955050-using-xpath-to-parse-a-yahoo-finance-page.html)

Jason Hsu 12-03-2012 01:23 AM

Using Xpath to parse a Yahoo Finance page
 
I'm trying to extract the data on "total assets" from Yahoo Finance using Python 2.7 and lxml.

Here is a special test script I set up to work on this issue:

import urllib
import lxml
import lxml.html

url_local1 = "http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1 &framework.view=smi_emptyView"
result1 = urllib.urlopen(url_local1)
element_html1 = result1.read()
doc1 = lxml.html.document_fromstring (element_html1)
list_row1 = doc1.xpath(u'.//th[div[text()="Total Assets"]]/following-sibling::td/text()')
print list_row1

url_local2 = "http://finance.yahoo.com/q/bs?s=FAST"
result2 = urllib.urlopen(url_local2)
element_html2 = result2.read()
doc2 = lxml.html.document_fromstring (element_html2)
list_row2 = doc2.xpath(u'.//td[strong[text()="Total Assets"]]/following-sibling::td/strong/text()')
print list_row2

I'm able to get the row of data on total assets from the Smartmoney page, but I get just an empty list when I try to parse the Yahoo Finance page.

MRAB 12-03-2012 02:25 AM

Re: Using Xpath to parse a Yahoo Finance page
 
On 2012-12-03 01:23, Jason Hsu wrote:
> I'm trying to extract the data on "total assets" from Yahoo Finance using Python 2.7 and lxml.
>
> Here is a special test script I set up to work on this issue:
>
> import urllib
> import lxml
> import lxml.html
>
> url_local1 = "http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1 &framework.view=smi_emptyView"
> result1 = urllib.urlopen(url_local1)
> element_html1 = result1.read()
> doc1 = lxml.html.document_fromstring (element_html1)
> list_row1 = doc1.xpath(u'.//th[div[text()="Total Assets"]]/following-sibling::td/text()')
> print list_row1
>
> url_local2 = "http://finance.yahoo.com/q/bs?s=FAST"
> result2 = urllib.urlopen(url_local2)
> element_html2 = result2.read()
> doc2 = lxml.html.document_fromstring (element_html2)
> list_row2 = doc2.xpath(u'.//td[strong[text()="Total Assets"]]/following-sibling::td/strong/text()')
> print list_row2
>
> I'm able to get the row of data on total assets from the Smartmoney page, but I get just an empty list when I try to parse the Yahoo Finance page.
>

The problem is that you're asking it to look for an exact match.

If you look at the HTML itself, you'll see that there's whitespace
around the "Total Assets" part.

This should work:

list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total
Assets")]]/following-sibling::td/strong/text()')

(Although I tested it in Python 3.2.)

Jason Hsu 12-03-2012 03:32 AM

Re: Using Xpath to parse a Yahoo Finance page
 
On Sunday, December 2, 2012 8:25:45 PM UTC-6, MRAB wrote:
>
> list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total
>
> Assets")]]/following-sibling::td/strong/text()')
>

Thanks, MRAB. Your suggestion works!

Jason Hsu 12-03-2012 03:32 AM

Re: Using Xpath to parse a Yahoo Finance page
 
On Sunday, December 2, 2012 8:25:45 PM UTC-6, MRAB wrote:
>
> list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total
>
> Assets")]]/following-sibling::td/strong/text()')
>

Thanks, MRAB. Your suggestion works!

Stefan Behnel 12-03-2012 06:44 AM

Re: Using Xpath to parse a Yahoo Finance page
 
MRAB, 03.12.2012 03:25:
> On 2012-12-03 01:23, Jason Hsu wrote:
>> I'm trying to extract the data on "total assets" from Yahoo Finance using
>> Python 2.7 and lxml.
>>
>> Here is a special test script I set up to work on this issue:
>>
>> import urllib
>> import lxml
>> import lxml.html
>>
>> url_local1 =
>> "http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1 &framework.view=smi_emptyView"
>>
>> result1 = urllib.urlopen(url_local1)
>> element_html1 = result1.read()
>> doc1 = lxml.html.document_fromstring (element_html1)


The last three lines are unnecessarily complicated code. Just use

doc = lxml.html.parse(url_local1)


>> list_row1 = doc1.xpath(u'.//th[div[text()="Total
>> Assets"]]/following-sibling::td/text()')
>> print list_row1
>>
>> url_local2 = "http://finance.yahoo.com/q/bs?s=FAST"
>> result2 = urllib.urlopen(url_local2)
>> element_html2 = result2.read()
>> doc2 = lxml.html.document_fromstring (element_html2)
>> list_row2 = doc2.xpath(u'.//td[strong[text()="Total
>> Assets"]]/following-sibling::td/strong/text()')
>> print list_row2
>>
>> I'm able to get the row of data on total assets from the Smartmoney page,
>> but I get just an empty list when I try to parse the Yahoo Finance page.
>>

> The problem is that you're asking it to look for an exact match.
>
> If you look at the HTML itself, you'll see that there's whitespace
> around the "Total Assets" part.
>
> This should work:
>
> list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total
> Assets")]]/following-sibling::td/strong/text()')


Something like "contains(text(),"Total Assets")" is better expressed as
"contains(.,"Total Assets")" because it considers the complete text content
instead of just one text node.

Stefan




All times are GMT. The time now is 12:36 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.