![]() |
Using Xpath to parse a Yahoo Finance page
I'm trying to extract the data on "total assets" from Yahoo Finance using Python 2.7 and lxml.
Here is a special test script I set up to work on this issue: import urllib import lxml import lxml.html url_local1 = "http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1 &framework.view=smi_emptyView" result1 = urllib.urlopen(url_local1) element_html1 = result1.read() doc1 = lxml.html.document_fromstring (element_html1) list_row1 = doc1.xpath(u'.//th[div[text()="Total Assets"]]/following-sibling::td/text()') print list_row1 url_local2 = "http://finance.yahoo.com/q/bs?s=FAST" result2 = urllib.urlopen(url_local2) element_html2 = result2.read() doc2 = lxml.html.document_fromstring (element_html2) list_row2 = doc2.xpath(u'.//td[strong[text()="Total Assets"]]/following-sibling::td/strong/text()') print list_row2 I'm able to get the row of data on total assets from the Smartmoney page, but I get just an empty list when I try to parse the Yahoo Finance page. |
Re: Using Xpath to parse a Yahoo Finance page
On 2012-12-03 01:23, Jason Hsu wrote:
> I'm trying to extract the data on "total assets" from Yahoo Finance using Python 2.7 and lxml. > > Here is a special test script I set up to work on this issue: > > import urllib > import lxml > import lxml.html > > url_local1 = "http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1 &framework.view=smi_emptyView" > result1 = urllib.urlopen(url_local1) > element_html1 = result1.read() > doc1 = lxml.html.document_fromstring (element_html1) > list_row1 = doc1.xpath(u'.//th[div[text()="Total Assets"]]/following-sibling::td/text()') > print list_row1 > > url_local2 = "http://finance.yahoo.com/q/bs?s=FAST" > result2 = urllib.urlopen(url_local2) > element_html2 = result2.read() > doc2 = lxml.html.document_fromstring (element_html2) > list_row2 = doc2.xpath(u'.//td[strong[text()="Total Assets"]]/following-sibling::td/strong/text()') > print list_row2 > > I'm able to get the row of data on total assets from the Smartmoney page, but I get just an empty list when I try to parse the Yahoo Finance page. > The problem is that you're asking it to look for an exact match. If you look at the HTML itself, you'll see that there's whitespace around the "Total Assets" part. This should work: list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total Assets")]]/following-sibling::td/strong/text()') (Although I tested it in Python 3.2.) |
Re: Using Xpath to parse a Yahoo Finance page
On Sunday, December 2, 2012 8:25:45 PM UTC-6, MRAB wrote:
> > list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total > > Assets")]]/following-sibling::td/strong/text()') > Thanks, MRAB. Your suggestion works! |
Re: Using Xpath to parse a Yahoo Finance page
On Sunday, December 2, 2012 8:25:45 PM UTC-6, MRAB wrote:
> > list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total > > Assets")]]/following-sibling::td/strong/text()') > Thanks, MRAB. Your suggestion works! |
Re: Using Xpath to parse a Yahoo Finance page
MRAB, 03.12.2012 03:25:
> On 2012-12-03 01:23, Jason Hsu wrote: >> I'm trying to extract the data on "total assets" from Yahoo Finance using >> Python 2.7 and lxml. >> >> Here is a special test script I set up to work on this issue: >> >> import urllib >> import lxml >> import lxml.html >> >> url_local1 = >> "http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1 &framework.view=smi_emptyView" >> >> result1 = urllib.urlopen(url_local1) >> element_html1 = result1.read() >> doc1 = lxml.html.document_fromstring (element_html1) The last three lines are unnecessarily complicated code. Just use doc = lxml.html.parse(url_local1) >> list_row1 = doc1.xpath(u'.//th[div[text()="Total >> Assets"]]/following-sibling::td/text()') >> print list_row1 >> >> url_local2 = "http://finance.yahoo.com/q/bs?s=FAST" >> result2 = urllib.urlopen(url_local2) >> element_html2 = result2.read() >> doc2 = lxml.html.document_fromstring (element_html2) >> list_row2 = doc2.xpath(u'.//td[strong[text()="Total >> Assets"]]/following-sibling::td/strong/text()') >> print list_row2 >> >> I'm able to get the row of data on total assets from the Smartmoney page, >> but I get just an empty list when I try to parse the Yahoo Finance page. >> > The problem is that you're asking it to look for an exact match. > > If you look at the HTML itself, you'll see that there's whitespace > around the "Total Assets" part. > > This should work: > > list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total > Assets")]]/following-sibling::td/strong/text()') Something like "contains(text(),"Total Assets")" is better expressed as "contains(.,"Total Assets")" because it considers the complete text content instead of just one text node. Stefan |
| All times are GMT. The time now is 04:03 AM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.