![]() |
tidy to convert google scholar page in xml
Dear friends,
I am trying to convert a google scholar page to xml. First, I am getting the mapge using the script: #!/usr/bin/python from HTMLParser import HTMLParser import urllib2 response = urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein%2B1905&btnG=&hl=en&as_sd t=0%2C5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"})) f=open('sch.html','w') f.write(response.read()) Which is giving sch.html starting as: <!doctype html><html><head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta name="viewport" content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no"> if I try tidy to convert this html page to xml, I get: $ tidy <sch.html |more line 3 column 40 - Warning: <style> isn't allowed in <div> elements line 3 column 23 - Info: <div> previously mentioned /************************** AND MANY MORE WARNNING **************************/ Info: Document content looks like HTML 4.01 Transitional Info: No system identifier in emitted doctype 131 warnings, 0 errors were found! <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta name="generator" content= "HTML Tidy for Linux (vers 25 March 2009), see www.w3.org"> <meta http-equiv="Content-Type" content= "text/html; charset=us-ascii"> <meta http-equiv="X-UA-Compatible" content="IE=Edge"> <meta name="viewport" content= "width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"> <meta name="format-detection" content="telephone=no"> <title>albert einstein+1905 - Google Scholar</title> <script type="text/javascript"> var gs_ts=Number(new Date()); </script> <style type="text/css"> html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol, ul,li,button{margin:0;padding: 0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}#gs_ top{position:relative;min-width:980px;_width:expression(document.documentEle ment ..clientWidth<982?"980px":"auto");}.gs_el_ph #gs_top,.gs_el_ta #gs_top{min-width: 300px;_width:expression(document.documentElement.c lientWidth<302?"300px":"auto") ;}body,td{font-size:13px;font-family:Arial,sans-serif;line-height:1.24}body{back So, this is still in html, not in xml. How can I convert the page to xml? |
Re: tidy to convert google scholar page in xml
On 10/08/2012 07:11 AM, রুদ্র ব্যাণার্জী wrote:
> Dear friends, > I am trying to convert a google scholar page to xml. > First, I am getting the mapge using the script: > #!/usr/bin/python > from HTMLParser import HTMLParser > import urllib2 > response = > urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein%2B1905&btnG=&hl=en&as_sd t=0%2C5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"})) > f=open('sch.html','w') > f.write(response.read()) > > Which is giving sch.html starting as: > <!doctype html><html><head><meta http-equiv="Content-Type" > content="text/html;charset=UTF-8"><meta http-equiv="X-UA-Compatible" > content="IE=Edge"><meta name="viewport" > content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no"> > > if I try tidy to convert this html page to xml, I get: > $ tidy <sch.html |more > line 3 column 40 - Warning: <style> isn't allowed in <div> elements > line 3 column 23 - Info: <div> previously mentioned > /************************** > AND MANY MORE WARNNING > **************************/ > Info: Document content looks like HTML 4.01 Transitional > Info: No system identifier in emitted doctype > 131 warnings, 0 errors were found! > > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> > <html> > <head> > <meta name="generator" content= > "HTML Tidy for Linux (vers 25 March 2009), see www.w3.org"> > <meta http-equiv="Content-Type" content= > "text/html; charset=us-ascii"> > <meta http-equiv="X-UA-Compatible" content="IE=Edge"> > <meta name="viewport" content= > "width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"> > <meta name="format-detection" content="telephone=no"> > <title>albert einstein+1905 - Google Scholar</title> > > <script type="text/javascript"> > var gs_ts=Number(new Date()); > </script> > <style type="text/css"> > html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol, ul,li,button{margin:0;padding: > 0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}#gs_ > top{position:relative;min-width:980px;_width:expression(document.documentEle ment > .clientWidth<982?"980px":"auto");}.gs_el_ph #gs_top,.gs_el_ta > #gs_top{min-width: > 300px;_width:expression(document.documentElement.c lientWidth<302?"300px":"auto") > ;}body,td{font-size:13px;font-family:Arial,sans-serif;line-height:1.24}body{back > > > So, this is still in html, not in xml. How can I convert the page to > xml? > What makes you think it's possible? (Possible automatically, that is) There is no mapping from html to xml, so a program that tries this is just guessing in many places. Further, many, if not most, web pages are not even valid html, just good enough to work with most browsers. Now, if the page was in valid xhtml, then it would already be valid xml. Do you have a license from google? If not, better read their terms of service. While they probably won't pursue the occasional page scraping, you should consider the costs before spending too much effort. Besides, they have APIs for most of their services, and there might be one that'll be much easier to use than trying to scrape the html. Do you have a plan for what to do when the page layout changes? You should look into Beautiful Soup; it's designed for parsing sloppily written html. I've no direct experience with it, but it gets recommended a lot. -- DaveA |
| All times are GMT. The time now is 04:55 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.