Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > tidy to convert google scholar page in xml

Reply
Thread Tools

tidy to convert google scholar page in xml

 
 
রুদ্র ব্যাণার্জী
Guest
Posts: n/a
 
      10-08-2012
Dear friends,
I am trying to convert a google scholar page to xml.
First, I am getting the mapge using the script:
#!/usr/bin/python
from HTMLParser import HTMLParser
import urllib2
response =
urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein%2B1905&btnG=&hl=en&as_sd t=0%2C5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"}))
f=open('sch.html','w')
f.write(response.read())

Which is giving sch.html starting as:
<!doctype html><html><head><meta http-equiv="Content-Type"
content="text/html;charset=UTF-8"><meta http-equiv="X-UA-Compatible"
content="IE=Edge"><meta name="viewport"
content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no">

if I try tidy to convert this html page to xml, I get:
$ tidy <sch.html |more
line 3 column 40 - Warning: <style> isn't allowed in <div> elements
line 3 column 23 - Info: <div> previously mentioned
/**************************
AND MANY MORE WARNNING
**************************/
Info: Document content looks like HTML 4.01 Transitional
Info: No system identifier in emitted doctype
131 warnings, 0 errors were found!

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">
<meta http-equiv="Content-Type" content=
"text/html; charset=us-ascii">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta name="viewport" content=
"width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2">
<meta name="format-detection" content="telephone=no">
<title>albert einstein+1905 - Google Scholar</title>

<script type="text/javascript">
var gs_ts=Number(new Date());
</script>
<style type="text/css">
html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol, ul,li,button{margin:0;padding:
0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}#gs_
top{position:relative;min-width:980px;_width:expression(document.documentEle ment
..clientWidth<982?"980px":"auto");}.gs_el_ph #gs_top,.gs_el_ta
#gs_top{min-width:
300px;_width:expression(document.documentElement.c lientWidth<302?"300px":"auto")
;}body,td{font-size:13px;font-family:Arial,sans-serif;line-height:1.24}body{back


So, this is still in html, not in xml. How can I convert the page to
xml?

 
Reply With Quote
 
 
 
 
Dave Angel
Guest
Posts: n/a
 
      10-08-2012
On 10/08/2012 07:11 AM, রুদ্র ব্যাণার্জী wrote:
> Dear friends,
> I am trying to convert a google scholar page to xml.
> First, I am getting the mapge using the script:
> #!/usr/bin/python
> from HTMLParser import HTMLParser
> import urllib2
> response =
> urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein%2B1905&btnG=&hl=en&as_sd t=0%2C5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"}))
> f=open('sch.html','w')
> f.write(response.read())
>
> Which is giving sch.html starting as:
> <!doctype html><html><head><meta http-equiv="Content-Type"
> content="text/html;charset=UTF-8"><meta http-equiv="X-UA-Compatible"
> content="IE=Edge"><meta name="viewport"
> content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no">
>
> if I try tidy to convert this html page to xml, I get:
> $ tidy <sch.html |more
> line 3 column 40 - Warning: <style> isn't allowed in <div> elements
> line 3 column 23 - Info: <div> previously mentioned
> /**************************
> AND MANY MORE WARNNING
> **************************/
> Info: Document content looks like HTML 4.01 Transitional
> Info: No system identifier in emitted doctype
> 131 warnings, 0 errors were found!
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <meta name="generator" content=
> "HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">
> <meta http-equiv="Content-Type" content=
> "text/html; charset=us-ascii">
> <meta http-equiv="X-UA-Compatible" content="IE=Edge">
> <meta name="viewport" content=
> "width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2">
> <meta name="format-detection" content="telephone=no">
> <title>albert einstein+1905 - Google Scholar</title>
>
> <script type="text/javascript">
> var gs_ts=Number(new Date());
> </script>
> <style type="text/css">
> html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol, ul,li,button{margin:0;padding:
> 0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}#gs_
> top{position:relative;min-width:980px;_width:expression(document.documentEle ment
> .clientWidth<982?"980px":"auto");}.gs_el_ph #gs_top,.gs_el_ta
> #gs_top{min-width:
> 300px;_width:expression(document.documentElement.c lientWidth<302?"300px":"auto")
> ;}body,td{font-size:13px;font-family:Arial,sans-serif;line-height:1.24}body{back
>
>
> So, this is still in html, not in xml. How can I convert the page to
> xml?
>


What makes you think it's possible? (Possible automatically, that is)
There is no mapping from html to xml, so a program that tries this is
just guessing in many places. Further, many, if not most, web pages are
not even valid html, just good enough to work with most browsers. Now,
if the page was in valid xhtml, then it would already be valid xml.

Do you have a license from google? If not, better read their terms of
service. While they probably won't pursue the occasional page scraping,
you should consider the costs before spending too much effort. Besides,
they have APIs for most of their services, and there might be one
that'll be much easier to use than trying to scrape the html.

Do you have a plan for what to do when the page layout changes?

You should look into Beautiful Soup; it's designed for parsing sloppily
written html. I've no direct experience with it, but it gets
recommended a lot.


--

DaveA

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
get google scholar using python রুদ্র ব্যাণার্জী Python 4 10-01-2012 06:09 PM
C program to search google scholar bnrj.rudra@gmail.com C Programming 1 05-23-2012 05:51 AM
H-Index with Google Scholar Gonsolo Python 0 02-25-2009 10:36 AM
Google Scholar Search (FF1.0) Kneewax Firefox 7 01-17-2005 10:43 AM
HTML to XML Conversion - Difficulty with Tidy and TagSoup Eric Java 0 12-30-2003 01:54 AM



Advertisments