Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > help with link parsing?

Reply
Thread Tools

help with link parsing?

 
 
Littlefield, Tyler
Guest
Posts: n/a
 
      12-20-2010
Hello all,
I have a question. I guess this worked pre 2.6; I don't remember the
last time I used it, but it was a while ago, and now it's failing.
Anyone mind looking at it and telling me what's going wrong? Also, is
there a quick way to match on a certain site? like links from google.com
and only output those?
#!/usr/bin/env python

#This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published
#by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

#This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
#MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
#
#You should have received a copy of the GNU General Public License along
with this program. If not, see
#http://www.gnu.org/licenses/.

"""
This script will parse out all the links in an html document and write
them to a textfile.
"""
import sys,optparse
import htmllib,formatter

#program class declarations:
class Links(htmllib.HTMLParser):
def __init__(self,formatter):
htmllib.HTMLParser.__init__(self, formatter)
self.links=[]
def start_a(self, attrs):
if (len(attrs)>0):
for a in attrs:
if a[0]=="href":
self.links.append(a[1])
print a[1]
break

def main(argv):
if (len(argv)!=3):
print("Error:\n"+argv[0]+" <input> <output>.\nParses <input>
for all links and saves them to <output>.")
return 1
lcount=0
format=formatter.NullFormatter()
html=Links(format)
print "Retrieving data:"
page=open(argv[1],"r")
print "Feeding data to parser:"
html.feed(page.read())
page.close()
print "Writing links:"
output=open(argv[2],"w")
for i in (html.links):
output.write(i+"\n")
lcount+=1
output.close()
print("Wrote "+str(lcount)+" links to "+argv[2]+".");
print("done.")

if (__name__ == "__main__"):
#we call the main function passing a list of args, and exit with
the return code passed back.
sys.exit(main(sys.argv))

--

Thanks,
Ty

 
Reply With Quote
 
 
 
 
Jon Clements
Guest
Posts: n/a
 
      12-21-2010
On Dec 20, 7:14*pm, "Littlefield, Tyler" <(E-Mail Removed)> wrote:
> Hello all,
> I have a question. I guess this worked pre 2.6; I don't remember the
> last time I used it, but it was a while ago, and now it's failing.
> Anyone mind looking at it and telling me what's going wrong? Also, is
> there a quick way to match on a certain site? like links from google.com
> and only output those?
> #!/usr/bin/env python
>
> #This program is free software: you can redistribute it and/or modify it
> under the terms of the GNU General Public License as published
> #by the Free Software Foundation, either version 3 of the License, or
> (at your option) any later version.
>
> #This program is distributed in the hope that it will be useful, but
> WITHOUT ANY WARRANTY; without even the implied warranty of
> #MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> General Public License for more details.
> #
> #You should have received a copy of the GNU General Public License along
> with this program. If not, see
> #http://www.gnu.org/licenses/.
>
> """
> This script will parse out all the links in an html document and write
> them to a textfile.
> """
> import sys,optparse
> import htmllib,formatter
>
> #program class declarations:
> class Links(htmllib.HTMLParser):
> * * *def __init__(self,formatter):
> * * * * *htmllib.HTMLParser.__init__(self, formatter)
> * * * * *self.links=[]
> * * *def start_a(self, attrs):
> * * * * *if (len(attrs)>0):
> * * * * * * *for a in attrs:
> * * * * * * * * *if a[0]=="href":
> * * * * * * * * * * *self.links.append(a[1])
> * * * * * * * * * * *print a[1]
> * * * * * * * * * * *break
>
> def main(argv):
> * * *if (len(argv)!=3):
> * * * * *print("Error:\n"+argv[0]+" <input> <output>.\nParses <input>
> for all links and saves them to <output>.")
> * * * * *return 1
> * * *lcount=0
> * * *format=formatter.NullFormatter()
> * * *html=Links(format)
> * * *print "Retrieving data:"
> * * *page=open(argv[1],"r")
> * * *print "Feeding data to parser:"
> * * *html.feed(page.read())
> * * *page.close()
> * * *print "Writing links:"
> * * *output=open(argv[2],"w")
> * * *for i in (html.links):
> * * * * *output.write(i+"\n")
> * * * * *lcount+=1
> * * *output.close()
> * * *print("Wrote "+str(lcount)+" links to "+argv[2]+".");
> * * *print("done.")
>
> if (__name__ == "__main__"):
> * * *#we call the main function passing a list of args, and exit with
> the return code passed back.
> * * *sys.exit(main(sys.argv))
>
> --
>
> Thanks,
> Ty


This doesn't answer your original question, but excluding the command
line handling, how's this do you?:

import lxml
from urlparse import urlsplit

doc = lxml.html.parse('http://www.google.com')
print map(urlsplit, doc.xpath('//a/@href'))

[SplitResult(scheme='http', netloc='www.google.co.uk', path='/imghp',
query='hl=en&tab=wi', fragment=''), SplitResult(scheme='http',
netloc='video.google.co.uk', path='/', query='hl=en&tab=wv',
fragment=''), SplitResult(scheme='http', netloc='maps.google.co.uk',
path='/maps', query='hl=en&tab=wl', fragment=''),
SplitResult(scheme='http', netloc='news.google.co.uk', path='/nwshp',
query='hl=en&tab=wn', fragment=''), ...]

Much nicer IMHO, plus the lxml.html has iterlinks() and other
convenience functions for handling HTML.

hth

Jon.

 
Reply With Quote
 
 
 
 
Colin J. Williams
Guest
Posts: n/a
 
      12-22-2010
On 21-Dec-10 12:22 PM, Jon Clements wrote:
> import lxml
> from urlparse import urlsplit
>
> doc = lxml.html.parse('http://www.google.com')
> print map(urlsplit, doc.xpath('//a/@href'))
>
> [SplitResult(scheme='http', netloc='www.google.co.uk', path='/imghp',
> query='hl=en&tab=wi', fragment=''), SplitResult(scheme='http',
> netloc='video.google.co.uk', path='/', query='hl=en&tab=wv',
> fragment=''), SplitResult(scheme='http', netloc='maps.google.co.uk',
> path='/maps', query='hl=en&tab=wl', fragment=''),
> SplitResult(scheme='http', netloc='news.google.co.uk', path='/nwshp',
> query='hl=en&tab=wn', fragment=''), ...]


Jon,

What version of Python was used to run this?

Colin W.
 
Reply With Quote
 
Jon Clements
Guest
Posts: n/a
 
      12-22-2010
On Dec 22, 4:24*pm, "Colin J. Williams" <(E-Mail Removed)>
wrote:
> On 21-Dec-10 12:22 PM, Jon Clements wrote:
>
> > import lxml
> > from urlparse import urlsplit

>
> > doc = lxml.html.parse('http://www.google.com')
> > print map(urlsplit, doc.xpath('//a/@href'))

>
> > [SplitResult(scheme='http', netloc='www.google.co.uk', path='/imghp',
> > query='hl=en&tab=wi', fragment=''), SplitResult(scheme='http',
> > netloc='video.google.co.uk', path='/', query='hl=en&tab=wv',
> > fragment=''), SplitResult(scheme='http', netloc='maps.google.co.uk',
> > path='/maps', query='hl=en&tab=wl', fragment=''),
> > SplitResult(scheme='http', netloc='news.google.co.uk', path='/nwshp',
> > query='hl=en&tab=wn', fragment=''), ...]

>
> Jon,
>
> What version of Python was used to run this?
>
> Colin W.


2.6.5 - the lxml library is not a standard module though and needs to
be installed.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Is D-Link DSL-604T same as D-Link DSL-604+ ? norm Wireless Networking 6 11-18-2005 10:25 AM
RE: Link Link Link =?Utf-8?B?REw=?= Windows 64bit 0 05-17-2005 12:15 PM
Re: Link Link Link DANGER WILL ROBINSON!!! Kevin Spencer ASP .Net 0 05-17-2005 10:41 AM
Single FE Link for State/Link PIX Failover Matthew Melbourne Cisco 2 01-10-2005 06:58 PM
D-LINK DWL-AG650 doesn't work with WinXP - Zero Configuration Service (WZCSVC) / D-Link AirXpert nearly works Johannes Rosenstock Wireless Networking 0 08-27-2004 11:49 PM



Advertisments