Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > web crawling.

Reply
Thread Tools

web crawling.

 
 
S Borg
Guest
Posts: n/a
 
      01-19-2006

Hello,

I have been writing very simple Python programs that parse HTML and
such, mainly just to get
a better feel for the language. Here is my question: If I parsed an
HTML page into all of the image
files listed on that page, how could I request all of those images and
download them into some specified folder? I am sure this is quite easy,
but I am stuck.

Thank you very much.
Burgeoning Pythonista

 
Reply With Quote
 
 
 
 
Alex Martelli
Guest
Posts: n/a
 
      01-19-2006
S Borg <(E-Mail Removed)> wrote:

> Hello,
>
> I have been writing very simple Python programs that parse HTML and
> such, mainly just to get
> a better feel for the language. Here is my question: If I parsed an
> HTML page into all of the image
> files listed on that page, how could I request all of those images and
> download them into some specified folder? I am sure this is quite easy,
> but I am stuck.


There's a good crawler in the Demo directory of the Python source
distribution, so download and unpack said sources and look there.


Alex
 
Reply With Quote
 
 
 
 
gene tani
Guest
Posts: n/a
 
      01-19-2006

S Borg wrote:
> Hello,
>
> I have been writing very simple Python programs that parse HTML and
> such, mainly just to get
> a better feel for the language. Here is my question: If I parsed an
> HTML page into all of the image
> files listed on that page, how could I request all of those images and
> download them into some specified folder? I am sure this is quite easy,
> but I am stuck.
>
> Thank you very much.
> Burgeoning Pythonista


http://sig.levillage.org/?p=588

 
Reply With Quote
 
Fuzzyman
Guest
Posts: n/a
 
      01-19-2006
Use BeautifulSoup to get all the image tags out of the html.

You'll need to join the urls of the images to the url of the page
(urlparse.urljoin off the top of my head). If you look at BeautifulSoup
you will see how to get the 'src' reference of each image tag.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

 
Reply With Quote
 
John M. Gabriele
Guest
Posts: n/a
 
      01-20-2006
Alex Martelli wrote:
> S Borg <(E-Mail Removed)> wrote:
>
>
>> Hello,
>>
>> I have been writing very simple Python programs that parse HTML and
>>such, mainly just to get
>>a better feel for the language. Here is my question: If I parsed an
>>HTML page into all of the image
>>files listed on that page, how could I request all of those images and
>>download them into some specified folder? I am sure this is quite easy,
>>but I am stuck.

>
>
> There's a good crawler in the Demo directory of the Python source
> distribution, so download and unpack said sources and look there.
>
>
> Alex


Hm. Looks like that's:

Python-2.4.2/Tools/webchecker

See 'pydoc ./webchecker.py' for more info.

---J


--
(remove zeez if demunging email address)
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Using Web.config's <system.web><pages><controls><add /></controls></pages></system.web> To Register UserControls Nathan Sokalski ASP .Net 5 01-10-2007 10:50 AM
Visual Web Developer does not support creating Web sites on a SharePoint Web server William LaMartin ASP .Net 0 12-29-2005 04:17 PM
accessing the web user control's control from a web page and set a value from another web page Reny J Joseph Thuthikattu ASP .Net 1 12-30-2004 12:21 PM
Web Clients, the role of ASP.NET and the Future of Web Development and Web Standards Guadala Harry ASP .Net 9 11-06-2004 03:05 AM
Conflicting Web.config between web site and web service Joe ASP .Net 1 10-28-2004 11:14 PM



Advertisments