Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   web crawling. (http://www.velocityreviews.com/forums/t353577-web-crawling.html)

S Borg 01-19-2006 05:18 AM

web crawling.
 

Hello,

I have been writing very simple Python programs that parse HTML and
such, mainly just to get
a better feel for the language. Here is my question: If I parsed an
HTML page into all of the image
files listed on that page, how could I request all of those images and
download them into some specified folder? I am sure this is quite easy,
but I am stuck.

Thank you very much.
Burgeoning Pythonista


Alex Martelli 01-19-2006 07:07 AM

Re: web crawling.
 
S Borg <spwpreston@gmail.com> wrote:

> Hello,
>
> I have been writing very simple Python programs that parse HTML and
> such, mainly just to get
> a better feel for the language. Here is my question: If I parsed an
> HTML page into all of the image
> files listed on that page, how could I request all of those images and
> download them into some specified folder? I am sure this is quite easy,
> but I am stuck.


There's a good crawler in the Demo directory of the Python source
distribution, so download and unpack said sources and look there.


Alex

gene tani 01-19-2006 07:57 AM

Re: web crawling.
 

S Borg wrote:
> Hello,
>
> I have been writing very simple Python programs that parse HTML and
> such, mainly just to get
> a better feel for the language. Here is my question: If I parsed an
> HTML page into all of the image
> files listed on that page, how could I request all of those images and
> download them into some specified folder? I am sure this is quite easy,
> but I am stuck.
>
> Thank you very much.
> Burgeoning Pythonista


http://sig.levillage.org/?p=588


Fuzzyman 01-19-2006 09:40 AM

Re: web crawling.
 
Use BeautifulSoup to get all the image tags out of the html.

You'll need to join the urls of the images to the url of the page
(urlparse.urljoin off the top of my head). If you look at BeautifulSoup
you will see how to get the 'src' reference of each image tag.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml


John M. Gabriele 01-20-2006 04:44 AM

Re: web crawling.
 
Alex Martelli wrote:
> S Borg <spwpreston@gmail.com> wrote:
>
>
>> Hello,
>>
>> I have been writing very simple Python programs that parse HTML and
>>such, mainly just to get
>>a better feel for the language. Here is my question: If I parsed an
>>HTML page into all of the image
>>files listed on that page, how could I request all of those images and
>>download them into some specified folder? I am sure this is quite easy,
>>but I am stuck.

>
>
> There's a good crawler in the Demo directory of the Python source
> distribution, so download and unpack said sources and look there.
>
>
> Alex


Hm. Looks like that's:

Python-2.4.2/Tools/webchecker

See 'pydoc ./webchecker.py' for more info.

---J


--
(remove zeez if demunging email address)


All times are GMT. The time now is 11:39 PM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.