Web get command (wget) to download all icons/pics on a web page (too large or too small)

Discussion in 'Digital Photography' started by barb, Aug 4, 2006.

  1. barb

    barb Guest

    How do I get Windows/Linux web get to ignore all files of a too-small size?

    Like everyone, I often use the Windows/Linux "Free Software Foundation"
    web-get wget command to download all the PDFs, GIFs, or JPEGs in a web site
    onto my hard disk.

    The basic command we all use is:

    EXAMPLE FOR WINDOWS:
    c:\> wget -prA.gif http://machine/path

    EXAMPLE FOR LINUX:
    % wget -prA.jpg http://machine/path

    This famous wget command works great, except it downloads ALL the JPG & GIF
    icons and photos at the targeted web site - large or small.

    How do we tell wget to skip files of a certain size?

    For example, assume we wish to skip anything smaller than, say, 10KB and
    antyhing larger than, say, 100KB.

    Can we get wget to skip files that are too small or too large?

    barb
     
    barb, Aug 4, 2006
    #1
    1. Advertising

  2. barb

    barb Guest

    On Fri, 04 Aug 2006 12:00:45 -0400, Marvin wrote:
    >> % wget -prA.jpg http://machine/path
    >> Can we get wget to skip files that are too small or too large?

    > I don't know how to do that, but it would be easy to erase
    > all the small files. When the images have been downloaded
    > to a directory, sort the directory by file size and erase
    > those below the minimum size you want to keep.


    Hi Marvin,

    Thank you for your help. I thought of this but I was kind of hoping that
    wget would have a "size" range option that handled this.

    Something like:

    wget -prA.pdf http://www.consumerreports.com --size<min:max>

    What I do today is sort by file size and then delete the too-large files
    and the too-small files but that is obviously not optimal.

    barb
     
    barb, Aug 4, 2006
    #2
    1. Advertising

  3. barb

    barb Guest

    On Fri, 04 Aug 2006 11:28:07 -0500, Dances With Crows wrote:

    > I think if you really want this, I think you're going to have to hack
    > wget such that it takes another option, --size-range or something.
    > Then wget would have to parse the server's 200 responses and either
    > halt the download if the 200 said the file wasn't in --size-range,
    > or unlink() the file after it finished. The exact approach you'd
    > take depends on the wget code itself, and your level of C skill.


    Hi Dances with Crows,

    Thank you for your kind help. As you surmised, I do not have the skill set
    to "hack" the venerable wget command so that it selects to download only
    files of a certain range in size.

    I had also read the manpage and I had searched prior but I did not see that
    anyone had done this yet. I am kind of surprised since it's the most basic
    of things you want to do.

    For example, let's say we went to a free icon site and let's say they
    updated that site periodically with the little web page bitmaps and better
    icons usable for powerpoint slides and too-big icons suitable for photo
    sessions.

    Let's say you had a scheduled wget go to that site daily and download all
    the icons automatically from that http web page but not the large ones or
    the really really small ones. Let's say there were thousands of these. Of
    course, ftp would be a pain. You likely wouldn't even have FTP access
    anyway. And, downloading them manually isn't in the cards.

    What I'd want to schedule is:
    wget -prA.gif,jpg,bmp http://that/freeware/icon/web/page --size:<low:high>

    barb
     
    barb, Aug 4, 2006
    #3
  4. barb

    barb Guest

    On 4 Aug 2006 10:43:53 -0700, poddys wrote:

    > I'm just wondering why you need to do this... You might be getting
    > into copyright issues here....


    Hi poddys,

    Thank you very much for asking the right questions. Let's say I went to
    http://www.freeimages.co.uk or http://www.bigfoto.com or
    http://www.freefoto.com/index.jsp or any of a zillion sites which supply
    royalty free images or GIFs or bitmaps or PDFs or HTML files etc.

    Why wouldn't I want to use wget to obtain all the images, pdfs, word
    documents, powerpoint templates, whatever ... that this site offers.

    Even for sites I PAY for such as consumer reports and technical data ...
    why wouldn't I want to just use wget to download every single PDF or
    Microsoft office document or graphic at that web site?

    There's no copyright infringement in that is there?

    I can do all that today with wget.
    The only problem I have is that the really large (too large) files get
    downloaded too and that the really small (too small) files seem to be
    useless clutter.

    barb
     
    barb, Aug 4, 2006
    #4
  5. barb

    barb Guest

    On Fri, 04 Aug 2006 13:34:13 -0500, Dances With Crows wrote:

    > barb never stated what barb was doing with the images. It's a legit and
    > semi-interesting question, though, regardless of what the final purpose
    > is. Too bad there's nothing in wget that does what barb wants. barb
    > will have to either hack wget or write a small script to remove all
    > files between sizes X and Y after wget's finished.


    Hi Dances with crows,

    I don't know what I want to do with the images or pdfs or powerpoint
    templates. For example, recently I found a page of royalty free powerpoint
    calendar templates. The web page had scores and scores of them.

    Nobody in their right mind is going to click on a link-by-link basis when
    they can run a simple wget command and get them all in one fell swoop (are
    they?)

    wget -prA.ppt http://that/web/page

    My older brother pointed me to one of his yahoo web pages which contained
    photos, hundreds of them. I picked up them all in seconds using:
    wget -prA.jpg http://that/web/page

    I wouldn't THINK of downloading a hundred photos manually (would you?).

    Do people REALLY download documents MANUALLY nowadays? Oh my. They're crazy
    in my opinion (although I did write and file this letter manually myself
    :p)

    barb
     
    barb, Aug 4, 2006
    #5
  6. barb

    barb Guest

    On Fri, 04 Aug 2006 18:51:11 GMT, Ben Dover wrote:

    > You could probably write a script or batchfile to process the results of
    > the wget download based on filesize.


    Hi Ben Dover,
    Thank you very much for your kind advice.

    I am not a programmer but I guess it could look like this (in dos)?

    REM wget.bat
    wget -prA.ppt,jpg,doc,pdf,gif http://some/web/page
    dir
    if filesize < 10K then del filename
    else
    if filesize > 100K then del filename
    end

    And, in linux, maybe something like this (I found on the web):

    # wget
    wget -prA.ppt,jpg,doc,pdf,gif http://some/web/page
    foreach file (`ls`)
    set size = `ls | awk 'print $3'`
    if $size < 10000 then rm $file
    if $size > 100000 then rm $file
    endif
    end

    Is this a good start (which newsgroup could we ask?)
    barb
     
    barb, Aug 4, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Shirley Azvedo

    Icons too large on desktop

    Shirley Azvedo, Feb 26, 2004, in forum: Computer Support
    Replies:
    3
    Views:
    21,459
    TheDailySpank
    Feb 28, 2004
  2. Edmond Lucid

    Icons too large

    Edmond Lucid, Jun 11, 2005, in forum: Computer Support
    Replies:
    5
    Views:
    11,478
    WormWood
    Jun 12, 2005
  3. Henry

    WGET

    Henry, Aug 10, 2004, in forum: Computer Security
    Replies:
    1
    Views:
    706
    Celtic Leroy
    Aug 11, 2004
  4. Replies:
    0
    Views:
    568
  5. mac

    WEB PAGE TEXT PRINTS TOO LARGE

    mac, Jun 29, 2006, in forum: Computer Support
    Replies:
    5
    Views:
    2,084
Loading...

Share This Page