Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Computing > Digital Photography > Web get command (wget) to download all icons/pics on a web page (too large or too small)

Reply
Thread Tools

Web get command (wget) to download all icons/pics on a web page (too large or too small)

 
 
barb
Guest
Posts: n/a
 
      08-04-2006
How do I get Windows/Linux web get to ignore all files of a too-small size?

Like everyone, I often use the Windows/Linux "Free Software Foundation"
web-get wget command to download all the PDFs, GIFs, or JPEGs in a web site
onto my hard disk.

The basic command we all use is:

EXAMPLE FOR WINDOWS:
c:\> wget -prA.gif http://machine/path

EXAMPLE FOR LINUX:
% wget -prA.jpg http://machine/path

This famous wget command works great, except it downloads ALL the JPG & GIF
icons and photos at the targeted web site - large or small.

How do we tell wget to skip files of a certain size?

For example, assume we wish to skip anything smaller than, say, 10KB and
antyhing larger than, say, 100KB.

Can we get wget to skip files that are too small or too large?

barb


 
Reply With Quote
 
 
 
 
barb
Guest
Posts: n/a
 
      08-04-2006
On Fri, 04 Aug 2006 12:00:45 -0400, Marvin wrote:
>> % wget -prA.jpg http://machine/path
>> Can we get wget to skip files that are too small or too large?

> I don't know how to do that, but it would be easy to erase
> all the small files. When the images have been downloaded
> to a directory, sort the directory by file size and erase
> those below the minimum size you want to keep.


Hi Marvin,

Thank you for your help. I thought of this but I was kind of hoping that
wget would have a "size" range option that handled this.

Something like:

wget -prA.pdf http://www.consumerreports.com --size<min:max>

What I do today is sort by file size and then delete the too-large files
and the too-small files but that is obviously not optimal.

barb
 
Reply With Quote
 
 
 
 
barb
Guest
Posts: n/a
 
      08-04-2006
On Fri, 04 Aug 2006 11:28:07 -0500, Dances With Crows wrote:

> I think if you really want this, I think you're going to have to hack
> wget such that it takes another option, --size-range or something.
> Then wget would have to parse the server's 200 responses and either
> halt the download if the 200 said the file wasn't in --size-range,
> or unlink() the file after it finished. The exact approach you'd
> take depends on the wget code itself, and your level of C skill.


Hi Dances with Crows,

Thank you for your kind help. As you surmised, I do not have the skill set
to "hack" the venerable wget command so that it selects to download only
files of a certain range in size.

I had also read the manpage and I had searched prior but I did not see that
anyone had done this yet. I am kind of surprised since it's the most basic
of things you want to do.

For example, let's say we went to a free icon site and let's say they
updated that site periodically with the little web page bitmaps and better
icons usable for powerpoint slides and too-big icons suitable for photo
sessions.

Let's say you had a scheduled wget go to that site daily and download all
the icons automatically from that http web page but not the large ones or
the really really small ones. Let's say there were thousands of these. Of
course, ftp would be a pain. You likely wouldn't even have FTP access
anyway. And, downloading them manually isn't in the cards.

What I'd want to schedule is:
wget -prA.gif,jpg,bmp http://that/freeware/icon/web/page --size:<low:high>

barb
 
Reply With Quote
 
barb
Guest
Posts: n/a
 
      08-04-2006
On 4 Aug 2006 10:43:53 -0700, poddys wrote:

> I'm just wondering why you need to do this... You might be getting
> into copyright issues here....


Hi poddys,

Thank you very much for asking the right questions. Let's say I went to
http://www.freeimages.co.uk or http://www.bigfoto.com or
http://www.freefoto.com/index.jsp or any of a zillion sites which supply
royalty free images or GIFs or bitmaps or PDFs or HTML files etc.

Why wouldn't I want to use wget to obtain all the images, pdfs, word
documents, powerpoint templates, whatever ... that this site offers.

Even for sites I PAY for such as consumer reports and technical data ...
why wouldn't I want to just use wget to download every single PDF or
Microsoft office document or graphic at that web site?

There's no copyright infringement in that is there?

I can do all that today with wget.
The only problem I have is that the really large (too large) files get
downloaded too and that the really small (too small) files seem to be
useless clutter.

barb
 
Reply With Quote
 
barb
Guest
Posts: n/a
 
      08-04-2006
On Fri, 04 Aug 2006 13:34:13 -0500, Dances With Crows wrote:

> barb never stated what barb was doing with the images. It's a legit and
> semi-interesting question, though, regardless of what the final purpose
> is. Too bad there's nothing in wget that does what barb wants. barb
> will have to either hack wget or write a small script to remove all
> files between sizes X and Y after wget's finished.


Hi Dances with crows,

I don't know what I want to do with the images or pdfs or powerpoint
templates. For example, recently I found a page of royalty free powerpoint
calendar templates. The web page had scores and scores of them.

Nobody in their right mind is going to click on a link-by-link basis when
they can run a simple wget command and get them all in one fell swoop (are
they?)

wget -prA.ppt http://that/web/page

My older brother pointed me to one of his yahoo web pages which contained
photos, hundreds of them. I picked up them all in seconds using:
wget -prA.jpg http://that/web/page

I wouldn't THINK of downloading a hundred photos manually (would you?).

Do people REALLY download documents MANUALLY nowadays? Oh my. They're crazy
in my opinion (although I did write and file this letter manually myself
)

barb
 
Reply With Quote
 
barb
Guest
Posts: n/a
 
      08-04-2006
On Fri, 04 Aug 2006 18:51:11 GMT, Ben Dover wrote:

> You could probably write a script or batchfile to process the results of
> the wget download based on filesize.


Hi Ben Dover,
Thank you very much for your kind advice.

I am not a programmer but I guess it could look like this (in dos)?

REM wget.bat
wget -prA.ppt,jpg,doc,pdf,gif http://some/web/page
dir
if filesize < 10K then del filename
else
if filesize > 100K then del filename
end

And, in linux, maybe something like this (I found on the web):

# wget
wget -prA.ppt,jpg,doc,pdf,gif http://some/web/page
foreach file (`ls`)
set size = `ls | awk 'print $3'`
if $size < 10000 then rm $file
if $size > 100000 then rm $file
endif
end

Is this a good start (which newsgroup could we ask?)
barb
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
problem with constant too too large shin82 C Programming 0 10-31-2012 06:12 PM
WEB PAGE TEXT PRINTS TOO LARGE mac Computer Support 5 07-01-2006 03:19 PM
Too many (small) vs. too large linked script files in a document... Dag Sunde Javascript 4 12-16-2004 11:38 PM
Re: Are these pictures too dark or/and too large? Luigi Donatello Asero HTML 0 05-21-2004 09:40 PM
Are these pictures too dark or/and too large? Luigi Donatello Asero HTML 13 05-21-2004 04:54 AM



Advertisments