Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > HTML > Re: funny (strange behavior)link

Reply
Thread Tools

Re: funny (strange behavior)link

 
 
Harlan Messinger
Guest
Posts: n/a
 
      07-15-2008
Eric wrote:
> I'm using a script to download intel architecture specs every couple of months
> so I'll always have the current docs.
> Go here:
> http://www.intel.com/products/processor/manuals/
> scroll down to the bottom of the page, the link for
> "IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual"
> points to "http://www.intel.com/design/processor/manuals/248966.pdf"
> which i can download by clicking on the link, but in my script using wget,
> i always get a 403 error - why?
>
> Example:
> wget http://www.intel.com/design/processo...als/248966.pdf -O
> Intel_64_and_IA-32_Architectures_Optimization_Reference_Manual_248 966.pdf
> --10:10:04-- http://www.intel.com/design/processo...als/248966.pdf
> =>
> `Intel_64_and_IA-32_Architectures_Optimization_Reference_Manual_248 966.pdf'
> Resolving www.intel.com... 64.209.118.114, 64.209.118.105
> Connecting to www.intel.com|64.209.118.114|:80... connected.
> HTTP request sent, awaiting response... 403 Forbidden
> 10:10:04 ERROR 403: Forbidden.


I'm guessing that they filter out requests from automated or automatable
tools like wget, googlebot, linkchecker, and so on to conserve on server
load and bandwidth.
 
Reply With Quote
 
 
 
 
Chris F.A. Johnson
Guest
Posts: n/a
 
      07-15-2008
On 2008-07-15, Eric wrote:
> Harlan Messinger wrote:
>
>> Eric wrote:
>>> I'm using a script to download intel architecture specs every couple of
>>> months so I'll always have the current docs.
>>> Go here:
>>> http://www.intel.com/products/processor/manuals/
>>> scroll down to the bottom of the page, the link for
>>> "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
>>> points to "http://www.intel.com/design/processor/manuals/248966.pdf"
>>> which i can download by clicking on the link, but in my script using wget,
>>> i always get a 403 error - why?
>>>
>>> Example:
>>> wget http://www.intel.com/design/processo...als/248966.pdf -O
>>> Intel_64_and_IA-32_Architectures_Optimization_Reference_Manual_248 966.pdf
>>> --10:10:04-- http://www.intel.com/design/processo...als/248966.pdf
>>> =>
>>> `Intel_64_and_IA-32_Architectures_Optimization_Reference_Manual_248 966.pdf'
>>> Resolving www.intel.com... 64.209.118.114, 64.209.118.105
>>> Connecting to www.intel.com|64.209.118.114|:80... connected.
>>> HTTP request sent, awaiting response... 403 Forbidden
>>> 10:10:04 ERROR 403: Forbidden.

>>
>> I'm guessing that they filter out requests from automated or automatable
>> tools like wget, googlebot, linkchecker, and so on to conserve on server
>> load and bandwidth.

>
> I wonder how they distinguish wget from a real browser, or if I can get around
> it.


You can change the user agent string:

-U agent-string
--user-agent=agent-string
Identify as agent-string to the HTTP server.

The HTTP protocol allows the clients to identify themselves using a
"User-Agent" header field. This enables distinguishing the WWW
software, usually for statistical purposes or for tracing of proto-
col violations. Wget normally identifies as Wget/version, version
being the current version number of Wget.

However, some sites have been known to impose the policy of tailor-
ing the output according to the "User-Agent"-supplied information.
While this is not such a bad idea in theory, it has been abused by
servers denying information to clients other than (historically)
Netscape or, more frequently, Microsoft Internet Explorer. This
option allows you to change the "User-Agent" line issued by Wget.
Use of this option is discouraged, unless you really know what you
are doing.

Specifying empty user agent with --user-agent="" instructs Wget not
to send the "User-Agent" header in HTTP requests.



--
Chris F.A. Johnson, webmaster <http://Woodbine-Gerrard.com>
================================================== =================
Author:
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
 
Reply With Quote
 
 
 
 
mynameisnobodyodyssea@googlemail.com
Guest
Posts: n/a
 
      07-17-2008
On Jul 15, 11:32 pm, Eric <Scor...@gordinator.org> wrote:
> Harlan Messinger wrote:
> > I'm guessing that they filter out requests from automated or automatable
> > tools like wget, googlebot, linkchecker, and so on to conserve on server
> > load and bandwidth.

>
> I wonder how they distinguish wget from a real browser, or if I can get around
> it.


Yes, they can see from the user-agent that it is not
a browser and block wget requests or other
type of HTTP requests that might come from bots.
But
did you try to contact the intel.com website, in case
they have APIs or RSS feeds that can be
downloaded with wget or by other automated
HTTP requests?

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Funny Pictures ! Funny Jokes ! Drink Recipes, Reviews & More ! joevan Computer Support 0 06-29-2006 06:00 PM
Mozilla 1.7.11 mail funny Broadback Firefox 3 10-12-2005 05:38 PM
Funny loading page on FF balado Firefox 2 06-29-2005 10:19 PM
OT: The Interview - Real, Funny...Real Funny The Rev [MCT] MCSE 42 05-31-2005 10:42 PM
Funny scrolling bug Alvaro G. Vicario Firefox 0 06-19-2004 12:44 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57