On 2008-07-15, Eric wrote:
> Harlan Messinger wrote:
>
>> Eric wrote:
>>> I'm using a script to download intel architecture specs every couple of
>>> months so I'll always have the current docs.
>>> Go here:
>>> http://www.intel.com/products/processor/manuals/
>>> scroll down to the bottom of the page, the link for
>>> "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
>>> points to "http://www.intel.com/design/processor/manuals/248966.pdf"
>>> which i can download by clicking on the link, but in my script using wget,
>>> i always get a 403 error - why?
>>>
>>> Example:
>>> wget http://www.intel.com/design/processo...als/248966.pdf -O
>>> Intel_64_and_IA-32_Architectures_Optimization_Reference_Manual_248 966.pdf
>>> --10:10:04-- http://www.intel.com/design/processo...als/248966.pdf
>>> =>
>>> `Intel_64_and_IA-32_Architectures_Optimization_Reference_Manual_248 966.pdf'
>>> Resolving www.intel.com... 64.209.118.114, 64.209.118.105
>>> Connecting to www.intel.com|64.209.118.114|:80... connected.
>>> HTTP request sent, awaiting response... 403 Forbidden
>>> 10:10:04 ERROR 403: Forbidden.
>>
>> I'm guessing that they filter out requests from automated or automatable
>> tools like wget, googlebot, linkchecker, and so on to conserve on server
>> load and bandwidth.
>
> I wonder how they distinguish wget from a real browser, or if I can get around
> it.
You can change the user agent string:
-U agent-string
--user-agent=agent-string
Identify as agent-string to the HTTP server.
The HTTP protocol allows the clients to identify themselves using a
"User-Agent" header field. This enables distinguishing the WWW
software, usually for statistical purposes or for tracing of proto-
col violations. Wget normally identifies as Wget/version, version
being the current version number of Wget.
However, some sites have been known to impose the policy of tailor-
ing the output according to the "User-Agent"-supplied information.
While this is not such a bad idea in theory, it has been abused by
servers denying information to clients other than (historically)
Netscape or, more frequently, Microsoft Internet Explorer. This
option allows you to change the "User-Agent" line issued by Wget.
Use of this option is discouraged, unless you really know what you
are doing.
Specifying empty user agent with --user-agent="" instructs Wget not
to send the "User-Agent" header in HTTP requests.
--
Chris F.A. Johnson, webmaster <http://Woodbine-Gerrard.com>
================================================== =================
Author:
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)