Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Automating Searches

Reply
Thread Tools

Automating Searches

 
 
Chris Uppal
Guest
Posts: n/a
 
      01-04-2007
John Ersatznom wrote:

> Add in a deliberate request of the
> front page before doing the search query, some random delays, and a
> spoofed user-agent, and I'm guessing the only way Google could figure
> out you weren't just a surfer using Mozilla 4.0 (compatible; MSIE 4.0)
> would be by using a tool like EtherSniffer to analyze your incoming
> requests and discovering that Java sends the HTTP headers in an
> idiosyncratic sequence. And they won't do that unless your IP generates
> an eyebrow-raising amount of traffic.


Google can and does have more intelligence than that.

The simplest thing to look for is the originating IP address of the request (at
the TCP/IP level). A suspicious pattern of requests from one IP (e.g. too many
in one time period), and Google will stop serving queries from that IP address.
(The originating IP /can/ be spoofed, but not too many Java programmers will
have the necessary skills, and in any case is hardly worth the effort.) That
criterion can also give false positives; for instance if an organisation is
working behind a NAT, so if one person from that organisation is detected
abusing Google's services, the entire organisation will be blocked. Does
Google care ? Why should it ?

Then, too, Google has available /all/ the data which enters its data-centres;
from low-level fingerprinting of IP packets, up through checking HTTP headers,
extending all the way to historical and cross-site access patterns (I would be
very surprised if they didn't use a custom TCP/IP stack implementation for
their HTTP servers). How much of that information it actual uses (or even
collects) I don't know -- but I'd guess that it collects most of it, and uses
as much as it feels it has to in order to prevent abuse.

And they do actively work to prevent abuse. There are many kinds of possible
abuse, and I imagine Google work to prevent most of them, but I doubt if there
are many things they dislike more than people attempting to steal their data.

-- chris



 
Reply With Quote
 
 
 
 
Andrew Thompson
Guest
Posts: n/a
 
      01-04-2007
nowwho wrote:
> Hey,
> Thanks for the information so far. I didn't realise there was so much
> legal stuff envolved, its for a once off educational project.


You 'ivory tower' types are *so* naiive. It's cute.

>...Didn't
> think it would amount to spamming.


I am not sure I would use that term for it.

Spamming is generally pushing an advertising
related message out to people who do not want it.

This (when done the 'wrong way') simply amounts
to a bit of theft of the resources of others.

& for my part, while I might hassle the thieves,
I'll bludgeon the spammers.

>...The pogram would only be run about
> 50 times in total.


I think you might be well placed to use the 'legal
and free' API's currently offered! Surely even the
small numbers of queries Google offers for free
would cover your requirement?

(In any case, from what I understand, Google simply
refuses further requests for the day if the limit
is struck - no hard feelings, and back tomorrow..)

>...There is a set number of queries, and a set number
> of results returned. As its an eductional project I never thought of
> the legal side!


Don't forget the there can be a few 'legalities' to the
educational side of things. Be careful of tripping over
using someone elses code without proper attribution
or accreditation.. Plagiarism/academic misconduct.
There was a classic thread on these groups from
a chap by the name of RoboCop - he got to find
out the hard way.

Andrew T.

 
Reply With Quote
 
 
 
 
nowwho
Guest
Posts: n/a
 
      01-04-2007

Andrew Thompson wrote:
> I am not sure I would use that term for it.


Fair enough, computers and technology aren't my main interest of study.


> I think you might be well placed to use the 'legal
> and free' API's currently offered! Surely even the
> small numbers of queries Google offers for free
> would cover your requirement?


More than likely, but would still require advise on how to incorporate
these into a Java program.


> Don't forget the there can be a few 'legalities' to the
> educational side of things. Be careful of tripping over
> using someone elses code without proper attribution
> or accreditation.. Plagiarism/academic misconduct.
> There was a classic thread on these groups from
> a chap by the name of RoboCop - he got to find
> out the hard way.


The use of other peoples code is allowed , however ALL work and ALL
sources of information used in any way required for the project have to
be detailed, we were well warned about the conquences of plagiarism.
All websites accessed for the project along with any copyright date
must be included along with the date that the website was accessed
etc...

 
Reply With Quote
 
NoNickName
Guest
Posts: n/a
 
      01-05-2007
> > Andrew Thompson wrote:
> ..


> BTW - nice to see you 'about the place' again..


Thanks. Been busy with end of year deadlines recently. Should be around
a bit more often now though.

--
TechBookReport Java - http://www.techbookreport.com/JavaIndex.html

 
Reply With Quote
 
John Ersatznom
Guest
Posts: n/a
 
      01-05-2007
nowwho wrote:
> Hey,
> Thanks for the information so far. I didn't realise there was so much
> legal stuff envolved, its for a once off educational project. Didn't
> think it would amount to spamming. The pogram would only be run about
> 50 times in total. There is a set number of queries, and a set number
> of results returned. As its an eductional project I never thought of
> the legal side!


It's not spamming -- I don't know what the other guy was smoking when he
wrote the post you're replying to. There is NO DIFFERENCE discernible to
Google if you

a) do 10 searches during the day by typing in a Firefox window while
doing research or
b) have your computer do the searches with less/no typing on your part

Google is being "ripped off" iff you do something like:

a) use huge amounts of their bandwidth -- well in excess of a normal
user doing a bit of heavy research say, generating large numbers of
searches or delving very deeply into the result set. Fetching 10
first-pages-of-results one for each of 10 queries, whether done by one
mouse click or ten typed-in queries, has little impact on them, and of
course the one mouse click case makes it actually 10 queries instead of
11 because you mistyped one and had to do it again
b) or use google search results to populate your own rival "search
engine" site with revenue-generating ads or what-have-you, either by
scraping google's database or by just putting up a page with a script
that takes peoples' queries and passes them to google, then takes the
result page and replaces google's sponsored links with umpteen flashing
banner ads. Then you're using google's work output to actually compete
against google, rather than simply using google for research. That makes
a crucial difference.

Using code to drive Google lightly and for personal/educational/research
reasons rather than commercial ones doesn't seem to be evil to me,
especially if they cannot in practise distinguish it from "normal" use
anyway, as it isn't producing excessive traffic or being used to compete
against google in some way.

In fact, where do you draw the line? Firefox with manually-typed queries
is OK. Then we have Firefox with a MRU for queries; Firefox with query
guessing or autocompletion based on your current activities and
interests; Firefox with a plugin to take the result set too and
transform it e.g. to show 50 rather than 10 hits or to weed out
"supplemental results" that are usually MFA sites that really ARE
ripping off google; Firefox with a plugin to run the query of your
choice and bookmark the results every few days; ... Firefox with a
plugin to gradually build up a database of hits for various queries by
occasionally fetching the nth page of results for one of them, but you
don't publish these anywhere, just use them personally ...

I think the two things that mark a transition to being evil are causing
them excessive traffic and competing with them using their own data in
some way. (Also generating content-free MFA pages to generate revenue
via AdSense ads and SEOing them, but that's more using AdSense than
using the search engine proper, though the SEO will impact the latter
and pollute the results.)

I don't see any way to derive some kind of moral law that makes typing
something morally superior to doing it with one click, and actually
scheduling an automatic (infrequent) job or whatever actually sinful.
There's no inherent virtue in inefficiency, and computers exist to
enable automating tasks. Hyperlinks automate looking up and finding that
dusty reference or whatever; librarians may complain that they rot young
brains but the actual upshot is a gain in productivity, rather than some
kind of evil decadence setting in.

 
Reply With Quote
 
John Ersatznom
Guest
Posts: n/a
 
      01-05-2007
Chris Uppal wrote:
> And they do actively work to prevent abuse. There are many kinds of possible
> abuse, and I imagine Google work to prevent most of them, but I doubt if there
> are many things they dislike more than people attempting to steal their data.


All of this depends on what constitutes "stealing" their data. Copying
it and publishing it? Sort of -- it's some kind of infringement but not
really "theft".

Merely doing with one mouse click or zero what you'd do anyway with
twenty keypresses? I don't see how the amount of clacking emanating from
someone's workstation at location A is in any way relevant to Google as
long as a) a single user isn't suddenly hogging their resources and b)
the user is using the results "normally" rather than to compete with
Google or whatever.

The red flags that would make them look into their logfiles would be a)
excessive bandwidth use and b) a Google clone or whatever springing up
all of a sudden and competing for their revenue streams.

Personal use of the search results isn't anything they can fault. Nor
however a person chooses to generate the requests (so long as they
aren't excessively frequent) or however they choose to filter and use
the results so long as they don't use them commercially.

I see no logical reason for them to care whether the 3 requests a given
IP gave them in a given day came from 30 typed characters and 3 mouse
clicks, 3 mouse clicks, or 0 mouse clicks at the requesting end, as long
as they don't consider 3 requests in one day from one source to be
excessive and as long as they aren't using those results in a way that
competes somehow with Google.

Unless, of course, the real intent is to enforce terms that let them use
a business model based on charging ordinary users a premium merely to
avoid tedium. I hope that isn't their intent; it would violate their
famous motto. A tiered "typed queries are free, bookmarked are a dime
each, and cron jobs require a monthly $59.99 subscription fee and
special account" service where it actually costs them exactly the same
amount (next to nil) to provide for all three use cases seems not merely
silly, but tantamount to fraudulent. A tiered "more than xx queries a
day requires a premium $10/month account" thing with xx in the dozens or
hundreds might not be considered evil -- after all, generating that many
queries actually scales up the amount serving you is costing them per
day. And of course disallowing commercial use of the results (other than
incidentally, like researching a purchase or new hire -- more selling
the results themselves in some manner) without a licensing arrangement
where Google gets a percentage. That's only fair.


 
Reply With Quote
 
nowwho
Guest
Posts: n/a
 
      01-05-2007

John Ersatznom wrote:
> nowwho wrote:
> > Hey,
> > Thanks for the information so far. I didn't realise there was so much
> > legal stuff envolved, its for a once off educational project. Didn't
> > think it would amount to spamming. The pogram would only be run about
> > 50 times in total. There is a set number of queries, and a set number
> > of results returned. As its an eductional project I never thought of
> > the legal side!

>
> It's not spamming -- I don't know what the other guy was smoking when he
> wrote the post you're replying to. There is NO DIFFERENCE discernible to
> Google if you
>
> a) do 10 searches during the day by typing in a Firefox window while
> doing research or
> b) have your computer do the searches with less/no typing on your part
>
> Google is being "ripped off" iff you do something like:
>
> a) use huge amounts of their bandwidth -- well in excess of a normal
> user doing a bit of heavy research say, generating large numbers of
> searches or delving very deeply into the result set. Fetching 10
> first-pages-of-results one for each of 10 queries, whether done by one
> mouse click or ten typed-in queries, has little impact on them, and of
> course the one mouse click case makes it actually 10 queries instead of
> 11 because you mistyped one and had to do it again
> b) or use google search results to populate your own rival "search
> engine" site with revenue-generating ads or what-have-you, either by
> scraping google's database or by just putting up a page with a script
> that takes peoples' queries and passes them to google, then takes the
> result page and replaces google's sponsored links with umpteen flashing
> banner ads. Then you're using google's work output to actually compete
> against google, rather than simply using google for research. That makes
> a crucial difference.


The point of the exercise is to get the URL's returned into an offline
database. It's an excersise purly to pull back the URL's from the
different search engines.

> Using code to drive Google lightly and for personal/educational/research
> reasons rather than commercial ones doesn't seem to be evil to me,
> especially if they cannot in practise distinguish it from "normal" use
> anyway, as it isn't producing excessive traffic or being used to compete
> against google in some way.


I don't think its a question of good or evil, I think people are
worried that the code could be used for commercial reasons.

> In fact, where do you draw the line? Firefox with manually-typed queries
> is OK. Then we have Firefox with a MRU for queries; Firefox with query
> guessing or autocompletion based on your current activities and
> interests; Firefox with a plugin to take the result set too and
> transform it e.g. to show 50 rather than 10 hits or to weed out
> "supplemental results" that are usually MFA sites that really ARE
> ripping off google; Firefox with a plugin to run the query of your
> choice and bookmark the results every few days; ... Firefox with a
> plugin to gradually build up a database of hits for various queries by
> occasionally fetching the nth page of results for one of them, but you
> don't publish these anywhere, just use them personally ...
>
> I think the two things that mark a transition to being evil are causing
> them excessive traffic and competing with them using their own data in
> some way. (Also generating content-free MFA pages to generate revenue
> via AdSense ads and SEOing them, but that's more using AdSense than
> using the search engine proper, though the SEO will impact the latter
> and pollute the results.)


This is an educational project and as computers is not my main interest
of study I don't know what a MFA, SEO are. Can this be explained?


> I don't see any way to derive some kind of moral law that makes typing
> something morally superior to doing it with one click, and actually
> scheduling an automatic (infrequent) job or whatever actually sinful.
> There's no inherent virtue in inefficiency, and computers exist to
> enable automating tasks. Hyperlinks automate looking up and finding that
> dusty reference or whatever; librarians may complain that they rot young
> brains but the actual upshot is a gain in productivity, rather than some
> kind of evil decadence setting in.


Any help with using the Google API or other suggestions would be a
great help. I also assume that Googe's API won't work with the other
serch engines, so would I have to write a different class for each
search engine?

 
Reply With Quote
 
John Ersatznom
Guest
Posts: n/a
 
      01-06-2007
nowwho wrote:
>>I think you might be well placed to use the 'legal
>>and free' API's currently offered! Surely even the
>>small numbers of queries Google offers for free
>>would cover your requirement?

>
> The use of other peoples code is allowed , however ALL work and ALL
> sources of information used in any way required for the project have to
> be detailed, we were well warned about the conquences of plagiarism.
> All websites accessed for the project along with any copyright date
> must be included along with the date that the website was accessed
> etc...


Oh what a tangled web we weave...what happened to the days when you
could just tinker and innovate without fear of lawyers or similar? Hmm?
Of course, wholesale copying of other stuff without permission and
misattributing it as your own original work is simply bad, but it's
because it's fraud and misrepresentation, not because it's copying, IMO.
Wheel-reinventing is supposed to be a bad thing. Let some attorneys get
involved and soon everyone is expecting you to get their permission to
copy anything. Then to *use* anything. Then to breathe or take a leak,
no doubt.

I think it's worth pointing out that unless you've signed something in
writing, you aren't in a binding agreement with Google about anything
(or anyone else) and only copyright, trademark, and patent law has any
true legal force. No matter what TOC boilerplate is on whose website.
Hell, they can't even prove that you *read* it, in any meaningful way,
even if your IP retrieved the page one day.

Of course the defacto law in the US isn't so rosy, thanks to a braindead
court system and a legislature that's long since been ritually auctioned
with great fanfare biannually to the highest bidder. I'd suggest a saner
country. Many in Europe and, I think, even Canada actually still have
sane legal systems, standards for when someone's actually entered into a
binding contract, standards of evidence to get subpoenas, warrants, and
judgments, and whatnot. Australia's as bad as the US or worse though. I
wonder how long it is before individuals have to jurisdiction-shop by
travel agent and $500 one-way airfare express just to do ordinary
victimless activities without legal repercussions and $50,000 in bogus
fines for phantom file sharing someone else on the neigborhood's cable
company internet service may or may not actually have done...
 
Reply With Quote
 
nowwho
Guest
Posts: n/a
 
      01-06-2007

John Ersatznom wrote:
> nowwho wrote:
> >>I think you might be well placed to use the 'legal
> >>and free' API's currently offered! Surely even the
> >>small numbers of queries Google offers for free
> >>would cover your requirement?

> >
> > The use of other peoples code is allowed , however ALL work and ALL
> > sources of information used in any way required for the project have to
> > be detailed, we were well warned about the conquences of plagiarism.
> > All websites accessed for the project along with any copyright date
> > must be included along with the date that the website was accessed
> > etc...

>
> Oh what a tangled web we weave...what happened to the days when you
> could just tinker and innovate without fear of lawyers or similar? Hmm?
> Of course, wholesale copying of other stuff without permission and
> misattributing it as your own original work is simply bad, but it's
> because it's fraud and misrepresentation, not because it's copying, IMO.
> Wheel-reinventing is supposed to be a bad thing. Let some attorneys get
> involved and soon everyone is expecting you to get their permission to
> copy anything. Then to *use* anything. Then to breathe or take a leak,
> no doubt.
>
> I think it's worth pointing out that unless you've signed something in
> writing, you aren't in a binding agreement with Google about anything
> (or anyone else) and only copyright, trademark, and patent law has any
> true legal force. No matter what TOC boilerplate is on whose website.
> Hell, they can't even prove that you *read* it, in any meaningful way,
> even if your IP retrieved the page one day.
>
> Of course the defacto law in the US isn't so rosy, thanks to a braindead
> court system and a legislature that's long since been ritually auctioned
> with great fanfare biannually to the highest bidder. I'd suggest a saner
> country. Many in Europe and, I think, even Canada actually still have
> sane legal systems, standards for when someone's actually entered into a
> binding contract, standards of evidence to get subpoenas, warrants, and
> judgments, and whatnot. Australia's as bad as the US or worse though. I
> wonder how long it is before individuals have to jurisdiction-shop by
> travel agent and $500 one-way airfare express just to do ordinary
> victimless activities without legal repercussions and $50,000 in bogus
> fines for phantom file sharing someone else on the neigborhood's cable
> company internet service may or may not actually have done...


While the legal information is handy and can (more than likely will) be
included in the report, is there any suggestions on how to tackle the
coding of the problem or suggestions as to where I can look for further
information?

 
Reply With Quote
 
Chris Uppal
Guest
Posts: n/a
 
      01-06-2007
John Ersatznom wrote:

> > The use of other peoples code is allowed , however ALL work and ALL
> > sources of information used in any way required for the project have to
> > be detailed, we were well warned about the conquences of plagiarism.
> > All websites accessed for the project along with any copyright date
> > must be included along with the date that the website was accessed
> > etc...

>
> Oh what a tangled web we weave...what happened to the days when you
> could just tinker and innovate without fear of lawyers or similar?


I think the OP's problem here is not so much the legality (or otherwise) of
"borrowing" Google's data, but that this is work in an academic context where
all sources /must/ be declared for reasons of honesty in scholarship.

-- chris



 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Firefox google search bar searches google.de Abso Firefox 3 01-06-2005 10:55 PM
17167 Mining the Web: eigenVectors, Kriging, Inverse DistanceWeighting Searches 17167 Web Science MCSE 0 11-16-2004 10:01 PM
M$N filters web searches Kneewax Firefox 1 11-04-2004 07:52 PM
Creating Smart Keywords for Mozilla Firebird (using Quick Searches) Who Firefox 1 12-06-2003 01:37 AM
Full-text searches and ASP.NET Antonio Maciel ASP .Net 1 06-28-2003 07:43 AM



Advertisments