Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Automating Searches

Reply
Thread Tools

Automating Searches

 
 
Chris Uppal
Guest
Posts: n/a
 
      01-06-2007
nowwho wrote:

> While the legal information is handy and can (more than likely will) be
> included in the report, is there any suggestions on how to tackle the
> coding of the problem or suggestions as to where I can look for further
> information?


Unfortunately, it appears that Google suspended their Search API last month
(http://code.google.com/apis/soapsearch/), so you will probably have to use
some sort of screen scraping.

If you want to do it in Java (rather than, say, by using command-line tools
such as wget or curl) then you'll need an HTTP client package. Java comes with
one (start with java.net.URL), but it has been said here that Google blocks
access via that, so you may be better off using a different, and more general,
package such as the Jakarta HTTP client
http://jakarta.apache.org/commons/httpclient/

Then, once you have worked out how to download data, you will need to parse it
to find the links you want. Parsing HTML with anything like reliability is not
easy (but you may not need much reliability in this case); you may find this
page of HTML parsers useful.
http://www.java-source.net/open-source/html-parsers

-- chris



 
Reply With Quote
 
 
 
 
Chris Uppal
Guest
Posts: n/a
 
      01-06-2007
John Ersatznom wrote:

[me:]
> > And they do actively work to prevent abuse. There are many kinds of
> > possible abuse, and I imagine Google work to prevent most of them, but
> > I doubt if there are many things they dislike more than people
> > attempting to steal their data.

>
> All of this depends on what constitutes "stealing" their data. Copying
> it and publishing it? Sort of -- it's some kind of infringement but not
> really "theft".


I don't particularly want to focus on what word(s) best fit the malefaction.
I'll stick with the general purpose "abuse" (which doesn't necessarily even
imply illegality).


> Merely doing with one mouse click or zero what you'd do anyway with
> twenty keypresses? I don't see how the amount of clacking emanating from
> someone's workstation at location A is in any way relevant to Google as
> long as a) a single user isn't suddenly hogging their resources and b)
> the user is using the results "normally" rather than to compete with
> Google or whatever.


Here you are mentioning only one aspect of the abuse (as it might appear to
Google) -- namely overuse of their resources. And I doubt if they are too
worried about that (within reason, of course). But almost /any/ automated
scanning of their database is an abuse in another sense: they make that data
available to people (not machines) in order to make money off it. Their (only,
as far as I know) source of cash is directly or indirectly from the advertising
they include with the search results. If you don't see the advertising then
you are using their resources and data without paying for them. How could they
/not/ want to minimise that ?


> The red flags that would make them look into their logfiles would be a)
> excessive bandwidth use and b) a Google clone or whatever springing up
> all of a sudden and competing for their revenue streams.


Or anything else that suggests that the search results are not being read by a
human...

Of course, they own the servers, they pay the (probably massive) network costs
and other data-centre costs, so it's up to them what they consider "fair". If
they choose to object to people called "Chris" using their services, then
that's up to them -- I have no real right to complain -- they can be as
arbitrary as they like. Naturally, since they want to make money, they can't
be too very arbitrary (and aren't), but by the same token, they do have good
reasons to (try to) protect their services from freeloaders.

-- chris



 
Reply With Quote
 
 
 
 
Lew
Guest
Posts: n/a
 
      01-07-2007
Chris Uppal wrote:
> Of course, they own the servers, they pay the (probably massive) network costs
> and other data-centre costs, so it's up to them what they consider "fair". If
> they choose to object to people called "Chris" using their services, then
> that's up to them -- I have no real right to complain -- they can be as
> arbitrary as they like. Naturally, since they want to make money, they can't
> be too very arbitrary (and aren't), but by the same token, they do have good
> reasons to (try to) protect their services from freeloaders.


I am not sure if name-bigotry is covered, but in many countries discrimination
in the provision of goods or services for certain factors like race, religion,
national origin, physical or mental disabilities and some other like
attributes is illegal. The legal principle rests in part on whether a trait is
innate, like national origin, or voluntary, like whether to wear a beard (for
most). This in no wise invalidates points others have made in this thread
except to point out that legal niceties punch exceptions into many broad
generalizations about these topics.

The legal question of data ownership carries many perilous implications. Does
Google own the information, or merely its representation? Is that
representation limited to its appearance on the screen, or does its specific
storage in their databases qualify? What about the source whence came Google's
data - when they scraped information off foo.com to include it in their data,
did they violate foo.com's owner's intellectual property rights? If I scraped
foo.com and came up with similar information to Google's in a similar data
structure (because data structures are "obvious" to a competent software
engineer), have I violated any of Google's IP rights?

Larger jurisprudential question: what degree of data openness or private
ownership best benefits society?

Concomitant question: what constitutes fair use of another's data?

- Lew
 
Reply With Quote
 
Andrew Thompson
Guest
Posts: n/a
 
      01-07-2007
Lew wrote:

> ...What about the source whence came Google's
> data - when they scraped information off foo.com to include it in their data,
> did they violate foo.com's owner's intellectual property rights?


I assume they figure that complying with a 'robots.txt'*
gives them some justification that they were 'invited'
(or at the very least, not exluded or banned) from
the site in question.

* <http://www.robotstxt.org/>

Andrew T.

 
Reply With Quote
 
Andrew Thompson
Guest
Posts: n/a
 
      01-07-2007
Andrew Thompson wrote:
> Lew wrote:
>
> > ...What about the source whence came Google's
> > data - when they scraped information off foo.com to include it in their data,
> > did they violate foo.com's owner's intellectual property rights?

>
> I assume they figure that complying with a 'robots.txt'* ...


E.G. <http://www.google.com/robots.txt>

Andrew T.

 
Reply With Quote
 
John Ersatznom
Guest
Posts: n/a
 
      01-08-2007
Chris Uppal wrote:
> Here you are mentioning only one aspect of the abuse (as it might appear to
> Google) -- namely overuse of their resources. And I doubt if they are too
> worried about that (within reason, of course). But almost /any/ automated
> scanning of their database is an abuse in another sense: they make that data
> available to people (not machines) in order to make money off it. Their (only,
> as far as I know) source of cash is directly or indirectly from the advertising
> they include with the search results. If you don't see the advertising then
> you are using their resources and data without paying for them. How could they
> /not/ want to minimise that ?


If accessing a site in such a way as to not see advertising is "wrong",
then using adblock plugins for your browser must be wrong. Using
Ad-Aware to wipe out those foo.doubleclick.com tracking cookies must be
wrong. Putting "*.doubleclick.com 127.0.0.1" in your hosts file must be
wrong. Hell, walking into the kitchen to fix yourself a snack when your
TV show goes to an ad must be wrong! Maybe even avoiding spam or
deleting it unread...

There is such a thing as taking something too far.

> Of course, they own the servers, they pay the (probably massive) network costs
> and other data-centre costs, so it's up to them what they consider "fair". If
> they choose to object to people called "Chris" using their services, then
> that's up to them -- I have no real right to complain -- they can be as
> arbitrary as they like. Naturally, since they want to make money, they can't
> be too very arbitrary (and aren't), but by the same token, they do have good
> reasons to (try to) protect their services from freeloaders.


That's completely aside any legal issues, and down to any business being
able to pick its customers selectively. And, of course, their ability to
do so is limited to the extent that they can detect whatever they don't
like. If they don't like people named "Chris" a Chris can use a phony
name and they won't know the difference unless they start demanding ID
verification to grant access, and they won't do that because it would be
a quick way to self-destruct in the search-engine business.

Automating some of your search usage is similarly something you can fly
below their radar, but in doing so you will clearly have to avoid any
high levels of usage that would bother them and get their attention. But
below that threshold, it's also a case of "what they don't know can't
hurt them"...
 
Reply With Quote
 
John Ersatznom
Guest
Posts: n/a
 
      01-08-2007
Lew wrote:
> Larger jurisprudential question: what degree of data openness or private
> ownership best benefits society?


Complete openness, except for national security matters, and those have
to be things like non-stale battle plans that are of use to the enemy if
they get it in a timely fashion. Any other security-based secrecy is
security-through-obscurity; prefer a massive, well-understood defense to
one that depends on the enemy being totally incompetent at espionage.

So-called "intellectual property" may be the single biggest
legal/judicial mistake in history -- far from promoting innovation, all
it seems to do is promote monopolies and lock-in. Check out
againstmonopoly.org sometime. Bad patents are a recurring theme there
and at techdirt, slashdot and other tech sites, but they're just the tip
of the iceberg.

> Concomitant question: what constitutes fair use of another's data?


Any private, educational, or nonprofit use should IMO. Of course if I
had my druthers any use at all would. The only things "protectable"
would be personal information, which people would be able to insist
(with legal clout) companies like ChoicePoint delete or at least verify.
And, eventually, the person's actual mind itself, once the technology to
download or otherwise access it with the right tools is available. If I
don't want spammers pestering me at some email address I think I have
that right, but if I publish something nonpersonal by choice I don't
feel I should then try to dictate how others use it.
 
Reply With Quote
 
John Ersatznom
Guest
Posts: n/a
 
      01-08-2007
Andrew Thompson wrote:
> Andrew Thompson wrote:
>
>>Lew wrote:
>>
>>
>>>...What about the source whence came Google's
>>>data - when they scraped information off foo.com to include it in their data,
>>>did they violate foo.com's owner's intellectual property rights?

>>
>>I assume they figure that complying with a 'robots.txt'* ...

>
> E.G. <http://www.google.com/robots.txt>


Unfortunately, one defacto effect of this protocol is that a lot of
sites configure it to deny any automated access and then carve out a few
narrow exemptions for Google and a handful of other big names in search,
on the grounds that nobody else actually drives traffic and business to
their site in any real quantity. The logical outcome is to shut out
smaller search engines and private web-use automation, however. The
former means the current crop of big-name search engines now have a lock
on the market. The latter is simply dumb, since letting people automate
aspects of their web use makes the web (and your site) more useful to them.

Some potentially useful web services are especially likely to be badly
affected. Price comparators, for one. If you run an ecommerce site with
nine competitors, and they all let a price comparator site's bot have
access, and you do likewise, then 90% of the time it will forward people
to a competitor. Obviously as an ecommerce vendor you want to block
price comparator bots! Unfortunately, this is not beneficial to society,
since you are outnumbered by your market, and your market is harmed by
stifling access to information, and the additional ENTIRE market of
online price comparison is threatened if everyone behaves the same.

So there are strong incentives to ignore robots.txt directives for
search engine startups, price comparison engines and suchlike, and
personal automation. Of course, accessing the file but then ignoring a
directive in it is detectable by the site admin who will block your IP,
and the ability to change IPs readily is much more available to the
bigger sites that don't need it than to the smaller sites and
individuals, so that means small-time bots have to not even access it
(and have to fly under the radar -- not too much bandwidth and "look
human").

The good side is that robots.txt does force non-bigname bots to run very
quietly and not use much bandwidth at all or otherwise call attention to
themselves, which serves part of the purpose anyway (one function of
robot directives is to help site admins prevent overuse of their bandwidth).
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Firefox google search bar searches google.de Abso Firefox 3 01-06-2005 10:55 PM
17167 Mining the Web: eigenVectors, Kriging, Inverse DistanceWeighting Searches 17167 Web Science MCSE 0 11-16-2004 10:01 PM
M$N filters web searches Kneewax Firefox 1 11-04-2004 07:52 PM
Creating Smart Keywords for Mozilla Firebird (using Quick Searches) Who Firefox 1 12-06-2003 01:37 AM
Full-text searches and ASP.NET Antonio Maciel ASP .Net 1 06-28-2003 07:43 AM



Advertisments