Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Web Spider

Reply
Thread Tools

Web Spider

 
 
Chase Preuninger
Guest
Posts: n/a
 
      03-04-2008
If I was parsing a web page and extracting data from it in order to
make a search engine, what should I extract?
 
Reply With Quote
 
 
 
 
Lord Zoltar
Guest
Posts: n/a
 
      03-04-2008
On Mar 4, 11:10*am, Chase Preuninger <(E-Mail Removed)>
wrote:
> If I was parsing a web page and extracting data from it in order to
> make a search engine, what should I extract?


That depends on what you are interested in.
 
Reply With Quote
 
 
 
 
Daniel Pitts
Guest
Posts: n/a
 
      03-04-2008
Lord Zoltar wrote:
> On Mar 4, 11:10 am, Chase Preuninger <(E-Mail Removed)>
> wrote:
>> If I was parsing a web page and extracting data from it in order to
>> make a search engine, what should I extract?

>
> That depends on what you are interested in.

Instead of a "that depends" answer...

You extract the information you want!

--
Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
 
Reply With Quote
 
Jeff Higgins
Guest
Posts: n/a
 
      03-04-2008

Chase Preuninger wrote:
> If I was parsing a web page and extracting data from it in order to
> make a search engine, what should I extract?


<http://en.wikipedia.org/wiki/Search_engine_%28computing%29>


 
Reply With Quote
 
Chase Preuninger
Guest
Posts: n/a
 
      03-04-2008
On Mar 4, 4:23*pm, "Jeff Higgins" <(E-Mail Removed)> wrote:
> Chase Preuninger wrote:
> > If I was parsing a web page and extracting data from it in order to
> > make a search engine, what should I extract?

>
> <http://en.wikipedia.org/wiki/Search_engine_%28computing%29>


just usefull stuff for a web search
 
Reply With Quote
 
Jeff Higgins
Guest
Posts: n/a
 
      03-04-2008

"Chase Preuninger" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
On Mar 4, 4:23 pm, "Jeff Higgins" <(E-Mail Removed)> wrote:
> Chase Preuninger wrote:
> > If I was parsing a web page and extracting data from it in order to
> > make a search engine, what should I extract?

>
> <http://en.wikipedia.org/wiki/Search_engine_%28computing%29>


just usefull stuff for a web search

<http://en.wikipedia.org/wiki/Sitemaps>

Both the links I've provided,
I've found using a web search engine
and the search terms: search, engine, wiki.

You could try searching on:
"most frequent search query", or
"most interesting search query", or
"most useful data for a web search engine".


 
Reply With Quote
 
timjowers
Guest
Posts: n/a
 
      03-05-2008
On Mar 4, 5:30 pm, "Jeff Higgins" <(E-Mail Removed)> wrote:
> "Chase Preuninger" <(E-Mail Removed)> wrote in message
>
> news:(E-Mail Removed)...
> On Mar 4, 4:23 pm, "Jeff Higgins" <(E-Mail Removed)> wrote:
>
> > Chase Preuninger wrote:
> > > If I was parsing a web page and extracting data from it in order to
> > > make a search engine, what should I extract?

>
> > <http://en.wikipedia.org/wiki/Search_engine_%28computing%29>

>
> just usefull stuff for a web search
>
> <http://en.wikipedia.org/wiki/Sitemaps>
>
> Both the links I've provided,
> I've found using a web search engine
> and the search terms: search, engine, wiki.
>
> You could try searching on:
> "most frequent search query", or
> "most interesting search query", or
> "most useful data for a web search engine".


Chase. The basic is the words. Then you correlate the words into
clusters. Historically these are called Information Retrieval Systems
("IRS" oh my). The simplest idea is pages with words in common must be
like one another and like your topic. Imagine what would happen if you
had one list with all URLs. One list with all words. And one list
connecting the two. Then you could lookup all matching URL's for each
word. These lists might be large though! Then you could find the set
matching the search phrase by intersecting each set for each word.

Second thing to know is words have forms so maybe you'd work off of
all lower case and reduce all words to a base form. Well, what about
"Farenheit 451"? Do you also store numbers? What about "(...)"? Can
you also search on computerese? So, it starts to get complicated. One
idea is the "edit distance" or number of changes to get from the word
entered to a base word. That might tell if it might be the same word.
What about synonyms (I haven't seen a search engine do this). What
about bigrams and n-grams? That is, multi-word combinations. If one
types super computer then maybe any occurrences of "super computer"
should be matched higher than a page with just the word super or
computer.

OK, so a real search engine uses ranking and bases this on many
things. Things like how long the site has been up. How many other
sites link to them. How stuffed full of links their pages are. Maybe
if they buy ads from teh search engine? Nah, that wouldn't b
right. Etc. Also, by clustering a person's past searches or areas
of interest then you can greatly increase your precision.

In 2001 I took an IR course and we studied MSN, Google, and Yahoo.
Everyone found Google to have about the same recall (document
universe) but superior precision (accuracy). Now if I'd had the common
sense to buy stock!!!!
 
Reply With Quote
 
timjowers
Guest
Posts: n/a
 
      03-05-2008

Check this out: http://sourceforge.net/projects/webwordcnt/

Also see the Apache Lucene project. Its the base search engine used in
many commercial products.
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      03-06-2008
On Tue, 4 Mar 2008 08:10:53 -0800 (PST), Chase Preuninger
<(E-Mail Removed)> wrote, quoted or indirectly quoted someone
who said :

>If I was parsing a web page and extracting data from it in order to
>make a search engine, what should I extract?

you DON'T want the html tags
You DON'T want the header.
you DON'T want header/footer info common to all pages at a website.
you DON'T want common words like the that then is a ...
you DON'T want URLs
--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
SEO spider-spider ANY website prasadpelluri5@gmail.com Java 0 10-30-2008 07:04 PM
how google spider access my web site? baroque Chou ASP .Net 7 02-02-2006 09:25 PM
Web Crawler / Spider Commercial Software Info Request Gray Ghost Computer Support 1 11-07-2004 01:48 PM
Web Spider Thomas Lindgaard Python 3 07-07-2004 01:29 PM
Suggestions on a Web Spider that .... JeepGary Java 2 10-21-2003 02:53 AM



Advertisments