Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > HTML > OT: Full text search

Reply
Thread Tools

OT: Full text search

 
 
Jeff Thies
Guest
Posts: n/a
 
      08-24-2004
I think many of us use mySQL...

I notice that MySQL has a full text search. This matches a phrase
like: "full text search of website", and returns a list of results
ordered by the highest degree of matches. Minor words and very frequent
words are excluded.

Sounds very powerful and as it's nearly trivial to spider a site and
stuff this it would be easy to implement.

So, anyone used this? Or something like it?

Jeff
 
Reply With Quote
 
 
 
 
Karl Groves
Guest
Posts: n/a
 
      08-24-2004

"Jeff Thies" <(E-Mail Removed)> wrote in message
news:b2yWc.9824$(E-Mail Removed) ink.net...
>
> So, anyone used this? Or something like it?


Seach engine programming is way too complicated for even the most
experienced programmers. Dealing with things like misspellings, homophones,
synonyms and all that stuff are even just the tip of the iceberg. Then, when
you get into things like ranking the results based on relevance and you have
yourself a major nightmare.

-Karl


 
Reply With Quote
 
 
 
 
Jeff Thies
Guest
Posts: n/a
 
      08-24-2004
Karl Groves wrote:
> "Jeff Thies" <(E-Mail Removed)> wrote in message
> news:b2yWc.9824$(E-Mail Removed) ink.net...
>
>> So, anyone used this? Or something like it?

>
>
> Seach engine programming is way too complicated for even the most
> experienced programmers.


Well keyword/multi searches on shopping sites are very common and not
hard to implement, you wouldn't want a full fledged search engine there
and for most other site apps you wouldn't either.


> Dealing with things like misspellings, homophones,
> synonyms and all that stuff are even just the tip of the iceberg.


Why bother? Spellchecking isn't that hard though, I posted a client side
version a couple years ago, doing it server side would be even easier.
But, there's a lot of ways to workaround no match scenarios.

> Then, when
> you get into things like ranking the results based on relevance and you have
> yourself a major nightmare.


But, that's where MySQL does it for you. Besides Google ranks on a lot
of crieria that would be less than helpfull on a mid size site.

So, your answer is no?

Jeff
>
> -Karl
>
>

 
Reply With Quote
 
Art Sackett
Guest
Posts: n/a
 
      08-24-2004
Jeff Thies <(E-Mail Removed)> wrote:

> So, anyone used this? Or something like it?


I toyed with it a while back, and it's slower than the rectification of
sin. I wouldn't use it at all on a large dataset or a busy server.

--
Art Sackett,
Patron Saint of Drunken Fornication
 
Reply With Quote
 
Toby Inkster
Guest
Posts: n/a
 
      08-24-2004
Karl Groves wrote:

> Seach engine programming is way too complicated for even the most
> experienced programmers. Dealing with things like misspellings, homophones,
> synonyms and all that stuff are even just the tip of the iceberg. Then, when
> you get into things like ranking the results based on relevance and you have
> yourself a major nightmare.


Rankings aren't hard. I'm pretty happy with the rankings on my search
engine. I only have a handful of pages, but the algorithm should work fine
even with many thousands.

I don't deal with misspellings, etc on my site -- if you misspell your
search term you don't *deserve* a result I can't imaging misspellings
would be that hard though, if you used some third-party software to
suggest corrections (e.g. ispell).

I've also not implemented boolean searches (just exact phrase) yet.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

 
Reply With Quote
 
Jeff Thies
Guest
Posts: n/a
 
      08-24-2004
Art Sackett wrote:

> Jeff Thies <(E-Mail Removed)> wrote:
>
>
>> So, anyone used this? Or something like it?

>
>
> I toyed with it a while back, and it's slower than the rectification of
> sin. I wouldn't use it at all on a large dataset or a busy server.


I was afraid of that! Must have one hell of an index, or maybe not!

I figured it would be slow slurping data in. It's slow querying also?

Cheers,
Jeff
>

 
Reply With Quote
 
Jeff Thies
Guest
Posts: n/a
 
      08-24-2004
> Rankings aren't hard. I'm pretty happy with the rankings on my search
> engine. I only have a handful of pages, but the algorithm should work fine
> even with many thousands.
>

How are you going about that?

> I don't deal with misspellings, etc on my site -- if you misspell your
> search term you don't *deserve* a result I can't imaging misspellings
> would be that hard though, if you used some third-party software to
> suggest corrections (e.g. ispell).
>
> I've also not implemented boolean searches (just exact phrase) yet.


I've been doing something like this:

AND search:

foreach my $keyword(@keywords){
$sort .= ' AND search_field like ' . '\'%' . $keyword . '%\' ';
}

seems too easy... you gotta clean out leading/trailing spaces in the
keywords is all.

Jeff

>

 
Reply With Quote
 
Art Sackett
Guest
Posts: n/a
 
      08-24-2004
Karl Groves <(E-Mail Removed)> wrote:

> Seach engine programming is way too complicated for even the most
> experienced programmers.


Who writes the software that drives the web's search engines?

> Dealing with things like misspellings, homophones,
> synonyms and all that stuff are even just the tip of the iceberg.


But are not all that hard to do, in my experience. The hardest part of
the submerged portion of the iceberg is thinking your way through the
task before writing any code.

> Then, when
> you get into things like ranking the results based on relevance and you have
> yourself a major nightmare.


Oh, I dunno. It's not as hard as it might seem. The hardest part is
coming up with a decent index from which to work. You have to spend a
lot of time thinking about your indexing algorithms, but I wouldn't go
so far as to call it a nightmare.

I like to do a hybrid sort of a thing, first indexing all of the words
as they appear, then applying the the Porter Stemming Algorithm
(http://tartarus.org/~martin/PorterStemmer/ ) to derive their stems,
which receive a lower basic score so that whole words are viewed as
"more relevant". I then factor both based upon their position in the
"stream", their containing elements (h1...h5, bold, italic, etc.), and
their occurrence in any of the more interesting places (path/filename,
document title, META descriptions/keywords, etc.) Then off into the
monstrous database they go. In a single-site search, you don't have to
do any complex heuristics to detect spamdexing or doorway pages, punish
zero-timed redirects, etc. so that bit of nightmare doesn't count.
Still, those things are easily enough detected if you have need of
protective measures.

The second thing you have to spend a lot of time thinking about is the
database. No matter how you optimize it, it's going to suck. Resources,
that is. Lots and lots of resources.

My favorite hand-rolled algorithm says that the document at the
(Porter Stemming Algorithm) URL above is most relevant to the
following "natural" keyword groups:

porter, stemming (most relevant)
common, ansi, encodings, published, errors (relevant but too common)

with the top five scored terms being version, algorithm, porter,
stemming (and) common. Using the first three, four, or all five of the
top-scored terms at Lycos lands the URL at number one. Using the first
two lands it at number two. (I use Lycos in this example because it
doesn't have a "PageRank" algorithm that would require web spidering
for validation.) It'd be kinda silly to even think of looking for it
in the results for the single term "version". Mixing and matching
any two from the list of "natural" search terms (at Lycos) brings the
site in at number one most of the time.

The least relevant natural group brings up that URL, at Lycos, in the
number one spot. Popping "errors" off the end moves it to number five.
Those terms are just way too common, even if they're what the document
appears to be "about." (It'd be really easy to knock that site out of
the number one spot, even for "porter stemming", as it obviously has
not been optimized for search engine ranking.)

In my retrieval algorithm, I first spellcheck, then generate a list of
synonyms, homonyms, and common abbreviations of the user-provided
search terms, giving the highest preference to the terms in the order
provided by the user, then the various permutations thereof working
down from best-fit to least-fit. A bit of heuristic manipulation (AKA
"magic") happens when I look at the results, to eliminate some that
might otherwise appear attractive, but for a single small or moderately
sized site these heuristics may be unimportant. If I get too few
"hits", I pop the last term off of the list, and reiterate to add more
hits after the first group, terminating either when I get a reasonable
set, or the relevance factor falls below some threshold.

If the site is all static content and has META descriptions/keywords,
it might be best to conserve resources by indexing only the path and
file name, title, and META description/keywords. The task gets far
simpler and the resource consumption falls off dramatically.

Having done it, albeit on a small scale (single sites of just tens of
thousands of documents, not hundreds of thousands or millions) I don't
consider search engine development to be "way too complicated for even
the most experienced programmers." Indexing the entire web would
require an experienced programmer (a la http://www.gigablast.com/ which
is one guy with just eight servers), but indexing a single site isn't.
It's a good stretching exercise even for moderately skilled programmers
who aren't betting their careers on the product, and lots of fun, too.

--
Art Sackett,
Patron Saint of Drunken Fornication
 
Reply With Quote
 
Art Sackett
Guest
Posts: n/a
 
      08-24-2004
Jeff Thies <(E-Mail Removed)> wrote:

> foreach my $keyword(@keywords){
> $sort .= ' AND search_field like ' . '\'%' . $keyword . '%\' ';
> }


You might consider using:

([[:<:]]|[[unct:]])$keyword([[unct:]]|[[:>:]])

to ensure you get 'em all. Just a thought...

--
Art Sackett,
Patron Saint of Drunken Fornication
 
Reply With Quote
 
Art Sackett
Guest
Posts: n/a
 
      08-24-2004
Jeff Thies <(E-Mail Removed)> wrote:

> I figured it would be slow slurping data in. It's slow querying also?


Painfully so.

--
Art Sackett,
Patron Saint of Drunken Fornication
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Options for generic full-text search without using database-specific full-text engine? Samuel R. Neff ASP .Net 2 06-10-2005 06:53 PM
Noise words in full text search.... macyp ASP .Net 1 01-29-2005 05:53 PM
Translating User Search Strings into Full-Text Contains Syntax Dylan Phillips ASP .Net 0 11-13-2003 01:58 AM
full text search from ado.net with access Brian Henry ASP .Net 0 10-05-2003 03:22 PM
Re: Regular Expression for SQL Server full-text search Ray Dixon [MVP] ASP .Net 0 07-21-2003 06:27 PM



Advertisments