Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   similar articles algorithm based on numeric indexing of all rows via columns in a table (http://www.velocityreviews.com/forums/t890260-similar-articles-algorithm-based-on-numeric-indexing-of-all-rows-via-columns-in-a-table.html)

julie_smith@operamail.com 01-18-2005 01:42 PM

similar articles algorithm based on numeric indexing of all rows via columns in a table
 
Hi,
I have an articles table containing columns like
id,name,author,section,creationdate,description,lo ngmatter, etc.
I am using mysql.

some of them are fixed value fields(enumerations)

like->section will have news,sports,politics etc...

while description will be a text field with any amount of arbitrary
text.

now I have 50000 articles under different sections.

I want to implement a "similar articles" feature.
By this I mean when an article is shown,
I want to display all the similar articles based on that article.(10
per page).

Now how do I calculate the similarity of 1 article with all the 50000
articles ?

I dont want articles from the same section only.
Since the search result has to be very fast,
Can I create some algorithm that will look through all the fields in
each row of the
articles table and assign a weight/checksum to it.

And then in the similar articles part I display all the articles wth a
+-5 difference in checksum with the
current displayed articles checksum ?

Thanks in advance,

Julie


Gunnar Hjalmarsson 01-18-2005 01:44 PM

Re: similar articles algorithm based on numeric indexing of all rowsvia columns in a table
 
julie_smith@operamail.com wrote:
> I want to implement a "similar articles" feature.
> By this I mean when an article is shown,
> I want to display all the similar articles based on that article.(10
> per page).
>
> Now how do I calculate the similarity of 1 article with all the 50000
> articles ?
>
> I dont want articles from the same section only.
> Since the search result has to be very fast,
> Can I create some algorithm that will look through all the fields in
> each row of the
> articles table and assign a weight/checksum to it.


Check out the CPAN module Algorithm::Diff.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Anno Siegel 01-18-2005 09:41 PM

Re: similar articles algorithm based on numeric indexing of all rows via columns in a table
 
<julie_smith@operamail.com> wrote in comp.lang.perl.misc:
> Hi,
> I have an articles table containing columns like
> id,name,author,section,creationdate,description,lo ngmatter, etc.
> I am using mysql.
>
> some of them are fixed value fields(enumerations)
>
> like->section will have news,sports,politics etc...
>
> while description will be a text field with any amount of arbitrary
> text.
>
> now I have 50000 articles under different sections.
>
> I want to implement a "similar articles" feature.


Okay. Given two articles, how do you decide if they are similar?

> By this I mean when an article is shown,
> I want to display all the similar articles based on that article.(10
> per page).


What you are going to do with the list of similar articles is of
no consequence on how you select them.

> Now how do I calculate the similarity of 1 article with all the 50000
> articles ?


First you have to tell us how to compare two individual articles, *then*
we can talk about ways to apply this to many pairs efficiently.

> I dont want articles from the same section only.
> Since the search result has to be very fast,
> Can I create some algorithm that will look through all the fields in
> each row of the
> articles table and assign a weight/checksum to it.
>
> And then in the similar articles part I display all the articles wth a
> +-5 difference in checksum with the
> current displayed articles checksum ?


Since you mention all the different fields, I suppose they all play
a part in deciding whether two articles are similar or not. You can't
map that many dimensions onto a single number and have it work like
you want to. The best you can hope for is a numeric representation
of *each field*, which can be compared to decide if articles are similar
with respect to one particular field. With some of the fields being
text strings, that won't be possible for all fields either.

Anno


All times are GMT. The time now is 02:01 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.