Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > similar articles algorithm based on numeric indexing of all rows via columns in a table

Reply
Thread Tools

similar articles algorithm based on numeric indexing of all rows via columns in a table

 
 
julie_smith@operamail.com
Guest
Posts: n/a
 
      01-18-2005
Hi,
I have an articles table containing columns like
id,name,author,section,creationdate,description,lo ngmatter, etc.
I am using mysql.

some of them are fixed value fields(enumerations)

like->section will have news,sports,politics etc...

while description will be a text field with any amount of arbitrary
text.

now I have 50000 articles under different sections.

I want to implement a "similar articles" feature.
By this I mean when an article is shown,
I want to display all the similar articles based on that article.(10
per page).

Now how do I calculate the similarity of 1 article with all the 50000
articles ?

I dont want articles from the same section only.
Since the search result has to be very fast,
Can I create some algorithm that will look through all the fields in
each row of the
articles table and assign a weight/checksum to it.

And then in the similar articles part I display all the articles wth a
+-5 difference in checksum with the
current displayed articles checksum ?

Thanks in advance,

Julie

 
Reply With Quote
 
 
 
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      01-18-2005
wrote:
> I want to implement a "similar articles" feature.
> By this I mean when an article is shown,
> I want to display all the similar articles based on that article.(10
> per page).
>
> Now how do I calculate the similarity of 1 article with all the 50000
> articles ?
>
> I dont want articles from the same section only.
> Since the search result has to be very fast,
> Can I create some algorithm that will look through all the fields in
> each row of the
> articles table and assign a weight/checksum to it.


Check out the CPAN module Algorithm:iff.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
 
Reply With Quote
 
 
 
 
Anno Siegel
Guest
Posts: n/a
 
      01-18-2005
<> wrote in comp.lang.perl.misc:
> Hi,
> I have an articles table containing columns like
> id,name,author,section,creationdate,description,lo ngmatter, etc.
> I am using mysql.
>
> some of them are fixed value fields(enumerations)
>
> like->section will have news,sports,politics etc...
>
> while description will be a text field with any amount of arbitrary
> text.
>
> now I have 50000 articles under different sections.
>
> I want to implement a "similar articles" feature.


Okay. Given two articles, how do you decide if they are similar?

> By this I mean when an article is shown,
> I want to display all the similar articles based on that article.(10
> per page).


What you are going to do with the list of similar articles is of
no consequence on how you select them.

> Now how do I calculate the similarity of 1 article with all the 50000
> articles ?


First you have to tell us how to compare two individual articles, *then*
we can talk about ways to apply this to many pairs efficiently.

> I dont want articles from the same section only.
> Since the search result has to be very fast,
> Can I create some algorithm that will look through all the fields in
> each row of the
> articles table and assign a weight/checksum to it.
>
> And then in the similar articles part I display all the articles wth a
> +-5 difference in checksum with the
> current displayed articles checksum ?


Since you mention all the different fields, I suppose they all play
a part in deciding whether two articles are similar or not. You can't
map that many dimensions onto a single number and have it work like
you want to. The best you can hope for is a numeric representation
of *each field*, which can be compared to decide if articles are similar
with respect to one particular field. With some of the fields being
text strings, that won't be possible for all fields either.

Anno
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Dynamic indexing (multi-dimensional-indexing) (probably my most important/valuable posting up to this date) Skybuck Flying C Programming 30 09-18-2011 11:29 PM
Indexing services under Windows XP SP2 - Can I disable MS Indexing Service to hasten Google's OR does Google Desktop uses this MS Indexing Service? ricardodefaria Computer Support 6 08-05-2007 04:14 AM
How much slower is dict indexing vs. list indexing? Emin Python 4 01-12-2007 02:40 PM
convert rows to columns and columns to rows helpful sql ASP .Net 0 05-19-2005 06:03 PM
Indexing PDF Files using MS Indexing Service C ASP .Net 0 10-17-2003 04:27 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57