Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > RFC: Text similarity

Reply
Thread Tools

RFC: Text similarity

 
 
Tore Aursand
Guest
Posts: n/a
 
      04-23-2004
Hi!

I have a large (more than 3,000 at the moment) set of documents in various
formats (mostly PDF and Word). I need to create a sort of (...) index of
these documents based on their similarity. I thought it would be nice to
gather some suggestions from the people in this group before I proceeded.

First of all: Converting the documents to a more sensible format (text in
my case) is not the problem. The problem is the indexing and how to store
the data which represents the similarity between the documents.

I've done a search on CPAN and found a few modules which is of interest,
primarily AI::Categorize and WordNet. I haven't used any of these before,
but it seems like WordNet is the most appropriate one; AI::Categorize
seems to require you to categorize some of the documents first (which I
don't have the opportunity to do).

Are there any other modules I should take a look at? Any suggestions on
how I should deal with this task? Something you think I might forget?
Some traps I should look out for?

Any comments are appreciated! Thanks.


--
Tore Aursand <(E-Mail Removed)>
"First get your facts; then you can distort them at your leisure."
(Mark Twain)
 
Reply With Quote
 
 
 
 
James Willmore
Guest
Posts: n/a
 
      04-23-2004
On Fri, 23 Apr 2004 14:16:53 +0200, Tore Aursand wrote:

> First of all: Converting the documents to a more sensible format (text in
> my case) is not the problem. The problem is the indexing and how to store
> the data which represents the similarity between the documents.


Just an insight or two ...

I'd use a database to store information about each document in. This way,
you can use SQL to do things like count the word occurances and create
stats on each document. Plus, your mixing apples with apples - raw word
count with raw word count. It doesn't have to be a "real" database (like
mySQL or PostgreSQL) - it could be a Sprite or SQLite database. The
advantages to this approach are 1) you can try different options out
without having to re-parse 3000 documents; 2)if you have more documents to
add or some to remove, a simple SQL statement or two is easier to perform
than a whole lot of re-coding or re-thinking the parsing part of your
code. In fact, you can split up the various parts of your logic into
different scripts that act as filters - one to parse the documents, one to
populate the database, and maybe a few to determine similarities. All too
often we think in terms of "once and done" when a few scripts might me a
better solution.

I'd also look over one (or more) of the Lingua modules to establish a
criteria of what to put into the database. I doubt if you want to put a
whole lot of "the" and "a" entries into the database. This would inflate
the data source to about 5 times what it needs to be. So, using something
like Lingua::StopWord(?) might help.

There are Statistics modules as well. You could perform tests againist
two documents and get a statistically correlation between the documents to
see *how* similar they are. I'm rusty on Statistics 101, but my thinking
is maybe using a t-test between the two documents might be the way to go.
This may be overkill for what you want, but worth thinking about (for
maybe a minute or two ). There may even be something easier to do.

[ ... ]

Just my $0.02
HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
The rhino is a homely beast, For human eyes he's not a feast.
Farewell, farewell, you old rhinoceros, I'll stare at something
less prepoceros. -- Ogden Nash
 
Reply With Quote
 
 
 
 
Michele Dondi
Guest
Posts: n/a
 
      04-23-2004
On Fri, 23 Apr 2004 14:16:53 +0200, Tore Aursand <(E-Mail Removed)>
wrote:

>I have a large (more than 3,000 at the moment) set of documents in various
>formats (mostly PDF and Word). I need to create a sort of (...) index of
>these documents based on their similarity. I thought it would be nice to
>gather some suggestions from the people in this group before I proceeded.


I know that this may seem naive, but in a popular science magazine I
read that a paper has been published about a technique that indeed
identifies the (natural) language some documents are written in by
compressing (e.g. LZW) them along with some more text from samples
taken from a bunch of different languages and comparing the different
compressed sizes. You may try some variation on this scheme...

I for one would be interested in the results, BTW!


Michele
--
you'll see that it shouldn't be so. AND, the writting as usuall is
fantastic incompetent. To illustrate, i quote:
- Xah Lee trolling on clpmisc,
"perl bug File::Basename and Perl's nature"
 
Reply With Quote
 
Malcolm Dew-Jones
Guest
Posts: n/a
 
      04-23-2004
Tore Aursand ((E-Mail Removed)) wrote:
: Hi!

: I have a large (more than 3,000 at the moment) set of documents in various
: formats (mostly PDF and Word). I need to create a sort of (...) index of
: these documents based on their similarity. I thought it would be nice to
: gather some suggestions from the people in this group before I proceeded.

: First of all: Converting the documents to a more sensible format (text in
: my case) is not the problem. The problem is the indexing and how to store
: the data which represents the similarity between the documents.

: I've done a search on CPAN and found a few modules which is of interest,
: primarily AI::Categorize and WordNet. I haven't used any of these before,
: but it seems like WordNet is the most appropriate one; AI::Categorize
: seems to require you to categorize some of the documents first (which I
: don't have the opportunity to do).

: Are there any other modules I should take a look at? Any suggestions on
: how I should deal with this task? Something you think I might forget?
: Some traps I should look out for?

: Any comments are appreciated! Thanks.

There is a bayesian filter, not for spam, I think it's called ifile.

It helps file email into folders based on categories.

I could imagine starting the process by creating a few categories by hand,
each with one document or two or three similar documents, and then adding
documents using ifile. Each time a document doesn't have a good match in
the existing categories then create a new category.

Or do it in reverse, start with 2,999 categories (one for each document,
except the last), and take the last document (number 3,000) and try to
file it into one of the 2,999 categories. Do that for each document to
get a feel for the process, and then start merging the categories.

$0.02
 
Reply With Quote
 
Ala Qumsieh
Guest
Posts: n/a
 
      04-24-2004
Tore Aursand wrote:

> Any comments are appreciated! Thanks.


I would suggest taking your question to the perlai mailing list. I
recall a discussion about a similar problem a while ago.

--Ala

 
Reply With Quote
 
ctcgag@hotmail.com
Guest
Posts: n/a
 
      04-24-2004
Michele Dondi <(E-Mail Removed)> wrote:
> On Fri, 23 Apr 2004 14:16:53 +0200, Tore Aursand <(E-Mail Removed)>
> wrote:
>
> >I have a large (more than 3,000 at the moment) set of documents in
> >various formats (mostly PDF and Word). I need to create a sort of (...)
> >index of these documents based on their similarity. I thought it would
> >be nice to gather some suggestions from the people in this group before
> >I proceeded.

>
> I know that this may seem naive, but in a popular science magazine I
> read that a paper has been published about a technique that indeed
> identifies the (natural) language some documents are written in by
> compressing (e.g. LZW) them along with some more text from samples
> taken from a bunch of different languages and comparing the different
> compressed sizes. You may try some variation on this scheme...


I've tried this in various incarnations. It works well for very short
files, but for longer files it takes some sort of preprocessing. Most
compressors either operate chunk-wise, starting over again once the
code-book is full, or have some other mechanism that compresses only
locally. So you if just append documents, and they are long, then the
compressor will have forgotten about the section of one docuemnt by the
time it gets to the corresponding part of another document.

Xho


>
> I for one would be interested in the results, BTW!
>
> Michele


--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
Tore Aursand
Guest
Posts: n/a
 
      04-26-2004
On Fri, 23 Apr 2004 21:50:52 +0200, Michele Dondi wrote:
>> I have a large (more than 3,000 at the moment) set of documents in
>> various formats (mostly PDF and Word). I need to create a sort of
>> (...) index of these documents based on their similarity. I thought it
>> would be nice to gather some suggestions from the people in this group
>> before I proceeded.


> I know that this may seem naive, but in a popular science magazine I
> read that a paper has been published about a technique that indeed
> identifies the (natural) language some documents are written in by
> compressing (e.g. LZW) them along with some more text from samples taken
> from a bunch of different languages and comparing the different
> compressed sizes. You may try some variation on this scheme...


I really don't have the opportunity to categorize any of the documents;
Everything must be 100% automatic without human interference.

I should also point out that the text is mainly in Norwegian, but there
might be occurances of English text (as we're talking about technical
manuals).

> I for one would be interested in the results, BTW!


I will keep you updated!


--
Tore Aursand <(E-Mail Removed)>
"First, God created idiots. That was just for practice. Then He created
school boards." (Mark Twain)
 
Reply With Quote
 
Tore Aursand
Guest
Posts: n/a
 
      04-26-2004
On Fri, 23 Apr 2004 11:46:44 -0400, James Willmore wrote:
>> First of all: Converting the documents to a more sensible format (text
>> in my case) is not the problem. The problem is the indexing and how to
>> store the data which represents the similarity between the documents.


> I'd use a database to store information about each document in.


That has already been taken care of; I will use MySQL for this, and have
already a database up and running which consists of meta information about
each document (title, description and where it is stored).

The next step will be to retrieve all the words from each document, remove
obvious stopwords, and then associate each document with its words (and
how many times it appears in each document).

Based on this information I will create a script which tries to find
similar documents based on the associated words; If two documents holds a
majority of the same words, they are doomed to be similar.

The documents are in Norwegian, though, so I'm not able to rely on some of
the excellent Lingua- and Stem-modules out there. I'm aware that there
are a few modules for the Norwegian language, too, but I'm not quite sure
about the quality of them (and if they rely too much on the Danish
language, which at least some of the modules do).

The whole application is - of course - split into more than one script;

* Processing: Converting the documents to text, and converting the
text into words (and how many times each word appears).
* Inserting into the database.
* Similiarity checking; A script which checks every document in the
database against all the other documents. Quite expensive, this one,
but easily run around 5 in the morning when everyone is asleep.
* Web frontend for querying the database (ie. selecting/reading the
documents and letting the user choose to see related documents).

> There are Statistics modules as well. You could perform tests againist
> two documents and get a statistically correlation between the documents
> to see *how* similar they are.


Hmm. Do you have any module names? A "brief search" didn't yield any
useful hits.

> I'm rusty on Statistics 101, but my thinking is maybe using a t-test
> between the two documents might be the way to go.


I don't even know what a "t-test" is, but googling for "t-test" may give
me the answer...? Or should I search for something else (specific)?

> Just my $0.02


Great! Thanks alot!


--
Tore Aursand <(E-Mail Removed)>
"Then there was the man who drowned crossing a stream with an average
depth of six inches." (W.I.E. Gates)
 
Reply With Quote
 
Michele Dondi
Guest
Posts: n/a
 
      04-27-2004
On Tue, 27 Apr 2004 00:37:29 +0200, Tore Aursand <(E-Mail Removed)>
wrote:

>> I know that this may seem naive, but in a popular science magazine I
>> read that a paper has been published about a technique that indeed
>> identifies the (natural) language some documents are written in by
>> compressing (e.g. LZW) them along with some more text from samples taken
>> from a bunch of different languages and comparing the different
>> compressed sizes. You may try some variation on this scheme...

>
>I really don't have the opportunity to categorize any of the documents;
>Everything must be 100% automatic without human interference.


Well, you may try matching limited-sized portions of the documents
(after having converted them to pure text) against each other (I mean
across documents, not within the *same* document) and average the
result over a document.


Just my 2x10^-12 Eur,
Michele
--
$\=q.,.,$_=q.print' ,\g,,( w,a'c'e'h,,map{$_-=qif/g/;chr
}107..q[..117,q)[map+hex,split//,join' ,2B,, w$ECDF078D3'
F9'5F3014$,$,];];$\.=$/,s,q,32,g,s,g,112,g,y,' , q,,eval;
 
Reply With Quote
 
Tore Aursand
Guest
Posts: n/a
 
      04-27-2004
On Tue, 27 Apr 2004 23:12:40 +0200, Michele Dondi wrote:
>>> I know that this may seem naive, but in a popular science magazine I
>>> read that a paper has been published about a technique that indeed
>>> identifies the (natural) language some documents are written in by
>>> compressing (e.g. LZW) them along with some more text from samples
>>> taken from a bunch of different languages and comparing the different
>>> compressed sizes. You may try some variation on this scheme...


>> I really don't have the opportunity to categorize any of the documents;
>> Everything must be 100% automatic without human interference.


> Well, you may try matching limited-sized portions of the documents
> (after having converted them to pure text) against each other (I mean
> across documents, not within the *same* document) and average the result
> over a document.


Because there will be _a lot_ of documents - where each document can be
quite big - I have to have in mind:

* Processing power is limited, so the matching must be as light-
weight as possible, but at the same time as good as possible. Yeah, I
know how that sentence sounds.

* Data storage is also limited; I can't store each document (and
all its contents) in the database. I can only store meta data and
data related to the task of finding related documents.

The latter brings me to the point of extracting all the words from each
document, removing single characters, stopwords and numbers, and then
store these words (and their frequency) in a document/word-mapped data
table. Quite simple, really.


--
Tore Aursand <(E-Mail Removed)>
"To cease smoking is the easiset thing I ever did. I ought to know,
I've done it a thousand times." (Mark Twain)
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Trying to somehow measure similarity of text files lbrtchx@gmail.com Java 1 01-07-2008 12:25 PM
What are the similarity and difference b/w EBJ and COM+? =?iso-8859-1?B?bW9vcJk=?= Java 1 05-30-2006 12:12 PM
Document-Document similarity Fabian Leitritz Java 0 01-14-2005 03:18 PM
Re: String similarity Tim Churches Python 1 10-10-2003 06:00 PM
String similarity Luca Montecchiani Python 0 10-10-2003 12:44 AM



Advertisments