Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Deduping quotations

Reply
Thread Tools

Deduping quotations

 
 
Roedy Green
Guest
Posts: n/a
 
      11-30-2009
Have you ever noticed how the quotation websites have the same
quotations with tiny variations? or the same quote attributed to
several different authors. Sometimes there is a short and long version
of the same quotation.

I was wondering how you might detect these.


I thought you might do it by converting all to lower case, stripping
punctuation and normalising white space to a single space.

Then you would remove common words.

Then you need to match, where order matters, put precise matching does
not. Just how would that work?


--
Roedy Green Canadian Mind Products
http://mindprod.com
I mean the word proof not in the sense of the lawyers, who set two half proofs equal to a whole one, but in the sense of a mathematician, where half proof = 0, and it is demanded for proof that every doubt becomes impossible.
~ Carl Friedrich Gauss
 
Reply With Quote
 
 
 
 
Arne Vajh°j
Guest
Posts: n/a
 
      11-30-2009
Roedy Green wrote:
> Have you ever noticed how the quotation websites have the same
> quotations with tiny variations? or the same quote attributed to
> several different authors. Sometimes there is a short and long version
> of the same quotation.


It is common.

Poor quoting can easily spread such variations.

> I was wondering how you might detect these.
>
> I thought you might do it by converting all to lower case, stripping
> punctuation and normalising white space to a single space.
>
> Then you would remove common words.
>
> Then you need to match, where order matters, put precise matching does
> not. Just how would that work?


Maybe:
- only look at the very specific words
- convert those to a standard form
- test if all of those are present

Arne
 
Reply With Quote
 
 
 
 
Tom Anderson
Guest
Posts: n/a
 
      11-30-2009
On Sun, 29 Nov 2009, Roedy Green wrote:

> Have you ever noticed how the quotation websites have the same
> quotations with tiny variations? or the same quote attributed to several
> different authors. Sometimes there is a short and long version of the
> same quotation.
>
> I was wondering how you might detect these.


> I thought you might do it by converting all to lower case, stripping
> punctuation and normalising white space to a single space.


Then computing edit distances between all pairs of quotations:

http://en.wikipedia.org/wiki/Levenshtein_distance

And reporting those with distances below a certain threshold. I would
guess that for a database of a few hundred quotations, the analysis would
take under five minutes - probably under one minute, and probably a matter
of seconds.

Lucene has an implementation of this algorithm, and i imagine it's a fast
one. If you weren't satisfied with a speed of a your own implementation
(and it's really not difficult), you could try finding and using that.

tom

--
No, Charlie, Tottenham Court Road is the Midlands. -- Lola, 'Kinky Boots'
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
deduping dirknbr Python 5 06-21-2010 06:07 PM
Adding quotation marks to string macro without quotations Vadim C++ 1 12-24-2008 09:23 AM
deduping algorithm Roedy Green Java 14 07-23-2004 08:33 PM
Computer related quotations. Alan Liefting NZ Computing 0 03-04-2004 10:01 AM
Dealing with Quotations (") in HTML Jesse ASP .Net 2 07-23-2003 07:45 PM



Advertisments