Velocity Reviews > Ruby > Levenshtein_distance and recreate the string

Levenshtein_distance and recreate the string

Rick DeNatale
Guest
Posts: n/a

 04-30-2008
On Wed, Apr 30, 2008 at 8:25 AM, Vidar Hokstad <(E-Mail Removed)> wrote:

> Several people have given great answers as to why this isn't possible,
> but let me give you another reason why, just
> because I feel like it, and it's a useful thing to keep in mind.
>
> Whenever you get a problem like this, a good way of approaching it if
> you don't understand the algorithm is to think:
>
> - If it _was_ possible, would that allow you to make impossibly
> efficient compression algorithms?
>
> In this case, the answer is that yes it would: You could pick a fixed
> S2, for example an empty string, calculate the
> Levensthein distance from S1 to the fixed S2. The distance would be
> proportional to the length of S1. In fact, if
> you choose S2 to be the empty string, then the distance would be equal
> to the length of S1 (it'd take one deletion
> per position in S1 to end up with an empty string)
>
> In other words, you'd "compress" your test string "BRUY" down to the
> number 4, but more importantly that "compression"
> method would compress any string of any length, which is obviously
> impossible (the reason this is impossible is that
> if it was possible you could apply it to it's own output over and over
> until you had compressed an arbitrary input
> down to a bit).
>
> In this case it's not a very hard problem to solve, but I find that
> for a large number of questions about data
> compression because the answers often become blatantly
> obvious once you restate them that way.

Another thing to consider is why it's called a distance rather than,
say, a difference.

Consider this analogy.

Given that the highway distance between Raleigh, NC, and Washington,
DC is 250 miles. Given Washington, DC, and 250 miles, can I uniquely
identify Raleigh, NC.

No since the highway distance between Washington, DC and New York, NY
(as well as many other places) is also 250 miles.

--
Rick DeNatale

My blog on Ruby

Ken Bloom
Guest
Posts: n/a

 05-01-2008
On Tue, 29 Apr 2008 20:56:30 -0500, Phillip Gawlowski wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> David A. Black wrote:
> |
> | But isn't it the same for RUBY -> RUBE and RUBA -> RUBE ? In which |
> case, if you had RUBE and the L. distance, you could not recreate |
> RUBY; you'd have a whole set of strings that were exactly that |
> distance from RUBE.
>
> Good point. I don't know if the L. distance is unique for each
> possibility or not.

It's not unique. Any single deletion has edit distance 1, any single
insertion has edit distance 1.

Thus, the strings that have edit distance 1 from "RUBY" include (but are
not limited to):

#delete a letter
RUB
RUY
RBY
UBY
ARUBY
RAUBY
RUABY
RUBAY
RUBYA
BRUBY
RBUBY
RUBBY
RUBYB
....

That's not to say there aren't reasons for recreating all of these. For
Mengmeng Du[1], recreating all of these and checking for their existance
in a Bloom filter was orders of magnitude faster than computing the edit
distance between a query and every entry in a 200,000 entry list of last
names. (Just a paper I happen to be reading this week.)

--Ken

[1] Du, Mengmeng (2005) "Approximate Name Matching" Master's Degree
Project, Royal Institute of Technology, Stockholm, Sweden

--
Ken (Chanoch) Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/