Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > difflib-like library supporting moved blocks detection?

Reply
Thread Tools

difflib-like library supporting moved blocks detection?

 
 
Vlastimil Brom
Guest
Posts: n/a
 
      07-13-2011
Hi all,
I'd like to ask about the availability of a text diff library, like
difflib, which would support the detection of moved text blocks.
Currently I am almost happy with the results of
difflib.SequenceMatcher in my app (comparing different versions of
natural language texts), however the only drawback seems to be the
missing detection of moves of the text parts. I was thinking of simply
recomparing the deleted and inserted blocks using difflib again, but
this obviously won't be a general solution.
I found several algorithm discussions, but unfortunately no suitable
python implementation. (E.g. Medite -
http://www-poleia.lip6.fr/~ganascia/Medite_Project - seems to be
implemented in Python but it targets some rather special and probably
much more complex textological issues, than my current needs.)
Does maybe someone know such python library (or possibly a way of
enhancing difflib) for this task (i.e character-wise comparing of
texts - detecting insertion, deletion, substitution and move of text
blocks)?

Thanks in advance,
Vlastimil Brom
 
Reply With Quote
 
 
 
 
Chris Torek
Guest
Posts: n/a
 
      07-14-2011
In article <(E-Mail Removed)>
Vlastimil Brom <(E-Mail Removed)> wrote:
>I'd like to ask about the availability of a text diff library, like
>difflib, which would support the detection of moved text blocks.


If you allow arbitrary moves, the "minimal edit distance" problem
(string-to-string edit) becomes substantially harder. If you only
allow insert, delete, or in-place-substitute, you have what is
called the "Levenshtein distance" case. If you allow transpositions
you get "Damerau-Levenshtein". These are both solveable with a
dynamic programming algorithm. Once you allow move operations,
though, the problem becomes NP-complete.

See http://pages.cs.brandeis.edu/~shapir...s/JDAmoves.pdf
for instance. (They give an algorithm that produces "usually
acceptable" results in polynomial time.)
--
In-Real-Life: Chris Torek, Wind River Systems
Intel require I note that my opinions are not those of WRS or Intel
Salt Lake City, UT, USA (4039.22'N, 11150.29'W) +1 801 277 2603
email: gmail (figure it out) http://web.torek.net/torek/index.html
 
Reply With Quote
 
 
 
 
Vlastimil Brom
Guest
Posts: n/a
 
      07-15-2011
2011/7/14 Chris Torek <(E-Mail Removed)>:
> In article <(E-Mail Removed)>
> Vlastimil Brom *<(E-Mail Removed)> wrote:
>>I'd like to ask about the availability of a text diff library, like
>>difflib, which would support the detection of moved text blocks.

>
> If you allow arbitrary moves, the "minimal edit distance" problem
> (string-to-string edit) becomes substantially harder. *If you only
> allow insert, delete, or in-place-substitute, you have what is
> called the "Levenshtein distance" case. *If you allow transpositions
> you get "Damerau-Levenshtein". *These are both solveable with a
> dynamic programming algorithm. *Once you allow move operations,
> though, the problem becomes NP-complete.
>
> See http://pages.cs.brandeis.edu/~shapir...s/JDAmoves.pdf
> for instance. *(They give an algorithm that produces "usually
> acceptable" results in polynomial time.)
> --
> In-Real-Life: Chris Torek, Wind River Systems
>
>

Thanks for the references and explanation!
I do realise the added complexity with taking the moves into account;
given that, my current needs and the usually satisfying results
obtained easily with difflib, I am not going to try to implement some
more complex diffing algorithm.
However, it turns out that the mentioned naive approach with just
recomparing the text additions and removals may be partially viable -
with some luck, i.e. given, the relevant segments are identified as
deletions and inserts and isolated by difflib in the first place (and
not subsumed under larger changes or split).

For illustration, the rough simplified code is attached (sorry for the
style and possible quirks...)
Just now the more similar text segments are just collected, it would
be also possible to sort them on their similarity ratio; the current
approach also allows to highlight potentially multiple moved segments.

Comments and suggestions are, of course, welcome,
regards,
vbr

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # # # # # # # # # # # # # # # # # # # #

#! Python
# -*- coding: utf-8 -*-

import difflib
import itertools

def compare_moves(a, b, similarity_threshold=0.6):
"""
Poor man's text comparison with simple moves check. Compares two
strings using difflib
and additionally tries to detect moved blocks
by comparing similar deleted and inserted segments with each other
- given the similarity_threshold.
"""

seq_matcher = difflib.SequenceMatcher(isjunk=None, a=a, b=b, autojunk=False)
diff_raw = [[tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]] for tag, i1,
i2, j1, j2 in seq_matcher.get_opcodes()]

deleted, inserted = {}, {}
for tag, i1, i2, j1, j2 in seq_matcher.get_opcodes():
if tag == 'delete':
deleted[(i1, i2)] = [tag, i1, i2, j1, j2, a[i1:i2]]
elif tag == 'insert':
inserted[(i1, i2)] = [tag, i1, i2, j1, j2, b[j1:j2]]

possibly_moved_blocks = []
for deleted_item, inserted_item in
itertools.product(deleted.values(), inserted.values()):
similarity_ratio = difflib.SequenceMatcher(isjunk=None,
a=deleted_item[5], b=inserted_item[5], autojunk=False).ratio()
if similarity_ratio >= similarity_threshold:
possibly_moved_blocks.append([deleted_item, inserted_item,
similarity_ratio])

print diff_raw
print possibly_moved_blocks


if __name__ == "__main__":
compare_moves("abcXYZdeABfghijklmnopABBCq",
"ABCDabcdeACfgXYXZhijklmnopq", similarity_threshold = 0.6)

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # # # # #
# output: #
[['insert', 0, 0, 0, 4, '', 'ABCD'], ['equal', 0, 3, 4, 7, 'abc',
'abc'], ['delete', 3, 6, 7, 7, 'XYZ', ''], ['equal', 6, 9, 7, 10,
'deA', 'deA'], ['replace', 9, 10, 10, 11, 'B', 'C'], ['equal', 10, 12,
11, 13, 'fg', 'fg'], ['insert', 12, 12, 13, 17, '', 'XYXZ'], ['equal',
12, 21, 17, 26, 'hijklmnop', 'hijklmnop'], ['delete', 21, 25, 26, 26,
'ABBC', ''], ['equal', 25, 26, 26, 27, 'q', 'q']]

[[['delete', 21, 25, 26, 26, 'ABBC'], ['insert', 0, 0, 0, 4, 'ABCD'],
0.75], [['delete', 3, 6, 7, 7, 'XYZ'], ['insert', 12, 12, 13, 17,
'XYXZ'], 0.8571428571428571]]

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Methods and blocks - not that clear when blocks passed into Steven Taylor Ruby 9 04-27-2009 08:46 AM
Is there any library supporting set operation. kuangye C++ 3 10-11-2008 08:05 PM
"Building Blocks" are "Application Blocks" Arjen ASP .Net 3 02-27-2005 01:06 AM
procs/blocks - blocks with procs, blocks with blocks? matt Ruby 1 08-06-2004 01:33 AM



Advertisments