Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > unexpected output from difflib.SequenceMatcher

Reply
Thread Tools

unexpected output from difflib.SequenceMatcher

 
 
Vlastimil Brom
Guest
Posts: n/a
 
      04-16-2010
Hi all,
Once in a while I happen to stumble on some not expected behaviour of
difflib.SequenceMatcher, in this case I hopefully managed to replicate
it on an illustrative sample text.
Both strings differ in a minimal way, each having one extra character
in a "strategic" position, which seems to meet some pathological case
for difflib.
Instead of just reporting the insertion and deletion of these single
characters (which works well for most cases - with most other
positions of the differing characters), the output of the
SequenceMatcher decides to delete a large part of the string in
between the differences and to insert the almost same text after that.
I didn't find any mentions of such cases in the documentation and,
honestly, I wasn't able to follow the sourcecode of difflib to make i
t clearer, hence I would like to ask for some hints.
Can this behaviour be avoided or worked around in some way? (I thought
about repeatedly trying sequence matcher on replaced parts, but this
doesn't help, if there is an insertion and deletion in the opcodes).
Or is this maybe some inherent possibility of the algorithm, which
cannot be dealt with reasonably?

The attached code simply prints the results of the comparison with the
respective tags, and substrings. No junk function is used.
I get the same results on Python 2.5.4, 2.6.5, 3.1.1 on windows XPp SP3.

Thanks in advance for any hints,
Regards,
vbr

################################################## ###########

#! Python
# -*- coding: utf-8 -*-

import difflib

# txt_a - extra character A at index 196
txt_a = "Chapman: *I* don't know - Mr Wentworth just told me to come
in here and say that there was trouble at the mill, that's all - I
didn't expect a kind of Spanish Inquisition.[jarring chord] Ximinez:
ANobody expects the Spanish Inquisition! Our chief weapon is
surprise...surprise and fear...fear and surprise.... Our two weapons
are fear and surprise...and ruthless efficiency.... Our *three*
weapons are fear, surprise, and ruthless efficiency...and an almost
fanatical devotion to the Pope.... Our *four*...no... *Amongst* our
weapons.... Amongst our weaponry...are such elements as fear,
surprise.... I'll come in again."

# txt_b - extra character B at index 525
txt_b = "Chapman: *I* don't know - Mr Wentworth just told me to come
in here and say that there was trouble at the mill, that's all - I
didn't expect a kind of Spanish Inquisition.[jarring chord] Ximinez:
Nobody expects the Spanish Inquisition! Our chief weapon is
surprise...surprise and fear...fear and surprise.... Our two weapons
are fear and surprise...and ruthless efficiency.... Our *three*
weapons are fear, surprise, and ruthless efficiency...and an almost
fanatical devotion to the Pope.... Our *four*...no... *Amongst* our
Bweapons.... Amongst our weaponry...are such elements as fear,
surprise.... I'll come in again."

seq_match = difflib.SequenceMatcher(None, txt_a, txt_b)
print ("\n".join("%7s a[%d:%d] (%s) b[%d:%d] (%s)" % (tag, i1, i2,
txt_a[i1:i2], j1, j2, txt_b[j1:j2]) for tag, i1, i2, j1, j2 in
seq_match.get_opcodes()))

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Unexpected Output when reading a files using read() Sanchit C Programming 8 03-22-2008 02:40 AM
Unexpected output while walking dirs Evan Carmi Python 2 01-02-2007 10:50 AM
output unexpected Vaibhav87@gmail.com C Programming 11 09-15-2006 07:08 PM
unexpected stream output with commas Kyle Kolander C++ 10 05-27-2005 09:58 PM
Unexpected repeating of my output function Tom Lam lemontea C Programming 5 11-13-2004 01:28 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57