Also sprach Veli-Pekka Tätilä:
> Hi,
> Here's a text processing problem that came to mind based on finding a
> certain set of sub strings. As I came up with this one on my own
> incidentally, I don't know any real good words for Googling useful
> information or solutions to the problem and thought I'd ask here then,
> having seen a number of basic string processing threads during my short-time
> presence.
>
> The Problem:
>
> Suppose we have a set S of character strings. Each string in the set is
> unique (well sets are), may be of arbitrary length and is composed of some
> known alphabet (e.g. a to z). The task is to write a function that takes an
> element E in S and returns all sub strings of E that uniquely identify E in
> S. Put another way, all the returned strings are sub strings of E but they
> may *not* be sub strings of any other element in S.
>
> Some Thoughts:
>
> Upon closer inspection, I also found out that the set of sub strings may
> contain an element that is equal to E if E differs from some other string in
> S by only one character. In that case, E as a whole is the only substring
> identifying E.
>
> Now, I must confess the problem is not inherently related to Perl in any way
> thus the Semi OT tag in the subject-line. Perl would be the language of
> choice for implementation personally, however, and thus some questions about
> that, too.
>
> There are bound to be better methods than the obvious brute-force tac of
> trying everything. Howabout the string matching, is substr the way to go or
> am I better off trying to hack regular expressions to do the job? If I'm
> aiming at low temporal complexity, what would be the worst case complexity
> for this problem? Surely not O(n). Performance is not an issue but I'm just
> curious and not guru enough in math or computer science to figure it out
> myself.
With a set of $n words, I would assume the complexity of a brute-force
approach to be
O( $n * ($n-1) * ($avg * ($avg-1) / 2) )
where $avg is the average word length, the term ($avg * ($avg-1) / 2)
being the number of substrings for an average word and ($n-1) the number
of words to test each substring against. As the average number of
substrings doesn't depend on the size of the set, it can be assumed to
be some constant in which case the algorithm is in O($n**2) for all
words and subsequently O($n) for retrieving the unique substrings for
one word.
Note that the question whether to use a pattern match or index() or
whatever doesn't really change the algorithm. In the one below I used
index() but could have done it just as easily using a pattern match.
Here's an implementation of such a brute-force algorithm which gave me
the opportunity to make use of Eric J. Roode's Iterator module recently
uploaded to the CPAN:
#!/usr/bin/perl -w
use strict;
use Iterator;
use List::MoreUtils qw/all/;
chomp( my @words = <DATA> );
for (@words) {
print "Unique ids for $_:\n ";
print join ", ", uniq_id($_, \@words);
print "\n";
}
sub uniq_id {
my ($word, $set) = @_;
my $iter = do {
my ($len, $off) = (1, 0);
my $i = Iterator->new(
sub {
Iterator::is_done() if $len > length $word;
my $sub = substr $word, $off++, $len;
($len, $off) = ($len + 1, 0)
if $off + $len > length $word;
return $sub;
}
);
$i;
};
my @ids;
while ($iter->isnt_exhausted) {
my $pat = $iter->value;
push @ids, $pat
if all { index($_, $pat) == -1 } grep { $_ ne $word } @$set;
}
return @ids;
}
__DATA__
abc
bcd
foobar
foo
bar
pearl
perl
modperl
One optimization immediately springs to my mind (unfortunately after
I've written the above):
The iterator returns the substrings in ascending order length-wise.
However, it's always so that if there is no unique substring of length
$m, there cannot possibly be one of $m-1. Therefore, have the iterator
return the substrings in descending order and rewrite the 'all { } grep
{ } @$set' condition into a real loop that exits early as soon as none
of the substrings of a given length are unique.
This does not really change the complexity of the algorithm as it only
makes the constant ($avg * ($avg-1) / 2) smaller. For small sets though
this constant is the dominant parameter.
> Motivation:
>
> By the way, the thing that got me into pondering this problem in the first
> place is somewhat related. I was writing a real crude mini-shell for a
> textmode app of mine and in one command the user had to choose a very long
> element (voice name) from a long list. To make the process easier, I allowed
> selection based on a partial match provided that only one element contained
> the sub string in question. Otherwise the matching elements would be listed.
> Then I started to think if I could generate all the strings that select a
> given element programatically, read lazily, and here's this post you're
> reading.
Something like TAB-completion offered by many shells? That would
decrease the complexity somewhat as you'd only have to test substrings
with offset zero (that is, at the beginning of the word).
> PS: I had a hard time titling this post appropriately. Feel free to change
> the subject to something more descriptive.
Your Subject was fine. You can't normally squeeze the whole content of a
posting into the Subject. If that was possible, postings wouldn't need a
body and we'd all communicate here by exchanging Subjects.
Tassilo
--
use bigint;
$n=71423350343770280161397026330337371139054411854 220053437565440;
$m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($ m+=

<=200);