Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Semi OT: Uniquely Identifying Substrings for an Elem in a Set: substr, Sets and Complexity

Reply
Thread Tools

Semi OT: Uniquely Identifying Substrings for an Elem in a Set: substr, Sets and Complexity

 
 
Veli-Pekka Tštilš
Guest
Posts: n/a
 
      08-20-2005
Hi,
Here's a text processing problem that came to mind based on finding a
certain set of sub strings. As I came up with this one on my own
incidentally, I don't know any real good words for Googling useful
information or solutions to the problem and thought I'd ask here then,
having seen a number of basic string processing threads during my short-time
presence.

The Problem:

Suppose we have a set S of character strings. Each string in the set is
unique (well sets are), may be of arbitrary length and is composed of some
known alphabet (e.g. a to z). The task is to write a function that takes an
element E in S and returns all sub strings of E that uniquely identify E in
S. Put another way, all the returned strings are sub strings of E but they
may *not* be sub strings of any other element in S.

Some Thoughts:

Upon closer inspection, I also found out that the set of sub strings may
contain an element that is equal to E if E differs from some other string in
S by only one character. In that case, E as a whole is the only substring
identifying E.

Now, I must confess the problem is not inherently related to Perl in any way
thus the Semi OT tag in the subject-line. Perl would be the language of
choice for implementation personally, however, and thus some questions about
that, too.

There are bound to be better methods than the obvious brute-force tac of
trying everything. Howabout the string matching, is substr the way to go or
am I better off trying to hack regular expressions to do the job? If I'm
aiming at low temporal complexity, what would be the worst case complexity
for this problem? Surely not O(n). Performance is not an issue but I'm just
curious and not guru enough in math or computer science to figure it out
myself.

I'm certain the solution is relatively simple. I didn't find anything that
useful in the Perl Cookbook. The problem is not exactly a straight
intersection, difference etc... of two arrays but rather the set will
include elements that are not directly in the array being processed, though
they are based on an element of the array.

Motivation:

By the way, the thing that got me into pondering this problem in the first
place is somewhat related. I was writing a real crude mini-shell for a
textmode app of mine and in one command the user had to choose a very long
element (voice name) from a long list. To make the process easier, I allowed
selection based on a partial match provided that only one element contained
the sub string in question. Otherwise the matching elements would be listed.
Then I started to think if I could generate all the strings that select a
given element programatically, read lazily, and here's this post you're
reading.

PS: I had a hard time titling this post appropriately. Feel free to change
the subject to something more descriptive.

--
With kind regards Veli-Pekka Tštilš ((E-Mail Removed))
Accessibility, game music, synthesizers and programming:
http://www.student.oulu.fi/~vtatila/


 
Reply With Quote
 
 
 
 
Veli-Pekka Tštilš
Guest
Posts: n/a
 
      08-20-2005
Hi again,
First mistake spotted. When talking about the perl functions I naturally
ment index rather than substr. I think this does come down to my English
skills and the semantics. Sub string means any part of a larger original
string. yet the substr function won't tell if a given substring is found but
rather extracts a substring based on character indeces. Maybe the names
cutString, piece or contains would be more descriptive. Well, my bad as
usual.

PS: Please reply to my original message in most cases. If someone wants to
discuss Perl's naming further, do at least change the subject.

With kind regards Veli-Pekka Tštilš ((E-Mail Removed))
Accessibility, game music, synthesizers and programming:
http://www.student.oulu.fi/~vtatila/


 
Reply With Quote
 
 
 
 
xhoster@gmail.com
Guest
Posts: n/a
 
      08-20-2005
"Veli-Pekka Tštilš" <(E-Mail Removed)> wrote:
> Hi,
> Here's a text processing problem that came to mind based on finding a
> certain set of sub strings. As I came up with this one on my own
> incidentally, I don't know any real good words for Googling useful
> information or solutions to the problem and thought I'd ask here then,
> having seen a number of basic string processing threads during my
> short-time presence.
>
> The Problem:
>
> Suppose we have a set S of character strings. Each string in the set is
> unique (well sets are), may be of arbitrary length and is composed of
> some known alphabet (e.g. a to z). The task is to write a function that
> takes an element E in S and returns all sub strings of E that uniquely
> identify E in S. Put another way, all the returned strings are sub
> strings of E but they may *not* be sub strings of any other element in S.


This was discussed here two months ago, and someone claimed Text::Ngrams
was the/a solution. I don't whether it actually is or not, but you may
want to look up the original thread.


> Upon closer inspection, I also found out that the set of sub strings may
> contain an element that is equal to E if E differs from some other string
> in S by only one character. In that case, E as a whole is the only
> substring identifying E.


What if E itself is a substring of some other element of S? Then not even
E as a whole identifies E!

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
Tassilo v. Parseval
Guest
Posts: n/a
 
      08-21-2005
Also sprach Veli-Pekka Tštilš:

> Hi,
> Here's a text processing problem that came to mind based on finding a
> certain set of sub strings. As I came up with this one on my own
> incidentally, I don't know any real good words for Googling useful
> information or solutions to the problem and thought I'd ask here then,
> having seen a number of basic string processing threads during my short-time
> presence.
>
> The Problem:
>
> Suppose we have a set S of character strings. Each string in the set is
> unique (well sets are), may be of arbitrary length and is composed of some
> known alphabet (e.g. a to z). The task is to write a function that takes an
> element E in S and returns all sub strings of E that uniquely identify E in
> S. Put another way, all the returned strings are sub strings of E but they
> may *not* be sub strings of any other element in S.
>
> Some Thoughts:
>
> Upon closer inspection, I also found out that the set of sub strings may
> contain an element that is equal to E if E differs from some other string in
> S by only one character. In that case, E as a whole is the only substring
> identifying E.
>
> Now, I must confess the problem is not inherently related to Perl in any way
> thus the Semi OT tag in the subject-line. Perl would be the language of
> choice for implementation personally, however, and thus some questions about
> that, too.
>
> There are bound to be better methods than the obvious brute-force tac of
> trying everything. Howabout the string matching, is substr the way to go or
> am I better off trying to hack regular expressions to do the job? If I'm
> aiming at low temporal complexity, what would be the worst case complexity
> for this problem? Surely not O(n). Performance is not an issue but I'm just
> curious and not guru enough in math or computer science to figure it out
> myself.


With a set of $n words, I would assume the complexity of a brute-force
approach to be

O( $n * ($n-1) * ($avg * ($avg-1) / 2) )

where $avg is the average word length, the term ($avg * ($avg-1) / 2)
being the number of substrings for an average word and ($n-1) the number
of words to test each substring against. As the average number of
substrings doesn't depend on the size of the set, it can be assumed to
be some constant in which case the algorithm is in O($n**2) for all
words and subsequently O($n) for retrieving the unique substrings for
one word.

Note that the question whether to use a pattern match or index() or
whatever doesn't really change the algorithm. In the one below I used
index() but could have done it just as easily using a pattern match.

Here's an implementation of such a brute-force algorithm which gave me
the opportunity to make use of Eric J. Roode's Iterator module recently
uploaded to the CPAN:

#!/usr/bin/perl -w

use strict;
use Iterator;
use List::MoreUtils qw/all/;

chomp( my @words = <DATA> );

for (@words) {
print "Unique ids for $_:\n ";
print join ", ", uniq_id($_, \@words);
print "\n";
}

sub uniq_id {
my ($word, $set) = @_;
my $iter = do {
my ($len, $off) = (1, 0);
my $i = Iterator->new(
sub {
Iterator::is_done() if $len > length $word;
my $sub = substr $word, $off++, $len;
($len, $off) = ($len + 1, 0)
if $off + $len > length $word;
return $sub;
}
);
$i;
};
my @ids;
while ($iter->isnt_exhausted) {
my $pat = $iter->value;
push @ids, $pat
if all { index($_, $pat) == -1 } grep { $_ ne $word } @$set;
}
return @ids;
}

__DATA__
abc
bcd
foobar
foo
bar
pearl
perl
modperl

One optimization immediately springs to my mind (unfortunately after
I've written the above):

The iterator returns the substrings in ascending order length-wise.
However, it's always so that if there is no unique substring of length
$m, there cannot possibly be one of $m-1. Therefore, have the iterator
return the substrings in descending order and rewrite the 'all { } grep
{ } @$set' condition into a real loop that exits early as soon as none
of the substrings of a given length are unique.

This does not really change the complexity of the algorithm as it only
makes the constant ($avg * ($avg-1) / 2) smaller. For small sets though
this constant is the dominant parameter.

> Motivation:
>
> By the way, the thing that got me into pondering this problem in the first
> place is somewhat related. I was writing a real crude mini-shell for a
> textmode app of mine and in one command the user had to choose a very long
> element (voice name) from a long list. To make the process easier, I allowed
> selection based on a partial match provided that only one element contained
> the sub string in question. Otherwise the matching elements would be listed.
> Then I started to think if I could generate all the strings that select a
> given element programatically, read lazily, and here's this post you're
> reading.


Something like TAB-completion offered by many shells? That would
decrease the complexity somewhat as you'd only have to test substrings
with offset zero (that is, at the beginning of the word).

> PS: I had a hard time titling this post appropriately. Feel free to change
> the subject to something more descriptive.


Your Subject was fine. You can't normally squeeze the whole content of a
posting into the Subject. If that was possible, postings wouldn't need a
body and we'd all communicate here by exchanging Subjects.

Tassilo
--
use bigint;
$n=71423350343770280161397026330337371139054411854 220053437565440;
$m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($ m+=<=200);
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      08-22-2005
Veli-Pekka Tštilš <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> Hi,
> Here's a text processing problem that came to mind based on finding a
> certain set of sub strings. As I came up with this one on my own
> incidentally, I don't know any real good words for Googling useful
> information or solutions to the problem and thought I'd ask here then,
> having seen a number of basic string processing threads during my short-time
> presence.
>
> The Problem:
>
> Suppose we have a set S of character strings. Each string in the set is
> unique (well sets are), may be of arbitrary length and is composed of some
> known alphabet (e.g. a to z). The task is to write a function that takes an
> element E in S and returns all sub strings of E that uniquely identify E in
> S. Put another way, all the returned strings are sub strings of E but they
> may *not* be sub strings of any other element in S.


Here is a brute-force solution. Generate all substrings of all strings
in the set and remember the string each substring was generated from.
(That happens in the hash %coll.) Then the substrings that were generated
from exactly one string are the ones you're looking for.

If you have to find solutions to your problem repeatedly (with the same
basic set of strings), my solution may even be practical. Its main
virtue is that the algortithm is so simple, it is almost certainly correct.

With a bow to Tassilo for providing a good example set:

#!/usr/bin/perl
use strict; use warnings; $| = 1;

my %coll;
for my $str ( <DATA> ) {
chomp $str;
push @{ $coll{ $_}}, $str for substrings( $str);
}

# postprocessing for nice output
my %abbrev;
@{ $coll{ $_}} == 1 and push @{ $abbrev{ $coll{ $_}->[ 0]}}, $_ for
keys %coll;

print "$_ -> @{ $abbrev{ $_}}\n" for sort keys %abbrev;

################################################## ###############

sub substrings {
my $str = shift;
map substr_n( $str, $_), 0 .. length $str;
}

sub substr_n {
my ($str, $l) = @_;
map substr( $str, $_, $l), 0 .. length( $str) - $l;
}

__DATA__
abc
bcd
foobar
foo
bar
pearl
perl
modperl

Anno
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.
 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      08-22-2005
http://www.velocityreviews.com/forums/(E-Mail Removed)-berlin.de (Anno Siegel) wrote:
> >
> > The Problem:
> >
> > Suppose we have a set S of character strings. Each string in the set is
> > unique (well sets are), may be of arbitrary length and is composed of
> > some known alphabet (e.g. a to z). The task is to write a function that
> > takes an element E in S and returns all sub strings of E that uniquely
> > identify E in S. Put another way, all the returned strings are sub
> > strings of E but they may *not* be sub strings of any other element in
> > S.

>
> Here is a brute-force solution. Generate all substrings of all strings
> in the set and remember the string each substring was generated from.
> (That happens in the hash %coll.) Then the substrings that were
> generated from exactly one string are the ones you're looking for.
>
> If you have to find solutions to your problem repeatedly (with the same
> basic set of strings), my solution may even be practical. Its main
> virtue is that the algortithm is so simple, it is almost certainly
> correct.




If you add 'mississippi' to the list, it misses 'iss' as a identifying
substring (and I suppose others), due to the internal duplication.

> With a bow to Tassilo for providing a good example set:
>
> #!/usr/bin/perl
> use strict; use warnings; $| = 1;
>
> my %coll;
> for my $str ( <DATA> ) {
> chomp $str;
> push @{ $coll{ $_}}, $str for substrings( $str);
> }


You should either use a hash of hash rather than a hash of array, (which
seems to me like it would be a more natural anyway) or change substrings so
that it only returns one copy of each substring for any given invocation.


Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
 
Reply With Quote
 
Anno Siegel
Guest
Posts: n/a
 
      08-23-2005
<(E-Mail Removed)> wrote in comp.lang.perl.misc:
> (E-Mail Removed)-berlin.de (Anno Siegel) wrote:
> > >
> > > The Problem:
> > >
> > > Suppose we have a set S of character strings. Each string in the set is
> > > unique (well sets are), may be of arbitrary length and is composed of
> > > some known alphabet (e.g. a to z). The task is to write a function that
> > > takes an element E in S and returns all sub strings of E that uniquely
> > > identify E in S. Put another way, all the returned strings are sub
> > > strings of E but they may *not* be sub strings of any other element in
> > > S.

> >
> > Here is a brute-force solution. Generate all substrings of all strings
> > in the set and remember the string each substring was generated from.
> > (That happens in the hash %coll.) Then the substrings that were
> > generated from exactly one string are the ones you're looking for.
> >
> > If you have to find solutions to your problem repeatedly (with the same
> > basic set of strings), my solution may even be practical. Its main
> > virtue is that the algortithm is so simple, it is almost certainly
> > correct.

>
>
>
> If you add 'mississippi' to the list, it misses 'iss' as a identifying
> substring (and I suppose others), due to the internal duplication.


Oh dear! Good thing I said "almost certainly".

[Advice snipped but incorporated below]

Anno

#!/usr/bin/perl
use strict; use warnings; $| = 1;

my %coll;
for my $str ( <DATA> ) {
chomp $str;
$coll{ $_}->{ $str} = 1 for substrings( $str);
}

# postprocessing for nice output
my %abbrev;

for ( keys %coll ) {
if ( ( my ( $str) = keys %{ $coll{ $_}}) == 1 ) {
push @{ $abbrev{ $str}}, $_;
}
}

@$_ = sort { length( $a) <=> length( $b) || $a cmp $b } @$_ for values %abbrev;

print "$_ -> @{ $abbrev{ $_}}\n" for sort keys %abbrev;

################################################## ###############

sub substrings {
my $str = shift;
map substr_n( $str, $_), 0 .. length $str;
}

sub substr_n {
my ($str, $l) = @_;
map substr( $str, $_, $l), 0 .. length( $str) - $l;
}

__DATA__
abc
bcd
foobar
foo
bar
pearl
perl
modperl
mississippi
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Uniquely identifying each & every html template Ferrous Cranus Python 58 01-24-2013 12:53 AM
py2app semi-standalone semi-works James Stroud Python 2 10-04-2006 08:35 PM
could not determine which columns uniquely identify the rows for ... bazzer ASP .Net 0 04-10-2006 11:09 AM
How do I uniquely identify a control? Alan Silver ASP .Net 6 02-24-2005 06:31 PM
The best way to uniquely identify anonymous visitors muser8@hotmail.com ASP .Net 2 07-26-2004 11:47 PM



Advertisments