Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   longest common substring (http://www.velocityreviews.com/forums/t900424-longest-common-substring.html)

Henry Townsend 11-01-2006 03:28 PM

longest common substring
 
Is there a standard algorithm or module which finds the N longest common
substrings in a set of text files?

Here's the use case: I'm trying to clean up a very old, very large, and
very ugly build system which has thousands of unparameterized
compile/link commands in hundreds of Makefiles. I want to search them
for frequently-occurring long substrings. Hopefully this will turn up
phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or
"-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for semantic
meaning, make up reasonable names like $(SYS_LIBS) and $(UNIX_DEFINES),
and do a global replace. Then repeat until satisfied.

I've done some searching but haven't found anything close to the above.

Thanks,
HT

xhoster@gmail.com 11-01-2006 04:46 PM

Re: longest common substring
 
Henry Townsend <henry.townsend@not.here> wrote:
> Is there a standard algorithm or module which finds the N longest common
> substrings in a set of text files?


I would probably just go implement it. It would be probably be faster than
looking for a pre-existing module and learning how to use it.


> Here's the use case: I'm trying to clean up a very old, very large, and
> very ugly build system which has thousands of unparameterized
> compile/link commands in hundreds of Makefiles. I want to search them
> for frequently-occurring long substrings. Hopefully this will turn up
> phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or
> "-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for semantic
> meaning, make up reasonable names like $(SYS_LIBS) and $(UNIX_DEFINES),
> and do a global replace. Then repeat until satisfied.


I would probably do this with a series of perl one liners and gnu text
utils, rather than all in Perl, but it would follow about the same pattern
as below. For memory/performance reasons, I would pick a "reasonable" upper
bound on the length of the common substrings. Once you have a list of
candidates which are over 50 (say), then it should be easy enough to go
back by hand and see which one is the absolute longest, but for present
purposes that probably isn't even necessary.


use strict;
use warnings;
my $max=50; # the longest string which is "long enough"
my $hash;
while (my $line=<>) { chomp $line;
foreach (0..(length $line)-1) {
$hash->{substr $line, $_, $max}++;
}
};


my $count=0;
while(1) {
my @common= grep {$hash->{$_}>1 and length == $max} keys %$hash;
$count+=@common;
print length $_, "\t$hash->{$_}\t$_\n" foreach @common;
last if $count>10 or $max==0;
my %hash2;
$max--;
$hash2{substr $_, 0, $max}++ foreach keys %$hash;
$hash=\%hash2;
}


Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB

Ted Zlatanov 11-01-2006 05:33 PM

Re: longest common substring
 
On 1 Nov 2006, henry.townsend@not.here wrote:

> Is there a standard algorithm or module which finds the N longest
> common substrings in a set of text files?
>
> Here's the use case: I'm trying to clean up a very old, very large,
> and very ugly build system which has thousands of unparameterized
> compile/link commands in hundreds of Makefiles. I want to search them
> for frequently-occurring long substrings. Hopefully this will turn up
> phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or
> "-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for
> semantic meaning, make up reasonable names like $(SYS_LIBS) and
> $(UNIX_DEFINES), and do a global replace. Then repeat until satisfied.


Seems like what you really want is to find what words commonly follow
each other. In other words, it would be good to know that -lcurses
usually follows -ltermlib. That will let you build sets of associated
words (so, for instance, "-lcurses -ldl -lnsl" and "-lnsl -lcurses
-ldl" will be noticed).

To do that is fairly easy. Here I will show you with a script that
reads from standard input or a set of files, and prints the contents
of %h at the end. So you should be able to filter this by number of
words that follow each other. You can also make sets of words that
are strongly associated, meaning that they are likely to appear
together. This will be much more useful, I think, than common
substrings, which would break on space and order of options.

I figured that order doesn't matter so I always sort the list of the
two words. That way, "a b" and "b a" will count the same.

Ted

#!/usr/bin/perl

use warnings;
use strict;
use Data::Dumper;

my @words = split ' ', join('', <>);

my %h; # this will hold the frequencies
while (exists $words[1])
{
my ($w1, $w2) = sort @words[0,1];

$h{$w1}->{$w2}++;

shift @words;
}

print Dumper \%h;

xhoster@gmail.com 11-01-2006 08:05 PM

Re: longest common substring
 
xhoster@gmail.com wrote:
>
> my $count=0;
> while(1) {
> my @common= grep {$hash->{$_}>1 and length == $max} keys %$hash;
> $count+=@common;
> print length $_, "\t$hash->{$_}\t$_\n" foreach @common;
> last if $count>10 or $max==0;
> my %hash2;
> $max--;
> $hash2{substr $_, 0, $max}++ foreach keys %$hash;


But this should be:

$hash2{substr $_, 0, $max}+=$hash->{$_} foreach keys %$hash;

> $hash=\%hash2;
> }
>
> Xho


--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB

Peter Scott 11-02-2006 02:24 PM

Re: longest common substring
 
On Wed, 01 Nov 2006 10:28:05 -0500, Henry Townsend wrote:
> Is there a standard algorithm or module which finds the N longest common
> substrings in a set of text files?


http://search.cpan.org/~gray/Tree-Su...Tree/Suffix.pm

--
Peter Scott
http://www.perlmedic.com/
http://www.perldebugged.com/



All times are GMT. The time now is 08:08 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.