![]() |
longest common substring
Is there a standard algorithm or module which finds the N longest common
substrings in a set of text files? Here's the use case: I'm trying to clean up a very old, very large, and very ugly build system which has thousands of unparameterized compile/link commands in hundreds of Makefiles. I want to search them for frequently-occurring long substrings. Hopefully this will turn up phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or "-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for semantic meaning, make up reasonable names like $(SYS_LIBS) and $(UNIX_DEFINES), and do a global replace. Then repeat until satisfied. I've done some searching but haven't found anything close to the above. Thanks, HT |
Re: longest common substring
Henry Townsend <henry.townsend@not.here> wrote:
> Is there a standard algorithm or module which finds the N longest common > substrings in a set of text files? I would probably just go implement it. It would be probably be faster than looking for a pre-existing module and learning how to use it. > Here's the use case: I'm trying to clean up a very old, very large, and > very ugly build system which has thousands of unparameterized > compile/link commands in hundreds of Makefiles. I want to search them > for frequently-occurring long substrings. Hopefully this will turn up > phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or > "-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for semantic > meaning, make up reasonable names like $(SYS_LIBS) and $(UNIX_DEFINES), > and do a global replace. Then repeat until satisfied. I would probably do this with a series of perl one liners and gnu text utils, rather than all in Perl, but it would follow about the same pattern as below. For memory/performance reasons, I would pick a "reasonable" upper bound on the length of the common substrings. Once you have a list of candidates which are over 50 (say), then it should be easy enough to go back by hand and see which one is the absolute longest, but for present purposes that probably isn't even necessary. use strict; use warnings; my $max=50; # the longest string which is "long enough" my $hash; while (my $line=<>) { chomp $line; foreach (0..(length $line)-1) { $hash->{substr $line, $_, $max}++; } }; my $count=0; while(1) { my @common= grep {$hash->{$_}>1 and length == $max} keys %$hash; $count+=@common; print length $_, "\t$hash->{$_}\t$_\n" foreach @common; last if $count>10 or $max==0; my %hash2; $max--; $hash2{substr $_, 0, $max}++ foreach keys %$hash; $hash=\%hash2; } Xho -- -------------------- http://NewsReader.Com/ -------------------- Usenet Newsgroup Service $9.95/Month 30GB |
Re: longest common substring
On 1 Nov 2006, henry.townsend@not.here wrote:
> Is there a standard algorithm or module which finds the N longest > common substrings in a set of text files? > > Here's the use case: I'm trying to clean up a very old, very large, > and very ugly build system which has thousands of unparameterized > compile/link commands in hundreds of Makefiles. I want to search them > for frequently-occurring long substrings. Hopefully this will turn up > phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or > "-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for > semantic meaning, make up reasonable names like $(SYS_LIBS) and > $(UNIX_DEFINES), and do a global replace. Then repeat until satisfied. Seems like what you really want is to find what words commonly follow each other. In other words, it would be good to know that -lcurses usually follows -ltermlib. That will let you build sets of associated words (so, for instance, "-lcurses -ldl -lnsl" and "-lnsl -lcurses -ldl" will be noticed). To do that is fairly easy. Here I will show you with a script that reads from standard input or a set of files, and prints the contents of %h at the end. So you should be able to filter this by number of words that follow each other. You can also make sets of words that are strongly associated, meaning that they are likely to appear together. This will be much more useful, I think, than common substrings, which would break on space and order of options. I figured that order doesn't matter so I always sort the list of the two words. That way, "a b" and "b a" will count the same. Ted #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my @words = split ' ', join('', <>); my %h; # this will hold the frequencies while (exists $words[1]) { my ($w1, $w2) = sort @words[0,1]; $h{$w1}->{$w2}++; shift @words; } print Dumper \%h; |
Re: longest common substring
xhoster@gmail.com wrote:
> > my $count=0; > while(1) { > my @common= grep {$hash->{$_}>1 and length == $max} keys %$hash; > $count+=@common; > print length $_, "\t$hash->{$_}\t$_\n" foreach @common; > last if $count>10 or $max==0; > my %hash2; > $max--; > $hash2{substr $_, 0, $max}++ foreach keys %$hash; But this should be: $hash2{substr $_, 0, $max}+=$hash->{$_} foreach keys %$hash; > $hash=\%hash2; > } > > Xho -- -------------------- http://NewsReader.Com/ -------------------- Usenet Newsgroup Service $9.95/Month 30GB |
Re: longest common substring
On Wed, 01 Nov 2006 10:28:05 -0500, Henry Townsend wrote:
> Is there a standard algorithm or module which finds the N longest common > substrings in a set of text files? http://search.cpan.org/~gray/Tree-Su...Tree/Suffix.pm -- Peter Scott http://www.perlmedic.com/ http://www.perldebugged.com/ |
| All times are GMT. The time now is 03:16 AM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.