Thanks for your help.
My script now looks like this:
#!/usr/bin/perl
# Perl script to find most common CS
use strict;
use warnings;
my $infile = "/home/martin/DATABASE/large.txt";
open INFILE, $infile or die "****! Couldn't open file $infile: $!\n";
my %count;
do {
$_ =~ s/^(\S+\s+){2}//;
$count{$_}++
} while <INFILE>;
print "$count{$_} $_" for keys %count;
__END__
So I'm feeding the file into the %count array by removing the first two
columns with the identifier information and then counting the keys.
How can I still keep the identifier part of the line linked to the array?
Since this is the part which I'm really interested in.
I can't keep the identifier in
the %count array, since this would screw up the "for keys" part.
I checked perldoc -q and found how to remove duplicates but I don't think
I can rewrite this to do what I want.
The "for keys" method is brillant but I'm losing the identifier.
So I'm back to my original script which looks like this.
#!/usr/bin/perl
# Perl script to find most common CS
use strict;
use warnings;
my $infile = "/home/martin/DATABASE/large.txt";
open INFILE, $infile or die "****! Couldn't open file $infile: $!\n";
my @array = <INFILE>;
print "There are ", $#array+1, " lines in the large array\n";
my (@table);
foreach my $array (@array) {
push(@table, [split(/\s/, $array) ]);
}
for (my $k =0; $k<=$#array; $k++) {
print "$table[$k][1] $table[$k][2] occurs ";
my $matched=0;
for (my $h =0; $h<=$no_lines; $h++) {
my $match=0;
for (my $j =2; $j<=11; $j++ ) {
if ($table[$k][$j] == $table[$h][$j]){
$match++;
}
}
if ($match==10) {
$matched++;
}
}
print "$matched times\n";
} # end of large loop
But this sad looking script is not very smart and very slow, I don't want to
run over each line. I would like the script to search the file,
identify a sequence as unique. If there are duplicate sequences
in that file then print out how many and do not revisit that line
if it has been counted as a duplicate.
my data file looks like this, a small section only.
810 141-2_1_2 4 10 21 37 58 83 111 145 184 226
811 141-2_1_6 4 12 24 42 64 92 124 162 204 252
812 141-2_1_7 4 11 23 44 67 95 134 168 215 271
879 141_1_2 4 10 21 37 58 83 111 145 184 226
880 141_1_6 4 12 24 42 64 92 124 162 204 252
881 141_1_7 4 11 23 44 67 95 134 168 215 271
882 152_1_15 4 12 26 44 72 104 138 178 228 282
883 152_1_23 4 10 21 40 65 96 134 180 230 286
884 152_1_24 4 10 21 40 65 96 134 180 230 286
885 152_1_3 4 12 22 40 66 102 128 168 218 268
Again many thanks for your help. I still don't get why you say
this newsgroup has been deleted. What is the url for the replacement
newsgroup?
wrote in message news:<. com>...
> (Martin Foster) wrote:
>
> > wrote:
> >
> > > I shall assume that you really want to count the number of times each
> > > distinct line appears in a file.
>
> > > perl -en '$c{$_}++; END { print "$c{$_} $_" for keys %c }'
>
> > > Or as a script:
>
> > > $count{$_}++ while <>;
>
> > This is amazing, I don't understand how it works but it's very
> > powerful.
>
> If you look in the newsgroup that replaced this one when this one was
> deleted, you'll find every couple of months someone posts a script
> substancially like the one above and says "I found this - how does it
> work?".
>
> You could look at one of those threads.
>
> I believe it is also an example that is used in most Perl tutorials.
>
> > Can I se this script to compare the n columns of a file, no the entire
> > file.
>
> No you can't use this _script_. But you can use the technique.
>
> Rather than keying %count on the whole line you can use some sort of
> string manipulation to extract just part of the line to consider. The
> most normal way to manipulate strings in Perl is the m// and s///
> operators.
>
> > I've got a identifier for each line at the beginning, for example
> >
> > 1666237 4 10 23 16 and so. The identifier is an id to link to
> > something else and so on. I just want to compare the 10 columns with
> > the numbers.
>
> Well if, for example, we say the first 3 whitespace delimted columns
> are the identifier you could remove them thus:
>
> s/^(\S+\s+){3}// and $count{$_}++ while <>;
>
> > > I also suggest you post to newsgroups that still exist (this one
> > > doesn't, see FAQ). Your post will then be seen my many more people.
>
> > BTW where is the FAQ, which says this newsgroup no longer exists?
>
> The Perl FAQ is part of the standard Perl documentation that can be
> found on any computer on which Perl has been installed and also on
> various Perl-related web sites.