Velocity Reviews > Perl > Difficulty manipulating hashes

# Difficulty manipulating hashes

limitz
Guest
Posts: n/a

 07-11-2007
Here is my script, followed by my question at the end:

print "\nThere are a total of 18990 bases in the entire sequence\n";
print "\nWhat is the starting base number? Keep in mind that Perl
begins the tally";
print "\nwith 0. So if you wanted to start from base 30, input in
29\n";

#Here we can define where we want the sequence to begin and end
print "Input Starting Number:\n";
\$beginning_of_sequence = <STDIN>;
print "Input Ending Base Number:\n";
\$end_of_sequence = <STDIN>;
\$length_of_sequence = \$end_of_sequence-\$beginning_of_sequence+1;

%dinucleotidepair = (
AT => 0,
AC => 0,
AG => 0,
AA => 0,

TA => 0,
TC => 0,
TG => 0,
TT => 0,

CA => 0,
CT => 0,
CG => 0,
CC => 0,

GA => 0,
GT => 0,
GC => 0,
GG => 0,
);

#\$AC = 'AC';
#\$ACcountss = 0;
#for my \$i (0 .. \$length_of_sequence-1) {
# my \$dinuc = substr(\$fastasequence, \$i, 2);
# if (\$dinuc =~ \$AC) {
# \$ACcountss++;
# }
#}

for my \$i (0 .. \$length_of_sequence-1) {
foreach (keys %dinucleotidepair) {
\$dinucleotidepair{\$_}++ if substr(\$fastasequence, \$i,
2) =~ /\$_/;
}

}

while ( my(\$keys,\$values) = each(%dinucleotidepair) ) {
print "\$keys \$values\n";

}

#print "The Fasta sequence segment has \$ACcountss AC's in
\$beginning_of_sequence to \$end_of_sequence",
#printf "for a relative frequency of %f\n", \$ACcountss/
\$length_of_sequence;

My new problem is this. I have to calculate the relative frequencies
for everything. So, what that means, is that if one of the keys in my
hash has an occurence, I have to divide that by \$length_of_sequence
to
find the relative frequency. Then, that relative frequency will be
used in another Perl script.

My question is this:
How do I manipulate individual elements in a hash
given the value of the key is not zero?

For example. In \$fastasequence, the first 30 nucleotide bases contain
4 occurences of the variable "AC" yet no occurences of the base
combination "TG".

Thus, the variable frequency of AC is 0.133.

Secondly, how do I save 0.133 as a variable that can be carried over
and used by another
Perl script for calcuation purposes?

Thanks!

~Frank

Dr.Ruud
Guest
Posts: n/a

 07-11-2007
limitz schreef:

> Here is my script, followed by my question at the end:

Missing:

#!/usr/bin/perl
use strict;
use warnings;

Never leave those out!

> print "\nThere are a total of 18990 bases in the entire sequence\n";
> print "\nWhat is the starting base number? Keep in mind that Perl
> begins the tally";
> print "\nwith 0. So if you wanted to start from base 30, input in
> 29\n";

Why not just subtract 1 from the input?

BTW, where does \$fastasequence get filled?

\$seq =~ s/[^ACGT]+//g; # remove any garbage like whitespace or
newlines

my \$seq_len = length \$seq;

print <<"EOT";

There are a total of \${seq_len} bases in the entire sequence

What is the starting base number?
EOT

> #Here we can define where we want the sequence to begin and end
> print "Input Starting Number:\n";
> \$beginning_of_sequence = <STDIN>;

my \$seq_begin = <STDIN>;

> print "Input Ending Base Number:\n";
> \$end_of_sequence = <STDIN>;

my \$seq_end = <STDIN>;

> \$length_of_sequence = \$end_of_sequence-\$beginning_of_sequence+1;

\$seq_len = \$seq_end - \$seq_begin + 1;

> %dinucleotidepair = (
> AT => 0,
> [...]
> GG => 0,
> );

my %dinucleotidepair;
\$dinucleotidepair{
qw( AA AC AG AT
CA CC CG CT
GA GC GG GT
TA TC TG TT ) } = (0) x 16;

> for my \$i (0 .. \$length_of_sequence-1) {
> foreach (keys %dinucleotidepair) {
> \$dinucleotidepair{\$_}++ if substr(\$fastasequence, \$i,
> 2) =~ /\$_/;
> }

Because you step through it by 1 character, a ATA will count an AT and a
TA, is that what you want?

You could then also replace your loops by

my %pair;

\$pair{\$_}++ for
(substr(\$seq, \$seq_begin - 1, \$seq_len) =~ /(..)/g);

\$pair{\$_}++ for
(substr(\$seq, \$seq_begin, \$seq_len - 1) =~ /(..)/g);

> while ( my(\$keys,\$values) = each(%dinucleotidepair) ) {
> print "\$keys \$values\n";
> }

print "\$_ \$pair{\$_}\n" for keys %pair;

There are in total (\$seq_len - 1) pairs, so just divide by that.

#!/usr/bin/perl
use strict;
use warnings;

my \$seq = do { local \$/; <DATA> };

# remove any garbage like whitespace or newlines
\$seq =~ s/[^ACGT]+//g;

my \$seq_len = length \$seq;

print qq(
There are a total of \${seq_len} bases in the entire sequence.\n
Input Starting Number: );
my \$seq_begin = <>;

print qq(
Input Ending Number : );
my \$seq_end = <>;

\$seq_len = \$seq_end - \$seq_begin + 1;
print qq(
Sequence length : \$seq_len\n\n);

my %pair;
@pair{ qw( AA AC AG AT
CA CC CG CT
GA GC GG GT
TA TC TG TT ) } = (0) x 16;

\$pair{\$_}++ for
(substr(\$seq, \$seq_begin - 1, \$seq_len) =~ /(..)/g);
\$pair{\$_}++ for
(substr(\$seq, \$seq_begin, \$seq_len - 1) =~ /(..)/g);

printf "%d pairs:\n--\n", scalar keys %pair;

for my \$k (sort keys %pair) {
my \$v = \$pair{\$k};
printf "%s %5d %5.3f\n", \$k, \$v, \$v / (\$seq_len - 1);
}

__DATA__
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG

--
Affijn, Ruud

"Gewoon is een tijger."

Guest
Posts: n/a

 07-12-2007
Dr.Ruud <(E-Mail Removed)> wrote:

> \$seq =~ s/[^ACGT]+//g; # remove any garbage like whitespace or newlines

\$seq =~ tr/ACGT//cd; # same thing, only much faster

--
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

John W. Krahn
Guest
Posts: n/a

 07-12-2007
Dr.Ruud wrote:
> limitz schreef:
>
>> Here is my script, followed by my question at the end:

>
> Missing:
>
> #!/usr/bin/perl
> use strict;
> use warnings;
>
> Never leave those out!
>
>
>> print "\nThere are a total of 18990 bases in the entire sequence\n";
>> print "\nWhat is the starting base number? Keep in mind that Perl
>> begins the tally";
>> print "\nwith 0. So if you wanted to start from base 30, input in
>> 29\n";

>
> Why not just subtract 1 from the input?
>
> BTW, where does \$fastasequence get filled?
>
>
> \$seq =~ s/[^ACGT]+//g; # remove any garbage like whitespace or
> newlines

Or do it more efficiently:

\$seq =~ tr/ACGT//cd; # remove any garbage like whitespace or newlines

> my \$seq_len = length \$seq;
>
> print <<"EOT";
>
> There are a total of \${seq_len} bases in the entire sequence

John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall

Guest
Posts: n/a

 07-12-2007
merl the perl <(E-Mail Removed)> wrote:

> So the arrays don't associate as in a *(b*c)=(a*b)*c, but the values and
> keys associate,

So far, so good.

> as in, if you have one, you can get the other.

Not so good.

If you have the key you can get the value, but not the other
way around (because there may be multiple keys with the same value).

> I find the
> semantics of perl quite challenging.

The semantics of a hash data structure are what you find challenging.

Hashes were around long before Perl, try googling "hash data structure".

--
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

limitz
Guest
Posts: n/a

 07-13-2007
This is my complete code. I'm at the very last step. Basically, I need
to multiply the variable \$relafreq as found near the end of the code
to the hash %dinucleotidescorepair and then sum the multiplied values
of the hash all together to return the score.

For example. If \$relafreq = .5, .4, .09, .01 and the values of
%dinucleotidescorepair is 5 8 3 6. I need to .5*5 + .4*8 + 3*.09 + .
01*6 then print the value out.

How would I go about doing this?

Thanks guy!

~Frank

#!/usr/bin/perl -w

#The filename of the file containing the e-coli sequence data

#This step is to open the file, if the file cannot be opened, an error
message is printed and the program exits
print "Could not open file \$fastadata!\n";
exit;
}

#Read the protein sequence into an array. This array will be used to
generate the Markov Transitional Matrices
@fasta = <FASTAFILE>;

#Close the FASTA data now as all the data has been read into the array
close FASTAFILE;

#Extracting the FASTA file from the array and putting it into a
sequence data

sub extract_sequence_from_fasta_data {
my(@fasta) =@_;

use strict;
use warnings;

#declaing the variables
my \$fastasequence = '';

foreach my \$line (@fasta) {

if (\$line =~ /^\s*\$/) {
next;

} elsif(\$line =~ /^\s*#/) {
next;

} elsif(\$line =~ /^>/) {
next;
#Keep line, add to sequence string
} else {
\$fastasequence .= \$line;
}
}

#remove non-sequence data from the \$sequence string
\$fastasequence =~ s/\s//g;

return \$fastasequence;
}

\$fastasequence = extract_sequence_from_fasta_data (@fasta);

#Split up the FASTA sequence into individual elements
@fastasequence = split( '' , \$fastasequence);

#extract data and remove fasta headers from the file to be scored
against the original fasta file
print "What is the filename?\n";
\$scoredata = <STDIN>;
unless ( open(SCOREFILE,\$scoredata) ) {
print "could not open file \$scoredata!\n";
exit;
}

@score = <SCOREFILE>;
close SCOREFILE;

#The sequence to be scored is now in identical format as our original
fasta sequence
\$scoresequence = extract_sequence_from_fasta_data (@score);

print "\nThere are a total of 18990 bases in the entire sequence\n";
print "\nWhat is the starting base number? Keep in mind that Perl
begins the tally";
print "\nwith 0. So if you wanted to start from base 30, input in
29\n";

#Here we can define where we want the sequence to begin and end
print "Input Starting Number:\n";
\$beginning_of_sequence = <STDIN>;
print "Input Ending Base Number:\n";
\$end_of_sequence = <STDIN>;
\$length_of_sequence = \$end_of_sequence-\$beginning_of_sequence+1;

#Counting the occurences for the file to be scored
%dinucleotidescorepair = (
AT => 0,
AC => 0,
AG => 0,
AA => 0,

TA => 0,
TC => 0,
TG => 0,
TT => 0,

CA => 0,
CT => 0,
CG => 0,
CC => 0,

GA => 0,
GT => 0,
GC => 0,
GG => 0,
);

for my \$i (0 .. \$length_of_sequence-1) {
foreach (keys %dinucleotidescorepair) {
\$dinucleotidescorepair{\$_}++ if substr(\$scoresequence, \$i, 2) =~ /
\$_/;
}
}
print "These are the occurences for the file to be scored\n";
while ( my(\$keys,\$values) = each(%dinucleotidescorepair) ) {
print "\$keys \$values\n";
}

#Counting and finding the frequencies for the original fasta file
%dinucleotidepair = (
AT => 0,
AC => 0,
AG => 0,
AA => 0,

TA => 0,
TC => 0,
TG => 0,
TT => 0,

CA => 0,
CT => 0,
CG => 0,
CC => 0,

GA => 0,
GT => 0,
GC => 0,
GG => 0,
);

for my \$i (0 .. \$length_of_sequence-1) {
foreach (keys %dinucleotidepair) {
\$dinucleotidepair{\$_}++ if substr(\$fastasequence, \$i, 2) =~ /\$_/;
}
}

print "\n";
print "These are the occurences for the original file (verified.fa)
\n";

while ( my(\$keys,\$values) = each(%dinucleotidepair) ) {
print "\$keys \$values\n";
}

print "\n";
print "These are the relative frequencies for the original file\n";

while ( my(\$keys,\$values) = each(%dinucleotidepair) ) {
\$relafreq = \$values/\$length_of_sequence;
printf "\$keys %f\n", \$relafreq;
}

#The calculation of the final score

Dr.Ruud
Guest
Posts: n/a

 07-13-2007
merl the perl schreef:
> "Dr.Ruud":

>> __DATA__
>> ACGTGCATGCACGTAATGATCCATG
>> ACGTGCATGCACGTAATGATCCATG

>
> Why do you want the same gene over and over again?

Just to get a quick first run of test data, of course. Is that really
hard for you to understand?
Just replace it by your favourite sequence.

code that limitz published.

--
Affijn, Ruud

"Gewoon is een tijger."