Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Parsing/sorting big file problem

Reply
Thread Tools

Parsing/sorting big file problem

 
 
mcvallet@hotmail.com
Guest
Posts: n/a
 
      02-24-2006
Hi,
I am coding a program that parses a file 370Mb. As long as I keep this
number less than a 1000 in this portion :
# basicly tells me until when i should continue to read the file)
if ($ligne =~ m/^.*1000>>>(\w+).*/){
$stop= 1;
}
it works, but as soon as I increase the number (the max number being
2225) so I am not even reading 1/2 of it, the program does not respond.
Does anybody have a suggestion for this ?
thank you,


################################################## ############################"
$#complete = 4000000;

open(OUTPUTFILE, $outPut)
|| die "cannot open file";

#variable initialisation
my $countTotPositive = 0;
my $countTotNegative = 0;
my $stop= 0;
my $countTotProt = 0;
my @start = times();


while(($ligne = <OUTPUTFILE> ) && $stop == 0){
#identifying the protein being compared
if ($ligne =~ m/^.+(\d*)+>>>\s*(\w+).*/){
#the next commented lignes are here for test purposes
if ($ligne =~ m/^.*1200>>>(\w+).*/){
$stop= 1;
}
$protName1 = $2;
$protName1 =~ s/_//g;
$count = 0;
}
#parsing the results
else{
$_=$ligne ;
my $evalue= 0;
/^\s?(\w+).*\s+\(\s*(\d+)\)\W+(\d+)\W+(\d*)\.?(\d*) \W+(\d*)\.?(\d*)e?\+?(\d{1,2})$/so;
my $protName2=$1;
my $nbAa=$2;
my $eval3=$3;
my $eval4=$4;
my $eval5=$5;
$eval[0]="$6";
$eval[1]=$7;
my $eval8=$8;
$protName2 =~ s/_//g;
#finding out what is the evalue for this result
if ($ligne =~ m/e\+(\d{2,2})$/so){
$evalue = $eval[0].".".@eval;
for ($i = 0; $i < $eval8; $i++){
$evalue = $evalue * 10;
}
}else{
if ($eval[0] =~ m/^0/){
$evalue = $eval[0].".".$eval[1].$eval8;
}else{
$evalue = $eval[0].$eval[1].$eval8;
}
}

@sortedCouple = sort($protName1,$protName2);

if ($complete{"$sortedCouple[0]-$sortedCouple[1]"}[0]
|| $sortedCouple[0] =~ m/$sortedCouple[1]/i){

$evalue2 = $evalue;
#modifying the evalue 1 if the identical couple
if($sortedCouple[0] =~ m/$sortedCouple[1]/i){
$evalue1 = $evalue;
$identical =1;
$countTotPositive++;
}else{
$evalue1 = $complete{"$sortedCouple[0]-$sortedCouple[1]"}[0];
$identical =$complete{"$sortedCouple[0]-$sortedCouple[1]"}[1];
}
$complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,
$protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
$count++;
}
# temporaly saving the partial results
else{
$class1 = $classes{$protName1};
$class2 = $classes{$protName2};
$identical = ( $class1=~ m/$class2/ ? 1 : 0);
if ($identical == 1){
$countTotPositive++;
}else{
$countTotNegative++;
}
$complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$evalue,
$identical];
}

}

}
close OUTPUTFILE;
#variable initialisation
$countPositive = 0;
$countNegative = 0;
foreach $complete (sort{$complete{$a}[2]<=> $complete{$b}[2]} keys
%complete) {
if ($complete{$complete}[3] == 1){
$countPositive++;
}else{
$countNegative++;
}
$newLigne =
$complete{$complete}[0]."\t".$complete{$complete}[1]."\t".$complete{$complete}[2]."\t".$complete{$complete}[3]."\t".$countPositive/$countTotPositive."\t".$countNegative/$countTotNegative."\t".$complete{$complete}[4]."\t".$complete{$complete}[5]."\n";
push @results,$newLigne;

}

@end = times();
# ============= Analyse results

print "Reading and parsing file took ",$end[0]-$start[0]," cpu
seconds\n";

# creation du document
print "\n";
@start = times();
open (F,">results/5out.test");
print F "@results";
close F;
@end = times();
# ============= Analyse results

print "Writting the file results/5out.test",$end[0]-$start[0]," cpu
seconds\n";


}
################################################## ############################""

 
Reply With Quote
 
 
 
 
John W. Krahn
Guest
Posts: n/a
 
      02-24-2006
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> I am coding a program that parses a file 370Mb. As long as I keep this
> number less than a 1000 in this portion :
> # basicly tells me until when i should continue to read the file)
> if ($ligne =~ m/^.*1000>>>(\w+).*/){
> $stop= 1;
> }
> it works, but as soon as I increase the number (the max number being
> 2225) so I am not even reading 1/2 of it, the program does not respond.
> Does anybody have a suggestion for this ?
> thank you,
>
>
> ################################################## ############################"
> $#complete = 4000000;


You are expanding the array @complete to contain 4,000,001 elements but it
doesn't look like you are using that array anywhere. Perhaps it is causing
your problem?


John
--
use Perl;
program
fulfillment
 
Reply With Quote
 
 
 
 
mcvallet@hotmail.com
Guest
Posts: n/a
 
      02-24-2006
The only thing I know is that the array will contain 2225*2225 = 4 950
625 and I thought I was using this array here
$complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,
$protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
Did I mix up the $ and @ ?

Furthermore, at the beginning I was not expanding the array to this
size, but it was not working either this is why I tried to expand the
array.

mc

 
Reply With Quote
 
John W. Krahn
Guest
Posts: n/a
 
      02-24-2006
(E-Mail Removed) wrote:
> The only thing I know is that the array will contain 2225*2225 = 4 950
> 625 and I thought I was using this array here
> $complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,


That is using the hash %complete, not the array @complete.


John
--
use Perl;
program
fulfillment
 
Reply With Quote
 
MSG
Guest
Posts: n/a
 
      02-24-2006

(E-Mail Removed) wrote:
> The only thing I know is that the array will contain 2225*2225 = 4 950
> 625 and I thought I was using this array here
> $complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,
> $protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
> Did I mix up the $ and @ ?
>
> Furthermore, at the beginning I was not expanding the array to this
> size, but it was not working either this is why I tried to expand the
> array.
>
> mc


Where are 'use strict' and 'use warnings' ?!!
You can catch a lot of problems simply by using those. such as your
using complete{ } and $#complete ( hash / array ).

 
Reply With Quote
 
January Weiner
Guest
Posts: n/a
 
      02-24-2006
(E-Mail Removed) wrote:
> Hi,


Hello,
first of all: I think you are parsing output of some sequence comparison
program. Maybe you could describe in more detail what you are trying to
do? Your code is long, incomplete, with messy intendation and
practically uncommented, so it is hard to see what you are doing. For
example, what about the %classes hash? Where does it come from, where is
it defined?

> 2225) so I am not even reading 1/2 of it, the program does not respond.
> Does anybody have a suggestion for this ?
> thank you,


Hm. From my experience with large protein data sets -- looks like your
program exhausts all of the memory. A couple of suggestions:

1) As far as I can tell, you do the following: you first parse the search
results (I assume these are search results) and evaluate them at the
same time, then you sort them according to e-value, then you save them
in a file. You can do the following:

- first do the parsing, and save the data on the fly to a temporary
file

- then open the temporary file, make the evaluation, sort the
results, remove redundant etc.

- how long are the protein names? Maybe that is the problem? If you
have hundreds of thousands of fasta-style descriptions, using them
for a hash table in Perl (your "%complete" hash) may be very
inefficient. Try to use only short ids.

- if everything else fails, instead of spending weeks on correcting
your program (and there is, methinks, a lot to correct), try to get
your hands on a machine with more memory or a better OS and run
your calculations there.

- clean up your code, comment it, post it again here.

2) if I am correct in my assumption and you are writing a parser for
blast or ssearch or the results of a similar program, why don't you
use Bioperl?

(snip the code fragment)

j.

--
------------ January Weiner 3 -------------------------------------
Division of Bioinformatics, University of Muenster
 
Reply With Quote
 
January Weiner
Guest
Posts: n/a
 
      02-24-2006
(E-Mail Removed) wrote:
> The only thing I know is that the array will contain 2225*2225 = 4 950
> 625 and I thought I was using this array here
> $complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,


this is a hash. When you write $blah{foo}, you access the hash %blah and
get the value stored for the key 'foo'.

> $protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
> Did I mix up the $ and @ ?


you mixed up the % and the @.

However, I think that your problem is rather the size of your data. You
have a hash with 5 million elements, right? Try to roughly estimate how
much memory this will take. You need to store 5 million keys, right? Each
key being at least some 10 characters, right? Not to mention the arrays
that you store in the hash, correct?

1)Make the hash keys as short as possible.

2)Maybe instead of using protein names as keys, encode the file with
results (protein name1 = 0 ; protein name2 = 1 etc.). And instead of
using a hash, use a two-dimensional array:

my $matrix = [ ] ;

while( <INPUT_FILE> ) {
... # do your stuff

my ($prot_a, $prot_b) ; # these will be numerical IDs, and not names

if($prot_a > $prot_b) { # sort
($prot_a, $prot_b) = ($prot_b, $prot_a) ;
}

$result = [ ] ;
... # do some more stuff
# fill up $result

# store the $result in the matrix
$matrix->[$prot_a][$prot_b] = $result ;
}

j.

--
------------ January Weiner 3 ---------------------+---------------
Division of Bioinformatics, University of Muenster
 
Reply With Quote
 
mcvallet@hotmail.com
Guest
Posts: n/a
 
      02-24-2006
the entire code is not here, but you were correct, Iwas not using them.
thanks,
mc

 
Reply With Quote
 
mcvallet@hotmail.com
Guest
Posts: n/a
 
      02-24-2006

> first of all: I think you are parsing output of some sequence

comparison
> program.

exactly
> Maybe you could describe in more detail what you are trying to
> do? Your code is long, incomplete, with messy intendation and
> practically uncommented, so it is hard to see what you are doing.

Sorry
>For example, what about the %classes hash? Where does it come from,

where is
>it defined?

the %classes is a class contains the structural family of the classes
-it is at the begining of my wode witch I did not post because, it
works correctly.



>1) As far as I can tell, you do the following: you first parse the search
>results (I assume these are search results) and evaluate them at

the
>same time, then you sort them according to e-value, then you

save them
> in a file. You can do the following:
> - first do the parsing, and save the data on the fly to a temporary
> file


Not exactly, the results are already pre-parsed, but there are still
thing that are not necessary. The file look a bit like this :
1>>> d1tima_ 244 fragments - 244 aa
1dqzB0 ( 277) 4276 20.6
99
1hbnC0 ( 244) 4193 20.4
1e+02
1cxpD0 ( 463) 4140 20.3
2e+02
......
2225>>> another protein
the last 2225 results....

> - first do the parsing, and save the data on the fly to a

temporary
> file


> - then open the temporary file, make the evaluation, sort the
> results, remove redundant etc.


> - how long are the protein names? Maybe that is the problem?

If you
> have hundreds of thousands of fasta-style descriptions, using

them
> for a hash table in Perl (your "%complete" hash) may be very
> inefficient. Try to use only short ids.

5 letters long

> - if everything else fails, instead of spending weeks on

correcting
> your program (and there is, methinks, a lot to correct), try

to get
> your hands on a machine with more memory or a better OS and

run
> your calculations there.


>- clean up your code, comment it, post it again here.

ok
thanks again,
mc

 
Reply With Quote
 
mcvallet@hotmail.com
Guest
Posts: n/a
 
      02-24-2006
> Maybe you could describe in more detail what you are trying to
> do?

I want to get all the couples a-b and the sum of there evalues eval_ab
+ eval_ba and sort the results according to that sum

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
GIDS 2009 .Net:: Save Big, Win Big, Learn Big: Act Before Dec 29 2008 Shaguf ASP .Net 0 12-26-2008 09:29 AM
GIDS 2009 .Net:: Save Big, Win Big, Learn Big: Act Before Dec 29 2008 Shaguf ASP .Net Web Controls 0 12-26-2008 06:11 AM
GIDS 2009 Java:: Save Big, Win Big, Learn Big: Act Before Dec 29 2008 Shaguf Python 0 12-24-2008 07:35 AM
GIDS 2009 Java:: Save Big, Win Big, Learn Big: Act Before Dec 29 2008 Shaguf Ruby 0 12-24-2008 05:07 AM
Re: why fopen( ) can't open a big data file? (single file, as big as 29G) dominiconnor@gmail.com C++ 2 06-21-2005 06:24 PM



Advertisments