Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Handling Huge Data

Reply
Thread Tools

Handling Huge Data

 
 
Vishal G
Guest
Posts: n/a
 
      09-30-2008
Hi Guys,

I am trying to edit some bioinformatic package written in perl which
was written to handle DNA sequence of about 500,000 base long (a
string containg 500000 chrs)..

I have to enhance it to handle 100 million base long DNA...

Each base in DNA has this information, base (A, C, G or T), qual
(0-99), position (1-length)

there is one main DNA sequence and on average 500,000 parts (max 2000
chrs long with the same set of information)...

The program first creates an alignment like
<code>

*
Main - .....ACCCTTTGTCTAGTCGTATCGTCGATCGTCGCTAGCTCTGCT... .
Part -
GTCGTATCGTCGAACGTCGCTAGCTC
Part - CTTTGTCTAGTCGTATCGTCGATCGTCGCT
Part
-
TCGAACGTCGCTAGC
</code>
Now, lets say I have to go thorugh each position and find how many
variations are present at certain position (with their original
position and quality).

Look at * position, there is T-A variation

Right now they are using hash to caputure this

%A, %C, %G, %T

Loop For Main DNA {
$T{$pos} = $qual;
# this tells me that there is T base at certain position with some
qual

}

Update the qual by adding the qual of parts

Loop For Parts {
$A{$pos} += $qual # for A parts

$T{$pos} += $qual $ for T parts
}

But because the dataset is huge, it consumes lot of memory...

so basically I am trying to figure out a way to store this information
without using much memory

If you dont understand the above problem, dont worry....

just tell me how to handle huge data which need to accessed frequently
using least possible memory..

Thanks in advance
 
Reply With Quote
 
 
 
 
xhoster@gmail.com
Guest
Posts: n/a
 
      09-30-2008
Vishal G <(E-Mail Removed)> wrote:
> Hi Guys,
>
> I am trying to edit some bioinformatic package written in perl which
> was written to handle DNA sequence of about 500,000 base long (a
> string containg 500000 chrs)..
>
> I have to enhance it to handle 100 million base long DNA...
>
> Each base in DNA has this information, base (A, C, G or T), qual
> (0-99), position (1-length)
>
> there is one main DNA sequence and on average 500,000 parts (max 2000
> chrs long with the same set of information)...


How is this data stored? Is it all in memory at once?

>
> The program first creates an alignment like
> <code>
>
> *
> Main - .....ACCCTTTGTCTAGTCGTATCGTCGATCGTCGCTAGCTCTGCT... .
> Part -
> GTCGTATCGTCGAACGTCGCTAGCTC
> Part - CTTTGTCTAGTCGTATCGTCGATCGTCGCT
> Part
> -
> TCGAACGTCGCTAGC
> </code>


It looks like your alignment was line-wrapped into oblivion. Anyway,
how was the alignment on such a large dataset done? Couldn't your quality
summarization thing be best implement by pushing it into the aligner code?


> Now, lets say I have to go thorugh each position and find how many
> variations are present at certain position (with their original
> position and quality).
>
> Look at * position, there is T-A variation
>
> Right now they are using hash to caputure this
>
> %A, %C, %G, %T
>
> Loop For Main DNA {
> $T{$pos} = $qual;
> # this tells me that there is T base at certain position with some
> qual


Since $pos is an integer and seems to be dense (every or almost every
position from 0 up to the length-1 will be occupied), then you should
consider using an array rather than a hash. That might save some memory.
On the other hand, it might take more memory if most positions are
unanimous, meaning that 3 of the 4 base-hashes would not have a value for
any given position.

Also, where is $qual coming from? Obviously it isn't a constant over the
life of the loop, like you have it shown. Doesn't it have to draw from
something in RAM to obtain its value?

>
> }
>
> Update the qual by adding the qual of parts
>
> Loop For Parts {
> $A{$pos} += $qual # for A parts
>
> $T{$pos} += $qual $ for T parts
> }


Is there another loop over $pos? If so, is it inside the Loop for parts
or outside of it? Again, where does $qual come from?

>
> But because the dataset is huge, it consumes lot of memory...
>
> so basically I am trying to figure out a way to store this information
> without using much memory


You could "pack" the numbers into strings and manipulate them with
"substr". I think there are even some Tie modules that do this for you, but
the speed decrease might be substantial.

What I would probably do is use Inline::C and have the data be accumulated
in a C float or double array, rather than a perl structure.

Or maybe you can address one $pos at a time, and output the results of that
$pos to disk before moving on to the next one, rather than accumulating
into memory.

>
> If you dont understand the above problem, dont worry....
>
> just tell me how to handle huge data which need to accessed frequently
> using least possible memory..


Don't worry about what disease I actually have doc, just give me the cure.
I'm afraid that isn't likely to work well. The details of the solution
are likely to depend on the details of the problem.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
Reply With Quote
 
 
 
 
Eric Pozharski
Guest
Posts: n/a
 
      10-01-2008
Vishal G <(E-Mail Removed)> wrote:
*SKIP*
> If you dont understand the above problem, dont worry....


You first...

> just tell me how to handle huge data which need to accessed frequently
> using least possible memory..


Free your mind of slurping (quite impossible if you came from world
where cycles are cheap, memory is cheap, disks are cheap etc.). Then
use C<use DBI> (I prefer B<DBD::SQLite>, it's fscking fast).

p.s. And a piece of advice. If you're not going to show your code that
"clearly exhibits your problem" -- don't wait for help here.

--
Torvalds' goal for Linux is very simple: World Domination
 
Reply With Quote
 
Vishal G
Guest
Posts: n/a
 
      10-02-2008
Hello Guys,

Thanks for your advice and sorry for being so vague...

In simple words if I have this code...

my $unitlength = 3;
my $dnaLength = 100000000;

my $A = sprintf("%3d", 0) x $dnaLength;
my $C = sprintf("%3d", 0) x $dnaLength;
my $G = sprintf("%3d", 0) x $dnaLength;
my $T = sprintf("%3d", 0) x $dnaLength;
my $I = sprintf("%3d", 0) x $dnaLength;

# Assign quality information of DNA
print "DNA Processing";
my ($num, $qual);
for (my $i = 0; $i < $dnaLength; $i++) {
$num = int(rand(5)) + 1;
$qual = int(rand(99)) + 1;

if ($num == 1) {
# Base A at position $i with base quality $qual
substr($A, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 2) {
substr($C, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 3) {
substr($G, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 4) {
substr($T, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 5) {
substr($I, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} else {
}
}

print "Member Processing\n";
my ($start, $stop);
for (my $j = 0; $j < 50000; $j++) {
# Start and Stop of memeber with respect to DNA
$start = int(rand($dnaLength - 2000)) + 1; # Member start with
respect to DNA
$stop = $dnaLength; # Finish at end

for (my $i = $start; $i <= $stop; $i++) {
$num = int(rand(5)) + 1;
$qual = int(rand(99)) + 1;
if ($num == 1) {
$qual = $qual + int( substr($A, $i * $unitlength,
$unitlength) );
substr($A, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 2) {
$qual = $qual + int( substr($C, $i * $unitlength,
$unitlength) );
substr($C, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 3) {
$qual = $qual + int( substr($G, $i * $unitlength,
$unitlength) );
substr($G, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 4) {
$qual = $qual + int( substr($T, $i * $unitlength,
$unitlength) );
substr($T, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 5) {
$qual = $qual + int( substr($I, $i * $unitlength,
$unitlength) );
substr($I, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} else {
}
}
}

I ran this code and it consumes around 3.0 GB of memory...

I have also ran this same code using Hash (%A, %C,....) (8.0+ GB) and
with Array (5.0+ GB)

Is there any other way to store the information using less memory.

Thanks


 
Reply With Quote
 
John W. Krahn
Guest
Posts: n/a
 
      10-02-2008
Vishal G wrote:
> Hello Guys,
>
> Thanks for your advice and sorry for being so vague...
>
> In simple words if I have this code...
>
> my $unitlength = 3;
> my $dnaLength = 100000000;
>
> my $A = sprintf("%3d", 0) x $dnaLength;
> my $C = sprintf("%3d", 0) x $dnaLength;
> my $G = sprintf("%3d", 0) x $dnaLength;
> my $T = sprintf("%3d", 0) x $dnaLength;
> my $I = sprintf("%3d", 0) x $dnaLength;


Why not just:

my $A = '000' x $dnaLength;
my $C = '000' x $dnaLength;
my $G = '000' x $dnaLength;
my $T = '000' x $dnaLength;
my $I = '000' x $dnaLength;

Or even:

my $A = my $C = my $G = my $T = my $I = '000' x $dnaLength;


> # Assign quality information of DNA
> print "DNA Processing";
> my ($num, $qual);
> for (my $i = 0; $i < $dnaLength; $i++) {
> $num = int(rand(5)) + 1;
> $qual = int(rand(99)) + 1;
>
> if ($num == 1) {
> # Base A at position $i with base quality $qual
> substr($A, $i * $unitlength, $unitlength, sprintf("%$
> {unitlength}d", $qual));
> } elsif ($num == 2) {
> substr($C, $i * $unitlength, $unitlength, sprintf("%$
> {unitlength}d", $qual));
> } elsif ($num == 3) {
> substr($G, $i * $unitlength, $unitlength, sprintf("%$
> {unitlength}d", $qual));
> } elsif ($num == 4) {
> substr($T, $i * $unitlength, $unitlength, sprintf("%$
> {unitlength}d", $qual));
> } elsif ($num == 5) {
> substr($I, $i * $unitlength, $unitlength, sprintf("%$
> {unitlength}d", $qual));
> } else {
> }
> }


If you wanted, you *could* write that loop as:

for my $i ( 0 .. $dnaLength - 1 ) {
substr ${\( $A, $C, $G, $T, $I )[ rand 5 ]}, $i * $unitlength,
$unitlength, sprintf '%*d', $unitlength, 1 + int rand 99;
}


> print "Member Processing\n";
> my ($start, $stop);
> for (my $j = 0; $j < 50000; $j++) {
> # Start and Stop of memeber with respect to DNA
> $start = int(rand($dnaLength - 2000)) + 1; # Member start with
> respect to DNA
> $stop = $dnaLength; # Finish at end
>
> for (my $i = $start; $i <= $stop; $i++) {
> $num = int(rand(5)) + 1;
> $qual = int(rand(99)) + 1;
> if ($num == 1) {
> $qual = $qual + int( substr($A, $i * $unitlength,
> $unitlength) );
> substr($A, $i * $unitlength, $unitlength, sprintf("%$
> {unitlength}d", $qual));
> } elsif ($num == 2) {
> $qual = $qual + int( substr($C, $i * $unitlength,
> $unitlength) );
> substr($C, $i * $unitlength, $unitlength, sprintf("%$
> {unitlength}d", $qual));
> } elsif ($num == 3) {
> $qual = $qual + int( substr($G, $i * $unitlength,
> $unitlength) );
> substr($G, $i * $unitlength, $unitlength, sprintf("%$
> {unitlength}d", $qual));
> } elsif ($num == 4) {
> $qual = $qual + int( substr($T, $i * $unitlength,
> $unitlength) );
> substr($T, $i * $unitlength, $unitlength, sprintf("%$
> {unitlength}d", $qual));
> } elsif ($num == 5) {
> $qual = $qual + int( substr($I, $i * $unitlength,
> $unitlength) );
> substr($I, $i * $unitlength, $unitlength, sprintf("%$
> {unitlength}d", $qual));
> } else {
> }
> }
> }
>
> I ran this code and it consumes around 3.0 GB of memory...


You are running out of memory because when you add the numbers together
they are sometimes longer than $unitlength which causes the strings to
expand.

$ perl -le'printf "%3d\n", 900 + 800'
1700


> I have also ran this same code using Hash (%A, %C,....) (8.0+ GB) and
> with Array (5.0+ GB)
>
> Is there any other way to store the information using less memory.


If you want to keep the substrings at only $unitlength you could use
either modulus:

$ perl -le'printf "%3d\n", ( 900 + 800 ) % 1000'
700

Or a truncating sprintf format:

$ perl -le'printf "%3.3s\n", 900 + 800'
170



John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall
 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      10-02-2008
"John W. Krahn" <(E-Mail Removed)> wrote:
> Vishal G wrote:
> > Hello Guys,
> >
> > Thanks for your advice and sorry for being so vague...
> >
> > In simple words if I have this code...
> >
> > my $unitlength = 3;
> > my $dnaLength = 100000000;
> >
> > my $A = sprintf("%3d", 0) x $dnaLength;
> > my $C = sprintf("%3d", 0) x $dnaLength;
> > my $G = sprintf("%3d", 0) x $dnaLength;
> > my $T = sprintf("%3d", 0) x $dnaLength;
> > my $I = sprintf("%3d", 0) x $dnaLength;

>
> Why not just:
>
> my $A = '000' x $dnaLength;
> my $C = '000' x $dnaLength;
> my $G = '000' x $dnaLength;
> my $T = '000' x $dnaLength;
> my $I = '000' x $dnaLength;
>
> Or even:
>
> my $A = my $C = my $G = my $T = my $I = '000' x $dnaLength;


Or better yet:

my %h;
$h{$_}='000' x $dnaLength foreach qw/A C G T I/;

Or, because $num is numbers:

$h{$_}='000' x $dnaLength foreach 1..5;


This cuts the memory use almost in half, as each of the lexicals instances
of '000' x $dnaLength takes up memory and doesn't seem to release it.


> > # Assign quality information of DNA
> > print "DNA Processing";
> > my ($num, $qual);
> > for (my $i = 0; $i < $dnaLength; $i++) {
> > $num = int(rand(5)) + 1;
> > $qual = int(rand(99)) + 1;
> >
> > if ($num == 1) {
> > # Base A at position $i with base quality $qual
> > substr($A, $i * $unitlength, $unitlength, sprintf("%$
> > {unitlength}d", $qual));


replace the ugly switch statement with:

substr($h{$num}, $i * $unitlength, #....


> > print "Member Processing\n";
> > my ($start, $stop);
> > for (my $j = 0; $j < 50000; $j++) {
> > # Start and Stop of memeber with respect to DNA
> > $start = int(rand($dnaLength - 2000)) + 1; # Member start with
> > respect to DNA
> > $stop = $dnaLength; # Finish at end


Shouldn't it finish at its own end, $start+2000-1, not at the main sequence
end?


> > if ($num == 1) {
> > $qual = $qual + int( substr($A, $i * $unitlength,
> > $unitlength) );


This too could be replaced by $h{$num} in the substr and getting rid of
the big if blocks.

....
> >
> > I ran this code and it consumes around 3.0 GB of memory...

>
> You are running out of memory because when you add the numbers together
> they are sometimes longer than $unitlength which causes the strings to
> expand.
>
> $ perl -le'printf "%3d\n", 900 + 800'
> 1700


This is truly a problem, but it is a correctness problem. In my hands
it leads to almost no size inflation. The way he stores data, the minimum
possible size would be 1.5e9 bytes, (5*3*1e and the way the x operator
works inflates that to 3e9 bytes if you have 5 literal instances of it.

> >
> > Is there any other way to store the information using less memory.


I've show how to cut it almost in half (but you will need to increase
$unitlength unless you want to get wrong answers or lose data, which will
cost you more space.)

But the real answer is not to store the entire set in RAM at all.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
Reply With Quote
 
J. Gleixner
Guest
Posts: n/a
 
      10-02-2008
Vishal G wrote:
> Hi Guys,
>
> I am trying to edit some bioinformatic package written in perl which
> was written to handle DNA sequence of about 500,000 base long (a
> string containg 500000 chrs)..

[...]

If you haven't read it yet, this might be useful:

http://www.perl.com/pub/a/2003/09/10...formatics.html
 
Reply With Quote
 
Ilya Zakharevich
Guest
Posts: n/a
 
      10-03-2008
[A complimentary Cc of this posting was sent to
John W. Krahn
<(E-Mail Removed)>], who wrote in article <_pZEk.3778$(E-Mail Removed)>:

> my $A = my $C = my $G = my $T = my $I = '000' x $dnaLength;


For best results, use

my $I = '000';
$I x= $dnaLength;
my $A = my $C = my $G = my $T = $I;

(otherwise '000' x $dnaLength is computed at compile time, and remains
in the compiled tree).

And do not have anything "large" as a last statement of a subroutine -
unless you want it to be duplicated to create a return value of the
subroutine.

Hope this helps,
Ilya
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Any suggestions for handling data of huge dimension in Java? Simon Ng Java 5 03-26-2011 12:41 PM
Re: Any suggestions for handling data of huge dimension in Java? Simon Ng Java 5 03-25-2011 07:32 PM
Any suggestions for handling data of huge dimension in Java? Simon Java 13 03-25-2011 04:00 AM
Memory error due to the huge/huge input file size tejsupra@gmail.com Python 3 11-20-2008 07:21 PM
Huge Data Handling Vishal G Perl Misc 1 09-30-2008 09:35 AM



Advertisments