Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > What strategy for random accession of records in massive FASTA file?

Reply
Thread Tools

What strategy for random accession of records in massive FASTA file?

 
 
Jeff Shannon
Guest
Posts: n/a
 
      01-13-2005
Chris Lasher wrote:

>>Given that the information content is 2 bits per character
>>that is taking up 8 bits of storage, there must be a good reason
>>for storing and/or transmitting them this way? I.e., it it easy
>>to think up a count-prefixed compressed format packing 4:1 in
>>subsequent data bytes (except for the last byte which have
>>less than 4 2-bit codes).

>
> My guess for the inefficiency in storage size is because it is
> human-readable, and because most in-silico molecular biology is just a
> bunch of fancy string algorithms. This is my limited view of these
> things at least.


Yeah, that pretty much matches my guess (not that I'm involved in
anything related to computational molecular biology or genetics).
Given the current technology, the cost of the extra storage size is
presumably lower than the cost of translating into/out of a packed
format. Heck, hard drives cost less than $1/GB now.

And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.

Jeff Shannon
Technician/Programmer
Credit International

 
Reply With Quote
 
 
 
 
Chris Lasher
Guest
Posts: n/a
 
      01-13-2005
>And besides, for long-term archiving purposes, I'd expect that zip et
>al on a character-stream would provide significantly better
>compression than a 4:1 packed format, and that zipping the packed
>format wouldn't be all that much more efficient than zipping the
>character stream.


This 105MB FASTA file is 8.3 MB gzip-ed.

 
Reply With Quote
 
 
 
 
Jeff Shannon
Guest
Posts: n/a
 
      01-14-2005
Chris Lasher wrote:

>>And besides, for long-term archiving purposes, I'd expect that zip et
>>al on a character-stream would provide significantly better
>>compression than a 4:1 packed format, and that zipping the packed
>>format wouldn't be all that much more efficient than zipping the
>>character stream.

>
> This 105MB FASTA file is 8.3 MB gzip-ed.


And a 4:1 packed-format file would be ~26MB. It'd be interesting to
see how that packed-format file would compress, but I don't care
enough to write a script to convert the FASTA file into a
packed-format file to experiment with...

Short version, then, is that yes, size concerns (such as they may be)
are outweighed by speed and conceptual simplicity (i.e. avoiding a
huge mess of bit-masking every time a single base needs to be
examined, or a human-(semi-)readable display is needed).

(Plus, if this format might be used for RNA sequences as well as DNA
sequences, you've got at least a fifth base to represent, which means
you need at least three bits per base, which means only two bases per
byte (or else base-encodings split across byte-boundaries).... That
gets ugly real fast.)

Jeff Shannon
Technician/Programmer
Credit International

 
Reply With Quote
 
Robert Kern
Guest
Posts: n/a
 
      01-14-2005
Jeff Shannon wrote:

> (Plus, if this format might be used for RNA sequences as well as DNA
> sequences, you've got at least a fifth base to represent, which means
> you need at least three bits per base, which means only two bases per
> byte (or else base-encodings split across byte-boundaries).... That gets
> ugly real fast.)


Not to mention all the IUPAC symbols for incompletely specified bases
(e.g. R = A or G).

http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html

--
Robert Kern
http://www.velocityreviews.com/forums/(E-Mail Removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
 
Reply With Quote
 
John Lenton
Guest
Posts: n/a
 
      01-14-2005
On Thu, Jan 13, 2005 at 12:19:49AM +0100, Fredrik Lundh wrote:
> Chris Lasher wrote:
>
> > Since the file I'm working with contains tens of thousands of these
> > records, I believe I need to find a way to hash this file such that I
> > can retrieve the respective sequence more quickly than I could by
> > parsing through the file request-by-request. However, I'm very new to
> > Python and am still very low on the learning curve for programming and
> > algorithms in general; while I'm certain there are ubiquitous
> > algorithms for this type of problem, I don't know what they are or
> > where to look for them. So I turn to the gurus and accost you for help
> > once again. If you could help me figure out how to code a solution
> > that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
> > keep it in Python only, even though I know interaction with a
> > relational database would provide the fastest method--the group I'm
> > trying to write this for does not have access to a RDBMS.)

>
> keeping an index in memory might be reasonable. the following class
> creates an index file by scanning the FASTA file, and uses the "marshal"
> module to save it to disk. if the index file already exists, it's used as is.
> to regenerate the index, just remove the index file, and run the program
> again.


the problem caught my interest, and the way you used a non-mmaped file
in a place where using mmap was pretty much obvious (IMVHO) bothered
me enough to overcome my laziness. It didn't overcome it by *much*,
mind you, so the following probably only works on python 2.3, on
Linux, in Argentina, and with FASTA data that looks like the sample I
was able to download.

However, having said that, the sample I downloaded was one 46MiB file,
and reading it in on my notebook was fast enough that I ripped out the
saving/reloading of the index and just reindex it every time. Adding
back the persistant index is trivial. [ time passes ] In fact, I just
found a several-gigabyte fasta file, and I guess you'd want the index
for that one; I put the code back in. And now I should really go to
bed, because this is very interesting but won't pay the bills.

import os, mmap, marshal
from UserDict import DictMixin

class FASTA(DictMixin):
def __init__(self, filename):
self.file = f = open(filename)
fno = f.fileno()
stat = os.fstat(fno)
self.map = mmap.mmap(fno, stat.st_size, access=mmap.ACCESS_COPY)
self.checkindex()

def __getitem__(self, key):
p0, pf = self.index[key]
m = self.map
return m[p0f]

def keys(self):
return self.index.keys()
def __contains__(self, key):
return self.index.__contains__(key)

def checkindex(self):
indexfile = self.file.name + ".idx"
if os.path.exists(indexfile):
# and os.stat(indexfile).st_mtime > os.stat(self.file.name).st_mtime:
self.index = marshal.load(open(indexfile, "rb"))
else:
print 'generating index...'
self.genindex()
marshal.dump(self.index, open(indexfile, "wb"))
print 'done.'

def genindex(self):
index = {}
m = self.map
last = None
while 1:
pos = m.find('>')
if last is not None:
index[last] = (index[last], pos)
if pos == -1:
break
m.seek(pos)
line = m.readline()
pos = m.tell()
# this is the bit that probably only works with FASTA
# files like I was able to find on the 'net.
sep = line.index(' ')
if sep == -1:
name = line[1:].strip()
else:
name = line[1:sep].strip()
index[name] = pos
last = name
self.index = index

db = FASTA("/home/john/tmp/uniprot_sprot.fasta")
print db["104K_THEPA"]


--
John Lenton ((E-Mail Removed)) -- Random fortune:
"Those who believe in astrology are living in houses with foundations of
Silly Putty."
- Dennis Rawlins, astronomer

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFB52VWgPqu395ykGsRAtiTAJ9sooa2folCJp3beGzY3P enuuMJJgCfQEnz
5ZSQezYJ5R0X6vB+Aj7FnOQ=
=n0xF
-----END PGP SIGNATURE-----

 
Reply With Quote
 
Neil Benn
Guest
Posts: n/a
 
      01-14-2005
Jeff Shannon wrote:

> Chris Lasher wrote:
>
>>> And besides, for long-term archiving purposes, I'd expect that zip et
>>> al on a character-stream would provide significantly better
>>> compression than a 4:1 packed format, and that zipping the packed
>>> format wouldn't be all that much more efficient than zipping the
>>> character stream.

>>
>>
>> This 105MB FASTA file is 8.3 MB gzip-ed.

>
>
> And a 4:1 packed-format file would be ~26MB. It'd be interesting to
> see how that packed-format file would compress, but I don't care
> enough to write a script to convert the FASTA file into a
> packed-format file to experiment with...
>
> Short version, then, is that yes, size concerns (such as they may be)
> are outweighed by speed and conceptual simplicity (i.e. avoiding a
> huge mess of bit-masking every time a single base needs to be
> examined, or a human-(semi-)readable display is needed).
>
> (Plus, if this format might be used for RNA sequences as well as DNA
> sequences, you've got at least a fifth base to represent, which means
> you need at least three bits per base, which means only two bases per
> byte (or else base-encodings split across byte-boundaries).... That
> gets ugly real fast.)
>
> Jeff Shannon
> Technician/Programmer
> Credit International
>

Hello,

Just to clear up a few things on the topic :

If the file denotes DNA sequences there are five basic identifiers

AGCT and X (where X means 'dunno!').

If the files denoites RNA sequences, you will still only need five
basic indentifiers the issue is that the T is replaced by a U.

One very good way I have found to parse large files of this nature
(I've done it with many a use case) is to write a sax parser for the
file. Therefore you can register a content handler, receive events from
the sax parser and do whatever you like with it. Basically, using the
sax framework to read the files - if your write the sax parser carefully
then you stream the files and remove old lines from memory, therefore
you have a scalable solution (rather than keeping everything in memory).

As an aside, I would seriously consider parsing your files and
putting this information in a small local db - it's really not much work
to do and the 'pure' python thing is a misnomer, whichever persistence
mechanism you use (file,DB,etching it on the floor with a small robot
accepting logo commands,etc) is unlikely to be pure python.

The advantage of putting it in a DB will show up later when you have
fast and powerful retrieval capability.

Cheers,

Neil

--

Neil Benn
Senior Automation Engineer
Cenix BioScience
BioInnovations Zentrum
Tatzberg 47
D-01307
Dresden
Germany

Tel : +49 (0)351 4173 154
e-mail : (E-Mail Removed)
Cenix Website : http://www.cenix-bioscience.com

 
Reply With Quote
 
Michael Maibaum
Guest
Posts: n/a
 
      01-14-2005
On Thu, Jan 13, 2005 at 04:41:45PM -0800, Robert Kern wrote:
>Jeff Shannon wrote:
>
>>(Plus, if this format might be used for RNA sequences as well as DNA
>>sequences, you've got at least a fifth base to represent, which means
>>you need at least three bits per base, which means only two bases per
>>byte (or else base-encodings split across byte-boundaries).... That gets
>>ugly real fast.)

>
>Not to mention all the IUPAC symbols for incompletely specified bases
>(e.g. R = A or G).
>
>http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html


Or, for those of us working with proteins as well, all the single letter codes for proteins:

http://www.chem.qmul.ac.uk/iupac/AminoAcid/A2021.html

lots more bits.

I have a db with approx 3 million proteins in it and would not want to be using a pure python approach

Michael

 
Reply With Quote
 
Chris Lasher
Guest
Posts: n/a
 
      01-14-2005
Forgive my ignorance, but what does using mmap do for the script? My
guess is that it improves performance, but I'm not sure how. I read the
module documentation and the module appears to be a way to read out
information from memory (RAM maybe?).

 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      01-14-2005
"Chris Lasher" <(E-Mail Removed)> writes:
> Forgive my ignorance, but what does using mmap do for the script? My
> guess is that it improves performance, but I'm not sure how. I read the
> module documentation and the module appears to be a way to read out
> information from memory (RAM maybe?).


Mmap lets you treat a disk file as an array, so you can randomly
access the bytes in the file without having to do seek operations.
Just say a[234]='x' and you've changed byte 234 of the file to the
letter x. It works through the OS's virtual memory system and the
computer's MMU hardware, and so it has lower overhead than doing
system calls for every access.
 
Reply With Quote
 
Steve Holden
Guest
Posts: n/a
 
      01-14-2005
Bengt Richter wrote:

> On 12 Jan 2005 14:46:07 -0800, "Chris Lasher" <(E-Mail Removed)> wrote:
>

[...]
> Others have probably solved your basic problem, or pointed
> the way. I'm just curious.
>
> Given that the information content is 2 bits per character
> that is taking up 8 bits of storage, there must be a good reason
> for storing and/or transmitting them this way? I.e., it it easy
> to think up a count-prefixed compressed format packing 4:1 in
> subsequent data bytes (except for the last byte which have
> less than 4 2-bit codes).
>
> I'm wondering how the data is actually used once records are
> retrieved. (but I'm too lazy to explore the biopython.org link).
>

Revealingly honest.

Of course, adopting an encoding that only used two bits per base would
make it impossible to use the re module to search for patterns in them,
for example. So the work of continuously translating between
representations might militate against more efficient representations.
Or, of course, it might not

it's-only-storage-ly y'rs - steve
--
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/
Holden Web LLC +1 703 861 4237 +1 800 494 3119
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Math.random() and Math.round(Math.random()) and Math.floor(Math.random()*2) VK Javascript 15 05-02-2010 03:43 PM
concatenate fasta file PeroMHC Python 3 02-13-2010 03:14 PM
random.random(), random not defined!? globalrev Python 4 04-20-2008 08:12 AM
convert protein fasta stream into harsh table zhong.huang@gmail.com Perl Misc 9 03-01-2006 07:57 PM
match muliple header records to associated detail records Luke Airig XML 0 12-31-2003 12:06 AM



Advertisments