Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > What strategy for random accession of records in massive FASTA file?

Reply
Thread Tools

What strategy for random accession of records in massive FASTA file?

 
 
Chris Lasher
Guest
Posts: n/a
 
      01-12-2005
Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a '>'
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:

>CW127_A01

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACAT
>CW127_A02

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAG GAATAGACGG
>CW127_A03

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
GCATTAAACATTCCGCCTGGG
....

Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I'm very new to
Python and am still very low on the learning curve for programming and
algorithms in general; while I'm certain there are ubiquitous
algorithms for this type of problem, I don't know what they are or
where to look for them. So I turn to the gurus and accost you for help
once again. If you could help me figure out how to code a solution
that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not have access to a RDBMS.)
Thanks very much in advance,
Chris

 
Reply With Quote
 
 
 
 
Fredrik Lundh
Guest
Posts: n/a
 
      01-12-2005
Chris Lasher wrote:

> Since the file I'm working with contains tens of thousands of these
> records, I believe I need to find a way to hash this file such that I
> can retrieve the respective sequence more quickly than I could by
> parsing through the file request-by-request. However, I'm very new to
> Python and am still very low on the learning curve for programming and
> algorithms in general; while I'm certain there are ubiquitous
> algorithms for this type of problem, I don't know what they are or
> where to look for them. So I turn to the gurus and accost you for help
> once again. If you could help me figure out how to code a solution
> that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
> keep it in Python only, even though I know interaction with a
> relational database would provide the fastest method--the group I'm
> trying to write this for does not have access to a RDBMS.)


keeping an index in memory might be reasonable. the following class
creates an index file by scanning the FASTA file, and uses the "marshal"
module to save it to disk. if the index file already exists, it's used as is.
to regenerate the index, just remove the index file, and run the program
again.

import os, marshal

class FASTA:

def __init__(self, file):
self.file = open(file)
self.checkindex()

def __getitem__(self, key):
try:
pos = self.index[key]
except KeyError:
raise IndexError("no such item")
else:
f = self.file
f.seek(pos)
header = f.readline()
assert ">" + header + "\n"
data = []
while 1:
line = f.readline()
if not line or line[0] == ">":
break
data.append(line)
return data

def checkindex(self):
indexfile = self.file.name + ".index"

try:
self.index = marshal.load(open(indexfile, "rb"))
except IOError:
print "building index..."

index = {}

# scan the file
f = self.file
f.seek(0)
while 1:
pos = f.tell()
line = f.readline()
if not line:
break
if line[0] == ">":
# save offset to header line
header = line[1:].strip()
index[header] = pos

# save index to disk
f = open(indexfile, "wb")
marshal.dump(index, f)
f.close()

self.index = index

db = FASTA("myfastafile.dat")
print db["CW127_A02"]

['TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAG...

tweak as necessary.

</F>



 
Reply With Quote
 
 
 
 
John Lenton
Guest
Posts: n/a
 
      01-12-2005
> If you could help me figure out how to code a solution
> that won't be a resource whore, I'd be _very_ grateful. (I'd prefer

to
> keep it in Python only, even though I know interaction with a
> relational database would provide the fastest method--the group I'm
> trying to write this for does not have access to a RDBMS.)

You don't need a RDBMS; I'd put it in a DBM or CDB myself.

 
Reply With Quote
 
Larry Bates
Guest
Posts: n/a
 
      01-12-2005
You don't say how this will be used, but here goes:

1) Read the records and put into dictionary with key
of sequence (from header) and data being the sequence
data. Use shelve to store the dictionary for subsequent
runs (if load time is excessive).

2) Take a look at Gadfly (gadfly.sourceforge.net). It
provides you with Python SQL-like database and may be
better solution if data is basically static and you
do lots of processing.

All depends on how you use the data.

Regards,
Larry Bates
Syscon, Inc.

Chris Lasher wrote:
> Hello,
> I have a rather large (100+ MB) FASTA file from which I need to
> access records in a random order. The FASTA format is a standard format
> for storing molecular biological sequences. Each record contains a
> header line for describing the sequence that begins with a '>'
> (right-angle bracket) followed by lines that contain the actual
> sequence data. Three example FASTA records are below:
>
>
>>CW127_A01

>
> TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
> TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
> GCATTAAACAT
>
>>CW127_A02

>
> TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
> TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
> GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAG GAATAGACGG
>
>>CW127_A03

>
> TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
> TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
> GCATTAAACATTCCGCCTGGG
> ...
>
> Since the file I'm working with contains tens of thousands of these
> records, I believe I need to find a way to hash this file such that I
> can retrieve the respective sequence more quickly than I could by
> parsing through the file request-by-request. However, I'm very new to
> Python and am still very low on the learning curve for programming and
> algorithms in general; while I'm certain there are ubiquitous
> algorithms for this type of problem, I don't know what they are or
> where to look for them. So I turn to the gurus and accost you for help
> once again. If you could help me figure out how to code a solution
> that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
> keep it in Python only, even though I know interaction with a
> relational database would provide the fastest method--the group I'm
> trying to write this for does not have access to a RDBMS.)
> Thanks very much in advance,
> Chris
>

 
Reply With Quote
 
David E. Konerding DSD staff
Guest
Posts: n/a
 
      01-13-2005
In article <(E-Mail Removed). com>, Chris Lasher wrote:
> Hello,
> I have a rather large (100+ MB) FASTA file from which I need to
> access records in a random order. The FASTA format is a standard format
> for storing molecular biological sequences. Each record contains a
> header line for describing the sequence that begins with a '>'
> (right-angle bracket) followed by lines that contain the actual
> sequence data. Three example FASTA records are below:


Use biopython. They have dictionary-style classes which wrap FASTA files using indexes.

http://www.biopython.org

Dave
 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      01-13-2005

Chris Lasher wrote:
> Hello,
> I have a rather large (100+ MB) FASTA file from which I need to
> access records in a random order. The FASTA format is a standard

format
> for storing molecular biological sequences. Each record contains a
> header line for describing the sequence that begins with a '>'
> (right-angle bracket) followed by lines that contain the actual
> sequence data. Three example FASTA records are below:
>
> >CW127_A01

> TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGG GTGAGTAATG
> TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTA ATACCCCATA
> GCATTAAACAT

[snip]
> Since the file I'm working with contains tens of thousands of these
> records, I believe I need to find a way to hash this file such that I
> can retrieve the respective sequence more quickly than I could by
> parsing through the file request-by-request. However, I'm very new to
> Python and am still very low on the learning curve for programming

and
> algorithms in general; while I'm certain there are ubiquitous
> algorithms for this type of problem, I don't know what they are or
> where to look for them. So I turn to the gurus and accost you for

help
> once again. If you could help me figure out how to code a

solution
> that won't be a resource whore, I'd be _very_ grateful. (I'd prefer

to
> keep it in Python only, even though I know interaction with a
> relational database would provide the fastest method--the group I'm
> trying to write this for does not have access to a RDBMS.)
> Thanks very much in advance,
> Chris


Before you get too carried away, how often do you want to do this and
how grunty is the box you will be running on? Will the data be on a
server? If the server is on a WAN or at the other end of a radio link
between buildings, you definitely need an index so that you can access
the data randomly!

By way of example, to read all of a 157MB file into memory from a local
(i.e. not networked) disk using readlines() takes less than 4 seconds
on a 1.4Ghz Athlon processor (see below). The average new corporate
desktop box is about twice as fast as that. Note that Windows Task
Manager showed 100% CPU utilisation for both read() and readlines().

My guess is that you don't need anything much fancier than the effbot's
index method -- which by now you have probably found works straight out
of the box and is more than fast enough for your needs.

BTW, you need to clarify "don't have access to an RDBMS" ... surely
this can only be due to someone stopping them from installing good free
software freely available on the Internet.

HTH,
John

C:\junk>python -m timeit -n 1 -r 6 "print
len(file('bigfile.csv').read())"
157581595
157581595
157581595
157581595
157581595
157581595
1 loops, best of 6: 3.3e+006 usec per loop

C:\junk>python -m timeit -n 1 -r 6 "print
len(file('bigfile.csv').readlines())"
1118870
1118870
1118870
1118870
1118870
1118870
1 loops, best of 6: 3.57e+006 usec per loop

 
Reply With Quote
 
Bengt Richter
Guest
Posts: n/a
 
      01-13-2005
On 12 Jan 2005 14:46:07 -0800, "Chris Lasher" <(E-Mail Removed)> wrote:

>Hello,
>I have a rather large (100+ MB) FASTA file from which I need to
>access records in a random order. The FASTA format is a standard format
>for storing molecular biological sequences. Each record contains a
>header line for describing the sequence that begins with a '>'
>(right-angle bracket) followed by lines that contain the actual
>sequence data. Three example FASTA records are below:
>

Others have probably solved your basic problem, or pointed
the way. I'm just curious.

Given that the information content is 2 bits per character
that is taking up 8 bits of storage, there must be a good reason
for storing and/or transmitting them this way? I.e., it it easy
to think up a count-prefixed compressed format packing 4:1 in
subsequent data bytes (except for the last byte which have
less than 4 2-bit codes).

I'm wondering how the data is actually used once records are
retrieved. (but I'm too lazy to explore the biopython.org link).

>>CW127_A01

>TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACG GGTGAGTAATG
>TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCT AATACCCCATA
>GCATTAAACAT
>>CW127_A02

>TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACG GGTGAGTAATG
>TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCT AATACCCCATA
>GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAA GGAATAGACGG
>>CW127_A03

>TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACG GGTGAGTAATG
>TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCT AATACCCCATA
>GCATTAAACATTCCGCCTGGG
>...



Regards,
Bengt Richter
 
Reply With Quote
 
Chris Lasher
Guest
Posts: n/a
 
      01-13-2005
>Before you get too carried away, how often do you want to do this and
>how grunty is the box you will be running on?


Oops, I should have specified this. The script will only need to be run
once every three or four months, when the sequences are updated. I'll
be running it on boxes that are 3GHz/100GB Ram, but others may not be
so fortunate, and so I'd like to keep that in mind.

>BTW, you need to clarify "don't have access to an RDBMS" ... surely
>this can only be due to someone stopping them from installing good
>free software freely available on the Internet.


I understand your and others' sentiment on this. I agree, the
open-source database systems are wonderful. However, keeping it
Python-only saves me hassle by only having to assist in instances where
others need help downloading and installing Python. I suppose if I keep
it in Python, I can even use Py2exe to generate an executable that
wouldn't even require them to install Python. A solution using
interaction with a database is much sexier, but, for the purposes of
the script, seems unnecesary. However, I certainly appreciate the
suggestions.

>My guess is that you don't need anything much fancier than the
>effbot's index method -- which by now you have probably found works
>straight out of the box and is more than fast enough for your needs.


I think Mr. Lundh's code will work very well for these purposes. Thanks
very much to him for posting it. Many thanks for posting that! You'll
have full credit for that part of the code. Thanks very much to all who
replied!

Chris

 
Reply With Quote
 
Chris Lasher
Guest
Posts: n/a
 
      01-13-2005
Thanks for your reply, Larry. I thought about this, but I'm worried the
dictionary will consume a lot of resources. I think my 3GHz/1GB RAM box
could handle the load fine, but I'm not sure about others' systems.
Chris

 
Reply With Quote
 
Chris Lasher
Guest
Posts: n/a
 
      01-13-2005
>Others have probably solved your basic problem, or pointed
>the way. I'm just curious.


>Given that the information content is 2 bits per character
>that is taking up 8 bits of storage, there must be a good reason
>for storing and/or transmitting them this way? I.e., it it easy
>to think up a count-prefixed compressed format packing 4:1 in
>subsequent data bytes (except for the last byte which have
>less than 4 2-bit codes).


My guess for the inefficiency in storage size is because it is
human-readable, and because most in-silico molecular biology is just a
bunch of fancy string algorithms. This is my limited view of these
things at least.

>I'm wondering how the data is actually used once records are
>retrieved.


This one I can answer. For my purposes, I'm just organizing the
sequences at hand, but there are all sorts of things one could actually
do with sequences: alignments, BLAST searches, gene annotations, etc.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Math.random() and Math.round(Math.random()) and Math.floor(Math.random()*2) VK Javascript 15 05-02-2010 03:43 PM
concatenate fasta file PeroMHC Python 3 02-13-2010 03:14 PM
random.random(), random not defined!? globalrev Python 4 04-20-2008 08:12 AM
convert protein fasta stream into harsh table zhong.huang@gmail.com Perl Misc 9 03-01-2006 07:57 PM
match muliple header records to associated detail records Luke Airig XML 0 12-31-2003 12:06 AM



Advertisments