Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Organize large DNA txt files

Reply
Thread Tools

Organize large DNA txt files

 
 
thomasvangurp@gmail.com
Guest
Posts: n/a
 
      03-20-2009
Dear Fellow programmers,

I'm using Python scripts too organize some rather large datasets
describing DNA variation. Information is read, processed and written
too a file in a sequential order, like this
1+
1-
2+
2-

etc.. The files that i created contain positional information
(nucleotide position) and some other info, like this:

file 1+:
--------------------------------------------
1 73 0 1 0 0
1 76 1 0 0 0
1 77 0 1 0 0
--------------------------------------------
file 1-
--------------------------------------------
1 74 0 0 6 0
1 78 0 0 4 0
1 89 0 0 0 2

Now the trick is that i want this:

File 1+ AND File 1-
--------------------------------------------
1 73 0 1 0 0
1 74 0 0 6 0
1 76 1 0 0 0
1 77 0 1 0 0
1 78 0 0 4 0
1 89 0 0 0 2
-------------------------------------------

So the information should be sorted onto position. Right now I've
written some very complicated scripts that read a number of lines from
file 1- and 1+ and then combine this output. The problem is of course
that the running number of file 1- can be lower then 1+, resulting in
a incorrect order. Since both files are too large to input in a
dictionary at once (both are 100 MB+) I need some sort of a
alternative that can quickly sort everything without crashing my pc..

Your thoughts are appreciated..
Kind regards,
Thomas


 
Reply With Quote
 
 
 
 
MRAB
Guest
Posts: n/a
 
      03-20-2009
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> Dear Fellow programmers,
>
> I'm using Python scripts too organize some rather large datasets
> describing DNA variation. Information is read, processed and written
> too a file in a sequential order, like this
> 1+
> 1-
> 2+
> 2-
>
> etc.. The files that i created contain positional information
> (nucleotide position) and some other info, like this:
>
> file 1+:
> --------------------------------------------
> 1 73 0 1 0 0
> 1 76 1 0 0 0
> 1 77 0 1 0 0
> --------------------------------------------
> file 1-
> --------------------------------------------
> 1 74 0 0 6 0
> 1 78 0 0 4 0
> 1 89 0 0 0 2
>
> Now the trick is that i want this:
>
> File 1+ AND File 1-
> --------------------------------------------
> 1 73 0 1 0 0
> 1 74 0 0 6 0
> 1 76 1 0 0 0
> 1 77 0 1 0 0
> 1 78 0 0 4 0
> 1 89 0 0 0 2
> -------------------------------------------
>
> So the information should be sorted onto position. Right now I've
> written some very complicated scripts that read a number of lines from
> file 1- and 1+ and then combine this output. The problem is of course
> that the running number of file 1- can be lower then 1+, resulting in
> a incorrect order. Since both files are too large to input in a
> dictionary at once (both are 100 MB+) I need some sort of a
> alternative that can quickly sort everything without crashing my pc..
>

Here's my attempt:

line_1 = input_1.readline()
line_2 = input_2.readline()
while line_1 and line_2:
pos_1 = int(line_1.split(None, 2)[1])
pos_2 = int(line_2.split(None, 2)[1])
if pos_1 < pos_2:
output.write(line_1)
line_1 = input_1.readline()
else:
output.write(line_2)
line_2 = input_2.readline()
while line_1:
output.write(line_1)
line_1 = input_1.readline()
while line_2:
output.write(line_2)
line_2 = input_2.readline()

 
Reply With Quote
 
 
 
 
thomasvangurp@gmail.com
Guest
Posts: n/a
 
      03-20-2009
Thanks,
This works great!
I did not know that it is possible to iterate through the file lines
with a while function that's conditional on additional lines being
present or not.
 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      03-20-2009
(E-Mail Removed) wrote:
> Thanks,
> This works great!
> I did not know that it is possible to iterate through the file lines
> with a while function that's conditional on additional lines being
> present or not.
>

It relies on file.readline() returning an empty string when it's at the
end of the file (and that's the only time it does) and empty strings
being treated as False by 'while' (and non-empty strings being treated
as True). It's all in the docs!
 
Reply With Quote
 
Daniel Fetchinson
Guest
Posts: n/a
 
      03-20-2009
> I'm using Python scripts too organize some rather large datasets
> describing DNA variation. Information is read, processed and written
> too a file in a sequential order, like this
> 1+
> 1-
> 2+
> 2-
>
> etc.. The files that i created contain positional information
> (nucleotide position) and some other info, like this:
>
> file 1+:
> --------------------------------------------
> 1 73 0 1 0 0
> 1 76 1 0 0 0
> 1 77 0 1 0 0
> --------------------------------------------
> file 1-
> --------------------------------------------
> 1 74 0 0 6 0
> 1 78 0 0 4 0
> 1 89 0 0 0 2
>
> Now the trick is that i want this:
>
> File 1+ AND File 1-
> --------------------------------------------
> 1 73 0 1 0 0
> 1 74 0 0 6 0
> 1 76 1 0 0 0
> 1 77 0 1 0 0
> 1 78 0 0 4 0
> 1 89 0 0 0 2
> -------------------------------------------
>
> So the information should be sorted onto position. Right now I've
> written some very complicated scripts that read a number of lines from
> file 1- and 1+ and then combine this output. The problem is of course
> that the running number of file 1- can be lower then 1+, resulting in
> a incorrect order. Since both files are too large to input in a
> dictionary at once (both are 100 MB+) I need some sort of a
> alternative that can quickly sort everything without crashing my pc..


Have you considered using a lightweight database solution? Sqlite is a
really simple, zero configuration, server-less db and a python binding
for it comes with python itself. I'd give it a try, it will simplify
tasks like these a great deal.

http://docs.python.org/library/sqlite3.html

Cheers,
Daniel


--
Psss, psss, put it down! - http://www.cafepress.com/putitdown
 
Reply With Quote
 
Paul McGuire
Guest
Posts: n/a
 
      03-20-2009
On Mar 20, 11:59*am, Daniel Fetchinson <(E-Mail Removed)>
wrote:
> Have you considered using a lightweight database solution? Sqlite is a
> really simple, zero configuration, server-less db and a python binding
> for it comes with python itself. I'd give it a try, it will simplify
> tasks like these a great deal.
>
> http://docs.python.org/library/sqlite3.html
>


I second Daniel's recommendation of using Sqlite. Just as easily as
you create output files 1+ and 1-, you can work with a sqlite databsae
file. If you are worried that your database file may not be as easy
to read as using Notepad on 1+ and 1-, you can download the freeware
SQLiteDatabase Browser (http://sqlitebrowser.sourceforge.net/) - think
of it as the Notepad for Sqlite database files.

-- Paul

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to organize files in large applications? Jayden Shui C++ 13 11-20-2011 09:56 PM
DNA String Compression For Storing in Data Structure Gundala Viswanath C++ 1 01-17-2009 07:01 PM
Large amount of files to parse/organize, tips on algorithm? cnb Python 6 09-02-2008 06:50 PM
~ **DNA,RNA Related Bio Technology** ~ sano Digital Photography 0 07-13-2007 10:34 AM



Advertisments