Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > extracting substrings from a file

Reply
Thread Tools

extracting substrings from a file

 
 
sofiafig@gmail.com
Guest
Posts: n/a
 
      09-11-2006
Hi,

I have a file with several entries in the form:

AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
dethiobiotin synthetase (bioD), complete cds.

1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
/CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
/DEF=Mus musculus chaperonin subunit 8 (theta) (Cct, mRNA.
/PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1

and I would like to create a file that has only the following:

AFFX-BioB-5_at /GEN=bioB /gb:J04423.1

1415785_a_at /gb:NM_009840.1 /GEN=Cct8

Could anyone please tell me how can I do it?

Many thanks in advance
Sofia

 
Reply With Quote
 
 
 
 
Tim Chase
Guest
Posts: n/a
 
      09-11-2006
> I have a file with several entries in the form:
>
> AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
> corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
> 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
> 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
> dethiobiotin synthetase (bioD), complete cds.
>
> 1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
> /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
> /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct, mRNA.
> /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
>
> and I would like to create a file that has only the following:
>
> AFFX-BioB-5_at /GEN=bioB /gb:J04423.1
>
> 1415785_a_at /gb:NM_009840.1 /GEN=Cct8
>
> Could anyone please tell me how can I do it?


The following seems to do it for me...

outfile = file('out.txt', 'w')
for line in file('in.txt'):
if '/GEN' in line and '/gb:' in line:
newline = []
for index, item in enumerate(line.split()):
if index == 0 or item.startswith('/GEN')
or item.startswith('/gb:'):
newline.append(item)
outfile.write('\t'.join(newline))
outfile.write('\n')
outfile.close()


There are some underdefined conditions...I presume that both the
GEN and gb: have to appear in the line. If only one of them is
required, change the "and" to an "or".

-tkc



 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      09-11-2006
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> Hi,
>
> I have a file with several entries in the form:
>
> AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
> corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
> 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
> 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
> dethiobiotin synthetase (bioD), complete cds.
>
> 1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
> /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
> /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct, mRNA.
> /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
>
> and I would like to create a file that has only the following:
>
> AFFX-BioB-5_at /GEN=bioB /gb:J04423.1
>
> 1415785_a_at /gb:NM_009840.1 /GEN=Cct8
>
> Could anyone please tell me how can I do it?
>
> Many thanks in advance
> Sofia


Here's my first iteration:
C:\junk>type sofia.py
prefixes = ['/GEN=', '/gb:']

def extract(fname):
f = open(fname, 'r')
chunks = [[]]
for line in f:
words = line.split()
if words:
chunks[-1].extend(words)
else:
chunks.append([])
for chunk in chunks:
if not chunk:
continue
output = [chunk[0]]
for word in chunk:
for prefix in prefixes:
if word.startswith(prefix):
output.append(word)
break
print ' '.join(output)

if __name__ == "__main__":
import sys
extract(sys.argv[1])

C:\junk>sofia.py sofia.txt
AFFX-BioB-5_at /GEN=bioB /gb:J04423.1 /gb:J04423.1
1415785_a_at /gb:NM_009840.1 /GEN=Cct8 /gb:BC009007.1

Before I fix the duplicate in the first line, you need to say whether
you really want the
/gb:BC009007.1 in the second line thrown away -- IOW, what's the rule?
For each prefix, either (1) get the first "word" that starts with that
prefix or (2) get all unique such words. You choose.

Cheers,
John

 
Reply With Quote
 
Larry Bates
Guest
Posts: n/a
 
      09-11-2006
(E-Mail Removed) wrote:
> Hi,
>
> I have a file with several entries in the form:
>
> AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
> corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
> 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
> 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
> dethiobiotin synthetase (bioD), complete cds.
>
> 1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
> /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
> /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct, mRNA.
> /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
>
> and I would like to create a file that has only the following:
>
> AFFX-BioB-5_at /GEN=bioB /gb:J04423.1
>
> 1415785_a_at /gb:NM_009840.1 /GEN=Cct8
>
> Could anyone please tell me how can I do it?
>
> Many thanks in advance
> Sofia
>

What have your tried so far?

Hint: split line on spaces, the first pieces is the first item you want,
then iterate over the pieces looking for the /GEN and /gb: pieces that
you are interested in keeping. I am assuming that /GEN= and /gb: data
doesn't have any spaces in them. If they do, you will need to use
regular expressions instead of split.

-Larry Bates
 
Reply With Quote
 
Paul McGuire
Guest
Posts: n/a
 
      09-11-2006
<(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ps.com...
> Hi,
>
> I have a file with several entries in the form:
>
> AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
> corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
> 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
> 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
> dethiobiotin synthetase (bioD), complete cds.
>
> 1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
> /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
> /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct, mRNA.
> /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
>
> and I would like to create a file that has only the following:
>
> AFFX-BioB-5_at /GEN=bioB /gb:J04423.1
>
> 1415785_a_at /gb:NM_009840.1 /GEN=Cct8


Here's a pyparsing solution that will address your immediate question, and
also gives you some leeway for adding other "/" options to your search.
Pyparsing's home page is at pyparsing.wikispaces.com.

-- Paul


data = """
AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
dethiobiotin synthetase (bioD), complete cds.

1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
/CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
/DEF=Mus musculus chaperonin subunit 8 (theta) (Cct, mRNA.
/PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
"""

from pyparsing import *

# create expression we are looking for:
# name [ junk word... ] /qualifier...
name = Word(alphanums,printables).setResultsName("name")
junkWord = ~(Literal("/")) + Word(printables)
qualifier = ("/" + Word(alphas+"_-.").setResultsName("key") + \
oneOf("= :") + \
Word(printables).setResultsName("value"))
expr = name + ZeroOrMore(junkWord) + \
Dict(ZeroOrMore(qualifier)).setResultsName("quals" )

# use parse action to repackage qualifier data to support "dict"-like
# access to qualifiers
qualifier.setParseAction( lambda t: (t.key,"".join(t)) )

# use this parse action instead if you just want whatever is
# after the '=' or ':' delimiter in the qualifier
# qualifier.setParseAction( lambda t: (t.key,t.value) )

# parse data strings, showing returned data structure
# (just to show what pyparsing results structure looks like)
for d in data.split("\n\n"):
res = expr.parseString(d)
print res.dump()
print
print

# now just do what the OP wanted in the first place
for d in data.split("\n\n"):
res = expr.parseString(d)
print res.name, res.quals["gb"], res.quals["GEN"]


Gives these results:
['AFFX-BioB-5_at', 'E.', 'coli', [('GEN', '/GEN=bioB'), ('gb',
'/gb:J04423.1')]]
- name: AFFX-BioB-5_at
- quals: [('GEN', '/GEN=bioB'), ('gb', '/gb:J04423.1')]
- GEN: /GEN=bioB
- gb: /gb:J04423.1

['1415785_a_at', [('gb', '/gb:NM_009840.1'), ('DB_XREF',
'/DB_XREF=gi:6753327'), ('GEN', '/GEN=Cct8'), ('FEA', '/FEA=FLmRNA'),
('CNT', '/CNT=482'), ('TID', '/TID=Mm.17989.1'), ('TIER', '/TIER=FL+Stack'),
('STK', '/STK=281'), ('UG', '/UG=Mm.17989'), ('LL', '/LL=12469'), ('DEF',
'/DEF=Mus')]]
- name: 1415785_a_at
- quals: [('gb', '/gb:NM_009840.1'), ('DB_XREF', '/DB_XREF=gi:6753327'),
('GEN', '/GEN=Cct8'), ('FEA', '/FEA=FLmRNA'), ('CNT', '/CNT=482'), ('TID',
'/TID=Mm.17989.1'), ('TIER', '/TIER=FL+Stack'), ('STK', '/STK=281'), ('UG',
'/UG=Mm.17989'), ('LL', '/LL=12469'), ('DEF', '/DEF=Mus')]
- CNT: /CNT=482
- DB_XREF: /DB_XREF=gi:6753327
- DEF: /DEF=Mus
- FEA: /FEA=FLmRNA
- GEN: /GEN=Cct8
- LL: /LL=12469
- STK: /STK=281
- TID: /TID=Mm.17989.1
- TIER: /TIER=FL+Stack
- UG: /UG=Mm.17989
- gb: /gb:NM_009840.1


AFFX-BioB-5_at /gb:J04423.1 /GEN=bioB
1415785_a_at /gb:NM_009840.1 /GEN=Cct8


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
extracting substrings based on 'fuzzy' match Pilcrow C Programming 2 11-21-2008 01:25 PM
Detect non-ascii substrings in a file killy971 Ruby 1 06-19-2008 10:34 AM
enumerate all adjecent substrings in the file puzzlecracker Perl Misc 9 12-13-2005 10:36 AM
regex: How to extract substrings? Markus Dehmann Java 2 12-10-2005 06:35 AM
Binary files, substrings and (un)packing. Leandro Pardini Perl 1 10-27-2003 07:57 PM



Advertisments