Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > yet another text parser...

Reply
Thread Tools

yet another text parser...

 
 
Marc Hoeppner
Guest
Posts: n/a
 
      07-18-2007
Hi,

I have yet another question about how to write a specific text parser in
ruby...
So, without further ado - this is what the source file looks like:

Query= gi|23510597|emb|CAD48982.1| ring-infected erythrocyte surface
antigen precursor [Plasmodium falciparum 3D7]
(1085 letters)

Database: KOG
112,920 sequences; 47,500,486 total letters

Searching......................................... .........done



Score
E
Sequences producing significant alignments: (bits)
Value

At2g21510 96
3e-19
At4g39150 95
1e-18
At1g76700

and so on...

What I want to do is the following:
Read the source file - and if a line starts with "Query=", strip
everything from the line but the expression "gi|xxxxx". That part was no
problem with gsub, mind you. But, now the tricky thing (or not, I
guess...).
Go from there until you find a line starting with "Sequence", skip this
line and the following and puts the third line together with the
"gi|xxxxx"
So from the above example it would look like this:

gi|23510597 At2g21510

No, ideally I wouldnt have to include this skip-lines part, but I cant
find a regexp, that lets me reliably identify the first line of the
results block (not all possible results start with At...).

How I tried to do it:

def stripname line
s = line.gsub(/Query=/, '')
u = s.gsub(/\|emb.*/, '')
end

count = 0 # initializing variables
t = nil
v = nil

ARGF.each do |l|

puts l unless count.zero?
count = [0, count-1].max

if l.match(/^Query=/)
t = stripname l
elsif l.match(/^Sequences/)
l = $1
count = 2
puts "#{t}#{l}"
else
end
end

But the output looks terrible:
gi|23510597

At2g21510
96 3e-19
gi|23510599

At5g14980
58 3e-08
gi|23510600

And no matter what I try, I cant get the gi|xxxx and the corresponding
"best hit" in the same line. Tried it with hashes, but frankly dont know
enough about those yet.
So If anyone has a helpful comment or solution, I would be extremely
grateful!

Cheers,

Marc

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
Andreas Schwarz
Guest
Posts: n/a
 
      07-18-2007
I'd throw it all into one big ugly regex:
s.match(/Query=
(.+?\|.+?)\|.+?\(bits\)\s+Value\s+(.+?)\s+/m).to_a[1..2].join(' ')
=> "gi|23510597 At2g21510"

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
Marc Hoeppner
Guest
Posts: n/a
 
      07-18-2007
Andreas Schwarz wrote:
> I'd throw it all into one big ugly regex:
> s.match(/Query=
> (.+?\|.+?)\|.+?\(bits\)\s+Value\s+(.+?)\s+/m).to_a[1..2].join(' ')
> => "gi|23510597 At2g21510"


Thanks for the suggestion! However, if someone has a suggestion
regarding the following code and how to fix it, I'd be happy...its
almost working and I just need to understand why it is behaving a bit
odd. So here is the code.

def stripname line
s = line.gsub(/Query=/, '')
u = s.gsub(/\|emb.*/, '')
end


count = 0
gene = nil
store = Array.new

ARGF.each do |l|

store.push(l) unless count.zero?
count = [0, count-1].max

if l.match(/^Query=/)
gene = stripname l

elsif l.match(/^Sequences/)
count = 2
puts "#{gene.strip} #{store.last.to_s.strip}"
else

end
end



Problem:

Reads: If line is found that starts with "Query=", use the method
stripname on it and store it in the variable "gene". Go further, and if
you find a line that starts with "Sequence", use the above specified
procedure "count". Now this is the problem right now. After I wasnt able
to figure out to get the formatting right, I decided to stick to the
skip-line approach and instead of having it printed, to store it in an
array. From there I simply read the last entry.

BUT: instead of printing every stored hit to the corresponding "gene",
it shifts the whole thing 1 line. So that each "gene" is associated with
the "best hit" of the previous match to "Query=".

gi|23510597
gi|23510599 At2g21510
gi|23510600 At5g14980

Now, I could solve that easily with a capable text editor, but I think
there must be an easy solution to this...right?

Cheers,
Marc

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Robert Dober
Guest
Posts: n/a
 
      07-18-2007
On 7/18/07, Marc Hoeppner <(E-Mail Removed)> wrote:
> Hi,
>
> I have yet another question about how to write a specific text parser in
> ruby...
> So, without further ado - this is what the source file looks like:
>
> Query= gi|23510597|emb|CAD48982.1| ring-infected erythrocyte surface
> antigen precursor [Plasmodium falciparum 3D7]
> (1085 letters)
>
> Database: KOG
> 112,920 sequences; 47,500,486 total letters
>
> Searching......................................... .........done
>
>
>
> Score
> E
> Sequences producing significant alignments: (bits)
> Value
>
> At2g21510 96
> 3e-19
> At4g39150 95
> 1e-18
> At1g76700
>
> and so on...
>
> What I want to do is the following:
> Read the source file - and if a line starts with "Query=", strip
> everything from the line but the expression "gi|xxxxx". That part was no
> problem with gsub, mind you. But, now the tricky thing (or not, I
> guess...).
> Go from there until you find a line starting with "Sequence", skip this
> line and the following and puts the third line together with the
> "gi|xxxxx"
> So from the above example it would look like this:
>
> gi|23510597 At2g21510
>
> No, ideally I wouldnt have to include this skip-lines part, but I cant
> find a regexp, that lets me reliably identify the first line of the
> results block (not all possible results start with At...).
>
> How I tried to do it:
>
> def stripname line
> s = line.gsub(/Query=/, '')
> u = s.gsub(/\|emb.*/, '')
> end
>
> count = 0 # initializing variables
> t = nil
> v = nil
>
> ARGF.each do |l|
>
> puts l unless count.zero?
> count = [0, count-1].max
>
> if l.match(/^Query=/)
> t = stripname l
> elsif l.match(/^Sequences/)
> l = $1
> count = 2
> puts "#{t}#{l}"
> else
> end
> end
>
> But the output looks terrible:
> gi|23510597
>
> At2g21510
> 96 3e-19
> gi|23510599
>
> At5g14980
> 58 3e-08
> gi|23510600
>
> And no matter what I try, I cant get the gi|xxxx and the corresponding
> "best hit" in the same line.

It is a terrible thing happens to me all the time, one tends to forget
these \n's.
Well fortunately we have #chomp, but maybe you want to use #strip
which removes trailing (and leading) WS \n included.

HTH
Robert
>Tried it with hashes, but frankly dont know
> enough about those yet.
> So If anyone has a helpful comment or solution, I would be extremely
> grateful!
>
> Cheers,
>
> Marc
>
> --
> Posted via http://www.ruby-forum.com/.
>
>



--
I always knew that one day Smalltalk would replace Java.
I just didn't know it would be called Ruby
-- Kent Beck

 
Reply With Quote
 
Robert Dober
Guest
Posts: n/a
 
      07-18-2007
On 7/18/07, Marc Hoeppner <(E-Mail Removed)> wrote:
> Andreas Schwarz wrote:
> > I'd throw it all into one big ugly regex:
> > s.match(/Query=
> > (.+?\|.+?)\|.+?\(bits\)\s+Value\s+(.+?)\s+/m).to_a[1..2].join(' ')
> > => "gi|23510597 At2g21510"

>
> Thanks for the suggestion! However, if someone has a suggestion
> regarding the following code and how to fix it, I'd be happy...its
> almost working and I just need to understand why it is behaving a bit
> odd. So here is the code.
>
> def stripname line
> s = line.gsub(/Query=/, '')
> u = s.gsub(/\|emb.*/, '')
> end
>
>
> count = 0
> gene = nil
> store = Array.new
>
> ARGF.each do |l|
>
> store.push(l) unless count.zero?
> count = [0, count-1].max
>
> if l.match(/^Query=/)
> gene = stripname l
>
> elsif l.match(/^Sequences/)
> count = 2
> puts "#{gene.strip} #{store.last.to_s.strip}"
> else
>
> end
> end
>
>
>
> Problem:
>
> Reads: If line is found that starts with "Query=", use the method
> stripname on it and store it in the variable "gene". Go further, and if
> you find a line that starts with "Sequence", use the above specified
> procedure "count". Now this is the problem right now. After I wasnt able
> to figure out to get the formatting right, I decided to stick to the
> skip-line approach and instead of having it printed, to store it in an
> array. From there I simply read the last entry.
>
> BUT: instead of printing every stored hit to the corresponding "gene",
> it shifts the whole thing 1 line. So that each "gene" is associated with
> the "best hit" of the previous match to "Query=".

Are you pushing before or after you use the last element of the array?
But you should go back to your original idea, which works just fine,
now that you have discovered #strip, before my post

Now this is a Ruby ML, right, so maybe you would accept that I Rubyish
the code a little bit

gi = nil
ARGF.each do |line|
case line
when /Query=\s*(gi\|.*?)\|/
gi = $1
when /Sequence/
puts gi.strip << " " << (1..2).map{ ARGF.readline }.last.strip
end
end

HTH
Robert
--
I always knew that one day Smalltalk would replace Java.
I just didn't know it would be called Ruby
-- Kent Beck

 
Reply With Quote
 
Marc Hoeppner
Guest
Posts: n/a
 
      07-18-2007

>
> Now this is a Ruby ML, right, so maybe you would accept that I Rubyish
> the code a little bit
>
> gi = nil
> ARGF.each do |line|
> case line
> when /Query=\s*(gi\|.*?)\|/
> gi = $1
> when /Sequence/
> puts gi.strip << " " << (1..2).map{ ARGF.readline
> }.last.strip
> end
> end
>


Very nice, thank you!

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Yet another book recommendation, but for someone who can program and yet does not the terminology well Berehem C Programming 4 04-28-2005 05:25 PM
Re: Yet another .net bug? Anon ASP .Net 3 01-19-2004 01:42 PM
Re: Yet another .net bug? Jan Tielens ASP .Net 0 01-18-2004 08:20 PM
Re: Yet another .net bug? Hermit Dave ASP .Net 0 01-18-2004 08:18 PM
yet another yEnc question Merlin Zener Firefox 1 07-19-2003 12:03 AM



Advertisments