Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > regexp with gaps

Reply
Thread Tools

regexp with gaps

 
 
egrasso
Guest
Posts: n/a
 
      07-14-2008
Hi, I need to find the position of some substrings inside of a long
string. For this I'm using a loop that uses str.index(pattern,
(last_found_position+1)) so I find all positions where the pattern
matches. The pattern is a string of 20 chars, different each time I
run the script. That worked perfect. The problem is that now I need to
find all positions where the pattern matches 12 or more chars.
For example: For the pattern "aaaaaa", find substrings "aaaaaa",
"aaabaa", "baaaaa", "ababaa", etc

First I thought that I could create all possible patterns (with \w)
and check them, but I realized that there would be a lot of different
patterns to check (over a few hundreds I think).
Is there any way to do this without the need of checking a lot of
patterns?
thanks

 
Reply With Quote
 
 
 
 
phlip
Guest
Posts: n/a
 
      07-14-2008
> The problem is that now I need to
> find all positions where the pattern matches 12 or more chars.
> For example: For the pattern "aaaaaa", find substrings "aaaaaa",
> "aaabaa", "baaaaa", "ababaa", etc
>
> First I thought that I could create all possible patterns (with \w)


\w{12,}

Right?

Either that or \w{12}\w*
 
Reply With Quote
 
 
 
 
egrasso.rb@eng2.net
Guest
Posts: n/a
 
      07-15-2008
Mmmmm... nop. I think I didn't explain the idea very well... I'm writing a
script to find specific secuences of DNA (binding sites) inside of a large
secuence of DNA (for thosse who doesn't know, DNA sequences are made of 4
diferent bases: A, T, C and G). The problem is that the binding sites don't
need to be 100% exact to work. For example, the binding site for an X
protein is "AAATTT", but the protein can also bind to the secuence "AAAGTT"
or "AACGTT" and work fine. I need to find all this sites, but the only data
I have is that "Protein X binds to AAATTT".
I finally solve the problem without using str.index nor regexp, basically,
I seek it manually:

(Note: variables are in spanish!: buscarBS=find binding site,
patron=pattern, semejanza=1 to 0, minimal similarity, cadena=string,
respuesta=answer, largo=length)

def buscarBS(patron, semejanza=0.6, cadena=@secuencia)
respuesta = ""
i = 0.0
j = 0.0
largoc = cadena.length
largop = patron.length

while i <= (largoc-largop)
j = 0.0
puntos = 0.0
subpuntos = largop * (1-semejanza)

while (j < largop) and (subpuntos > 0)
pos = i + j
if cadena[pos] == patron[j] then
puntos +=1
else
subpuntos -=1
end
j+=1
end
if (puntos / largop) >= semejanza then
respuesta = respuesta + "desde: "+(i+1).to_i.to_s+" hasta:
"+(i+j).to_i.to_s+" - similitud: - "+(puntos / largop * 100).to_s+"%\n"
end
i+=1
end

if respuesta == "" then
respuesta = "No se encontro ninguna secuencia similar (similitud:
#{semejanza} - #{patron})"
else
respuesta = "\nSe encontraron las siguientes similitudes:\n\n"+respuesta
end
return respuesta

end

I still need to polish and optimize the code but it find all possible
sites with at least an specific similarity and tells me how similar they
are. If anyone have another idea, need more details about the code or is
interested in bioinformatic with ruby tell me.
Thanks

On Mon, 14 Jul 2008 23:15:30 +0900, phlip <(E-Mail Removed)> wrote:
>> The problem is that now I need to
>> find all positions where the pattern matches 12 or more chars.
>> For example: For the pattern "aaaaaa", find substrings "aaaaaa",
>> "aaabaa", "baaaaa", "ababaa", etc
>>
>> First I thought that I could create all possible patterns (with \w)

>
> \w{12,}
>
> Right?
>
> Either that or \w{12}\w*



 
Reply With Quote
 
Axel Etzold
Guest
Posts: n/a
 
      07-15-2008

-------- Original-Nachricht --------
> Datum: Tue, 15 Jul 2008 12:18:09 +0900
> Von: http://www.velocityreviews.com/forums/(E-Mail Removed)
> An: (E-Mail Removed)
> Betreff: Re: regexp with gaps


> Mmmmm... nop. I think I didn't explain the idea very well... I'm writing a
> script to find specific secuences of DNA (binding sites) inside of a large
> secuence of DNA (for thosse who doesn't know, DNA sequences are made of 4
> diferent bases: A, T, C and G). The problem is that the binding sites
> don't
> need to be 100% exact to work. For example, the binding site for an X
> protein is "AAATTT", but the protein can also bind to the secuence
> "AAAGTT"
> or "AACGTT" and work fine. I need to find all this sites, but the only
> data
> I have is that "Protein X binds to AAATTT".
> I finally solve the problem without using str.index nor regexp, basically,
> I seek it manually:
>
> (Note: variables are in spanish!: buscarBS=find binding site,
> patron=pattern, semejanza=1 to 0, minimal similarity, cadena=string,
> respuesta=answer, largo=length)
>
> def buscarBS(patron, semejanza=0.6, cadena=@secuencia)
> respuesta = ""
> i = 0.0
> j = 0.0
> largoc = cadena.length
> largop = patron.length
>
> while i <= (largoc-largop)
> j = 0.0
> puntos = 0.0
> subpuntos = largop * (1-semejanza)
>
> while (j < largop) and (subpuntos > 0)
> pos = i + j
> if cadena[pos] == patron[j] then
> puntos +=1
> else
> subpuntos -=1
> end
> j+=1
> end
> if (puntos / largop) >= semejanza then
> respuesta = respuesta + "desde: "+(i+1).to_i.to_s+" hasta:
> "+(i+j).to_i.to_s+" - similitud: - "+(puntos / largop * 100).to_s+"%\n"
> end
> i+=1
> end
>
> if respuesta == "" then
> respuesta = "No se encontro ninguna secuencia similar (similitud:
> #{semejanza} - #{patron})"
> else
> respuesta = "\nSe encontraron las siguientes
> similitudes:\n\n"+respuesta
> end
> return respuesta
>
> end
>
> I still need to polish and optimize the code but it find all possible
> sites with at least an specific similarity and tells me how similar they
> are. If anyone have another idea, need more details about the code or is
> interested in bioinformatic with ruby tell me.
> Thanks
>
> On Mon, 14 Jul 2008 23:15:30 +0900, phlip <(E-Mail Removed)> wrote:
> >> The problem is that now I need to
> >> find all positions where the pattern matches 12 or more chars.
> >> For example: For the pattern "aaaaaa", find substrings "aaaaaa",
> >> "aaabaa", "baaaaa", "ababaa", etc
> >>
> >> First I thought that I could create all possible patterns (with \w)

> >
> > \w{12,}
> >
> > Right?
> >
> > Either that or \w{12}\w*

>


Hi ---

you could make use of the McIlroy-Hunt longest common subsequence (LCS) algorithm,
which will give you longest common subsequences, and also information of the type

'sequence AAATTT is transformed into AAAGTT by changing T to G at the fourth entry.'

You can find a Ruby gem implementation here: http://raa.ruby-lang.org/project/diff-lcs/

Best regards,

Axel

--
Psssst! Schon das coole Video vom GMX MultiMessenger gesehen?
Der Eine für Alle: http://www.gmx.net/de/go/messenger03

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
Why gaps in my meny in IE? greg HTML 4 01-29-2006 02:19 AM
XSLT White Space Gaps Under Images Mark247 XML 1 09-02-2004 11:47 PM
log parsing: gaps between transactions Arthur Dent Perl 1 12-03-2003 12:20 AM
Writing audio cd's without gaps??? Rob Mitchell Computer Support 2 06-30-2003 11:39 PM



Advertisments