Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Need a specific sort of string modification. Can someone help?

Reply
Thread Tools

Need a specific sort of string modification. Can someone help?

 
 
Sia
Guest
Posts: n/a
 
      01-05-2013
I have strings such as:

tA.-2AG.-2AG,-2ag
or
..+3ACG.+5CAACG.+3ACG.+3ACG

The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:

tA..,
and
....

How can I do that?
Thanks.
 
Reply With Quote
 
 
 
 
Frank Millman
Guest
Posts: n/a
 
      01-05-2013
On 05/01/2013 10:35, Sia wrote:
> I have strings such as:
>
> tA.-2AG.-2AG,-2ag
> or
> .+3ACG.+5CAACG.+3ACG.+3ACG
>
> The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
>
> tA..,
> and
> ...
>
> How can I do that?
> Thanks.
>


Here is a naive solution (I am sure there are more elegant ones) -

def strip(old_string):
new_string = ''
max = len(old_string)
pos = 0
while pos < max:
char = old_string[pos]
if char in ('-', '+'):
num_pos = pos+1
num_str = ''
while old_string[num_pos].isdigit():
num_str += old_string[num_pos]
num_pos += 1
pos = num_pos + int(num_str)
else:
new_string += old_string[pos]
pos += 1
return new_string

It caters for the possibility that the number following the +/- could be
greater than 9 - I don't know if you need that.

It works with your examples, except that the second one returns '....',
which I think is correct - there are 4 dots in the original string.

HTH

Frank Millman


 
Reply With Quote
 
 
 
 
Chris Angelico
Guest
Posts: n/a
 
      01-05-2013
On Sat, Jan 5, 2013 at 7:35 PM, Sia <(E-Mail Removed)> wrote:
> I have strings such as:
>
> tA.-2AG.-2AG,-2ag
> or
> .+3ACG.+5CAACG.+3ACG.+3ACG
>
> The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
>
> tA..,
> and
> ...
>
> How can I do that?


Interesting. Are you guaranteed that there are no other plus or minus
signs? Is the number after the sign just one digit? Assuming the
answers to both are "yes", here's a one-liner:

s=".+3ACG.+5CAACG.+3ACG.+3ACG"

result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")])

Split on either - or +, then trim off that many characters from each
(that's the list comp), then join them back into a string.

ChrisA
 
Reply With Quote
 
Roy Smith
Guest
Posts: n/a
 
      01-05-2013
In article <(E-Mail Removed)>,
Sia <(E-Mail Removed)> wrote:

> I have strings such as:
>
> tA.-2AG.-2AG,-2ag
> or
> .+3ACG.+5CAACG.+3ACG.+3ACG


Some kind of DNA binding site?

A couple of questions. Are the numbers always single digits? How much
data is there? Are we talking a few hundred 20-character strings, or
all of Genbank?

> The plus and minus signs are always followed by a number (say, i). I want
> python to find each single plus or minus, remove the sign, the number after
> it and remove i characters after that. So the two strings above become:
>
> tA..,
> and
> ...


If I follow your description properly, the last output should be "...."
(4 dots), right? This looks like it should work. I'm sure there's more
efficient ways to do it, but for small inputs, this should be ok.

The general pattern here is a state machine. It's a good pattern to
learn if you're going to be doing any kind of sequence analysis. See,
for example, http://en.wikipedia.org/wiki/State_machine.

# Build up the new string as a list (for efficiency)
new = []

# Keep track of what state we're in. The three possible states
# are 1) scanning for a region to be deleted, 2) looking for the
# number, and 3) reading past the letters to be deleted.
SCANNING = 1
NUMBER = 2
DROPPING = 3
state = SCANNING

# If we are in state DROPPING, dropcount is the number of
# letters remaining to be dropped.
dropcount = 0

old = '.+3ACG.+5CAACG.+3ACG.+3ACG'
for c in old:
if state == SCANNING:
if c in '+-':
state = NUMBER
else:
new.append(c)

elif state == NUMBER:
# Assume the counts are all single digits. If not, then
# we'll need a 4th state for accumulating the digits.
dropcount = int(c)
state = DROPPING

else:
assert state == DROPPING
dropcount -= 1
if dropcount == 0:
state = SCANNING

print ''.join(new)
 
Reply With Quote
 
Roy Smith
Guest
Posts: n/a
 
      01-05-2013
In article <(E-Mail Removed)>,
Chris Angelico <(E-Mail Removed)> wrote:

> result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")])


That's exceedingly clever. But bordering on line noise. At the very
least, I would break it up into a couple of lines to make it easier to
understand (plus you can print out the intermediate values to see what's
going on):

chunks = ("0"+s).replace("-","+").split("+")
result = "".join([x[int(x[0])+1:] for x in chunks]
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      01-05-2013
On Sun, Jan 6, 2013 at 1:30 AM, Roy Smith <(E-Mail Removed)> wrote:
> In article <(E-Mail Removed)>,
> Chris Angelico <(E-Mail Removed)> wrote:
>
>> result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")])

>
> That's exceedingly clever. But bordering on line noise. At the very
> least, I would break it up into a couple of lines to make it easier to
> understand (plus you can print out the intermediate values to see what's
> going on):
>
> chunks = ("0"+s).replace("-","+").split("+")
> result = "".join([x[int(x[0])+1:] for x in chunks]


Sure. You can always split a one-liner to taste, doesn't much matter
where. You could split it majorly into half a dozen lines if you want,
doesn't make a lot of diff. It'll work the same way

ChrisA
 
Reply With Quote
 
Roy Smith
Guest
Posts: n/a
 
      01-05-2013
In article <(E-Mail Removed)>,
Chris Angelico <(E-Mail Removed)> wrote:

> On Sun, Jan 6, 2013 at 1:30 AM, Roy Smith <(E-Mail Removed)> wrote:
> > In article <(E-Mail Removed)>,
> > Chris Angelico <(E-Mail Removed)> wrote:
> >
> >> result = "".join([x[int(x[0])+1:] for x in
> >> ("0"+s).replace("-","+").split("+")])

> >
> > That's exceedingly clever. But bordering on line noise. At the very
> > least, I would break it up into a couple of lines to make it easier to
> > understand (plus you can print out the intermediate values to see what's
> > going on):
> >
> > chunks = ("0"+s).replace("-","+").split("+")
> > result = "".join([x[int(x[0])+1:] for x in chunks]

>
> Sure. You can always split a one-liner to taste, doesn't much matter
> where.


Well, there are better and worse places to split things. Consider these
three statements:

result = f1().f2().f3().f4().f5()
result = f1(f2.f3(f4().f5()))
result = f1(f2(3(f4(f5()))))

Same number of function calls, but the middle one is clearly the most
difficult to understand. The reason is because the scan direction keeps
changing. The first one is strictly left-to-right. The last one is
strictly inside-to-outside. The middle one is all over the place.

That's why I chose to split this where I did. It was where the scan
direction changed.

Readability matters.
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      01-05-2013
On Sun, Jan 6, 2013 at 2:03 AM, Roy Smith <(E-Mail Removed)> wrote:
> That's why I chose to split this where I did. It was where the scan
> direction changed.


Ah, good point. In any case, this is a fairly simple and clear way of
doing things; it may or may not run faster than the explicit state
machine, but IMHO it's a lot clearer to read a split() than something
that changes state when a particular character is found.

ChrisA
 
Reply With Quote
 
Roy Smith
Guest
Posts: n/a
 
      01-05-2013
In article <(E-Mail Removed)>,
Chris Angelico <(E-Mail Removed)> wrote:

> it may or may not run faster than the explicit state machine,


Hmmm, hard to say. Both are O(n), but yours makes several more copies
of the data than mine (the string addition, the replace(), the split(),
the string slicing). We both make copies as we put fragments into a
list, and when we do the join().

On the other hand, yours does all that inside the library, which is
generally faster than random python code.

Let's see:

My way:
real 0m2.097s
user 0m2.078s
sys 0m0.016s

Your way:
real 0m0.649s
user 0m0.622s
sys 0m0.025s

You got me by a factor of 3 or 4. Not bad. And not terribly
surprising. Once you've got things to the same O() group, pushing as
much data manipulation as you can down into the library is usually a
pretty effective optimization technique.

> but IMHO it's a lot clearer to read a split() than something
> that changes state when a particular character is found.


Maybe. But being familiar with state machines is still a handy skill.
DNA sequence analysis has lots of problems like "find a start codon
which is within about 50 bases of a binding site, and then copy
everything up until you find a stop codon". Things like that often map
well to state machines. Especially if you're trying to do it in
parallel in all three reading frames.
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      01-05-2013
On Sun, Jan 6, 2013 at 2:38 AM, Roy Smith <(E-Mail Removed)> wrote:
> In article <(E-Mail Removed)>,
> Chris Angelico <(E-Mail Removed)> wrote:
>
>> it may or may not run faster than the explicit state machine,

>
> You got me by a factor of 3 or 4. Not bad.


You miss my point, though. I went for simple Pythonic code, and never
measured its performance, on the expectation that it's "good enough".
Written in C, the state machine is probably WAY faster than splitting
and then iterating. My C++ MUD client uses code similar to that to
parse TELNET and ANSI codes from a stream of bytes in a socket (and
one of its "states" is that there's no more data available, so wait on
the socket); the rewrite in a high level language divides the string
on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
then afterward splits on "\n" to divide into lines. The code's much
less convoluted, it's easier to test different parts (because I can
simply call the ANSI parser with a block of text), and on a modern
computer, you can't see the performance difference (since you spend
most of your time waiting for socket data anyway).

But it's gratifying to know that the obvious and brief way to do
things is fast too

>> but IMHO it's a lot clearer to read a split() than something
>> that changes state when a particular character is found.

>
> Maybe. But being familiar with state machines is still a handy skill.
> DNA sequence analysis has lots of problems like "find a start codon
> which is within about 50 bases of a binding site, and then copy
> everything up until you find a stop codon". Things like that often map
> well to state machines. Especially if you're trying to do it in
> parallel in all three reading frames.


Sure. And if you're working with petabytes of data, these
considerations become fairly important. When that happens, you start
rewriting your algorithms in C, or using Cython, or something; at very
least, you start rewriting clear and simple algorithms into more
complex ones. But all this happens *after* the code has been tested
and proven. All the rewrites can be verified as being identical to
their reference implementations; you can test one piece at a time as
you change them. It's ever so much easier to work that way.

<anecdote>At work, we had one employee whose code was, shall we say,
less than stellar. At one point, he embarked on a months-long rewrite
of one of his modules; meanwhile, I was unable to adequately test code
that called on it. Once the rewrite was finally complete, I discovered
myriad bugs in my own code, ones that would have been found and fixed
instantly if I'd had even a slow version of the code to work against.
Starting with something you can easily debug helps enormously with
that, because debugging doesn't demand mega-TPS throughput.</anecdote>

ChrisA
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
HELP!! anyone ??can help me about my project "quick sort implemented with shell sort? comsciepartner General Computer Support 0 10-06-2008 01:02 PM
can someone explain line 7 and 8 of the insertion sort Troy Java 0 02-16-2008 04:50 PM
Can someone help me sort this one out and quick? Brent White ASP .Net 2 01-30-2008 05:28 PM
can Perl Sort do this, unix sort breaks on it (muliple spaces as demiliter) colin_lyse Perl Misc 1 02-03-2005 01:13 AM
Ado sort error-Ado Sort -Relate, Compute By, or Sort operations cannot be done on column(s) whose key length is unknown or exceeds 10 KB. Navin ASP General 1 09-09-2003 07:16 AM



Advertisments