Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Need a specific sort of string modification. Can someone help? (http://www.velocityreviews.com/forums/t956150-need-a-specific-sort-of-string-modification-can-someone-help.html)

 Sia 01-05-2013 08:35 AM

Need a specific sort of string modification. Can someone help?

I have strings such as:

tA.-2AG.-2AG,-2ag
or
..+3ACG.+5CAACG.+3ACG.+3ACG

The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:

tA..,
and
....

How can I do that?
Thanks.

 Frank Millman 01-05-2013 09:15 AM

Re: Need a specific sort of string modification. Can someone help?

On 05/01/2013 10:35, Sia wrote:
> I have strings such as:
>
> tA.-2AG.-2AG,-2ag
> or
> .+3ACG.+5CAACG.+3ACG.+3ACG
>
> The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
>
> tA..,
> and
> ...
>
> How can I do that?
> Thanks.
>

Here is a naive solution (I am sure there are more elegant ones) -

def strip(old_string):
new_string = ''
max = len(old_string)
pos = 0
while pos < max:
char = old_string[pos]
if char in ('-', '+'):
num_pos = pos+1
num_str = ''
while old_string[num_pos].isdigit():
num_str += old_string[num_pos]
num_pos += 1
pos = num_pos + int(num_str)
else:
new_string += old_string[pos]
pos += 1
return new_string

It caters for the possibility that the number following the +/- could be
greater than 9 - I don't know if you need that.

It works with your examples, except that the second one returns '....',
which I think is correct - there are 4 dots in the original string.

HTH

Frank Millman

 Chris Angelico 01-05-2013 09:27 AM

Re: Need a specific sort of string modification. Can someone help?

On Sat, Jan 5, 2013 at 7:35 PM, Sia <hossein.asgharian@gmail.com> wrote:
> I have strings such as:
>
> tA.-2AG.-2AG,-2ag
> or
> .+3ACG.+5CAACG.+3ACG.+3ACG
>
> The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
>
> tA..,
> and
> ...
>
> How can I do that?

Interesting. Are you guaranteed that there are no other plus or minus
signs? Is the number after the sign just one digit? Assuming the
answers to both are "yes", here's a one-liner:

s=".+3ACG.+5CAACG.+3ACG.+3ACG"

result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")])

Split on either - or +, then trim off that many characters from each
(that's the list comp), then join them back into a string.

ChrisA

 Roy Smith 01-05-2013 02:12 PM

Re: Need a specific sort of string modification. Can someone help?

Sia <hossein.asgharian@gmail.com> wrote:

> I have strings such as:
>
> tA.-2AG.-2AG,-2ag
> or
> .+3ACG.+5CAACG.+3ACG.+3ACG

Some kind of DNA binding site?

A couple of questions. Are the numbers always single digits? How much
data is there? Are we talking a few hundred 20-character strings, or
all of Genbank?

> The plus and minus signs are always followed by a number (say, i). I want
> python to find each single plus or minus, remove the sign, the number after
> it and remove i characters after that. So the two strings above become:
>
> tA..,
> and
> ...

If I follow your description properly, the last output should be "...."
(4 dots), right? This looks like it should work. I'm sure there's more
efficient ways to do it, but for small inputs, this should be ok.˜

The general pattern here is a state machine. It's a good pattern to
learn if you're going to be doing any kind of sequence analysis. See,
for example, http://en.wikipedia.org/wiki/State_machine.

# Build up the new string as a list (for efficiency)
new = []

# Keep track of what state we're in. The three possible states
# are 1) scanning for a region to be deleted, 2) looking for the
# number, and 3) reading past the letters to be deleted.
SCANNING = 1
NUMBER = 2
DROPPING = 3
state = SCANNING

# If we are in state DROPPING, dropcount is the number of
# letters remaining to be dropped.
dropcount = 0

old = '.+3ACG.+5CAACG.+3ACG.+3ACG'
for c in old:
if state == SCANNING:
if c in '+-':
state = NUMBER
else:
new.append(c)

elif state == NUMBER:
# Assume the counts are all single digits. If not, then
# we'll need a 4th state for accumulating the digits.
dropcount = int(c)
state = DROPPING

else:
assert state == DROPPING
dropcount -= 1
if dropcount == 0:
state = SCANNING

print ''.join(new)

 Roy Smith 01-05-2013 02:30 PM

Re: Need a specific sort of string modification. Can someone help?

In article <mailman.109.1357378077.2939.python-list@python.org>,
Chris Angelico <rosuav@gmail.com> wrote:

> result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")])

That's exceedingly clever. But bordering on line noise. At the very
least, I would break it up into a couple of lines to make it easier to
understand (plus you can print out the intermediate values to see what's
going on):

chunks = ("0"+s).replace("-","+").split("+")
result = "".join([x[int(x[0])+1:] for x in chunks]

 Chris Angelico 01-05-2013 02:47 PM

Re: Need a specific sort of string modification. Can someone help?

On Sun, Jan 6, 2013 at 1:30 AM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.109.1357378077.2939.python-list@python.org>,
> Chris Angelico <rosuav@gmail.com> wrote:
>
>> result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")])

>
> That's exceedingly clever. But bordering on line noise. At the very
> least, I would break it up into a couple of lines to make it easier to
> understand (plus you can print out the intermediate values to see what's
> going on):
>
> chunks = ("0"+s).replace("-","+").split("+")
> result = "".join([x[int(x[0])+1:] for x in chunks]

Sure. You can always split a one-liner to taste, doesn't much matter
where. You could split it majorly into half a dozen lines if you want,
doesn't make a lot of diff. It'll work the same way :)

ChrisA

 Roy Smith 01-05-2013 03:03 PM

Re: Need a specific sort of string modification. Can someone help?

In article <mailman.120.1357397255.2939.python-list@python.org>,
Chris Angelico <rosuav@gmail.com> wrote:

> On Sun, Jan 6, 2013 at 1:30 AM, Roy Smith <roy@panix.com> wrote:
> > In article <mailman.109.1357378077.2939.python-list@python.org>,
> > Chris Angelico <rosuav@gmail.com> wrote:
> >
> >> result = "".join([x[int(x[0])+1:] for x in
> >> ("0"+s).replace("-","+").split("+")])

> >
> > That's exceedingly clever. But bordering on line noise. At the very
> > least, I would break it up into a couple of lines to make it easier to
> > understand (plus you can print out the intermediate values to see what's
> > going on):
> >
> > chunks = ("0"+s).replace("-","+").split("+")
> > result = "".join([x[int(x[0])+1:] for x in chunks]

>
> Sure. You can always split a one-liner to taste, doesn't much matter
> where.

Well, there are better and worse places to split things. Consider these
three statements:

result = f1().f2().f3().f4().f5()
result = f1(f2.f3(f4().f5()))
result = f1(f2(3(f4(f5()))))

Same number of function calls, but the middle one is clearly the most
difficult to understand. The reason is because the scan direction keeps
changing. The first one is strictly left-to-right. The last one is
strictly inside-to-outside. The middle one is all over the place.

That's why I chose to split this where I did. It was where the scan
direction changed.

 Chris Angelico 01-05-2013 03:09 PM

Re: Need a specific sort of string modification. Can someone help?

On Sun, Jan 6, 2013 at 2:03 AM, Roy Smith <roy@panix.com> wrote:
> That's why I chose to split this where I did. It was where the scan
> direction changed.

Ah, good point. In any case, this is a fairly simple and clear way of
doing things; it may or may not run faster than the explicit state
machine, but IMHO it's a lot clearer to read a split() than something
that changes state when a particular character is found.

ChrisA

 Roy Smith 01-05-2013 03:38 PM

Re: Need a specific sort of string modification. Can someone help?

In article <mailman.121.1357398573.2939.python-list@python.org>,
Chris Angelico <rosuav@gmail.com> wrote:

> it may or may not run faster than the explicit state machine,

Hmmm, hard to say. Both are O(n), but yours makes several more copies
of the data than mine (the string addition, the replace(), the split(),
the string slicing). We both make copies as we put fragments into a
list, and when we do the join().

On the other hand, yours does all that inside the library, which is
generally faster than random python code.

Let's see:

My way:
real 0m2.097s
user 0m2.078s
sys 0m0.016s

real 0m0.649s
user 0m0.622s
sys 0m0.025s

You got me by a factor of 3 or 4. Not bad. And not terribly
surprising. Once you've got things to the same O() group, pushing as
much data manipulation as you can down into the library is usually a
pretty effective optimization technique.

> but IMHO it's a lot clearer to read a split() than something
> that changes state when a particular character is found.

Maybe. But being familiar with state machines is still a handy skill.
DNA sequence analysis has lots of problems like "find a start codon
which is within about 50 bases of a binding site, and then copy
everything up until you find a stop codon". Things like that often map
well to state machines. Especially if you're trying to do it in
parallel in all three reading frames.

 Chris Angelico 01-05-2013 03:57 PM

Re: Need a specific sort of string modification. Can someone help?

On Sun, Jan 6, 2013 at 2:38 AM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.121.1357398573.2939.python-list@python.org>,
> Chris Angelico <rosuav@gmail.com> wrote:
>
>> it may or may not run faster than the explicit state machine,

>
> You got me by a factor of 3 or 4. Not bad.

You miss my point, though. I went for simple Pythonic code, and never
measured its performance, on the expectation that it's "good enough".
Written in C, the state machine is probably WAY faster than splitting
and then iterating. My C++ MUD client uses code similar to that to
parse TELNET and ANSI codes from a stream of bytes in a socket (and
one of its "states" is that there's no more data available, so wait on
the socket); the rewrite in a high level language divides the string
on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
then afterward splits on "\n" to divide into lines. The code's much
less convoluted, it's easier to test different parts (because I can
simply call the ANSI parser with a block of text), and on a modern
computer, you can't see the performance difference (since you spend
most of your time waiting for socket data anyway).

But it's gratifying to know that the obvious and brief way to do
things is fast too :)

>> but IMHO it's a lot clearer to read a split() than something
>> that changes state when a particular character is found.

>
> Maybe. But being familiar with state machines is still a handy skill.
> DNA sequence analysis has lots of problems like "find a start codon
> which is within about 50 bases of a binding site, and then copy
> everything up until you find a stop codon". Things like that often map
> well to state machines. Especially if you're trying to do it in
> parallel in all three reading frames.

Sure. And if you're working with petabytes of data, these
considerations become fairly important. When that happens, you start
rewriting your algorithms in C, or using Cython, or something; at very
least, you start rewriting clear and simple algorithms into more
complex ones. But all this happens *after* the code has been tested
and proven. All the rewrites can be verified as being identical to
their reference implementations; you can test one piece at a time as
you change them. It's ever so much easier to work that way.

<anecdote>At work, we had one employee whose code was, shall we say,
less than stellar. At one point, he embarked on a months-long rewrite
of one of his modules; meanwhile, I was unable to adequately test code
that called on it. Once the rewrite was finally complete, I discovered
myriad bugs in my own code, ones that would have been found and fixed
instantly if I'd had even a slow version of the code to work against.
Starting with something you can easily debug helps enormously with
that, because debugging doesn't demand mega-TPS throughput.</anecdote>

ChrisA

All times are GMT. The time now is 02:11 PM.