![]() |
Need a specific sort of string modification. Can someone help?
I have strings such as:
tA.-2AG.-2AG,-2ag or ..+3ACG.+5CAACG.+3ACG.+3ACG The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become: tA.., and .... How can I do that? Thanks. |
Re: Need a specific sort of string modification. Can someone help?
On 05/01/2013 10:35, Sia wrote:
> I have strings such as: > > tA.-2AG.-2AG,-2ag > or > .+3ACG.+5CAACG.+3ACG.+3ACG > > The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become: > > tA.., > and > ... > > How can I do that? > Thanks. > Here is a naive solution (I am sure there are more elegant ones) - def strip(old_string): new_string = '' max = len(old_string) pos = 0 while pos < max: char = old_string[pos] if char in ('-', '+'): num_pos = pos+1 num_str = '' while old_string[num_pos].isdigit(): num_str += old_string[num_pos] num_pos += 1 pos = num_pos + int(num_str) else: new_string += old_string[pos] pos += 1 return new_string It caters for the possibility that the number following the +/- could be greater than 9 - I don't know if you need that. It works with your examples, except that the second one returns '....', which I think is correct - there are 4 dots in the original string. HTH Frank Millman |
Re: Need a specific sort of string modification. Can someone help?
On Sat, Jan 5, 2013 at 7:35 PM, Sia <hossein.asgharian@gmail.com> wrote:
> I have strings such as: > > tA.-2AG.-2AG,-2ag > or > .+3ACG.+5CAACG.+3ACG.+3ACG > > The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become: > > tA.., > and > ... > > How can I do that? Interesting. Are you guaranteed that there are no other plus or minus signs? Is the number after the sign just one digit? Assuming the answers to both are "yes", here's a one-liner: s=".+3ACG.+5CAACG.+3ACG.+3ACG" result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")]) Split on either - or +, then trim off that many characters from each (that's the list comp), then join them back into a string. ChrisA |
Re: Need a specific sort of string modification. Can someone help?
In article <e480480d-f3b4-4491-969c-7d1843bf9e33@googlegroups.com>,
Sia <hossein.asgharian@gmail.com> wrote: > I have strings such as: > > tA.-2AG.-2AG,-2ag > or > .+3ACG.+5CAACG.+3ACG.+3ACG Some kind of DNA binding site? A couple of questions. Are the numbers always single digits? How much data is there? Are we talking a few hundred 20-character strings, or all of Genbank? > The plus and minus signs are always followed by a number (say, i). I want > python to find each single plus or minus, remove the sign, the number after > it and remove i characters after that. So the two strings above become: > > tA.., > and > ... If I follow your description properly, the last output should be "...." (4 dots), right? This looks like it should work. I'm sure there's more efficient ways to do it, but for small inputs, this should be ok.˜ The general pattern here is a state machine. It's a good pattern to learn if you're going to be doing any kind of sequence analysis. See, for example, http://en.wikipedia.org/wiki/State_machine. # Build up the new string as a list (for efficiency) new = [] # Keep track of what state we're in. The three possible states # are 1) scanning for a region to be deleted, 2) looking for the # number, and 3) reading past the letters to be deleted. SCANNING = 1 NUMBER = 2 DROPPING = 3 state = SCANNING # If we are in state DROPPING, dropcount is the number of # letters remaining to be dropped. dropcount = 0 old = '.+3ACG.+5CAACG.+3ACG.+3ACG' for c in old: if state == SCANNING: if c in '+-': state = NUMBER else: new.append(c) elif state == NUMBER: # Assume the counts are all single digits. If not, then # we'll need a 4th state for accumulating the digits. dropcount = int(c) state = DROPPING else: assert state == DROPPING dropcount -= 1 if dropcount == 0: state = SCANNING print ''.join(new) |
Re: Need a specific sort of string modification. Can someone help?
In article <mailman.109.1357378077.2939.python-list@python.org>,
Chris Angelico <rosuav@gmail.com> wrote: > result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")]) That's exceedingly clever. But bordering on line noise. At the very least, I would break it up into a couple of lines to make it easier to understand (plus you can print out the intermediate values to see what's going on): chunks = ("0"+s).replace("-","+").split("+") result = "".join([x[int(x[0])+1:] for x in chunks] |
Re: Need a specific sort of string modification. Can someone help?
On Sun, Jan 6, 2013 at 1:30 AM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.109.1357378077.2939.python-list@python.org>, > Chris Angelico <rosuav@gmail.com> wrote: > >> result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")]) > > That's exceedingly clever. But bordering on line noise. At the very > least, I would break it up into a couple of lines to make it easier to > understand (plus you can print out the intermediate values to see what's > going on): > > chunks = ("0"+s).replace("-","+").split("+") > result = "".join([x[int(x[0])+1:] for x in chunks] Sure. You can always split a one-liner to taste, doesn't much matter where. You could split it majorly into half a dozen lines if you want, doesn't make a lot of diff. It'll work the same way :) ChrisA |
Re: Need a specific sort of string modification. Can someone help?
In article <mailman.120.1357397255.2939.python-list@python.org>,
Chris Angelico <rosuav@gmail.com> wrote: > On Sun, Jan 6, 2013 at 1:30 AM, Roy Smith <roy@panix.com> wrote: > > In article <mailman.109.1357378077.2939.python-list@python.org>, > > Chris Angelico <rosuav@gmail.com> wrote: > > > >> result = "".join([x[int(x[0])+1:] for x in > >> ("0"+s).replace("-","+").split("+")]) > > > > That's exceedingly clever. But bordering on line noise. At the very > > least, I would break it up into a couple of lines to make it easier to > > understand (plus you can print out the intermediate values to see what's > > going on): > > > > chunks = ("0"+s).replace("-","+").split("+") > > result = "".join([x[int(x[0])+1:] for x in chunks] > > Sure. You can always split a one-liner to taste, doesn't much matter > where. Well, there are better and worse places to split things. Consider these three statements: result = f1().f2().f3().f4().f5() result = f1(f2.f3(f4().f5())) result = f1(f2(3(f4(f5())))) Same number of function calls, but the middle one is clearly the most difficult to understand. The reason is because the scan direction keeps changing. The first one is strictly left-to-right. The last one is strictly inside-to-outside. The middle one is all over the place. That's why I chose to split this where I did. It was where the scan direction changed. Readability matters. |
Re: Need a specific sort of string modification. Can someone help?
On Sun, Jan 6, 2013 at 2:03 AM, Roy Smith <roy@panix.com> wrote:
> That's why I chose to split this where I did. It was where the scan > direction changed. Ah, good point. In any case, this is a fairly simple and clear way of doing things; it may or may not run faster than the explicit state machine, but IMHO it's a lot clearer to read a split() than something that changes state when a particular character is found. ChrisA |
Re: Need a specific sort of string modification. Can someone help?
In article <mailman.121.1357398573.2939.python-list@python.org>,
Chris Angelico <rosuav@gmail.com> wrote: > it may or may not run faster than the explicit state machine, Hmmm, hard to say. Both are O(n), but yours makes several more copies of the data than mine (the string addition, the replace(), the split(), the string slicing). We both make copies as we put fragments into a list, and when we do the join(). On the other hand, yours does all that inside the library, which is generally faster than random python code. Let's see: My way: real 0m2.097s user 0m2.078s sys 0m0.016s Your way: real 0m0.649s user 0m0.622s sys 0m0.025s You got me by a factor of 3 or 4. Not bad. And not terribly surprising. Once you've got things to the same O() group, pushing as much data manipulation as you can down into the library is usually a pretty effective optimization technique. > but IMHO it's a lot clearer to read a split() than something > that changes state when a particular character is found. Maybe. But being familiar with state machines is still a handy skill. DNA sequence analysis has lots of problems like "find a start codon which is within about 50 bases of a binding site, and then copy everything up until you find a stop codon". Things like that often map well to state machines. Especially if you're trying to do it in parallel in all three reading frames. |
Re: Need a specific sort of string modification. Can someone help?
On Sun, Jan 6, 2013 at 2:38 AM, Roy Smith <roy@panix.com> wrote:
> In article <mailman.121.1357398573.2939.python-list@python.org>, > Chris Angelico <rosuav@gmail.com> wrote: > >> it may or may not run faster than the explicit state machine, > > You got me by a factor of 3 or 4. Not bad. You miss my point, though. I went for simple Pythonic code, and never measured its performance, on the expectation that it's "good enough". Written in C, the state machine is probably WAY faster than splitting and then iterating. My C++ MUD client uses code similar to that to parse TELNET and ANSI codes from a stream of bytes in a socket (and one of its "states" is that there's no more data available, so wait on the socket); the rewrite in a high level language divides the string on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and then afterward splits on "\n" to divide into lines. The code's much less convoluted, it's easier to test different parts (because I can simply call the ANSI parser with a block of text), and on a modern computer, you can't see the performance difference (since you spend most of your time waiting for socket data anyway). But it's gratifying to know that the obvious and brief way to do things is fast too :) >> but IMHO it's a lot clearer to read a split() than something >> that changes state when a particular character is found. > > Maybe. But being familiar with state machines is still a handy skill. > DNA sequence analysis has lots of problems like "find a start codon > which is within about 50 bases of a binding site, and then copy > everything up until you find a stop codon". Things like that often map > well to state machines. Especially if you're trying to do it in > parallel in all three reading frames. Sure. And if you're working with petabytes of data, these considerations become fairly important. When that happens, you start rewriting your algorithms in C, or using Cython, or something; at very least, you start rewriting clear and simple algorithms into more complex ones. But all this happens *after* the code has been tested and proven. All the rewrites can be verified as being identical to their reference implementations; you can test one piece at a time as you change them. It's ever so much easier to work that way. <anecdote>At work, we had one employee whose code was, shall we say, less than stellar. At one point, he embarked on a months-long rewrite of one of his modules; meanwhile, I was unable to adequately test code that called on it. Once the rewrite was finally complete, I discovered myriad bugs in my own code, ones that would have been found and fixed instantly if I'd had even a slow version of the code to work against. Starting with something you can easily debug helps enormously with that, because debugging doesn't demand mega-TPS throughput.</anecdote> ChrisA |
| All times are GMT. The time now is 11:49 AM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.