Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Need a specific sort of string modification. Can someone help?

Reply
Thread Tools

Need a specific sort of string modification. Can someone help?

 
 
Tim Chase
Guest
Posts: n/a
 
      01-05-2013
On 01/05/13 02:35, Sia wrote:
> I have strings such as:
>
> tA.-2AG.-2AG,-2ag
> or
> .+3ACG.+5CAACG.+3ACG.+3ACG
>
> The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
>
> tA..,
> and
> ...


With the same caveat as Frank posted about the second one being
"...." (4 dots), I don't know how this version times out:

import re
r = re.compile(r"[-+](\d+)([^-+]*)")
def modify(m):
result = m.group(2)[int(m.group(1)):]
return result
for test, expected in (
("tA.-2AG.-2AG,-2ag", "tA..,"),
(".+3ACG.+5CAACG.+3ACG.+3ACG", "...."),
):
s = r.sub(modify, test)
print "%r -> %r (%r)" % (
test, s, expected
)
assert s == expected, "Nope"

(it passes the tests as modified to "....")

-tkc





 
Reply With Quote
 
 
 
 
Tim Chase
Guest
Posts: n/a
 
      01-05-2013
On 01/05/13 11:24, Tim Chase wrote:
> I don't know how this version times out:
>
> import re
> r = re.compile(r"[-+](\d+)([^-+]*)")
> def modify(m):
> result = m.group(2)[int(m.group(1)):]
> return result


Doh, I intended to change this after testing, making it just

returm m.group(2)[int(m.group(1)):]

rather than expending an intermediate variable I used for testing

-tkc


 
Reply With Quote
 
 
 
 
Ian Kelly
Guest
Posts: n/a
 
      01-05-2013
On Sat, Jan 5, 2013 at 8:57 AM, Chris Angelico <(E-Mail Removed)> wrote:
> You miss my point, though. I went for simple Pythonic code, and never
> measured its performance, on the expectation that it's "good enough".
> Written in C, the state machine is probably WAY faster than splitting
> and then iterating. My C++ MUD client uses code similar to that to
> parse TELNET and ANSI codes from a stream of bytes in a socket (and
> one of its "states" is that there's no more data available, so wait on
> the socket); the rewrite in a high level language divides the string
> on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
> then afterward splits on "\n" to divide into lines. The code's much
> less convoluted, it's easier to test different parts (because I can
> simply call the ANSI parser with a block of text), and on a modern
> computer, you can't see the performance difference (since you spend
> most of your time waiting for socket data anyway).


Anecdotally and somewhat off-topic, when I wrote my own MUD client in
Python, I implemented both TELNET and ANSI parsing in Python using a
state machine processing one byte at a time (actually two state
machines - one at the protocol layer and one at the client layer; the
telnet module is a modified version of the twisted.conch.telnet
module). I was worried that the processing would be too slow in
Python. When I got it running, it turned out that there was a
noticeable lag between input being received and displayed. But when I
profiled the issue it actually turned out that the rich text control I
was using, which was written in C++, wasn't able to handle a large
buffer gracefully. The parsing code in Python did just fine and
didn't contribute to the lag issue at all.
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      01-05-2013
On Sun, Jan 6, 2013 at 7:04 AM, Ian Kelly <(E-Mail Removed)> wrote:
> On Sat, Jan 5, 2013 at 8:57 AM, Chris Angelico <(E-Mail Removed)> wrote:
>> You miss my point, though. I went for simple Pythonic code, and never
>> measured its performance, on the expectation that it's "good enough".
>> Written in C, the state machine is probably WAY faster than splitting
>> and then iterating. My C++ MUD client uses code similar to that to
>> parse TELNET and ANSI codes from a stream of bytes in a socket (and
>> one of its "states" is that there's no more data available, so wait on
>> the socket); the rewrite in a high level language divides the string
>> on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
>> then afterward splits on "\n" to divide into lines. The code's much
>> less convoluted, it's easier to test different parts (because I can
>> simply call the ANSI parser with a block of text), and on a modern
>> computer, you can't see the performance difference (since you spend
>> most of your time waiting for socket data anyway).

>
> Anecdotally and somewhat off-topic, when I wrote my own MUD client in
> Python, I implemented both TELNET and ANSI parsing in Python using a
> state machine processing one byte at a time (actually two state
> machines - one at the protocol layer and one at the client layer; the
> telnet module is a modified version of the twisted.conch.telnet
> module). I was worried that the processing would be too slow in
> Python. When I got it running, it turned out that there was a
> noticeable lag between input being received and displayed. But when I
> profiled the issue it actually turned out that the rich text control I
> was using, which was written in C++, wasn't able to handle a large
> buffer gracefully. The parsing code in Python did just fine and
> didn't contribute to the lag issue at all.


The opposite decision, the same result. Performance was *good enough*.
With Gypsum, my early performance problems were almost completely
solved by properly handling the expose-event and only painting the
parts that changed (I don't use a rich text control, I use a
GTK2.DrawingArea). Text processing of something that's come over a
network is seldom a problem; if it were, we'd see noticeably slower
downloads over HTTPS than HTTP, and that just doesn't happen. I think
(though I can't prove) that crypto written in C is probably more
expensive than ANSI parsing written in Python.

ChrisA
 
Reply With Quote
 
Roy Smith
Guest
Posts: n/a
 
      01-05-2013
In article <(E-Mail Removed)>,
Chris Angelico <(E-Mail Removed)> wrote:

> On Sun, Jan 6, 2013 at 7:04 AM, Ian Kelly <(E-Mail Removed)> wrote:
> > On Sat, Jan 5, 2013 at 8:57 AM, Chris Angelico <(E-Mail Removed)> wrote:
> >> You miss my point, though. I went for simple Pythonic code, and never
> >> measured its performance, on the expectation that it's "good enough".
> >> Written in C, the state machine is probably WAY faster than splitting
> >> and then iterating. My C++ MUD client uses code similar to that to
> >> parse TELNET and ANSI codes from a stream of bytes in a socket (and
> >> one of its "states" is that there's no more data available, so wait on
> >> the socket); the rewrite in a high level language divides the string
> >> on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
> >> then afterward splits on "\n" to divide into lines. The code's much
> >> less convoluted, it's easier to test different parts (because I can
> >> simply call the ANSI parser with a block of text), and on a modern
> >> computer, you can't see the performance difference (since you spend
> >> most of your time waiting for socket data anyway).

> >
> > Anecdotally and somewhat off-topic, when I wrote my own MUD client in
> > Python, I implemented both TELNET and ANSI parsing in Python using a
> > state machine processing one byte at a time (actually two state
> > machines - one at the protocol layer and one at the client layer; the
> > telnet module is a modified version of the twisted.conch.telnet
> > module). I was worried that the processing would be too slow in
> > Python. When I got it running, it turned out that there was a
> > noticeable lag between input being received and displayed. But when I
> > profiled the issue it actually turned out that the rich text control I
> > was using, which was written in C++, wasn't able to handle a large
> > buffer gracefully. The parsing code in Python did just fine and
> > didn't contribute to the lag issue at all.

>
> The opposite decision, the same result. Performance was *good enough*.
> With Gypsum, my early performance problems were almost completely
> solved by properly handling the expose-event and only painting the
> parts that changed (I don't use a rich text control, I use a
> GTK2.DrawingArea). Text processing of something that's come over a
> network is seldom a problem; if it were, we'd see noticeably slower
> downloads over HTTPS than HTTP, and that just doesn't happen. I think
> (though I can't prove) that crypto written in C is probably more
> expensive than ANSI parsing written in Python.
>
> ChrisA


It's rare to find applications these days that are truly CPU bound.
Once you've used some reasonable algorithm, i.e. not done anything in
O(n^2) that could have been done in O(n) or O(n log n), you will more
often run up against I/O speed, database speed, network latency, memory
exhaustion, or some such as the reason your code is too slow.

The upshot of this is that for most things, Python (even though it runs
an order of magnitude slower than C), will always be *good enough*.

I'm sure I've mentioned this before, but the application code for
Songza.com is 100% Python. We pretty much can't even measure how much
CPU time is spent running Python code. Everything is database, network,
and (occasionally) disk throughput.
 
Reply With Quote
 
Mitya Sirenef
Guest
Posts: n/a
 
      01-06-2013
On 01/05/2013 03:35 AM, Sia wrote:
> I have strings such as:
>
> tA.-2AG.-2AG,-2ag
> or
> .+3ACG.+5CAACG.+3ACG.+3ACG
>
> The plus and minus signs are always followed by a number (say, i). I

want python to find each single plus or minus, remove the sign, the
number after it and remove i characters after that. So the two strings
above become:
>
> tA..,
> and
> ...
>
> How can I do that?
> Thanks.



I think it's a bit cleaner and nicer to do something similar to
itertools.takewhile but takewhile 'eats' a single next value.
I was actually doing some stuff that also needed this. I wonder if
there's a more elegant, robust way to do this?

Here's what I got for now:


class BIterator(object):
"""Iterator with 'buffered' takewhile."""

def __init__(self, seq):
self.seq = iter(seq)
self.buffer = []
self.end_marker = object()
self.last = None

def consume(self, n):
for _ in range(n): self.next()

def next(self):
val = self.buffer.pop() if self.buffer else next(self.seq,
self.end_marker)
self.last = val
return val

def takewhile(self, test):
lst = []
while True:
val = self.next()
if val is self.end_marker:
return lst
elif test(val):
lst.append(val)
else:
self.buffer.append(val)
return lst

def joined_takewhile(self, test):
return ''.join(self.takewhile(test))

def done(self):
return bool(self.last is self.end_marker)


s = ".+3ACG.+5CAACG.+3ACG.+3ACG"
not_plusminus = lambda x: x not in "+-"
isdigit = lambda x: x.isdigit()

def process(s):
lst = []
s = BIterator(s)

while True:
lst.extend(s.takewhile(not_plusminus))
if s.done(): break
s.next()
n = int(s.joined_takewhile(isdigit))
s.consume(n)

return ''.join(lst)


print(process(s))


Obviously it assumes the input is well-formed, but the logic would be
very easy to change to, for example, check for s.done() after each step.

- mitya



--
Lark's Tongue Guide to Python: http://lightbird.net/larks/

 
Reply With Quote
 
Roy Smith
Guest
Posts: n/a
 
      01-06-2013
In article <(E-Mail Removed)>,
Roy Smith <(E-Mail Removed)> wrote:

> It's rare to find applications these days that are truly CPU bound.
> Once you've used some reasonable algorithm, i.e. not done anything in
> O(n^2) that could have been done in O(n) or O(n log n), you will more
> often run up against I/O speed, database speed, network latency, memory
> exhaustion, or some such as the reason your code is too slow.


Well, I just found a counter-example

I've been doing some log analysis. It's been taking a grovelingly long
time, so I decided to fire up the profiler and see what's taking so
long. I had a pretty good idea of where the ONLY TWO POSSIBLE hotspots
might be (looking up IP addresses in the geolocation database, or
producing some pretty pictures using matplotlib). It was just a matter
of figuring out which it was.

As with most attempts to out-guess the profiler, I was totally,
absolutely, and embarrassingly wrong.

It turns out we were spending most of the time parsing timestamps!
Since there's no convenient way (I don't consider strptime() to be
convenient) to parse isoformat strings in the standard library, our
habit has been to use the oh-so-simple parser from the third-party
dateutil package. Well, it turns out that's slow as all get-out
(probably because it's trying to be smart about auto-recognizing
formats). For the test I ran (on a few percent of the real data), we
spent 90 seconds in parse().

OK, so I dragged out the strptime() docs and built the stupid format
string (%Y-%m-%dT%H:%M:%S+00:00). That got us down to 25 seconds in
strptime().

But, I could also see it was spending a significant amount in routines
that looked like they were computing things like day of the week that we
didn't need. For what I was doing, we only really needed the hour and
minute. So I tried:

t_hour = int(date[11:13])
t_minute = int(date[14:16])

that got us down to 12 seconds overall (including the geolocation and
pretty pictures).

I think it turns out we never do anything with the hour and minute other
than print them back out, so just

t_hour_minute = date[11:16]

would probably be good enough, but I think I'm going to stop where I am
and declare victory
 
Reply With Quote
 
Mitya Sirenef
Guest
Posts: n/a
 
      01-06-2013
On 01/06/2013 01:32 AM, Mitya Sirenef wrote:
> On 01/05/2013 03:35 AM, Sia wrote:
>> I have strings such as:
> >
> > tA.-2AG.-2AG,-2ag
> > or
> > .+3ACG.+5CAACG.+3ACG.+3ACG
> >
> > The plus and minus signs are always followed by a number (say, i). I

> want python to find each single plus or minus, remove the sign, the
> number after it and remove i characters after that. So the two strings
> above become:
> >
> > tA..,
> > and
> > ...
> >
> > How can I do that?
> > Thanks.

>
>
> I think it's a bit cleaner and nicer to do something similar to
> itertools.takewhile but takewhile 'eats' a single next value.
> I was actually doing some stuff that also needed this. I wonder if
> there's a more elegant, robust way to do this?
>
> Here's what I got for now:
>
>
> class BIterator(object):
> """Iterator with 'buffered' takewhile."""
>
> def __init__(self, seq):
> self.seq = iter(seq)
> self.buffer = []
> self.end_marker = object()
> self.last = None
>
> def consume(self, n):
> for _ in range(n): self.next()
>
> def next(self):
> val = self.buffer.pop() if self.buffer else next(self.seq,
> self.end_marker)
> self.last = val
> return val
>
> def takewhile(self, test):
> lst = []
> while True:
> val = self.next()
> if val is self.end_marker:
> return lst
> elif test(val):
> lst.append(val)
> else:
> self.buffer.append(val)
> return lst
>
> def joined_takewhile(self, test):
> return ''.join(self.takewhile(test))
>
> def done(self):
> return bool(self.last is self.end_marker)
>
>
> s = ".+3ACG.+5CAACG.+3ACG.+3ACG"
> not_plusminus = lambda x: x not in "+-"
> isdigit = lambda x: x.isdigit()
>
> def process(s):
> lst = []
> s = BIterator(s)
>
> while True:
> lst.extend(s.takewhile(not_plusminus))
> if s.done(): break
> s.next()
> n = int(s.joined_takewhile(isdigit))
> s.consume(n)
>
> return ''.join(lst)
>
>
> print(process(s))
>
>
> Obviously it assumes the input is well-formed, but the logic would be
> very easy to change to, for example, check for s.done() after each step.
>
> - mitya
>
>
>


I've added some refinements:



class BIterator(object):
"""Iterator with 'buffered' takewhile and takeuntil."""

def __init__(self, seq):
self.seq = iter(seq)
self.buffer = []
self.end_marker = object()
self.last = None

def __bool__(self):
return self.last is not self.end_marker

def __next__(self):
val = self.buffer.pop() if self.buffer else next(self.seq,
self.end_marker)
self.last = val
return val

def consume(self, n):
for _ in range(n): next(self)

def takewhile(self, test):
lst = []
while True:
val = next(self)
if val is self.end_marker:
return lst
elif test(val):
lst.append(val)
else:
self.buffer.append(val)
return lst

def takeuntil(self, test):
negtest = lambda x: not test(x)
return self.takewhile(negtest)

def joined_takewhile(self, test):
return ''.join(self.takewhile(test))

def joined_takeuntil(self, test):
return ''.join(self.takeuntil(test))


def process(s):
s = BIterator(s)
lst = []
plusminus = lambda x: x in "+-"
isdigit = lambda x: x.isdigit()

while s:
lst.extend(s.takeuntil(plusminus))
next(s)
n = s.joined_takewhile(isdigit) or 0
s.consume(int(n))

return ''.join(lst)


s = ".+3ACG.+5CAACG.+3ACG.+3ACG"
print(process(s))




--
Lark's Tongue Guide to Python: http://lightbird.net/larks/

 
Reply With Quote
 
Steven D'Aprano
Guest
Posts: n/a
 
      01-06-2013
On Sun, 06 Jan 2013 12:28:55 -0500, Roy Smith wrote:

> I've been doing some log analysis. It's been taking a grovelingly long
> time, so I decided to fire up the profiler and see what's taking so
> long. I had a pretty good idea of where the ONLY TWO POSSIBLE hotspots
> might be (looking up IP addresses in the geolocation database, or
> producing some pretty pictures using matplotlib). It was just a matter
> of figuring out which it was.
>
> As with most attempts to out-guess the profiler, I was totally,
> absolutely, and embarrassingly wrong.


+1 QOTW

Thank you for sharing this with us.


--
Steven
 
Reply With Quote
 
Nick Mellor
Guest
Posts: n/a
 
      01-07-2013
Hi Sia,

Here's another variation. It's within my tolerance for readability and also quick, if that's an issue. It does 100,000 of your longer string in a couple of seconds on my venerable laptop.

It handles only single-digit numbers. For multi-digit, I'd be inclined to have a look at takewhile and/or dropwhile, both in itertools (Python 2.7 and 3.x only.)


from string import maketrans

class redux:

def __init__(self):
intab = '+-'
outtab = ' ' # two spaces
self.trantab = maketrans(intab, outtab)


def reduce_plusminus(self, s):
list_form = [r[int(r[0]) + 1:] if r[0].isdigit() else r
for r
in s.translate(self.trantab).split()]
return ''.join(list_form)

if __name__ == "__main__":
p = redux()
print p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG")
print p.reduce_plusminus("tA.-2AG.-2AG,-2ag")

for n in range(100000): p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG")

On Saturday, 5 January 2013 19:35:26 UTC+11, Sia wrote:
> I have strings such as:
>
>
>
> tA.-2AG.-2AG,-2ag
>
> or
>
> .+3ACG.+5CAACG.+3ACG.+3ACG
>
>
>
> The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
>
>
>
> tA..,
>
> and
>
> ...
>
>
>
> How can I do that?
>
> Thanks.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
HELP!! anyone ??can help me about my project "quick sort implemented with shell sort? comsciepartner General Computer Support 0 10-06-2008 01:02 PM
can someone explain line 7 and 8 of the insertion sort Troy Java 0 02-16-2008 04:50 PM
Can someone help me sort this one out and quick? Brent White ASP .Net 2 01-30-2008 05:28 PM
can Perl Sort do this, unix sort breaks on it (muliple spaces as demiliter) colin_lyse Perl Misc 1 02-03-2005 01:13 AM
Ado sort error-Ado Sort -Relate, Compute By, or Sort operations cannot be done on column(s) whose key length is unknown or exceeds 10 KB. Navin ASP General 1 09-09-2003 07:16 AM



Advertisments