- **Python**
(*http://www.velocityreviews.com/forums/f43-python.html*)

- - **Keeping track of the N largest values**
(*http://www.velocityreviews.com/forums/t740840-keeping-track-of-the-n-largest-values.html*)

Keeping track of the N largest valuesI'm processing a stream of N numbers and want to keep track of the K
largest. There's too many numbers in the stream (i.e. N is too large) to keep in memory at once. K is small (100 would be typical). From a theoretical point of view, I should be able to do this in N log K time. What I'm doing now is essentially: top = [-1] # Assume all x are >= 0 for x in input(): if x <= top[0]: continue top.append(x) if len(top) > K: top.sort() top.pop(0) I can see pathological cases (say, all input values the same) where running time would be N K log K, but on average (N >> K and random distribution of values), this should be pretty close to N. Is there a better way to do this, either from a theoretical running time point of view, or just a nicer way to code this in Python? |

Re: Keeping track of the N largest valuesRoy Smith wrote:
> I'm processing a stream of N numbers and want to keep track of the K > largest. There's too many numbers in the stream (i.e. N is too large) > to keep in memory at once. K is small (100 would be typical). > > From a theoretical point of view, I should be able to do this in N log K > time. What I'm doing now is essentially: > > top = [-1] # Assume all x are >= 0 > for x in input(): > if x <= top[0]: > continue > top.append(x) > if len(top) > K: > top.sort() > top.pop(0) > > I can see pathological cases (say, all input values the same) where > running time would be N K log K, but on average (N >> K and random > distribution of values), this should be pretty close to N. > > Is there a better way to do this, either from a theoretical running time > point of view, or just a nicer way to code this in Python? http://docs.python.org/library/heapq...heapq.nlargest |

Re: Keeping track of the N largest valuesOn 12/25/10 10:42 AM, Roy Smith wrote:
> I'm processing a stream of N numbers and want to keep track of the K > largest. There's too many numbers in the stream (i.e. N is too large) > to keep in memory at once. K is small (100 would be typical). > >> From a theoretical point of view, I should be able to do this in N log K > time. What I'm doing now is essentially: > > top = [-1] # Assume all x are>= 0 > for x in input(): > if x<= top[0]: > continue > top.append(x) > if len(top)> K: > top.sort() > top.pop(0) > > I can see pathological cases (say, all input values the same) where > running time would be N K log K, but on average (N>> K and random > distribution of values), this should be pretty close to N. > > Is there a better way to do this, either from a theoretical running time > point of view, or just a nicer way to code this in Python? heapq.nlargest() http://docs.python.org/library/heapq#heapq.nlargest -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco |

Re: Keeping track of the N largest valuesIn article <Xns9E59A44D7CC49duncanbooth@127.0.0.1>,
Duncan Booth <duncan.booth@invalid.invalid> wrote: > Roy Smith <roy@panix.com> wrote: > > > > > I'm processing a stream of N numbers and want to keep track of the K > > largest. There's too many numbers in the stream (i.e. N is too large) > > to keep in memory at once. K is small (100 would be typical). > > ... > > Is there a better way to do this, either from a theoretical running > > time point of view, or just a nicer way to code this in Python? > > How about: > > from heapq import nlargest > top = nlargest(K, input()) > > It uses a heap so avoids completely resorting each time. Hmmm, that looks like it would solve the problem as stated, but there's another twist. In addition to finding the K largest values, I *also* need to do some other processing on all the values (which I didn't show in the original example, incorrectly thinking it not germane to my question). The problem with nlargest() is that it doesn't give me a hook to do that. PS: I'm assuming heapq.nlargest(n, iterable) operates with memory proportional to n, and not to the iterator length. That's the only reasonable conclusion, but the docs don't actually come out and say it. |

Re: Keeping track of the N largest valuesRoy Smith <roy@panix.com> writes:
>> from heapq import nlargest >> top = nlargest(K, input()) > In addition to finding the K largest values, I *also* need to do some > other processing on all the values .... The problem with nlargest() > is that it doesn't give me a hook to do that. def processed_input(): for x in input(): process(x) yield x top = nlargest(K, processed_input()) You could also write that more consisely with genexps but it's a bit clumsy. |

Re: Keeping track of the N largest valuesOn 12/25/2010 8:04 AM, Peter Otten wrote:
> Roy Smith wrote: > >> I'm processing a stream of N numbers and want to keep track of the K >> largest. There's too many numbers in the stream (i.e. N is too large) >> to keep in memory at once. K is small (100 would be typical). > > http://docs.python.org/library/heapq...heapq.nlargest Incidentally, if it happens that the data is already in a database, MySQL will do that. SELECT val FROM tab ORDER BY val DESC LIMIT N; will, for small N, keep only N values. For large N, it sorts. That's for an un-indexed field. If "val" is an index, this is a very fast operation, since the tree building has already been done. John Nagle |

Re: Keeping track of the N largest valuesRoy Smith wrote:
> In article <Xns9E59A44D7CC49duncanbooth@127.0.0.1>, > Duncan Booth <duncan.booth@invalid.invalid> wrote: > >> Roy Smith <roy@panix.com> wrote: >> >> > >> > I'm processing a stream of N numbers and want to keep track of the K >> > largest. There's too many numbers in the stream (i.e. N is too large) >> > to keep in memory at once. K is small (100 would be typical). >> > ... >> > Is there a better way to do this, either from a theoretical running >> > time point of view, or just a nicer way to code this in Python? >> >> How about: >> >> from heapq import nlargest >> top = nlargest(K, input()) >> >> It uses a heap so avoids completely resorting each time. > > Hmmm, that looks like it would solve the problem as stated, but there's > another twist. In addition to finding the K largest values, I *also* > need to do some other processing on all the values (which I didn't show > in the original example, incorrectly thinking it not germane to my > question). The problem with nlargest() is that it doesn't give me a > hook to do that. If Paul's solution doesn't suffice -- the heapq module has the building blocks for a custom solution, e. g.: import heapq from functools import partial class NLargest(object): def __init__(self, n): self.n = n self.heap = [] def tally(self, item): heap = self.heap if len(heap) >= self.n: heapq.heappushpop(heap, item) self.tally = partial(heapq.heappushpop, heap) else: heapq.heappush(heap, item) def __str__(self): return str(sorted(self.heap, reverse=True)) if __name__ == "__main__": import random items = range(100) random.shuffle(items) accu = NLargest(10) for item in items: accu.tally(item) print item, accu > PS: I'm assuming heapq.nlargest(n, iterable) operates with memory > proportional to n, and not to the iterator length. That's the only > reasonable conclusion, but the docs don't actually come out and say it. I would hope so. |

Re: Keeping track of the N largest valuesfrom bisect import insort_left K = 5 top = [] while 1: x = input() if len(top) < K: insort_left(top, x) elif x > top[0]: del top[0] insort_left(top, x) print top will be enough |

Re: Keeping track of the N largest valuesIn article
<e05e480b-8956-4984-b4cc-9a1666380eed@l32g2000yqc.googlegroups.com>, n00m <n00m@narod.ru> wrote: > from bisect import insort_left > > K = 5 > top = [] > while 1: > x = input() > if len(top) < K: > insort_left(top, x) > elif x > top[0]: > del top[0] > insort_left(top, x) > print top > > > will be enough Hmmm, that's an interesting idea. Thanks. |

Re: Keeping track of the N largest valuesAm 25.12.2010 16:42, schrieb Roy Smith:
> I'm processing a stream of N numbers and want to keep track of the K > largest. There's too many numbers in the stream (i.e. N is too large) > to keep in memory at once. K is small (100 would be typical). > > > From a theoretical point of view, I should be able to do this in N log K > time. What I'm doing now is essentially: > > top = [-1] # Assume all x are>= 0 > for x in input(): > if x<= top[0]: > continue > top.append(x) > if len(top)> K: > top.sort() > top.pop(0) > > I can see pathological cases (say, all input values the same) where > running time would be N K log K, but on average (N>> K and random > distribution of values), this should be pretty close to N. > > Is there a better way to do this, either from a theoretical running time > point of view, or just a nicer way to code this in Python? Here is my version: l = [] K = 10 while 1: a = input() if len(l) == K: l.remove(min(l)) l=[x for x in l if x < a] + [a] + [x for x in l if x > a] print l |

All times are GMT. The time now is 01:31 PM. |

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.

SEO by vBSEO ©2010, Crawlability, Inc.