Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Keeping track of the N largest values (http://www.velocityreviews.com/forums/t740840-keeping-track-of-the-n-largest-values.html)

Roy Smith 12-25-2010 03:42 PM

Keeping track of the N largest values
 
I'm processing a stream of N numbers and want to keep track of the K
largest. There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once. K is small (100 would be typical).

From a theoretical point of view, I should be able to do this in N log K
time. What I'm doing now is essentially:

top = [-1] # Assume all x are >= 0
for x in input():
if x <= top[0]:
continue
top.append(x)
if len(top) > K:
top.sort()
top.pop(0)

I can see pathological cases (say, all input values the same) where
running time would be N K log K, but on average (N >> K and random
distribution of values), this should be pretty close to N.

Is there a better way to do this, either from a theoretical running time
point of view, or just a nicer way to code this in Python?

Peter Otten 12-25-2010 04:04 PM

Re: Keeping track of the N largest values
 
Roy Smith wrote:

> I'm processing a stream of N numbers and want to keep track of the K
> largest. There's too many numbers in the stream (i.e. N is too large)
> to keep in memory at once. K is small (100 would be typical).
>
> From a theoretical point of view, I should be able to do this in N log K
> time. What I'm doing now is essentially:
>
> top = [-1] # Assume all x are >= 0
> for x in input():
> if x <= top[0]:
> continue
> top.append(x)
> if len(top) > K:
> top.sort()
> top.pop(0)
>
> I can see pathological cases (say, all input values the same) where
> running time would be N K log K, but on average (N >> K and random
> distribution of values), this should be pretty close to N.
>
> Is there a better way to do this, either from a theoretical running time
> point of view, or just a nicer way to code this in Python?


http://docs.python.org/library/heapq...heapq.nlargest

Robert Kern 12-25-2010 05:09 PM

Re: Keeping track of the N largest values
 
On 12/25/10 10:42 AM, Roy Smith wrote:
> I'm processing a stream of N numbers and want to keep track of the K
> largest. There's too many numbers in the stream (i.e. N is too large)
> to keep in memory at once. K is small (100 would be typical).
>
>> From a theoretical point of view, I should be able to do this in N log K

> time. What I'm doing now is essentially:
>
> top = [-1] # Assume all x are>= 0
> for x in input():
> if x<= top[0]:
> continue
> top.append(x)
> if len(top)> K:
> top.sort()
> top.pop(0)
>
> I can see pathological cases (say, all input values the same) where
> running time would be N K log K, but on average (N>> K and random
> distribution of values), this should be pretty close to N.
>
> Is there a better way to do this, either from a theoretical running time
> point of view, or just a nicer way to code this in Python?


heapq.nlargest()

http://docs.python.org/library/heapq#heapq.nlargest

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco


Roy Smith 12-25-2010 07:16 PM

Re: Keeping track of the N largest values
 
In article <Xns9E59A44D7CC49duncanbooth@127.0.0.1>,
Duncan Booth <duncan.booth@invalid.invalid> wrote:

> Roy Smith <roy@panix.com> wrote:
>
> >
> > I'm processing a stream of N numbers and want to keep track of the K
> > largest. There's too many numbers in the stream (i.e. N is too large)
> > to keep in memory at once. K is small (100 would be typical).
> > ...
> > Is there a better way to do this, either from a theoretical running
> > time point of view, or just a nicer way to code this in Python?

>
> How about:
>
> from heapq import nlargest
> top = nlargest(K, input())
>
> It uses a heap so avoids completely resorting each time.


Hmmm, that looks like it would solve the problem as stated, but there's
another twist. In addition to finding the K largest values, I *also*
need to do some other processing on all the values (which I didn't show
in the original example, incorrectly thinking it not germane to my
question). The problem with nlargest() is that it doesn't give me a
hook to do that.

PS: I'm assuming heapq.nlargest(n, iterable) operates with memory
proportional to n, and not to the iterator length. That's the only
reasonable conclusion, but the docs don't actually come out and say it.

Paul Rubin 12-25-2010 07:41 PM

Re: Keeping track of the N largest values
 
Roy Smith <roy@panix.com> writes:
>> from heapq import nlargest
>> top = nlargest(K, input())


> In addition to finding the K largest values, I *also* need to do some
> other processing on all the values .... The problem with nlargest()
> is that it doesn't give me a hook to do that.


def processed_input():
for x in input():
process(x)
yield x

top = nlargest(K, processed_input())

You could also write that more consisely with genexps but it's a bit
clumsy.

John Nagle 12-25-2010 08:26 PM

Re: Keeping track of the N largest values
 
On 12/25/2010 8:04 AM, Peter Otten wrote:
> Roy Smith wrote:
>
>> I'm processing a stream of N numbers and want to keep track of the K
>> largest. There's too many numbers in the stream (i.e. N is too large)
>> to keep in memory at once. K is small (100 would be typical).

>
> http://docs.python.org/library/heapq...heapq.nlargest


Incidentally, if it happens that the data is already in
a database, MySQL will do that.

SELECT val FROM tab ORDER BY val DESC LIMIT N;

will, for small N, keep only N values. For large N,
it sorts.

That's for an un-indexed field. If "val" is an
index, this is a very fast operation, since the tree
building has already been done.

John Nagle


Peter Otten 12-26-2010 10:36 AM

Re: Keeping track of the N largest values
 
Roy Smith wrote:

> In article <Xns9E59A44D7CC49duncanbooth@127.0.0.1>,
> Duncan Booth <duncan.booth@invalid.invalid> wrote:
>
>> Roy Smith <roy@panix.com> wrote:
>>
>> >
>> > I'm processing a stream of N numbers and want to keep track of the K
>> > largest. There's too many numbers in the stream (i.e. N is too large)
>> > to keep in memory at once. K is small (100 would be typical).
>> > ...
>> > Is there a better way to do this, either from a theoretical running
>> > time point of view, or just a nicer way to code this in Python?

>>
>> How about:
>>
>> from heapq import nlargest
>> top = nlargest(K, input())
>>
>> It uses a heap so avoids completely resorting each time.

>
> Hmmm, that looks like it would solve the problem as stated, but there's
> another twist. In addition to finding the K largest values, I *also*
> need to do some other processing on all the values (which I didn't show
> in the original example, incorrectly thinking it not germane to my
> question). The problem with nlargest() is that it doesn't give me a
> hook to do that.


If Paul's solution doesn't suffice -- the heapq module has the building
blocks for a custom solution, e. g.:

import heapq
from functools import partial

class NLargest(object):
def __init__(self, n):
self.n = n
self.heap = []
def tally(self, item):
heap = self.heap
if len(heap) >= self.n:
heapq.heappushpop(heap, item)
self.tally = partial(heapq.heappushpop, heap)
else:
heapq.heappush(heap, item)
def __str__(self):
return str(sorted(self.heap, reverse=True))

if __name__ == "__main__":
import random
items = range(100)
random.shuffle(items)
accu = NLargest(10)
for item in items:
accu.tally(item)
print item, accu


> PS: I'm assuming heapq.nlargest(n, iterable) operates with memory
> proportional to n, and not to the iterator length. That's the only
> reasonable conclusion, but the docs don't actually come out and say it.


I would hope so.

n00m 12-26-2010 01:56 PM

Re: Keeping track of the N largest values
 

from bisect import insort_left

K = 5
top = []
while 1:
x = input()
if len(top) < K:
insort_left(top, x)
elif x > top[0]:
del top[0]
insort_left(top, x)
print top


will be enough


Roy Smith 12-26-2010 03:12 PM

Re: Keeping track of the N largest values
 
In article
<e05e480b-8956-4984-b4cc-9a1666380eed@l32g2000yqc.googlegroups.com>,
n00m <n00m@narod.ru> wrote:

> from bisect import insort_left
>
> K = 5
> top = []
> while 1:
> x = input()
> if len(top) < K:
> insort_left(top, x)
> elif x > top[0]:
> del top[0]
> insort_left(top, x)
> print top
>
>
> will be enough


Hmmm, that's an interesting idea. Thanks.

Stefan Sonnenberg-Carstens 12-26-2010 06:51 PM

Re: Keeping track of the N largest values
 
Am 25.12.2010 16:42, schrieb Roy Smith:
> I'm processing a stream of N numbers and want to keep track of the K
> largest. There's too many numbers in the stream (i.e. N is too large)
> to keep in memory at once. K is small (100 would be typical).
>
> > From a theoretical point of view, I should be able to do this in N log K

> time. What I'm doing now is essentially:
>
> top = [-1] # Assume all x are>= 0
> for x in input():
> if x<= top[0]:
> continue
> top.append(x)
> if len(top)> K:
> top.sort()
> top.pop(0)
>
> I can see pathological cases (say, all input values the same) where
> running time would be N K log K, but on average (N>> K and random
> distribution of values), this should be pretty close to N.
>
> Is there a better way to do this, either from a theoretical running time
> point of view, or just a nicer way to code this in Python?

Here is my version:

l = []
K = 10

while 1:
a = input()
if len(l) == K:
l.remove(min(l))
l=[x for x in l if x < a] + [a] + [x for x in l if x > a]
print l


All times are GMT. The time now is 01:31 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.