Velocity Reviews > Keeping track of the N largest values

# Keeping track of the N largest values

Stefan Sonnenberg-Carstens
Guest
Posts: n/a

 12-27-2010
Am 26.12.2010 19:51, schrieb Stefan Sonnenberg-Carstens:
> l = []
> K = 10
>
> while 1:
> a = input()
> if len(l) == K:
> l.remove(min(l))
> l=[x for x in l if x < a] + [a] + [x for x in l if x > a]
> print l

A minor fault made it into my prog:

l = [0]
K = 10

while 1:
a = input()
l=[x for x in l if x < a] + [a] + [x for x in l if x > a]
if len(l) == K:
l.remove(min(l))
print l

Dan Stromberg
Guest
Posts: n/a

 12-31-2010
On Sat, Dec 25, 2010 at 7:42 AM, Roy Smith <(E-Mail Removed)> wrote:
> I'm processing a stream of N numbers and want to keep track of the K
> largest. *There's too many numbers in the stream (i.e. N is too large)
> to keep in memory at once. *K is small (100 would be typical).
>
> >From a theoretical point of view, I should be able to do this in N log K

> time. *What I'm doing now is essentially:
>
> top = [-1] * *# Assume all x are >= 0
> for x in input():
> * *if x <= top[0]:
> * * * *continue
> * *top.append(x)
> * *if len(top) > K:
> * * * *top.sort()
> * * * *top.pop(0)
>
> I can see pathological cases (say, all input values the same) where
> running time would be N K log K, but on average (N >> K and random
> distribution of values), this should be pretty close to N.
>
> Is there a better way to do this, either from a theoretical running time
> point of view, or just a nicer way to code this in Python?
> --
> http://mail.python.org/mailman/listinfo/python-list
>

heapq is probably fine, but I've empirically found a treap faster than
a heap under some conditions:

http://stromberg.dnsalias.org/~strombrg/treap/
http://stromberg.dnsalias.org/~strombrg/highest/