Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Sampling a population

Reply
Thread Tools

Sampling a population

 
 
Brian Quinlan
Guest
Posts: n/a
 
      06-02-2006
This is less a Python question and more a optimization/probability
question. Imaging that you have a list of objects and there frequency in
a population e.g.

lst = [(a, 0.01), (b, 0.05), (c, 0.50), (d, 0.30), (e, 0.04), (f, 0.10)]

and you want to drawn n items from that list (duplicates allowed), with
that probability distribution.

The fastest algorithm that I have been able to devise for doing so is:
O(n * log(len(lst))). Can anyone think or a solution with a better time
complexity? If not, is there an obviously better way to do this
(assuming n is big and the list size is small).

Here is the code:

from random import random
from bisect import bisect

def draw(n, lst):
ps = []
last = 0
for p in lst:
ps.append(last + p)
last += p

# ps = [0.01, 0.06, 0.56, 0.86, 0.90, 1.00]

chosen = [0] * len(lst) # track frequency
for i in range(n):
r = random()

chosen[bisect(ps, r)] += 1 # binary search and increment

result = [] # rescale probability based on frequency
for c in chosen:
result.append(float(c) / n)
return result

lst = [0.01, 0.05, 0.50, 0.30, 0.04, 0.10]
print draw(10000, lst)

 
Reply With Quote
 
 
 
 
Duncan Smith
Guest
Posts: n/a
 
      06-02-2006
Brian Quinlan wrote:
> This is less a Python question and more a optimization/probability
> question. Imaging that you have a list of objects and there frequency in
> a population e.g.
>
> lst = [(a, 0.01), (b, 0.05), (c, 0.50), (d, 0.30), (e, 0.04), (f, 0.10)]
>
> and you want to drawn n items from that list (duplicates allowed), with
> that probability distribution.
>
> The fastest algorithm that I have been able to devise for doing so is:
> O(n * log(len(lst))). Can anyone think or a solution with a better time
> complexity? If not, is there an obviously better way to do this
> (assuming n is big and the list size is small).
>
> Here is the code:
>
> from random import random
> from bisect import bisect
>
> def draw(n, lst):
> ps = []
> last = 0
> for p in lst:
> ps.append(last + p)
> last += p
>
> # ps = [0.01, 0.06, 0.56, 0.86, 0.90, 1.00]
>
> chosen = [0] * len(lst) # track frequency
> for i in range(n):
> r = random()
>
> chosen[bisect(ps, r)] += 1 # binary search and increment
>
> result = [] # rescale probability based on frequency
> for c in chosen:
> result.append(float(c) / n)
> return result
>
> lst = [0.01, 0.05, 0.50, 0.30, 0.04, 0.10]
> print draw(10000, lst)
>


I would do something like the following (maybe with an additional check
that the probabilities do not sum to less than 1),

>>> from random import random
>>> import operator
>>> def draw(n, lst):

lst.sort(key=operator.itemgetter(1), reverse=True)
cumprobs = []
this_cp = 0
for p in lst:
this_cp += p[1]
cumprobs.append(this_cp)
for _ in xrange(n):
rnd = random()
for i, cumprob in enumerate(cumprobs):
if rnd < cumprob:
yield lst[i][0]
break


>>> lst = [('a', 0.01), ('b', 0.05), ('c', 0.50), ('d', 0.30), ('e',

0.04), ('f', 0.10)]
>>> list(draw(8, lst))

['d', 'd', 'c', 'e', 'c', 'd', 'c', 'f']
>>>


The sorting of the list means that iterating over the cumulative
probabilities is minimised (for your density function the inner loop
will be broken out of after the first iteration 50% of the time). (I've
assumed that it doesn't matter to you that the list is sorted.) I'm not
sure that this is in any way optimal.

Duncan
 
Reply With Quote
 
 
 
 
Mitja Trampus
Guest
Posts: n/a
 
      06-02-2006
Brian Quinlan wrote:
> The fastest algorithm that I have been able to devise for doing so is:
> O(n * log(len(lst))). Can anyone think or a solution with a better time
> complexity? If not, is there an obviously better way to do this
> (assuming n is big and the list size is small).



If list is indeed short, I'd say common sense speaks against complicating and optimizing
further - you can only get log(len(lst))-fold speedup, which is in your case more or less
a small constant. _IF_ this part of code later turns out to be a bottleneck, you might
profit more by porting it to C than searching for an O(n) solution, if it even exists.
 
Reply With Quote
 
Paddy
Guest
Posts: n/a
 
      06-02-2006
Brian Quinlan wrote:
> This is less a Python question and more a optimization/probability
> question. Imaging that you have a list of objects and there frequency in
> a population e.g.
>
> lst = [(a, 0.01), (b, 0.05), (c, 0.50), (d, 0.30), (e, 0.04), (f, 0.10)]
>
> and you want to drawn n items from that list (duplicates allowed), with
> that probability distribution.
>
> The fastest algorithm that I have been able to devise for doing so is:
> O(n * log(len(lst))). Can anyone think or a solution with a better time
> complexity? If not, is there an obviously better way to do this
> (assuming n is big and the list size is small).
>


Any way I tried to slice and dice it, I could not get any faster.
draw2 and draw 3 generate code on the fly. draw4 sneakily tries to
trade memory and accuracy for speed but is even slower!

First the times, then the code:

$ ./timeit.py 'from probDistribution import draw as draw; draw(10000,
[0.01, 0.05, 0.50, 0.30, 0.04, 0.10])'
100 loops, best of 3: 13.4 msec per loop

$ ./timeit.py 'from probDistribution import draw2 as draw; draw(10000,
[0.01, 0.05, 0.50, 0.30, 0.04, 0.10])'
100 loops, best of 3: 15.2 msec per loop

$ ./timeit.py 'from probDistribution import draw3 as draw; draw(10000,
[0.01, 0.05, 0.50, 0.30, 0.04, 0.10])'
100 loops, best of 3: 16.2 msec per loop

$ ./timeit.py 'from probDistribution import draw4 as draw; draw(10000,
[0.01, 0.05, 0.50, 0.30, 0.04, 0.10])'
10 loops, best of 3: 30.5 msec per loop


=== CODE probDistribution.py ===
from random import random, randrange
from bisect import bisect

def draw(n, lst):
ps = []
last = 0
for p in lst:
ps.append(last + p)
last += p

# ps = [0.01, 0.06, 0.56, 0.86, 0.90, 1.00]

chosen = [0] * len(lst) # track frequency
for i in range(n):
r = random()

chosen[bisect(ps, r)] += 1 # binary search and increment

result = [] # rescale probability based on frequency
for c in chosen:
result.append(float(c) / n)
return result


def draw2(n, lst):
""" uses dynamicc code generation of this form:
chosen = [0] * 6
for i in xrange(10000):
r = random()
if r < 0.01: chosen[0]+=1
elif r < 0.06: chosen[1] +=1
...
elif r < 0.90: chosen[4] +=1
else chosen[5]+=1
"""
assert len(lst)>1, "Corner case NOT covered"

codestr = 'chosen = [0] * %i\n' % (len(lst),)
codestr += 'for i in xrange(%i):\n r = random()\n' % (n,)

last = 0.0
lstmax = len(lst)-1
for i,p in enumerate(lst):
last += p
if i==0: codestr += ' if r < %g: chosen[%i] +=1\n' %
(last, i)
elif i==lstmax: codestr += ' else: chosen[%i] +=1\n' % (i,)
else: codestr += ' elif r < %g: chosen[%i] +=1\n' %
(last, i)

exec codestr

result = [] # rescale probability based on frequency
for c in chosen:
result.append(float(c) / n)
return result


def draw3(n, lst):
""" uses dynamicc code generation of this form:
chosen = [0] * 6
for i in xrange(10000):
r = random()
chosen[-1+ (
((r<0.01) and 1)
or ((r<0.06) and 2)
...
or ((r<0.90) and 5)
or 6
)] +=1

"""
assert len(lst)>1, "Corner case NOT covered"

codestr = 'chosen = [0] * %i\n' % (len(lst),)
codestr += 'for i in xrange(%i):\n r = random()\n' % (n,)
codestr += ' chosen[-1+ (\n'

last = 0.0
lstmax = len(lst)-1
for i,p in enumerate(lst):
last += p
if i==0: codestr += ' ((r<%g) and 1)\n' % (last)
elif i==lstmax: codestr += ' or %i\n )] +=1\n' % (i+1,)
else: codestr += ' or ((r<%g) and %i)\n' % (last,
i+1)
#return codestr
exec codestr

result = [] # rescale probability based on frequency
for c in chosen:
result.append(float(c) / n)
return result


def draw4(n, lst, precision = 0.01, maxbins=10000):
""" Memory/speed tradeoff by coarse quantizing of frequency values.

"""
assert len(lst)>1, "Corner case NOT covered"
assert 0.0 < precision < 1.0 and (1.0/precision) < maxbins

binmax = int(1.0/precision)
chosenbin = [0] * binmax
for i in xrange(n):
chosenbin[randrange(binmax)] +=1

left, right = 0, 0 # extract bin range for summation
chosen = [0] * len(lst)
last = 0.0
for i,p in enumerate(lst):
last += p
right = int(last/precision)
chosen[i] = sum( chosenbin[left:right] )
left = right

result = [] # rescale probability based on frequency
for c in chosen:
result.append(float(c) / n)
return result

=== END CODE ===

 
Reply With Quote
 
Terry Reedy
Guest
Posts: n/a
 
      06-02-2006

"Paddy" <> wrote in message
news: ups.com...
> Brian Quinlan wrote:
>> This is less a Python question and more a optimization/probability
>> question. Imaging that you have a list of objects and there frequency in
>> a population e.g.
>>
>> lst = [(a, 0.01), (b, 0.05), (c, 0.50), (d, 0.30), (e, 0.04), (f, 0.10)]
>>
>> and you want to drawn n items from that list (duplicates allowed), with
>> that probability distribution.


For a set of probabilities that uneven, you *might* do better to sort high
to low and do a linear search. Or check the two common cases and then do a
binary search on the rest.

tjr



 
Reply With Quote
 
Roger Miller
Guest
Posts: n/a
 
      06-02-2006
For your example, since the probabilities are all multiples of 0.01 you
could make a list of 100 elements. Set one of them to a, 5 of them to
b, 50 of them to c, etc. Then just randomly sample from the table
(which is O(1)). Of course if the probabilities can be arbitrary
floating point values then this won't work. But if you can live with
rounding to 3 or 4 digits this is probably the fastest and easiest
approach.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
What sound format and sampling to use? Ramon F Herrera VOIP 2 07-21-2005 09:01 PM
Stratified random sampling with random_shuffle ? steflhermitte C++ 1 04-21-2005 12:54 AM
Over-Sampling ALuPin VHDL 5 03-12-2005 03:49 AM
sampling algoritm based on indexed constrains Albretch Java 0 11-29-2004 08:56 PM
I have a series of CDs that I have ripper to MP3 - there speach - so I used a low sampling rate... Marc Computer Support 22 01-22-2004 04:13 PM



Advertisments