Velocity Reviews > Tough sorting problem: or, I'm confusing myself

# Tough sorting problem: or, I'm confusing myself

david jensen
Guest
Posts: n/a

 04-09-2010
Hi all,

I'm trying to find a good way of doing the following:

Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
value n-tuple (call them "scores" for clarity later). I'm currently
storing them in a dictionary, by doing:

----
res={}
for i in itertools.combinations( range( 2**m ) , n):
res[ i ] = getValues( i ) # getValues() is computationally
expensive
----

For each (n-1)-tuple, I need to find the two numbers that have the
highest scores versus them. I know this isn't crystal clear, but
hopefully an example will help: with m=n=3:

Looking at only the (1, 3) case, assuming:
getValues( (1, 2, 3) ) == ( -200, 125, 75 ) # this contains the
highest "other" score, where 2 scores 125
getValues( (1, 3, 4) ) == ( 50, -50, 0 )
getValues( (1, 3, 5) ) == ( 25, 300, -325 )
getValues( (1, 3, 6) ) == ( -100, 0, 100 ) # this contains the
second-highest, where 6 scores 100
getValues( (1, 3, 7) ) == ( 80, -90, 10 )
getValues( (1, 3, ) == ( 10, -5, -5 )

I'd like to return ( (2, 125), (6, 100) ).

The most obvious (to me) way to do this would be not to generate the
res dictionary at the beginning, but just to go through each
combinations( range( 2**m), n-1) and try every possibility... this
will test each combination n times, however, and generating those
values is expensive. [e.g. (1,2,3)'s scores will be generated when
finding the best possibilities for (1,2), (1,3) and (2,3)]

What I'm doing now is ugly, and i think is where i'm confusing myself:

----
best2={}
for i in itertools.combinations( range( 2**m), n-1):
scorelist=[]
for j in range( 2**m ):
if j not in i:
k=list(i)
k.append(j)
k=tuple(sorted(k)) #gets the key for looking up the
scores in res
scorelist.append((j,res[k][k.index(j)]))
best2[i]=sorted(scorelist,key=lambda x: -x[1])[:2]
----

Am I missing an obviously better way?

Many thanks!

David

Raymond Hettinger
Guest
Posts: n/a

 04-11-2010
On Apr 9, 8:03*am, david jensen <(E-Mail Removed)> wrote:
> Hi all,
>
> I'm trying to find a good way of doing the following:
>
> Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
> value n-tuple (call them "scores" for clarity later). I'm currently
> storing them in a dictionary, by doing:
>
> ----
> res={}
> for i in itertools.combinations( range( 2**m ) , n):
> * * res[ i ] = getValues( i ) * *# getValues() is computationally
> expensive
> ----
>
> For each (n-1)-tuple, I need to find the two numbers that have the
> highest scores versus them. I know this isn't crystal clear, but
> hopefully an example will help: with m=n=3:
>
> Looking at only the (1, 3) case, assuming:
> getValues( (1, 2, 3) ) == ( -200, 125, 75 ) * *# this contains the
> highest "other" score, where 2 scores 125
> getValues( (1, 3, 4) ) == ( 50, -50, 0 )
> getValues( (1, 3, 5) ) == ( 25, 300, -325 )
> getValues( (1, 3, 6) ) == ( -100, 0, 100 ) * *# this contains the
> second-highest, where 6 scores 100
> getValues( (1, 3, 7) ) == ( 80, -90, 10 *)
> getValues( (1, 3, ) == ( 10, -5, -5 )
>
> I'd like to return ( (2, 125), (6, 100) ).
>
> The most obvious (to me) way to do this would be not to generate the
> res dictionary at the beginning, but just to go through each
> combinations( range( 2**m), n-1) and try every possibility... this
> will test each combination n times, however, and generating those
> values is expensive. [e.g. (1,2,3)'s scores will be generated when
> finding the best possibilities for (1,2), (1,3) and (2,3)]
>
> What I'm doing now is ugly, and i think is where i'm confusing myself:
>
> ----
> best2={}
> for i in itertools.combinations( range( 2**m), n-1):
> * * scorelist=[]
> * * for j in range( 2**m ):
> * * * * if j not in i:
> * * * * * * k=list(i)
> * * * * * * k.append(j)
> * * * * * * k=tuple(sorted(k)) * *#gets the key for looking up the
> scores in res
> * * * * * * scorelist.append((j,res[k][k.index(j)]))
> * * best2[i]=sorted(scorelist,key=lambda x: -x[1])[:2]
> ----
>
> Am I missing an obviously better way?

The overall algorithm looks about right.
The inner-loop could be tighted-up a bit.
And you could replace the outer sort with a heap.

best2 = {}
for i in itertools.combinations(range( 2**m), n-1):
scorelist = []
for j in range( 2**m ):
if j not in i:
k = tuple(sorted(i + (j,)))
scorelist.append((j, res[k][k.index(j)]))
best2[i] = heapq.nlargest(2, scorelist,
key=operator.itemgetter(1))

Raymond

Paul McGuire
Guest
Posts: n/a

 04-11-2010
On Apr 9, 10:03*am, david jensen <(E-Mail Removed)> wrote:
> Hi all,
>
> I'm trying to find a good way of doing the following:
>
> Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
> value n-tuple (call them "scores" for clarity later). I'm currently
> storing them in a dictionary, by doing:
>
> ----
> res={}
> for i in itertools.combinations( range( 2**m ) , n):
> * * res[ i ] = getValues( i ) * *# getValues() is computationally
> expensive
> ----
>
> For each (n-1)-tuple, I need to find the two numbers that have the
> highest scores versus them. I know this isn't crystal clear, but
> hopefully an example will help: with m=n=3:
>
> Looking at only the (1, 3) case, assuming:
> getValues( (1, 2, 3) ) == ( -200, 125, 75 ) * *# this contains the
> highest "other" score, where 2 scores 125
> getValues( (1, 3, 4) ) == ( 50, -50, 0 )
> getValues( (1, 3, 5) ) == ( 25, 300, -325 )
> getValues( (1, 3, 6) ) == ( -100, 0, 100 ) * *# this contains the
> second-highest, where 6 scores 100
> getValues( (1, 3, 7) ) == ( 80, -90, 10 *)
> getValues( (1, 3, ) == ( 10, -5, -5 )
>
> I'd like to return ( (2, 125), (6, 100) ).
>
> The most obvious (to me) way to do this would be not to generate the
> res dictionary at the beginning, but just to go through each
> combinations( range( 2**m), n-1) and try every possibility... this
> will test each combination n times, however, and generating those
> values is expensive. [e.g. (1,2,3)'s scores will be generated when
> finding the best possibilities for (1,2), (1,3) and (2,3)]
>

Add a memoizing decorator to getValues, so that repeated calls will do

-- Paul

david jensen
Guest
Posts: n/a

 04-13-2010
On Apr 11, 9:39*pm, Raymond Hettinger <(E-Mail Removed)> wrote:

> The overall algorithm looks about right.
> The inner-loop could be tighted-up a bit.
> And you could replace the outer sort with a heap.
>
> best2 = {}
> for i in itertools.combinations(range( 2**m), n-1):
> * * scorelist = []
> * * for j in range( 2**m ):
> * * * * if j not in i:
> * * * * * * k = tuple(sorted(i + (j,)))
> * * * * * * scorelist.append((j, res[k][k.index(j)]))
> * * best2[i] = heapq.nlargest(2, scorelist,
> key=operator.itemgetter(1))
>
> Raymond

Thanks for the ideas... I should have seen the k = tuple(sorted(i +
(j,))). I'm not sure a heap will help much, and at least to me,

Thanks for taking a look, I appreciate it!

David

david jensen
Guest
Posts: n/a

 04-13-2010
On Apr 12, 1:22*am, Paul McGuire <(E-Mail Removed)> wrote:
> On Apr 9, 10:03*am, david jensen <(E-Mail Removed)> wrote:
>
>
>
> > Hi all,

>
> > I'm trying to find a good way of doing the following:

>
> > Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
> > value n-tuple (call them "scores" for clarity later). I'm currently
> > storing them in a dictionary, by doing:

>
> > ----
> > res={}
> > for i in itertools.combinations( range( 2**m ) , n):
> > * * res[ i ] = getValues( i ) * *# getValues() is computationally
> > expensive
> > ----

>
> > For each (n-1)-tuple, I need to find the two numbers that have the
> > highest scores versus them. I know this isn't crystal clear, but
> > hopefully an example will help: with m=n=3:

>
> > Looking at only the (1, 3) case, assuming:
> > getValues( (1, 2, 3) ) == ( -200, 125, 75 ) * *# this contains the
> > highest "other" score, where 2 scores 125
> > getValues( (1, 3, 4) ) == ( 50, -50, 0 )
> > getValues( (1, 3, 5) ) == ( 25, 300, -325 )
> > getValues( (1, 3, 6) ) == ( -100, 0, 100 ) * *# this contains the
> > second-highest, where 6 scores 100
> > getValues( (1, 3, 7) ) == ( 80, -90, 10 *)
> > getValues( (1, 3, ) == ( 10, -5, -5 )

>
> > I'd like to return ( (2, 125), (6, 100) ).

>
> > The most obvious (to me) way to do this would be not to generate the
> > res dictionary at the beginning, but just to go through each
> > combinations( range( 2**m), n-1) and try every possibility... this
> > will test each combination n times, however, and generating those
> > values is expensive. [e.g. (1,2,3)'s scores will be generated when
> > finding the best possibilities for (1,2), (1,3) and (2,3)]

>
> Add a memoizing decorator to getValues, so that repeated calls will do
> lookups instead of repeated calculations.
>
> -- Paul

Thanks for the idea... I'd thought of that, but didn't use it because
I don't think it improves complexity or readability: just makes it a
little more intuitive.

David

Paul Rubin
Guest
Posts: n/a

 04-14-2010
Raymond Hettinger <(E-Mail Removed)> writes:
> Not sure what the readability issue is. The phrase "nlargest(2,
> iterable)" does exactly what it says, finds the 2 largest elements
> from an iterable. That makes the programmer's intent more clear than
> the slower, but semanticly equivalent form: sorted(iterable)[:2].

I think you meant

sorted(iterable, reverse=True)[:2]

Raymond Hettinger
Guest
Posts: n/a

 04-14-2010
> I'm not sure a heap will help much, and at least to me,

nlargest() should save quite a few comparisons and run much faster
than sorted().

Not sure what the readability issue is. The phrase "nlargest(2,
iterable)" does exactly what it says, finds the 2 largest elements
from an iterable. That makes the programmer's intent more clear than
the slower, but semanticly equivalent form: sorted(iterable)[:2].

The running time for your algorithm can be very long, depending on the
inputs. Do you need an exact answer or can you approximate it with
sampling random combinations (i.e. choose the best two scores out of
10000 samples).

Raymond

Raymond Hettinger
Guest
Posts: n/a

 04-14-2010
> > Not sure what the readability issue is. *The phrase "nlargest(2,
> > iterable)" does exactly what it says, finds the 2 largest elements
> > from an iterable. *That makes the programmer's intent more clear than
> > the slower, but semanticly equivalent form: *sorted(iterable)[:2].

>
> I think you meant
>
> * *sorted(iterable, reverse=True)[:2]

Raymond

david jensen
Guest
Posts: n/a

 04-15-2010
On Apr 14, 6:06*pm, Raymond Hettinger <(E-Mail Removed)> wrote:
> > I'm not sure a heap will help much, and at least to me,

>
> nlargest() should save quite a few comparisons and run much faster
> than sorted().
>
> Not sure what the readability issue is. *The phrase "nlargest(2,
> iterable)" does exactly what it says, finds the 2 largest elements
> from an iterable. *That makes the programmer's intent more clear than
> the slower, but semanticly equivalent form: *sorted(iterable)[:2].
>
> The running time for your algorithm can be very long, depending on the
> inputs. *Do you need an exact answer or can you approximate it with
> sampling random combinations (i.e. choose the best two scores out of
> 10000 samples).
>
> Raymond

You are of course correct. I was running on lack of sleep. Thanks
again for having taken the time to help!
I need an exact answer, but don't anticipate 2**m or n to be huge so
even with e.g. m=10 and n=7 it should still be feasible. I can see how
sampling would become necessary if m is large though. Much larger, and
i'd have to think about optimizing the getValues function first
though.

thank you,
David

Lie Ryan
Guest
Posts: n/a

 04-16-2010
On 04/15/10 02:03, Paul Rubin wrote:
> Raymond Hettinger <(E-Mail Removed)> writes:
>> Not sure what the readability issue is. The phrase "nlargest(2,
>> iterable)" does exactly what it says, finds the 2 largest elements
>> from an iterable. That makes the programmer's intent more clear than
>> the slower, but semanticly equivalent form: sorted(iterable)[:2].

>
> I think you meant
>
> sorted(iterable, reverse=True)[:2]

or sorted(iterable)[-2:]