Velocity Reviews > Feature request: sorting a list slice

# Feature request: sorting a list slice

George Sakkis
Guest
Posts: n/a

 05-18-2006
It would be useful if list.sort() accepted two more optional
parameters, start and stop, so that you can sort a slice in place. In
other words,

x = range(1000000)
x.sort(start=3, stop=-1)

would be equivalent to

x[3:-1] = sorted(x[3:-1])

but more efficient and without repeating the slice twice. If it's not
too late for feature additions, I'd love to see this in 2.5.

George

John Machin
Guest
Posts: n/a

 05-18-2006
Use case?

Fredrik Lundh
Guest
Posts: n/a

 05-18-2006
George Sakkis wrote:

> It would be useful if list.sort() accepted two more optional
> parameters

useful for what? what's the use case ?

</F>

Raymond Hettinger
Guest
Posts: n/a

 05-18-2006
George Sakkis wrote:
> It would be useful if list.sort() accepted two more optional
> parameters, start and stop, so that you can sort a slice in place. In
> other words,
>
> x = range(1000000)
> x.sort(start=3, stop=-1)
>
> would be equivalent to
>
> x[3:-1] = sorted(x[3:-1])
>
> but more efficient and without repeating the slice twice. If it's not
> too late for feature additions, I'd love to see this in 2.5.

This is a false optimization. The slicing steps are O(n) and the sort
step is O(n log n) unless the data has some internal structure that
Timsort can use to get closer to O(n).

If you implemented this and timed in it real apps, I would be surprised
to find that the slice extraction and assignments were the performance
bottleneck. IMO, a worthwhile performance gain is unlikely.

Raymond

George Sakkis
Guest
Posts: n/a

 05-18-2006
Fredrik Lundh wrote:

> George Sakkis wrote:
>
>> It would be useful if list.sort() accepted two more optional
>> parameters

>
>
> useful for what? what's the use case ?
>
> </F>
>

Although there is a use case (see below), I don't think it's strictly
necessary in this case. Arguments to add it:
- Don't Repeat Yourself.
- It doesn't introduce new syntax, builtin or standard library
function; only two optional keywords in an existing method.
- These keywords are already used in several list and string methods
(index,find,..) and their meaning is instantly obvious.
- Increased (even if small) efficiency. Yes, I know that in practice,
and especially in apps that involve I/O, network, DB connections and so
on, it's unlikely to be the bottleneck but this isn't a general
argument for being sloppy. The fact that having a "for i in
xrange(100000)ass" at the top of every file costs me only a few msec
is not a reason to keep it there.

IMO the only reason not to add it would be if it makes sort's
implementation much more complicate that it already is. I don't know
Timsort's details, but I doubt that this is the case; please correct me
if I am wrong.

And here's (a cut-down version of) the use case: a recursive
generalization of groupby. On each recursive call, the argument list is
sorted and regrouped. If sort() accepted a (start,end) range I could
pass this over instead of creating a slice of the input each time. It's
not hard to see that there can be an exponential number of calls wrt to
the input's length, so cutting down the average cost of each call may
be worthwhile for medium-largish sized inputs (of course it's still
exponential asymptotically, so the constant factor does not make much
difference for really large inputs).

import itertools as it

def groupBy(iterable, *keys):
return _groupBy(list(iterable), keys, len(keys))

def _groupBy(lst, keys, depth):
if depth == 0:
return lst
key = keys[-depth]
# sort group before subgrouping
lst.sort(key=key)
# group by key and then recursively do the same for each subgroup
return [(k, _groupBy(list(subgroup),keys,depth-1))
for k,subgroup in it.groupby(lst,key)]

#==== example ========================================

class Struct(object):
def __init__(self, args):
self.args = args
def __getitem__(self,index):
return self.args[index]

data = map(Struct, [
('a', 2, True),
('b', 1, False),
('a', 1, True),
('a', 1, False),
('b', 2, True),
('b', 2, True),
('a', 2, True),
])

from pprint import pprint
from operator import itemgetter
pprint(groupBy(data, itemgetter(2), itemgetter(0), itemgetter(1)))

#======= output =================================

[(False,
[('a', [(1, [<__main__.Struct object at 0xb7dc128c>])]),
('b', [(1, [<__main__.Struct object at 0xb7dc124c>])])]),
(True,
[('a',
[(1, [<__main__.Struct object at 0xb7dc126c>]),
(2,
[<__main__.Struct object at 0xb7dc122c>,
<__main__.Struct object at 0xb7dc12ec>])]),
('b',
[(2,
[<__main__.Struct object at 0xb7dc12ac>,
<__main__.Struct object at 0xb7dc12cc>])])])]

George

Heiko Wundram
Guest
Posts: n/a

 05-18-2006
Am Donnerstag 18 Mai 2006 22:13 schrieb Raymond Hettinger:
> This is a false optimization. The slicing steps are O(n) and the sort
> step is O(n log n) unless the data has some internal structure that
> Timsort can use to get closer to O(n).
>
> If you implemented this and timed in it real apps, I would be surprised
> to find that the slice extraction and assignments were the performance
> bottleneck. IMO, a worthwhile performance gain is unlikely.

I personally wouldn't find this to be a false optimization, at least not
memory-wise. If you sort really large lists, and only want to sort part of
the list, the memory gains of not having to do a slice should be enormous,
and there should be some time-gains too. And, additionally, implementing this
(if I understand timsort correctly, just had a look at the sources) isn't
hard.

Rather, I'd pose the usage question: why are you only sorting a part of a
list? lists are basically meant for homogenous data. And sorting only a part
of that should be a pretty meaningless operation, mostly...

Anyway, I've just written up a patch to extend sort() with a start/stop
parameter (which behaves just like the default slice notation). Generally,
this will make sort() setup slightly slower in the "standard" case (because
there's some more pointer arithmetic involved, but this should be negligible,
and is O(1)), but for the actual use case, the following numbers can be seen:

modelnine@phoenix ~/mercurial/python-modelnine \$ ./python test.py
New average time: 13.7254650593 ms per loop
Old average time: 14.839854002 ms per loop
[10198 refs]
modelnine@phoenix ~/mercurial/python-modelnine \$

This is just a very, very simple test (testing the case of a list with one
million entries, where 99000 are sorted):

>>>

from random import randrange
from time import time

x = [randrange(256) for x in xrange(1000000)]
timesnew, timesold = [], []

for reps in range(1000):
y = x[:]
timesnew.append(time())
y.sort(start=1000,stop=10000)
timesnew[-1] = time() - timesnew[-1]

for reps in range(1000):
y = x[:]
timesold.append(time())
y[1000:10000] = sorted(y[1000:10000])
timesold[-1] = time() - timesold[-1]

print "New average time:", sum(timesnew), "ms per loop"
print "Old average time:", sum(timesold), "ms per loop"
>>>

I'll post the patch to the SF bugtracker tomorrow, it's too late today for me
to review it, but generally, I wouldn't find it to be a bad addition, if
there's actually a general use case to only sort parts of a list.

--- Heiko.

Harold Fellermann
Guest
Posts: n/a

 05-19-2006
Fredrik Lundh wrote:
> George Sakkis wrote:
>
> > It would be useful if list.sort() accepted two more optional
> > parameters

+1

> useful for what? what's the use case ?

Actually, I was in need of such a construct only few weeks ago. The
of an mp3 player. I wanted to sort future entries but not past ones
(according to an index in the
list that seperates past from future). I can imagine many more
situations in which sorting part of
a list is usefull.

While I agree that the improvement of the additional keywords is minor,
load they impose on the sort function is also minor. In my oppinion,
this is a straight-forward
extension of what we already find in other places in the library. I
would like to see it in a future version.

- harold -

Heiko Wundram
Guest
Posts: n/a

 05-19-2006
Am Donnerstag 18 Mai 2006 19:27 schrieb George Sakkis:
> It would be useful if list.sort() accepted two more optional
> parameters, start and stop, so that you can sort a slice in place.

I've just submitted:

http://sourceforge.net/tracker/index...70&atid=305470

to the bugtracker, which extends the (start, stop) keyword arguments to
list.reverse() (which I've needed more than once). The patch updates the test
suite, documentation, list object, and sorted() builtin to accept (or
specify) the new arguments.

Any comment/feedback would be appreciated.

--- Heiko.

George Sakkis
Guest
Posts: n/a

 05-19-2006
Heiko Wundram wrote:

> Am Donnerstag 18 Mai 2006 19:27 schrieb George Sakkis:
> > It would be useful if list.sort() accepted two more optional
> > parameters, start and stop, so that you can sort a slice in place.

>
> I've just submitted:
>
> http://sourceforge.net/tracker/index...70&atid=305470
>
> to the bugtracker, which extends the (start, stop) keyword arguments to
> list.reverse() (which I've needed more than once). The patch updates the test
> suite, documentation, list object, and sorted() builtin to accept (or
> specify) the new arguments.
>
> Any comment/feedback would be appreciated.
>
> --- Heiko.

This is great, thanks Heiko ! Any idea on the chances of being
considered for inclusion in 2.5 ?

George

Heiko Wundram
Guest
Posts: n/a

 05-19-2006
Am Freitag 19 Mai 2006 23:24 schrieb George Sakkis:
> This is great, thanks Heiko ! Any idea on the chances of being
> considered for inclusion in 2.5 ?

Don't ask me, I'm not one of the core developers... But, anyway, the
people on python-dev are doing their best to review patches. Just: I rather
write them, than review them...

--- Heiko.