Velocity Reviews > RE: outline-style sorting algorithm

# RE: outline-style sorting algorithm

Delaney, Timothy C (Timothy)
Guest
Posts: n/a

 04-29-2004
Thorsten Kampe wrote:

>>>>> foo

>> ['1.0', '1.0.1', '1.1.1', '1.2', '1.9', '1.10', '1.11', '1.20',
>> '1.20.1', '1.30']
>>>>> foo.sort()
>>>>> foo

>> ['1.0', '1.0.1', '1.1.1', '1.10', '1.11', '1.2', '1.20', '1.20.1',
>> '1.30', '1.9']
>>
>> Obviously 1.20.1 should be after 1.9 if we look at this as
>> dot-delimited integers, not as decimal numbers.

>
> You need some general approach to avoid the DSU thing:
>
> def funcsort(seq, func):
> """ sort seq by func(item) """
> seq = seq[:]
> seq.sort(lambda x, y: cmp(func(x), func(y)))
> return seq
>
> funcsort(foo, lambda x: map(int, x.split('.')))

I've seen you give this advice several times, today, and IMO it's
completely wrong.

DSU is *exactly* what you want to do. If you want to wrap it inside a
general function, that's fine, but DSU is in almost all cases preferred
to passing a comparison function - it's much faster.

def sorted (seq, cmp=None, key=None):
""" sort seq by func(item) """
if key is None:
seq = seq[:]
else:
seq = [(key(e), i, e) for i, e in enumerate(seq)]

seq.sort(cmp)

if key is None:
return seq
else:
return [i[-1] for i in seq]

foo = ['1.0', '1.0.1', '1.1.1', '1.2', '1.9', '1.10', '1.11',
'1.20', '1.20.1', '1.30']

foo = sorted(foo, key=lambda x: map(int, x.split('.')))
print foo

Note that Python 2.4 will have DSU as a built-in idiom to `list.sort`
i.e. `list.sort(key=func)` but will be somewhat more efficient than the
above. Likewise there will be a new builtin `sorted` which has exactly
the same semantics as the above - it is stable, and it does not ever
compare the actual value if a key function is supplied - only the key
value (the above also compares the position, but that's an
implementation detail and has no visible effect).

Tim Delaney

Thorsten Kampe
Guest
Posts: n/a

 04-29-2004
* Delaney, Timothy C (Timothy) (2004-04-29 02:52 +0100)
> Thorsten Kampe wrote:
>>>>>> foo
>>> ['1.0', '1.0.1', '1.1.1', '1.2', '1.9', '1.10', '1.11', '1.20',
>>> '1.20.1', '1.30']
>>>>>> foo.sort()
>>>>>> foo
>>> ['1.0', '1.0.1', '1.1.1', '1.10', '1.11', '1.2', '1.20', '1.20.1',
>>> '1.30', '1.9']
>>>
>>> Obviously 1.20.1 should be after 1.9 if we look at this as
>>> dot-delimited integers, not as decimal numbers.

>>
>> You need some general approach to avoid the DSU thing:
>>
>> def funcsort(seq, func):
>> """ sort seq by func(item) """
>> seq = seq[:]
>> seq.sort(lambda x, y: cmp(func(x), func(y)))
>> return seq
>>
>> funcsort(foo, lambda x: map(int, x.split('.')))

>
> I've seen you give this advice several times, today, and IMO it's
> completely wrong.
>
> DSU is *exactly* what you want to do. If you want to wrap it inside a
> general function, that's fine, but DSU is in almost all cases preferred
> to passing a comparison function - it's much faster.

My advice was to use a more general approach like 'funcsort' to avoid
reiventing the wheel for every problem. DSU is okay for funcsort.

> def sorted (seq, cmp=None, key=None):

What is 'cmp=None' for?

> """ sort seq by func(item) """
> if key is None:
> seq = seq[:]
> else:
> seq = [(key(e), i, e) for i, e in enumerate(seq)]
>
> seq.sort(cmp)

Are you passing a second comparison function? And if, isn't that the
same as I did and to which you said "preferred to passing a comparison
function"? Are you shadowing the builtin cmp function?

> if key is None:
> return seq
> else:
> return [i[-1] for i in seq]

What is 'key=None' for? Is that for speeding up if someone wants to
pass the indentity function (f(x)=x; lambda x: x)?

> Note that Python 2.4 will have DSU as a built-in idiom to `list.sort`
> i.e. `list.sort(key=func)` but will be somewhat more efficient than the
> above. Likewise there will be a new builtin `sorted` which has exactly
> the same semantics as the above - it is stable, and it does not ever
> compare the actual value if a key function is supplied - only the key
> value (the above also compares the position, but that's an
> implementation detail and has no visible effect).

If it has no visible effects than it would be useless. In my opinion
it has the effect that two items that have the same func(x) stay in
the same position as before. Is that desirable? Was that your goal?

>>> sorted([4, 2], None, lambda x: x % 2)

[4, 2], but [2, 4] if the index was omitted

Thorsten

Thorsten Kampe
Guest
Posts: n/a

 04-29-2004
* Delaney, Timothy C (Timothy) (2004-04-29 02:52 +0100)
> Thorsten Kampe wrote:
>>>>>> foo
>>> ['1.0', '1.0.1', '1.1.1', '1.2', '1.9', '1.10', '1.11', '1.20',
>>> '1.20.1', '1.30']
>>>>>> foo.sort()
>>>>>> foo
>>> ['1.0', '1.0.1', '1.1.1', '1.10', '1.11', '1.2', '1.20', '1.20.1',
>>> '1.30', '1.9']
>>>
>>> Obviously 1.20.1 should be after 1.9 if we look at this as
>>> dot-delimited integers, not as decimal numbers.

>>
>> You need some general approach to avoid the DSU thing:
>>
>> def funcsort(seq, func):
>> """ sort seq by func(item) """
>> seq = seq[:]
>> seq.sort(lambda x, y: cmp(func(x), func(y)))
>> return seq
>>
>> funcsort(foo, lambda x: map(int, x.split('.')))

>
> I've seen you give this advice several times, today, and IMO it's
> completely wrong.
>
> DSU is *exactly* what you want to do. If you want to wrap it inside a
> general function, that's fine, but DSU is in almost all cases preferred
> to passing a comparison function - it's much faster.

If made some tests with lists of one million items:

The first test was with a randomized sequence of numbers from 1 -
1000000. Duplicate numbers were allowed. The comparing function was
the identity (lambda x: x).

You DSU approach took about 43 s and mine took about 86 s.
Than I tested range(1000000) in reverse order:

So you cannot simply state that DSU is faster than passing a function
if the structure of the input you want to sort is unknown...

Thorsten

Anton Vredegoor
Guest
Posts: n/a

 04-29-2004
"Delaney, Timothy C (Timothy)" <(E-Mail Removed)> wrote:

> foo = sorted(foo, key=lambda x: map(int, x.split('.')))
> print foo

as possible and still provide minimal functionality. For example
dropping the "integer" assumption:

def _cmp(x,y):
L,R = x.split('.'),y.split('.')
for a,b in zip(L,R):
r = cmp(len(a),len(b)) or cmp(a,b)
if r:
return r
return cmp(len(L),len(R))

def test():
L = ['1.0', '1.0.1', '1.1.1', '1.2', '1.10',
'1.11', '1.20', '1.20.1', '1.30', '1.9']
L.sort(_cmp)
print L

if __name__=='__main__':
test()

Anton

Terry Reedy
Guest
Posts: n/a

 04-29-2004

"Thorsten Kampe" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> If made some tests with lists of one million items:
>
> The first test was with a randomized sequence of numbers from 1 -
> 1000000. Duplicate numbers were allowed. The comparing function was
> the identity (lambda x: x).

I do not understand what you did, and therefore cannot judge results. A
compare function takes two args (the objects to be compared) and
returns -1, 0, 1. (I believe 0, 1 for <= or >t may be sufficient
currently). How did you use the one-param identity?

Terry J. Reedy

Thorsten Kampe
Guest
Posts: n/a

 04-29-2004
* Terry Reedy (2004-04-29 20:03 +0100)
> "Thorsten Kampe" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed)...
>> If made some tests with lists of one million items:
>>
>> The first test was with a randomized sequence of numbers from 1 -
>> 1000000. Duplicate numbers were allowed. The comparing function was
>> the identity (lambda x: x).

>
> I do not understand what you did, and therefore cannot judge results. A
> compare function takes two args (the objects to be compared) and
> returns -1, 0, 1. (I believe 0, 1 for <= or >t may be sufficient
> currently).

That's the "traditional" approach (and it doesn't make much sense -
see the "Searching and Sorting FAQ" in the Python Cookbook from Tim
Peters: "Why does Python use the three out-come cmp for sorting? Why
doesn't it use a simple less-than comparison instead?". It better to
abstract the function from the comparison.

> How did you use the one-param identity?

I assembled my "pass function" approach and TCD's DSU approach into a
single program[1]

Then I generated two lists[2]:
>>> a = range(1000000); a.reverse()

and
>>> b = randseq(1, 1000000, 1000000, 'repeat')

(meaning: create a list with one million integers 1 <= n <= 1000000;
allow duplicate numbers)

Then I timed [3]
timer(funcsort, a, lambda x: x, 'dsu')) -> 43 sec
timer(funcsort, a, lambda x: x, 'pass')) -> 86 sec
timer(funcsort, b, lambda x: x, 'dsu')) -> 24 sec
timer(funcsort, b, lambda x: x, 'pass')) -> 5 sec

[1]
def funcsort(seq, func, modus = 'dsu'):
""" sort seq by func(item) """
if func is None:
seq = seq[:]
seq.sort()
return seq
elif modus == 'dsu':
seq = [(func(item), index, item) for index, item in enumerate(seq)]
seq.sort()
return [item[2] for item in seq]
elif modus == 'pass':
seq = seq[:]
seq.sort(lambda x, y: cmp(func(x), func(y)))
return seq
[2]
def randseq(range_start, range_end = 0, count = 0, modus = 'norepeat'):
from random import randint

if modus == 'repeat':
return [randint(range_start, range_end) for index in range(count)]

elif isinstance(range_start, int):
return randseq(range(range_start, range_end + 1))[:count]

else: # 'range_start' is a seq, so shuffle
range_start = range_start[:]
randomseq = []

for i in range(len(range_start)):
itemindex = randint(0, len(range_start) - 1)
randomseq.append(range_start[itemindex])
del range_start[itemindex]
return randomseq
[3]
def timer(iteration, *func_and_args):
"""
print the time elapsed (in seconds) evaluating func iteration
times
(default is '1') """
import gc, \
time
from colour import ltyellow, \
reset_color

if isinstance(iteration, int):
func, args = func_and_args[0], func_and_args[1:]
else:
# if first argument is not a number, set func to iteration and iteration
# to '1'
iteration, func, args = 1, iteration, func_and_args

iteration = range(iteration)
gc.collect() # force garbage collection
start_time_cpu = time.clock()
start_time_total = time.time()

for i in iteration:
func(*args)
print ltyellow, 'cpu: %(cpu).3f,' % {'cpu': time.clock() - start_time_cpu}, \
'total: %(total).3f' % {'total': time.time() - start_time_total}, reset_color