Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Performance of list.index - how to speed up a silly algorithm?

Reply
Thread Tools

Re: Performance of list.index - how to speed up a silly algorithm?

 
 
MRAB
Guest
Posts: n/a
 
      04-29-2010
Laszlo Nagy wrote:
> I have some ten thousand rows in a table. E.g.
>
> columns = ["color","size","weight","value"]
> rows = [
> [ "Yellow", "Big", 2, 4 ],
> [ "Blue", "Big", 3, -4 ],
> [ "Blue", "Small", 10, 55 ],
> ...
> ]
>
> Some columns are dimensions, others are measures. I want to convert this
> to an indexed version, that looks like this:
>
> dimension_names = ["color","size"] # List of dimension names
> dimension_cols = [0,1] # Column indexes for dimension values
> dimension_values = { # Dimension value occurences for each dimension
> 0: ["Yellow","Blue",....],
> 1: ["Big","Small",...],
> }
> measure_names = ["weight","value"] # List of measure names
> measure_cols = [2,3] # List of measure columns
> facts = [ # Facts, containing tuples of (dimension_value_incides,
> measure_values )
> ( (0,0) , (2,4) ),
> ( (1,0) , (3,-4) ),
> ( (1,1) , (10,55) ),
> ...
> ]
>
> This is how I try to convert to the indexed version:
>
> #1. Iterate over all rows, and collect possible dimension values.
>
> cnt = 0
> for row in iterator_factory():
> cnt += 1
> for dimension_idx,col_idx in enumerate(dimension_cols):
> dimension_values[colidx].append(row[cold_idx])
> if cnt%10000:
> dimension_values[colidx] = list(set(dimension_values[colidx]))
>
> #2. Index facts by dimension values
>
> facts = []
> for row in iterator_factory():
> dv = []
> for dimension_idx,col_idx in enumerate(dimension_cols):
> dv.append( dimension_values[col_idx].index(row[col_idx]) ) #
> THIS IS THE PROBLEMATIC LINE!
> mv = [ row[col_idx] for col_idx in measure_cols ]
> facts.append( dv,mv )
>
> (In reality, rows and facts are not stored in memory, because there can
> be 100 000+ facts. I did not want to bore you with the full source code.)
>
> And finally, here is my problem. If a dimension has many possible
> values, then the list.index() code above slows down eveything. In my
> test case, the problematic dimension had about 36 000 different values,
> and the number of rows was about 60 000. Calling index() on a list of 36
> 000 values, times 60 000, is too slow.
>
> Test performance was 262 rows/sec. If I use dv.append(0) instead of "
> dv.append( dimension_values[col_idx].index(row[col_idx]) ) " then it
> goes up to 11520 rows/sec. If I exclude the problematic dimension, and
> only leave the others (with 1000 or less values) then goes up to 3000
> rows/sec.
>
> Maybe this should be implemented in C. But I believe that the algorithm
> itself must be wrong (regardless of the language). I really think that
> I'm doing something wrong. Looks like my algorithm's processing time is
> not linear to the number of rows. Not even log(n)*n. There should be a
> more effective way to do this. But how?
>
> I had the idea of sorting the rows by a given dimension. Then it would
> be obvious to speed up the indexing part - for that dimension. PROBABLY
> sorting all rows would be faster than calling list.index() for each row.
> But... it does not seem very efficient either. What if I have 1 million
> rows and 10 dimensions? Do I sort 1 million rows on the disk 10 times?
> Some of you might have ran into the same problem before, and can tell me
> which is the most efficient way to do this.
>

The .index method does a linear search, checking on average 1/2 of the
items in the list. That's why it's so slow.

In order to avoid that you could build a dict of each value in
dimension_values[col_idx] and its index in a single pass so that it
becomes a quick lookup.
 
Reply With Quote
 
 
 
 
Peter Otten
Guest
Posts: n/a
 
      04-30-2010
MRAB wrote:

> The .index method does a linear search, checking on average 1/2 of the
> items in the list. That's why it's so slow.
>
> In order to avoid that you could build a dict of each value in
> dimension_values[col_idx] and its index in a single pass so that it
> becomes a quick lookup.


For example:

from itertools import count, izip

def iterator_factory():
# columns = ["color", "size", "weight", "value"]
rows = [
["Yellow", "Big", 2, 4],
["Blue", "Big", 3, -4],
["Blue", "Small", 10, 55],
#...
]

return iter(rows)

class Indexer(object):
def __init__(self):
self.lookup = {}
self.counter = count()
def __call__(self, value):
try:
return self.lookup[value]
except KeyError:
result = self.lookup[value] = next(self.counter)
return result
def to_list(self):
d = self.lookup
reverse = dict(izip(d.itervalues(), d.iterkeys()))
assert len(reverse) == len(d)
return [reverse[i] for i in xrange(len(reverse))]


def pairs(rows, dimension_cols, measure_cols, indexers):
for row in rows:
dv = [indexer(row[col])
for indexer, col in izip(indexers, dimension_cols)]
mv = [row[col] for col in measure_cols]
yield dv, mv

def main():
# dimension_names = ["color", "size"]
dimension_cols = [0, 1]

# measure_names = ["weight", "value"]
measure_cols = [2, 3]

indexers = [Indexer() for _ in dimension_cols]

facts = pairs(iterator_factory(),
dimension_cols, measure_cols, indexers)
facts = list(facts)

print facts
for i, indexer in enumerate(indexers):
print "%d: %s" % (i, indexer.to_list())

if __name__ == "__main__":
main()

 
Reply With Quote
 
 
 
 
Laszlo Nagy
Guest
Posts: n/a
 
      04-30-2010

>> The .index method does a linear search, checking on average 1/2 of the
>> items in the list. That's why it's so slow.
>>
>> In order to avoid that you could build a dict of each value in
>> dimension_values[col_idx] and its index in a single pass so that it
>> becomes a quick lookup.
>>

>
> For example:
>
> from itertools import count, izip
>
>

....
> def main():
> # dimension_names = ["color", "size"]
> dimension_cols = [0, 1]
>
> # measure_names = ["weight", "value"]
> measure_cols = [2, 3]
>
> indexers = [Indexer() for _ in dimension_cols]
>
> facts = pairs(iterator_factory(),
> dimension_cols, measure_cols, indexers)
> facts = list(facts)
>
> print facts
> for i, indexer in enumerate(indexers):
> print "%d: %s" % (i, indexer.to_list())
>
> if __name__ == "__main__":
> main()
>

Whew! Thank you for taking the time. I'm not very familiar to
itertools yet, so I need some time to understand this *beautiful* code.

L

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
Performance of list.index - how to speed up a silly algorithm? Laszlo Nagy Python 1 04-29-2010 11:06 PM
(silly?) speed comparisons mk Python 1 07-09-2008 06:17 AM
Reported Wireless speed w/ repeater 7-9x Measured Speed Lance Wireless Networking 0 10-31-2004 09:31 PM
speed speed speed a.metselaar Computer Support 14 12-30-2003 03:34 AM



Advertisments