On Apr 29, 5:21*pm, Laszlo Nagy <(E-Mail Removed)> wrote:

> I have some ten thousand rows in a table. E.g.

>

> columns = ["color","size","weight","value"]

> rows = [

> * * *[ "Yellow", "Big", 2, 4 ],

> * * *[ "Blue", "Big", 3, -4 ],

> * * *[ "Blue", "Small", 10, 55 ],

> * * *...

> ]

>

> Some columns are dimensions, others are measures. I want to convert this

> to an indexed version, that looks like this:

>

> dimension_names = ["color","size"] # List of dimension names

> dimension_cols = [0,1] # Column indexes for dimension values

> dimension_values = { # Dimension value occurences for each dimension

> * * *0: ["Yellow","Blue",....],

> * * *1: ["Big","Small",...],}

>

> measure_names = ["weight","value"] # List of measure names

> measure_cols = [2,3] # List of measure columns

> facts = [ # Facts, containing tuples of (dimension_value_incides,

> measure_values )

> * * *( (0,0) , (2,4) ),

> * * *( (1,0) , (3,-4) ),

> * * *( (1,1) , (10,55) ),

> * * *...

> ]

>

> This is how I try to convert to the indexed version:

>

> #1. Iterate over all rows, and collect possible dimension values.

>

> cnt = 0

> for row in iterator_factory():

> * * *cnt += 1

> * * *for dimension_idx,col_idx in enumerate(dimension_cols):

> * * * * *dimension_values[colidx].append(row[cold_idx])

> * * * * *if cnt%10000:

> * * * * * * *dimension_values[colidx] = list(set(dimension_values[colidx]))

>

> #2. Index facts by dimension values

>

> facts = []

> for row in iterator_factory():

> * * *dv = []

> * * *for dimension_idx,col_idx in enumerate(dimension_cols):

> * * * * *dv.append( dimension_values[col_idx].index(row[col_idx]) ) #

> THIS IS THE PROBLEMATIC LINE!

> * * *mv = [ row[col_idx] for col_idx in measure_cols ]

> * * *facts.append( dv,mv )

>

> (In reality, rows and facts are not stored in memory, because there can

> be 100 000+ facts. I did not want to bore you with the full source code.)

>

> And finally, here is my problem. If a dimension has many possible

> values, then the list.index() code above slows down eveything. In my

> test case, the problematic dimension had about 36 000 different values,

> and the number of rows was about 60 000. Calling index() on a list of 36

> 000 values, times 60 000, is too slow.

>

> Test performance was 262 rows/sec. If I use dv.append(0) instead of "

> dv.append( dimension_values[col_idx].index(row[col_idx]) ) " then it

> goes up to 11520 rows/sec. If I exclude the problematic dimension, and

> only leave the others (with 1000 or less values) then goes up to 3000

> rows/sec.

>

> Maybe this should be implemented in C. But I believe that the algorithm

> itself must be wrong (regardless of the language). I really think that

> I'm doing something wrong. Looks like my algorithm's processing time is

> not linear to the number of rows. Not even log(n)*n. There should be a

> more effective way to do this. But how?

>

> I had the idea of sorting the rows by a given dimension. Then it would

> be obvious to speed up the indexing part - for that dimension. PROBABLY

> sorting all rows would be faster than calling list.index() for each row.

> But... it does not seem very efficient either. What if I have 1 million

> rows and 10 dimensions? Do I sort 1 million rows on the disk 10 times?

> Some of you might have ran into the same problem before, and can tell me

> which is the most efficient way to do this.

>

> Thanks,

Have you considered using a SQL database? Personally, I would use

MS-Access and link it to Python via ODBC. That way, I could use

the Access drag-and-drop design tools and either

- copy the SQL code of working query designs to Python

or

- use the ODBC to link to said queries rather than directly

to the raw tables

Of course, Python has SQLlight now, I just don't know SQL that well.

>

> * * Laszlo