Velocity Reviews > please help with optimisation of this code - update of given table according to another table

# please help with optimisation of this code - update of given table according to another table

Farraige
Guest
Posts: n/a

 11-08-2006
I am implementing the method that updates given table (table is
represented as list of lists of strings) according to other table (some
kind of merging)...

This method takes following arguments:
t1 - table we would like to update
t2 - table we would like to take data
from
keyColumns - list of key indexes e.g. [0,1]
columnsToBeUpdated - list of column indexes we would like to update in
our table T1 e.g [2,4]

Let's say we have a table T1:

A B C D E
---------------
1 4 5 7 7
3 4 0 0 0

and we call a method mergeTable(T1, T2, [0,1], [2,4])

It means that we would like to update columns C and E of table T1 with
data from table T2 but only in case the key columns A and B are equal
in both tables.... I grant that the given key is unique in both tables
so if I find a row with the same key in table T2 I do merging, stop and
go to next row in table T1...

Let's say T2 looks following:

A B C D E
---------------
2 2 8 8 8
1 4 9 9 9

So after execution of our mergeTable method, the table T1 should look
like :

A B C D E
1 4 9 7 9
3 4 0 0 0

The 2nd row ['3', '4', '0' ,'0', '0'] didn't change because there was
no row in table T2 with key = 3 ,4

The main part of my algorithm now looks something like ...

merge(t1, t2, keyColumns, columnsToBeUpdated)

........

for row_t1 in t1:
for row_t2 in t2:
if [row_t1[i] for i in keyColumns] == [row_t2[j] for j
in keyColumns]:
# the keys are the same
for colName in columnsToBeUpdated:
row_t1[colName] = row_t2[colName]

# go outside the inner loop - we found a row with
# the same key in the table
break

In my algorithm I have 2 for loops and I have no idea how to optimise
it (maybe with map? )
I call this method for very large data and the performance is a
critical issue for me

I will be grateful for any ideas

Marc 'BlackJack' Rintsch
Guest
Posts: n/a

 11-08-2006
In <(E-Mail Removed). com>, Farraige wrote:

> Let's say we have a table T1:
>
> A B C D E
> ---------------
> 1 4 5 7 7
> 3 4 0 0 0
>
> and we call a method mergeTable(T1, T2, [0,1], [2,4])
>
> It means that we would like to update columns C and E of table T1 with
> data from table T2 but only in case the key columns A and B are equal
> in both tables.... I grant that the given key is unique in both tables
> so if I find a row with the same key in table T2 I do merging, stop and
> go to next row in table T1...
>
> Let's say T2 looks following:
>
> A B C D E
> ---------------
> 2 2 8 8 8
> 1 4 9 9 9
>
> So after execution of our mergeTable method, the table T1 should look
> like :
>
> A B C D E
> 1 4 9 7 9
> 3 4 0 0 0
>
> The 2nd row ['3', '4', '0' ,'0', '0'] didn't change because there was
> no row in table T2 with key = 3 ,4
>
> The main part of my algorithm now looks something like ...
>
> merge(t1, t2, keyColumns, columnsToBeUpdated)
>
> .......
>
> for row_t1 in t1:
> for row_t2 in t2:
> if [row_t1[i] for i in keyColumns] == [row_t2[j] for j
> in keyColumns]:
> # the keys are the same
> for colName in columnsToBeUpdated:
> row_t1[colName] = row_t2[colName]
>
> # go outside the inner loop - we found a row with
> # the same key in the table
> break
>
> In my algorithm I have 2 for loops and I have no idea how to optimise
> it (maybe with map? )
> I call this method for very large data and the performance is a
> critical issue for me

Just go through the first table once and build a mapping key->row and then
go through the second table once and look for each row if the key is in
the mapping. If yes: update columns. This runs in O(2*rows) instead if
O(rows**2).

def update_table(table_a, table_b, key_columns, columns_to_be_updated):
def get_key(row):
return tuple(row[x] for x in key_columns)

key2row = dict((get_key(row), row) for row in table_a)
for row in table_b:
row_to_be_updated = key2row.get(get_key(row))
if row_to_be_updated is not None:
for column in columns_to_be_updated:
row_to_be_updated[column] = row[column]

def main():
table_a = [[1, 4, 5, 7, 7],
[3, 4, 0, 0, 0]]
table_b = [[2, 2, 8, 8, 8],
[1, 4, 9, 9, 9]]
update_table(table_a, table_b, (0, 1), (2, 4))
for row in table_a:
print row

Ciao,
Marc 'BlackJack' Rintsch

Antoon Pardon
Guest
Posts: n/a

 11-08-2006
On 2006-11-08, Farraige <(E-Mail Removed)> wrote:
>
> ...
>
> The main part of my algorithm now looks something like ...
>
> merge(t1, t2, keyColumns, columnsToBeUpdated)
>
> .......
>
> for row_t1 in t1:
> for row_t2 in t2:
> if [row_t1[i] for i in keyColumns] == [row_t2[j] for j
> in keyColumns]:
> # the keys are the same
> for colName in columnsToBeUpdated:
> row_t1[colName] = row_t2[colName]
>
> # go outside the inner loop - we found a row with
> # the same key in the table
> break
>
> In my algorithm I have 2 for loops and I have no idea how to optimise
> it (maybe with map? )
> I call this method for very large data and the performance is a
> critical issue for me
>
> I will be grateful for any ideas

One idea would be to precompute the list comprehensions in the if test.

p2 = [[row_t2[i] for i in keyColums] for row_t2 in t2]
for row_t1 in t1:
proj1 = [row_t1[i] for i in keyColumns]
for row_t1, proj2 in izip(t2, p2):
if proj1 == proj2:
...

--
Antoon Pardon

Gabriel Genellina
Guest
Posts: n/a

 11-08-2006
At Wednesday 8/11/2006 07:18, Farraige wrote:

> for row_t1 in t1:
> for row_t2 in t2:
> if [row_t1[i] for i in keyColumns] == [row_t2[j] for j
>in keyColumns]:
> # the keys are the same
> for colName in columnsToBeUpdated:
> row_t1[colName] = row_t2[colName]
>
> # go outside the inner loop - we found a row with
> # the same key in the table
> break
>
>In my algorithm I have 2 for loops and I have no idea how to optimise
>it (maybe with map? )
>I call this method for very large data and the performance is a
>critical issue for me

def getkey(row):
return tuple([row[i] for i in keyColumns])

Sort both tables using .sort(key=getkey); then traverse both in
paralell like in classic merge; in 1 pass you get the job done (not
counting the two initial sorts)

--
Gabriel Genellina
Softlab SRL

__________________________________________________
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ˇgratis!
ˇAbrí tu cuenta ya! - http://correo.yahoo.com.ar

Farraige
Guest
Posts: n/a

 11-08-2006
Thank you all for your very valuable clues!
Thanks to you I got nearly 97% perfomance improvement !!!!!!!!!!!!!!! I
can't believe it ))
Once again thank you

Best wishes
Farraige

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is OffTrackbacks are On Pingbacks are On Refbacks are Off Forum Rules

 Similar Threads Thread Thread Starter Forum Replies Last Post Casey Hawthorne C Programming 385 04-04-2010 02:11 AM 2Barter.net C++ 0 12-13-2006 02:56 AM Lord0 Java 1 04-19-2006 04:54 PM Mayank Kaushik C Programming 11 11-13-2005 03:55 AM chiara C Programming 6 10-06-2005 01:43 AM