Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > please help with optimisation of this code - update of given table according to another table

Reply
Thread Tools

please help with optimisation of this code - update of given table according to another table

 
 
Farraige
Guest
Posts: n/a
 
      11-08-2006
Hi I need your help...
I am implementing the method that updates given table (table is
represented as list of lists of strings) according to other table (some
kind of merging)...

This method takes following arguments:
t1 - table we would like to update
t2 - table we would like to take data
from
keyColumns - list of key indexes e.g. [0,1]
columnsToBeUpdated - list of column indexes we would like to update in
our table T1 e.g [2,4]

Let's say we have a table T1:

A B C D E
---------------
1 4 5 7 7
3 4 0 0 0

and we call a method mergeTable(T1, T2, [0,1], [2,4])

It means that we would like to update columns C and E of table T1 with
data from table T2 but only in case the key columns A and B are equal
in both tables.... I grant that the given key is unique in both tables
so if I find a row with the same key in table T2 I do merging, stop and
go to next row in table T1...

Let's say T2 looks following:

A B C D E
---------------
2 2 8 8 8
1 4 9 9 9

So after execution of our mergeTable method, the table T1 should look
like :

A B C D E
1 4 9 7 9
3 4 0 0 0

The 2nd row ['3', '4', '0' ,'0', '0'] didn't change because there was
no row in table T2 with key = 3 ,4

The main part of my algorithm now looks something like ...

merge(t1, t2, keyColumns, columnsToBeUpdated)

........

for row_t1 in t1:
for row_t2 in t2:
if [row_t1[i] for i in keyColumns] == [row_t2[j] for j
in keyColumns]:
# the keys are the same
for colName in columnsToBeUpdated:
row_t1[colName] = row_t2[colName]

# go outside the inner loop - we found a row with
# the same key in the table
break

In my algorithm I have 2 for loops and I have no idea how to optimise
it (maybe with map? )
I call this method for very large data and the performance is a
critical issue for me

I will be grateful for any ideas
Thanks in advance!

 
Reply With Quote
 
 
 
 
Marc 'BlackJack' Rintsch
Guest
Posts: n/a
 
      11-08-2006
In <(E-Mail Removed). com>, Farraige wrote:

> Let's say we have a table T1:
>
> A B C D E
> ---------------
> 1 4 5 7 7
> 3 4 0 0 0
>
> and we call a method mergeTable(T1, T2, [0,1], [2,4])
>
> It means that we would like to update columns C and E of table T1 with
> data from table T2 but only in case the key columns A and B are equal
> in both tables.... I grant that the given key is unique in both tables
> so if I find a row with the same key in table T2 I do merging, stop and
> go to next row in table T1...
>
> Let's say T2 looks following:
>
> A B C D E
> ---------------
> 2 2 8 8 8
> 1 4 9 9 9
>
> So after execution of our mergeTable method, the table T1 should look
> like :
>
> A B C D E
> 1 4 9 7 9
> 3 4 0 0 0
>
> The 2nd row ['3', '4', '0' ,'0', '0'] didn't change because there was
> no row in table T2 with key = 3 ,4
>
> The main part of my algorithm now looks something like ...
>
> merge(t1, t2, keyColumns, columnsToBeUpdated)
>
> .......
>
> for row_t1 in t1:
> for row_t2 in t2:
> if [row_t1[i] for i in keyColumns] == [row_t2[j] for j
> in keyColumns]:
> # the keys are the same
> for colName in columnsToBeUpdated:
> row_t1[colName] = row_t2[colName]
>
> # go outside the inner loop - we found a row with
> # the same key in the table
> break
>
> In my algorithm I have 2 for loops and I have no idea how to optimise
> it (maybe with map? )
> I call this method for very large data and the performance is a
> critical issue for me


Just go through the first table once and build a mapping key->row and then
go through the second table once and look for each row if the key is in
the mapping. If yes: update columns. This runs in O(2*rows) instead if
O(rows**2).

def update_table(table_a, table_b, key_columns, columns_to_be_updated):
def get_key(row):
return tuple(row[x] for x in key_columns)

key2row = dict((get_key(row), row) for row in table_a)
for row in table_b:
row_to_be_updated = key2row.get(get_key(row))
if row_to_be_updated is not None:
for column in columns_to_be_updated:
row_to_be_updated[column] = row[column]


def main():
table_a = [[1, 4, 5, 7, 7],
[3, 4, 0, 0, 0]]
table_b = [[2, 2, 8, 8, 8],
[1, 4, 9, 9, 9]]
update_table(table_a, table_b, (0, 1), (2, 4))
for row in table_a:
print row

Ciao,
Marc 'BlackJack' Rintsch
 
Reply With Quote
 
 
 
 
Antoon Pardon
Guest
Posts: n/a
 
      11-08-2006
On 2006-11-08, Farraige <(E-Mail Removed)> wrote:
>
> ...
>
> The main part of my algorithm now looks something like ...
>
> merge(t1, t2, keyColumns, columnsToBeUpdated)
>
> .......
>
> for row_t1 in t1:
> for row_t2 in t2:
> if [row_t1[i] for i in keyColumns] == [row_t2[j] for j
> in keyColumns]:
> # the keys are the same
> for colName in columnsToBeUpdated:
> row_t1[colName] = row_t2[colName]
>
> # go outside the inner loop - we found a row with
> # the same key in the table
> break
>
> In my algorithm I have 2 for loops and I have no idea how to optimise
> it (maybe with map? )
> I call this method for very large data and the performance is a
> critical issue for me
>
> I will be grateful for any ideas


One idea would be to precompute the list comprehensions in the if test.

p2 = [[row_t2[i] for i in keyColums] for row_t2 in t2]
for row_t1 in t1:
proj1 = [row_t1[i] for i in keyColumns]
for row_t1, proj2 in izip(t2, p2):
if proj1 == proj2:
...


--
Antoon Pardon
 
Reply With Quote
 
Gabriel Genellina
Guest
Posts: n/a
 
      11-08-2006
At Wednesday 8/11/2006 07:18, Farraige wrote:

> for row_t1 in t1:
> for row_t2 in t2:
> if [row_t1[i] for i in keyColumns] == [row_t2[j] for j
>in keyColumns]:
> # the keys are the same
> for colName in columnsToBeUpdated:
> row_t1[colName] = row_t2[colName]
>
> # go outside the inner loop - we found a row with
> # the same key in the table
> break
>
>In my algorithm I have 2 for loops and I have no idea how to optimise
>it (maybe with map? )
>I call this method for very large data and the performance is a
>critical issue for me


def getkey(row):
return tuple([row[i] for i in keyColumns])

Sort both tables using .sort(key=getkey); then traverse both in
paralell like in classic merge; in 1 pass you get the job done (not
counting the two initial sorts)


--
Gabriel Genellina
Softlab SRL

__________________________________________________
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ˇgratis!
ˇAbrí tu cuenta ya! - http://correo.yahoo.com.ar
 
Reply With Quote
 
Farraige
Guest
Posts: n/a
 
      11-08-2006
Thank you all for your very valuable clues!
Thanks to you I got nearly 97% perfomance improvement !!!!!!!!!!!!!!! I
can't believe it ))
Once again thank you

Best wishes
Farraige

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Has thought been given given to a cleaned up C? Possibly called C+. Casey Hawthorne C Programming 385 04-04-2010 02:11 AM
" Given BACK what was freely GIVEN " 2Barter.net C++ 0 12-13-2006 02:56 AM
Days in a given date range for a given month......... Lord0 Java 1 04-19-2006 04:54 PM
function to generate 1 or 0 according to a given probability Mayank Kaushik C Programming 11 11-13-2005 03:55 AM
generate all possible strings of given length given a set of characters chiara C Programming 6 10-06-2005 01:43 AM



Advertisments