Velocity Reviews > looping in array vs looping in a dic

# looping in array vs looping in a dic

giuseppe.amatulli@gmail.com
Guest
Posts: n/a

 09-20-2012
Hi,
I have this script in python that i need to apply for very large arrays (arrays coming from satellite images).
The script works grate but i would like to speed up the process.
The larger computational time is in the for loop process.
Is there is a way to improve that part?
Should be better to use dic() instead of np.ndarray for saving the results?
and if yes how i can make the sum in dic()(like in the correspondent matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col] )?
If the dic() is the solution way is faster?

Thanks
Giuseppe

import numpy as np
import sys
from time import clock, time

# create the arrays

start = time()
valuesRaster = np.random.random_integers(0, 100, 100).reshape(10, 10)
valuesCategory = np.random.random_integers(1, 10, 100).reshape(10, 10)

elapsed = (time() - start)
print(elapsed , "create the data")

start = time()

categories = np.unique(valuesCategory)
matrix = np.c_[ categories , np.zeros(len(categories))]

elapsed = (time() - start)
print(elapsed , "create the matrix and append a colum zero ")

rows = 10
cols = 10

start = time()

for col in range(0,cols):
for row in range(0,rows):
for row_c in range(0,len(matrix)) :
if valuesCategory[row,col] == matrix[row_c,0] :
matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col]
break
elapsed = (time() - start)
print(elapsed , "loop in the data ")

print (matrix)

MRAB
Guest
Posts: n/a

 09-20-2012
On 2012-09-20 19:31, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> Hi,
> I have this script in python that i need to apply for very large arrays (arrays coming from satellite images).
> The script works grate but i would like to speed up the process.
> The larger computational time is in the for loop process.
> Is there is a way to improve that part?
> Should be better to use dic() instead of np.ndarray for saving the results?
> and if yes how i can make the sum in dic()(like in the correspondent matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col] )?
> If the dic() is the solution way is faster?
>
> Thanks
> Giuseppe
>
> import numpy as np
> import sys
> from time import clock, time
>
> # create the arrays
>
> start = time()
> valuesRaster = np.random.random_integers(0, 100, 100).reshape(10, 10)
> valuesCategory = np.random.random_integers(1, 10, 100).reshape(10, 10)
>
> elapsed = (time() - start)
> print(elapsed , "create the data")
>
> start = time()
>
> categories = np.unique(valuesCategory)
> matrix = np.c_[ categories , np.zeros(len(categories))]
>
> elapsed = (time() - start)
> print(elapsed , "create the matrix and append a colum zero ")
>
> rows = 10
> cols = 10
>
> start = time()
>
> for col in range(0,cols):
> for row in range(0,rows):
> for row_c in range(0,len(matrix)) :
> if valuesCategory[row,col] == matrix[row_c,0] :
> matrix[row_c,1] = matrix[row_c,1] + valuesRaster[row,col]
> break
> elapsed = (time() - start)
> print(elapsed , "loop in the data ")
>
> print (matrix)
>

If I understand the code correctly, 'matrix' contains the categories in
column 0 and the totals in column 1.

What you're doing is performing a linear search through the categories
and then adding to the corresponding total.

Linear searches are slow because on average you have to search through
half of the list. Using a dict would be much faster (although you
should of course measure it!).

Try something like this:

import numpy as np
from time import time

# Create the arrays.

start = time()

valuesRaster = np.random.random_integers(0, 100, 100).reshape(10, 10)
valuesCategory = np.random.random_integers(1, 10, 100).reshape(10, 10)

elapsed = time() - start
print(elapsed, "Create the data.")

start = time()

categories = np.unique(valuesCategory)
totals = dict.fromkeys(categories, 0)

elapsed = time() - start
print(elapsed, "Create the totals dict.")

rows = 100
cols = 10

start = time()

for col in range(cols):
for row in range(rows):
cat = valuesCategory[row, col]
ras = valuesRaster[row, col]
totals[cat] += ras

elapsed = time() - start
print(elapsed, "Loop in the data.")

print(totals)

Ian Kelly
Guest
Posts: n/a

 09-20-2012
On Thu, Sep 20, 2012 at 1:09 PM, MRAB <(E-Mail Removed)> wrote:
> for col in range(cols):
> for row in range(rows):
> cat = valuesCategory[row, col]
> ras = valuesRaster[row, col]
> totals[cat] += ras

Expanding on what MRAB wrote, since you probably have far fewer
categories than pixels, you may be able to take better advantage of
numpy's vectorized operations (which are pretty much the whole point
of using numpy in the first place) by looping over the categories

for cat in categories:
totals[cat] += np.sum(valuesCategory * (valuesRaster == cat))

giuseppe.amatulli@gmail.com
Guest
Posts: n/a

 09-20-2012
Hi Ian and MRAB
thanks to you input i have improve the speed of my code. Definitely reading in dic() is faster. I have one more question.
In the dic() I calculate the sum of the values, but i want count also the number of observation, in order to calculate the average in the end.
Should i create a new dic() or is possible to do in the same dic().
Here in the final code.
Thanks Giuseppe

rows = dsCategory.RasterYSize
cols = dsCategory.RasterXSize

print("Generating output file %s" %(dst_file))

start = time()

unique=dict()

for irows in xrange(rows):
for icols in xrange(cols):
if ( valuesRaster[0,icols] != no_data_Raster ) and ( valuesCategory[0,icols] != no_data_Category ) :
row = valuesCategory[0, icols],valuesRaster[0, icols]
if row[0] in unique :
unique[row[0]] += row[1]
else:
unique[row[0]] = 0+row[1] # this 0 was add if not the first observation was considered = 0

giuseppe.amatulli@gmail.com
Guest
Posts: n/a

 09-20-2012
Hi Ian and MRAB
thanks to you input i have improve the speed of my code. Definitely reading in dic() is faster. I have one more question.
In the dic() I calculate the sum of the values, but i want count also the number of observation, in order to calculate the average in the end.
Should i create a new dic() or is possible to do in the same dic().
Here in the final code.
Thanks Giuseppe

rows = dsCategory.RasterYSize
cols = dsCategory.RasterXSize

print("Generating output file %s" %(dst_file))

start = time()

unique=dict()

for irows in xrange(rows):
for icols in xrange(cols):
if ( valuesRaster[0,icols] != no_data_Raster ) and ( valuesCategory[0,icols] != no_data_Category ) :
row = valuesCategory[0, icols],valuesRaster[0, icols]
if row[0] in unique :
unique[row[0]] += row[1]
else:
unique[row[0]] = 0+row[1] # this 0 was add if not the first observation was considered = 0

MRAB
Guest
Posts: n/a

 09-20-2012
On 2012-09-21 00:35, (E-Mail Removed) wrote:
> Hi Ian and MRAB
> thanks to you input i have improve the speed of my code. Definitely reading in dic() is faster. I have one more question.
> In the dic() I calculate the sum of the values, but i want count also the number of observation, in order to calculate the average in the end.
> Should i create a new dic() or is possible to do in the same dic().
> Here in the final code.
> Thanks Giuseppe
>

Keep it simple. Use 2 dicts.

>
>
> rows = dsCategory.RasterYSize
> cols = dsCategory.RasterXSize
>
> print("Generating output file %s" %(dst_file))
>
> start = time()
>
> unique=dict()
>
> for irows in xrange(rows):
> for icols in xrange(cols):
> if ( valuesRaster[0,icols] != no_data_Raster ) and ( valuesCategory[0,icols] != no_data_Category ) :
> row = valuesCategory[0, icols],valuesRaster[0, icols]
> if row[0] in unique :
> unique[row[0]] += row[1]
> else:
> unique[row[0]] = 0+row[1] # this 0 was add if not the first observation was considered = 0
>

from collections import defaultdict

unique = defaultdict(int)
....
category, raster = valuesCategory[0, icols],
valuesRaster[0, icols]
unique[category] += raster