PyPK
Guest
Posts: n/a

 06-01-2005
I was testing this piece of code and it takes about 24-30 seconds to do
a look up on a list(m) of size 1000x1000
m -> list of size 1000x1000
import time
print time.ctime()
box = {}
for r,row in enumerate(m):
for c,e in enumerate(row):
if box.has_key(e):
params = box[e]
box[e] = ( min(c, params[0]), min(r, params[1]),
max(c, params[2]), max(r, params[3] ) )
else:
box[e] = [c, r, c, r]
print time.ctime()

Can some one tell me what exactly is taking more time here. Is it
because I am using dictionaries or what is the problem. Can some one
help me improve this .Is there a better way to write this.

John Machin
Guest
Posts: n/a

 06-02-2005
PyPK wrote:
> I was testing this piece of code and it takes about 24-30 seconds to do
> a look up on a list(m) of size 1000x1000
> m -> list of size 1000x1000
> import time
> print time.ctime()
> box = {}
> for r,row in enumerate(m):
> for c,e in enumerate(row):
> if box.has_key(e):
> params = box[e]
> box[e] = ( min(c, params[0]), min(r, params[1]),
> max(c, params[2]), max(r, params[3] ) )
> else:
> box[e] = [c, r, c, r]
> print time.ctime()
>
> Can some one tell me what exactly is taking more time here. Is it
> because I am using dictionaries or what is the problem. Can some one
> help me improve this .Is there a better way to write this.
>

Without gross changes to the algorithm, and in no particular order:
0. Stop worrying. Find something else to do during the 30 seconds.
1. Use psyco, if your [unspecified] platform supports it.
2. Why has_key()? Try "if e in box:" instead, if your [unspecified]
version of Python supports it. If it doesn't, (a) consider upgrading (b)
try doing box_has_key = box.has_key outside the loops and using the
result inside the loops.
3. Ensure this code is inside a function, not at global level in a
script [not apparent from your post].
4. Outside the loops, put "local_min = min", ditto for max. Then use
box[e] = (local_min(c, etc etc
6. Use try/except, if indicated by the [unspecified] percentage of times
"e in box" is true. Hint: printing len(box) at the end might be useful.
7. Consider using pyrex.
8. Consider using numarray/mumeric.

Info req'd for discussing algorithm changes:
1. How much free memory do you have?
2. What can you tell us about the distribution of "e"?
3. Is "m" a rectangle, or are the rows of variable length?
4. Do you have control over "m" i.e. are you stuck with that data
structure, or can we design a different one?

PyPK
Guest
Posts: n/a

 06-02-2005
>Info req'd for discussing algorithm changes:
>1. How much free memory do you have?

- Memory is not a problem but I am working on some timing
constraints.Thats is the reason I am a bit concerned abt these 30
seconds

> 2. What can you tell us about the distribution of "e"?

- the distribution of e is not fixed.It changes based on what comes
first in the scan of the list.

> 3. Is "m" a rectangle, or are the rows of variable length?

- m is not a restnagle .its of varied dimensions.

> 4. Do you have control over "m" i.e. are you stuck with that data

structure, or can we design a different one?

-I can use 'm' as an array or a list. Other than that I am stuck.

I am suspecious abput dictionary approach here. Is this slower than if
i do it with arrays or lists with index of array as my key ?

John Machin
Guest
Posts: n/a

 06-02-2005
PyPK wrote:

>
> I am suspecious abput dictionary approach here. Is this slower than if
> i do it with arrays or lists with index of array as my key ?
>

"with index of array as my key" is either redundant -- indicating that
"e is an integer that can range from blah0 to blah_max" -- or not
understandable.

In any case, seeing that the code in question is only a few lines, why
don't *you* write an alternative version and see how fast it runs? And
let us know?

Peter Otten
Guest
Posts: n/a

 06-02-2005
PyPK wrote:

> I was testing this piece of code and it takes about 24-30 seconds to do
> a look up on a list(m) of size 1000x1000
> m -> list of size 1000x1000
> import time
> print time.ctime()
> box = {}
> for r,row in enumerate(m):
> for c,e in enumerate(row):
> if box.has_key(e):
> params = box[e]
> box[e] = ( min(c, params[0]), min(r, params[1]),
> max(c, params[2]), max(r, params[3] ) )
> else:
> box[e] = [c, r, c, r]
> print time.ctime()
>
> Can some one tell me what exactly is taking more time here. Is it
> because I am using dictionaries or what is the problem. Can some one
> help me improve this .Is there a better way to write this.

AFAICT the row index 'r' can only grow. Therefore one min() and one max()
should be superfluous (untested):

def make_box(m):
box = {}
for r, row in enumerate(m):
for c, e in enumerate(row):
if e in box:
left, top, right, bottom = box[e]
box[e] = (min(c, left), top, max(c, right), r)
else:
box[e] = (c, r, c, r)
return box

Peter

PyPK
Guest
Posts: n/a

 06-02-2005
Yep that improved the speed by about 50% now it takes about 10 secs
instead of 24 seconds..Thanks much. I guess that is the best we could
do right.It would be really helpful if I could get it less than 5
seconds. Any suggestions on that??

Michael Spencer
Guest
Posts: n/a

 06-02-2005
PyPK wrote:
> Yep that improved the speed by about 50% now it takes about 10 secs
> instead of 24 seconds..Thanks much. I guess that is the best we could
> do right.It would be really helpful if I could get it less than 5
> seconds. Any suggestions on that??
>

Things to try:
* in-lining the min and max expressions
* depending on the distribution of e, it may be faster to catch KeyErrors

def search1(m):
box = {}
for r,row in enumerate(m):
for c,e in enumerate(row):
try:
minc, minr, maxc, maxr = box[e]
box[e] = ( c < minc and c or minc,
r < minr and r or minr,
c > maxc and c or maxc,
r > maxr and r or maxr)
except KeyError:
box[e] = (c, r, c, r)
return box

Michael

Michael Spencer
Guest
Posts: n/a

 06-02-2005
Michael Spencer wrote:
> def search1(m):
> box = {}
> for r,row in enumerate(m):
> for c,e in enumerate(row):
> try:
> minc, minr, maxc, maxr = box[e]
> box[e] = ( c < minc and c or minc,
> r < minr and r or minr,
> c > maxc and c or maxc,
> r > maxr and r or maxr)
> except KeyError:
> box[e] = (c, r, c, r)
> return box
>
> Michael
>

Sorry, my earlier solution was incorrect since:
c < minc and c or minc # evaluates to minc if c == 0
Two of the tests are unnecessary, as pointed out by Peter
The remaining test:
c > maxc and c or maxc
does not suffer from the same problem, since c cannot both be 0 and greater than
maxc

The following version, still untested has the correction:

def search2(m):
box = {}
for r,row in enumerate(m):
for c,e in enumerate(row):
try: # use this form if e will be found 90%+ of the time
# otherwise, use: if e in box:
minc, minr, maxc, maxr = box[e]
# note correction for c == 0
# also Peter's simplification
box[e] = ( c and (c < minc and c or minc),
minr,
c > maxc and c or maxc,
r)

except KeyError:
box[e] = (c, r, c, r)
return box

Michael

John Machin
Guest
Posts: n/a

 06-02-2005
Michael Spencer wrote:

> minc, minr, maxc, maxr = box[e]
> # note correction for c == 0
> # also Peter's simplification
> box[e] = ( c and (c < minc and c or minc),
> minr,
> c > maxc and c or maxc,
> r)
>

This may be slightly faster (when c > 0), and slightly less opaque:

minc, minr, maxc, maxr = box[e]
# note correction for c == 0
# also Peter's simplification
if c < minc:
minc = c
box[e] = ( minc,
minr,
c > maxc and c or maxc,
r)

John Machin
Guest
Posts: n/a

 06-02-2005
Michael Spencer wrote:

> minc, minr, maxc, maxr = box[e]
> # note correction for c == 0
> # also Peter's simplification
> box[e] = ( c and (c < minc and c or minc),
> minr,
> c > maxc and c or maxc,
> r)
>

This may be slightly faster (when c > 0), and slightly less opaque:

minc, minr, maxc, maxr = box[e]
# note correction for c == 0
# also Peter's simplification
if c < minc:
minc = c
box[e] = ( minc,
minr,
c > maxc and c or maxc,
r)