Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   help with making my code more efficient (http://www.velocityreviews.com/forums/t955701-help-with-making-my-code-more-efficient.html)

Larry.Martell@gmail.com 12-21-2012 12:19 AM

help with making my code more efficient
 
I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool.

Here is how I am currently doing this:

# record time for each message matching the specified message for each tool
messageTimes = {}
for row in cdata: # tool, time, message
if self.message in row[2]:
messageTimes[row[0], row[1]] = 1

# now pull out each message that is within the time diff for each matched message
# as well as the matched messages themselves

def determine(tup):
if self.message in tup[2]: return True # matched message

for (tool, date_time) in messageTimes:
if tool == tup[0]:
if abs(date_time-tup[1]) <= tdiff:
return True

return False

cdata[:] = [tup for tup in cdata if determine(tup)]

This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.

Can anyone give me some suggestions on speeding this up?

TIA!

Chris Angelico 12-21-2012 12:38 AM

Re: help with making my code more efficient
 
On Fri, Dec 21, 2012 at 11:19 AM, Larry.Martell@gmail.com
<Larry.Martell@gmail.com> wrote:
> This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
>
> Can anyone give me some suggestions on speeding this up?
>


It sounds like you may have enough data to want to not keep it all in
memory. Have you considered switching to a database? You could then
execute SQL queries against it.

ChrisA

Larry.Martell@gmail.com 12-21-2012 12:43 AM

Re: help with making my code more efficient
 
On Thursday, December 20, 2012 5:38:03 PM UTC-7, Chris Angelico wrote:
> On Fri, Dec 21, 2012 at 11:19 AM, Larry.Martell@gmail.com
>
> <Larry.Martell@gmail.com> wrote:
>
> > This code works, but it takes way too long to run - e.g. when cdata has600,000 elements (which is typical for my app) it takes 2 hours for this to run.

>
> >

>
> > Can anyone give me some suggestions on speeding this up?

>
> >

>
>
>
> It sounds like you may have enough data to want to not keep it all in
>
> memory. Have you considered switching to a database? You could then
>
> execute SQL queries against it.


It came from a database. Originally I was getting just the data I wanted using SQL, but that was taking too long also. I was selecting just the messages I wanted, then for each one of those doing another query to get the datawithin the time diff of each. That was resulting in tens of thousands of queries. So I changed it to pull all the potential matches at once and then process it in python.


Larry.Martell@gmail.com 12-21-2012 12:43 AM

Re: help with making my code more efficient
 
On Thursday, December 20, 2012 5:38:03 PM UTC-7, Chris Angelico wrote:
> On Fri, Dec 21, 2012 at 11:19 AM, Larry.Martell@gmail.com
>
> <Larry.Martell@gmail.com> wrote:
>
> > This code works, but it takes way too long to run - e.g. when cdata has600,000 elements (which is typical for my app) it takes 2 hours for this to run.

>
> >

>
> > Can anyone give me some suggestions on speeding this up?

>
> >

>
>
>
> It sounds like you may have enough data to want to not keep it all in
>
> memory. Have you considered switching to a database? You could then
>
> execute SQL queries against it.


It came from a database. Originally I was getting just the data I wanted using SQL, but that was taking too long also. I was selecting just the messages I wanted, then for each one of those doing another query to get the datawithin the time diff of each. That was resulting in tens of thousands of queries. So I changed it to pull all the potential matches at once and then process it in python.


Dave Angel 12-21-2012 01:17 AM

Re: help with making my code more efficient
 
On 12/20/2012 07:19 PM, Larry.Martell@gmail.com wrote:
> I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool.
>
> Here is how I am currently doing this:


No, it's not. This is a fragment of code, without enough clues as to
what else is going. We can guess, but that's likely to make a mess.

First question is whether this code works exactly correctly? Are you
only concerned about speed, not fixing features? As far as I can tell,
the logic that includes the time comparison is bogus. You don't do
anything there to worry about the value of tup[2], just whether some
item has a nearby time. Of course, I could misunderstand the spec.

Are you making a global called 'self' ? That name is by convention only
used in methods to designate the instance object. What's the attribute
self?

Can cdata have duplicates, and are they significant? Are you including
the time building that as part of your 2 hour measurement? Is the list
sorted in any way?

Chances are your performance bottleneck is the doubly-nested loop. You
have a list comprehension at top-level code, and inside it calls a
function that also loops over the 600,000 items. So the inner loop gets
executed 360 billion times. You can cut this down drastically by some
judicious sorting, as well as by having a map of lists, where the map is
keyed by the tool.

>
> # record time for each message matching the specified message for each tool
> messageTimes = {}


You're building a dictionary; are you actually using the value (1), or
is only the key relevant? A set is a dict without a value. But more
importantly, you never look up anything in this dictionary. So why
isn't it a list? For that matter, why don't you just use the
messageTimes list?

> for row in cdata: # tool, time, message
> if self.message in row[2]:
> messageTimes[row[0], row[1]] = 1
>
> # now pull out each message that is within the time diff for each matched message
> # as well as the matched messages themselves
>
> def determine(tup):
> if self.message in tup[2]: return True # matched message
>
> for (tool, date_time) in messageTimes:
> if tool == tup[0]:
> if abs(date_time-tup[1]) <= tdiff:
> return True
>
> return False
>
> cdata[:] = [tup for tup in cdata if determine(tup)]


As the code exists, there's no need to copy the list. Just do a simple
bind.

>
> This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
>
> Can anyone give me some suggestions on speeding this up?
>
> TIA!



--

DaveA


Larry.Martell@gmail.com 12-21-2012 01:46 AM

Re: help with making my code more efficient
 
On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
> On 12/20/2012 07:19 PM, Larry.Martell@gmail.com wrote:
>
> > I have a list of tuples that contains a tool_id, a time, and a message.I want to select from this list all the elements where the message matchessome string, and all the other elements where the time is within some diffof any matching message for that tool.

>
> > Here is how I am currently doing this:

>
> No, it's not. This is a fragment of code, without enough clues as to
>
> what else is going. We can guess, but that's likely to make a mess.


Of course it's a fragment - it's part of a large program and I was just showing the relevant parts.

> First question is whether this code works exactly correctly?


Yes, the code works. I end up with just the rows I want.

> Are you only concerned about speed, not fixing features?


Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long.

> As far as I can tell, the logic that includes the time comparison is bogus.


Not at all.

> You don't do anything there to worry about the value of tup[2], just whether some
> item has a nearby time. Of course, I could misunderstand the spec.


The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta()

> Are you making a global called 'self' ? That name is by convention only
> used in methods to designate the instance object. What's the attribute
> self?


Yes, self is my instance object. self.message contains the string of interest that I need to look for.

> Can cdata have duplicates, and are they significant?


No, it will not have duplicates.

> Are you including the time building that as part of your 2 hour measurement?


No, the 2 hours is just the time to run the

cdata[:] = [tup for tup in cdata if determine(tup)]

> Is the list sorted in any way?


Yes, the list is sorted by tool and datetime.

> Chances are your performance bottleneck is the doubly-nested loop. You
> have a list comprehension at top-level code, and inside it calls a
> function that also loops over the 600,000 items. So the inner loop gets
> executed 360 billion times. You can cut this down drastically by some
> judicious sorting, as well as by having a map of lists, where the map is
> keyed by the tool.


Thanks. I will try that.

> > # record time for each message matching the specified message for each tool

>
> > messageTimes = {}

>
> You're building a dictionary; are you actually using the value (1), or
> is only the key relevant?


Only the keys.

> A set is a dict without a value.


Yes, I could use a set, but I don't think that would make it measurably faster.

> But more mportantly, you never look up anything in this dictionary. So why
> isn't it a list? For that matter, why don't you just use the
> messageTimes list?


Yes, it could be a list too.

> > for row in cdata: # tool, time, message

>
> > if self.message in row[2]:

>
> > messageTimes[row[0], row[1]] = 1

>
> >

>
> > # now pull out each message that is within the time diff for each matched message

>
> > # as well as the matched messages themselves

>
> >

>
> > def determine(tup):

>
> > if self.message in tup[2]: return True # matched message

>
> >

>
> > for (tool, date_time) in messageTimes:

>
> > if tool == tup[0]:

>
> > if abs(date_time-tup[1]) <= tdiff:

>
> > return True

>
> >

>
> > return False

>
> >

>
> > cdata[:] = [tup for tup in cdata if determine(tup)]

>
>
>
> As the code exists, there's no need to copy the list. Just do a simple
> bind.


This statement is to remove the items from cdata that I don't want. I don'tknow what you mean by bind. I'm not familiar with that python function.

>
>
>
> >

>
> > This code works, but it takes way too long to run - e.g. when cdata has600,000 elements (which is typical for my app) it takes 2 hours for this to run.

>
> >

>
> > Can anyone give me some suggestions on speeding this up?


Larry.Martell@gmail.com 12-21-2012 01:46 AM

Re: help with making my code more efficient
 
On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
> On 12/20/2012 07:19 PM, Larry.Martell@gmail.com wrote:
>
> > I have a list of tuples that contains a tool_id, a time, and a message.I want to select from this list all the elements where the message matchessome string, and all the other elements where the time is within some diffof any matching message for that tool.

>
> > Here is how I am currently doing this:

>
> No, it's not. This is a fragment of code, without enough clues as to
>
> what else is going. We can guess, but that's likely to make a mess.


Of course it's a fragment - it's part of a large program and I was just showing the relevant parts.

> First question is whether this code works exactly correctly?


Yes, the code works. I end up with just the rows I want.

> Are you only concerned about speed, not fixing features?


Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long.

> As far as I can tell, the logic that includes the time comparison is bogus.


Not at all.

> You don't do anything there to worry about the value of tup[2], just whether some
> item has a nearby time. Of course, I could misunderstand the spec.


The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta()

> Are you making a global called 'self' ? That name is by convention only
> used in methods to designate the instance object. What's the attribute
> self?


Yes, self is my instance object. self.message contains the string of interest that I need to look for.

> Can cdata have duplicates, and are they significant?


No, it will not have duplicates.

> Are you including the time building that as part of your 2 hour measurement?


No, the 2 hours is just the time to run the

cdata[:] = [tup for tup in cdata if determine(tup)]

> Is the list sorted in any way?


Yes, the list is sorted by tool and datetime.

> Chances are your performance bottleneck is the doubly-nested loop. You
> have a list comprehension at top-level code, and inside it calls a
> function that also loops over the 600,000 items. So the inner loop gets
> executed 360 billion times. You can cut this down drastically by some
> judicious sorting, as well as by having a map of lists, where the map is
> keyed by the tool.


Thanks. I will try that.

> > # record time for each message matching the specified message for each tool

>
> > messageTimes = {}

>
> You're building a dictionary; are you actually using the value (1), or
> is only the key relevant?


Only the keys.

> A set is a dict without a value.


Yes, I could use a set, but I don't think that would make it measurably faster.

> But more mportantly, you never look up anything in this dictionary. So why
> isn't it a list? For that matter, why don't you just use the
> messageTimes list?


Yes, it could be a list too.

> > for row in cdata: # tool, time, message

>
> > if self.message in row[2]:

>
> > messageTimes[row[0], row[1]] = 1

>
> >

>
> > # now pull out each message that is within the time diff for each matched message

>
> > # as well as the matched messages themselves

>
> >

>
> > def determine(tup):

>
> > if self.message in tup[2]: return True # matched message

>
> >

>
> > for (tool, date_time) in messageTimes:

>
> > if tool == tup[0]:

>
> > if abs(date_time-tup[1]) <= tdiff:

>
> > return True

>
> >

>
> > return False

>
> >

>
> > cdata[:] = [tup for tup in cdata if determine(tup)]

>
>
>
> As the code exists, there's no need to copy the list. Just do a simple
> bind.


This statement is to remove the items from cdata that I don't want. I don'tknow what you mean by bind. I'm not familiar with that python function.

>
>
>
> >

>
> > This code works, but it takes way too long to run - e.g. when cdata has600,000 elements (which is typical for my app) it takes 2 hours for this to run.

>
> >

>
> > Can anyone give me some suggestions on speeding this up?


MRAB 12-21-2012 02:08 AM

Re: help with making my code more efficient
 
On 2012-12-21 00:19, Larry.Martell@gmail.com wrote:
> I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool.
>
> Here is how I am currently doing this:
>
> # record time for each message matching the specified message for each tool
> messageTimes = {}
> for row in cdata: # tool, time, message
> if self.message in row[2]:
> messageTimes[row[0], row[1]] = 1
>

It looks like 'messageTimes' is really a set of tool/time pairs.

You could make it a dict of sets of time; in other words, a set of
times for each tool:

messageTimes = defaultdict(set)
for row in cdata: # tool, time, message
if self.message in row[2]:
messageTimes[row[0]].add(row[1])

> # now pull out each message that is within the time diff for each matched message
> # as well as the matched messages themselves
>
> def determine(tup):
> if self.message in tup[2]: return True # matched message
>
> for (tool, date_time) in messageTimes:
> if tool == tup[0]:
> if abs(date_time-tup[1]) <= tdiff:
> return True
>
> return False
>

def determine(tup):
if self.message in tup[2]: return True # matched message

# Scan through the times for the tool given by tup[0].
for date_time in messageTimes[tup[0]]:
if abs(date_time - tup[1]) <= tdiff:
return True

return False

> cdata[:] = [tup for tup in cdata if determine(tup)]
>
> This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
>
> Can anyone give me some suggestions on speeding this up?
>



Mitya Sirenef 12-21-2012 02:39 AM

Re: help with making my code more efficient
 
On 12/20/2012 08:46 PM, Larry.Martell@gmail.com wrote:
> On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
>> On 12/20/2012 07:19 PM, Larry.Martell@gmail.com wrote:
>>
>>> I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool.
>>> Here is how I am currently doing this:

>> No, it's not. This is a fragment of code, without enough clues as to
>>
>> what else is going. We can guess, but that's likely to make a mess.

> Of course it's a fragment - it's part of a large program and I was just showing the relevant parts.
>
>> First question is whether this code works exactly correctly?

> Yes, the code works. I end up with just the rows I want.
>
>> Are you only concerned about speed, not fixing features?

> Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long.
>
>> As far as I can tell, the logic that includes the time comparison is bogus.

> Not at all.
>
>> You don't do anything there to worry about the value of tup[2], just whether some
>> item has a nearby time. Of course, I could misunderstand the spec.

> The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta()
>
>> Are you making a global called 'self' ? That name is by convention only
>> used in methods to designate the instance object. What's the attribute
>> self?

> Yes, self is my instance object. self.message contains the string of interest that I need to look for.
>
>> Can cdata have duplicates, and are they significant?

> No, it will not have duplicates.
>
>> Are you including the time building that as part of your 2 hour measurement?

> No, the 2 hours is just the time to run the
>
> cdata[:] = [tup for tup in cdata if determine(tup)]
>
>> Is the list sorted in any way?

> Yes, the list is sorted by tool and datetime.
>
>> Chances are your performance bottleneck is the doubly-nested loop. You
>> have a list comprehension at top-level code, and inside it calls a
>> function that also loops over the 600,000 items. So the inner loop gets
>> executed 360 billion times. You can cut this down drastically by some
>> judicious sorting, as well as by having a map of lists, where the map is
>> keyed by the tool.

> Thanks. I will try that.
>
>>> # record time for each message matching the specified message for each tool
>>> messageTimes = {}

>> You're building a dictionary; are you actually using the value (1), or
>> is only the key relevant?

> Only the keys.
>
>> A set is a dict without a value.

> Yes, I could use a set, but I don't think that would make it measurably faster.
>
>> But more mportantly, you never look up anything in this dictionary. So why
>> isn't it a list? For that matter, why don't you just use the
>> messageTimes list?

> Yes, it could be a list too.
>
>>> for row in cdata: # tool, time, message
>>> if self.message in row[2]:
>>> messageTimes[row[0], row[1]] = 1
>>> # now pull out each message that is within the time diff for each matched message
>>> # as well as the matched messages themselves
>>> def determine(tup):
>>> if self.message in tup[2]: return True # matched message
>>> for (tool, date_time) in messageTimes:
>>> if tool == tup[0]:
>>> if abs(date_time-tup[1]) <= tdiff:
>>> return True
>>> return False
>>>
>>> cdata[:] = [tup for tup in cdata if determine(tup)]

>>
>>
>> As the code exists, there's no need to copy the list. Just do a simple
>> bind.

> This statement is to remove the items from cdata that I don't want. I don't know what you mean by bind. I'm not familiar with that python function.
>
>>
>>
>>> This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
>>> Can anyone give me some suggestions on speeding this up?



This code probably is not faster but it's simpler and may be easier for
you to work with
to experiment with speed-improving changes:


diffrng = 1

L = [
# id, time, string
(1, 5, "ok"),
(1, 6, "ok"),
(1, 7, "no"),
(1, 8, "no"),
]

match_times = [t[1] for t in L if "ok" in t[2]]

def in_range(timeval):
return bool( min([abs(timeval-v) for v in match_times]) <= diffrng )

print([t for t in L if in_range(t[1])])


But it really sounds like you could look into optimizing the db
query and db indexes, etc.


--
Lark's Tongue Guide to Python: http://lightbird.net/larks/


Mitya Sirenef 12-21-2012 02:49 AM

Re: help with making my code more efficient
 
On 12/20/2012 09:39 PM, Mitya Sirenef wrote:
> On 12/20/2012 08:46 PM, Larry.Martell@gmail.com wrote:
>> On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
>>> On 12/20/2012 07:19 PM, Larry.Martell@gmail.com wrote:
>>>
>>>> I have a list of tuples that contains a tool_id, a time, and a
>>>> message. I want to select from this list all the elements where the
>>>> message matches some string, and all the other elements where the
>>>> time is within some diff of any matching message for that tool.
>>>> Here is how I am currently doing this:
>>> No, it's not. This is a fragment of code, without enough clues as to
>>>
>>> what else is going. We can guess, but that's likely to make a mess.

>> Of course it's a fragment - it's part of a large program and I was
>> just showing the relevant parts.
>>
>>> First question is whether this code works exactly correctly?

>> Yes, the code works. I end up with just the rows I want.
>>
>>> Are you only concerned about speed, not fixing features?

>> Don't know what you mean by 'fixing features'. The code does what I
>> want, it just takes too long.
>>
>>> As far as I can tell, the logic that includes the time comparison is
>>> bogus.

>> Not at all.
>>
>>> You don't do anything there to worry about the value of tup[2],
>>> just whether some
>>> item has a nearby time. Of course, I could misunderstand the spec.

>> The data comes from a database. tup[2] is a datetime column. tdiff
>> comes from a datetime.timedelta()
>>
>>> Are you making a global called 'self' ? That name is by convention only
>>> used in methods to designate the instance object. What's the attribute
>>> self?

>> Yes, self is my instance object. self.message contains the string of
>> interest that I need to look for.
>>
>>> Can cdata have duplicates, and are they significant?

>> No, it will not have duplicates.
>>
>>> Are you including the time building that as part of your 2 hour
>>> measurement?

>> No, the 2 hours is just the time to run the
>>
>> cdata[:] = [tup for tup in cdata if determine(tup)]
>>
>>> Is the list sorted in any way?

>> Yes, the list is sorted by tool and datetime.
>>
>>> Chances are your performance bottleneck is the doubly-nested loop. You
>>> have a list comprehension at top-level code, and inside it calls a
>>> function that also loops over the 600,000 items. So the inner loop
>>> gets
>>> executed 360 billion times. You can cut this down drastically by some
>>> judicious sorting, as well as by having a map of lists, where the
>>> map is
>>> keyed by the tool.

>> Thanks. I will try that.
>>
>>>> # record time for each message matching the specified message for
>>>> each tool
>>>> messageTimes = {}
>>> You're building a dictionary; are you actually using the value (1), or
>>> is only the key relevant?

>> Only the keys.
>>
>>> A set is a dict without a value.

>> Yes, I could use a set, but I don't think that would make it
>> measurably faster.
>>
>>> But more mportantly, you never look up anything in this dictionary.
>>> So why
>>> isn't it a list? For that matter, why don't you just use the
>>> messageTimes list?

>> Yes, it could be a list too.
>>>> for row in cdata: # tool, time, message
>>>> if self.message in row[2]:
>>>> messageTimes[row[0], row[1]] = 1
>>>> # now pull out each message that is within the time diff for each
>>>> matched message
>>>> # as well as the matched messages themselves
>>>> def determine(tup):
>>>> if self.message in tup[2]: return True # matched message
>>>> for (tool, date_time) in messageTimes:
>>>> if tool == tup[0]:
>>>> if abs(date_time-tup[1]) <= tdiff:
>>>> return True
>>>> return False
>>>> cdata[:] = [tup for tup in cdata if determine(tup)]
>>>
>>>
>>> As the code exists, there's no need to copy the list. Just do a simple
>>> bind.

>> This statement is to remove the items from cdata that I don't want. I
>> don't know what you mean by bind. I'm not familiar with that python
>> function.
>>
>>>
>>>
>>>> This code works, but it takes way too long to run - e.g. when cdata
>>>> has 600,000 elements (which is typical for my app) it takes 2 hours
>>>> for this to run.
>>>> Can anyone give me some suggestions on speeding this up?

>
>
> This code probably is not faster but it's simpler and may be easier
> for you to work with
> to experiment with speed-improving changes:
>
>
> diffrng = 1
>
> L = [
> # id, time, string
> (1, 5, "ok"),
> (1, 6, "ok"),
> (1, 7, "no"),
> (1, 8, "no"),
> ]
>
> match_times = [t[1] for t in L if "ok" in t[2]]
>
> def in_range(timeval):
> return bool( min([abs(timeval-v) for v in match_times]) <= diffrng )
>
> print([t for t in L if in_range(t[1])])
>
>
> But it really sounds like you could look into optimizing the db
> query and db indexes, etc.
>
>


Actually, it might be slower.. this version of in_range should be better:

def in_range(timeval):
return any( abs(timeval-v) <= diffrng for v in match_times )



--
Lark's Tongue Guide to Python: http://lightbird.net/larks/



All times are GMT. The time now is 05:34 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.