Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > help with making my code more efficient

Reply
Thread Tools

help with making my code more efficient

 
 
Dave Angel
Guest
Posts: n/a
 
      12-22-2012
On 12/21/2012 11:47 PM, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
>> On 12/21/2012 03:36 PM, (E-Mail Removed) wrote:
>>
>>>> <snip>

> I think you're misunderstanding what I need to do. I have a set of rows from the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rows that contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold.
>
> cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that did match.
>
> Here's my original code:
>
> # record time for each message matching the specified message for each tool
> messageTimes = {}
> for row in cdata: # tool, time, message
> if self.message in row[2]:
> messageTimes[row[0], row[1]] = 1
>
> # now pull out each message that is within the time diff for each matched message
> # as well as the matched messages themselves
>
> def determine(tup):
> if self.message in tup[2]: return True # matched message
>
> for (tool, date_time) in messageTimes:
> if tool == tup[0]:
> if abs(date_time-tup[1]) <= tdiff:
> return True
>
> return False
>
> cdata[:] = [tup for tup in cdata if determine(tup)]
>
> Here's the code now:
>
>
> # Parse data and make a list of the time for each message matching the specified message for each tool
> messageTimes = defaultdict(list) # a dict with sorted lists
>
> for row in cdata: # tool, time, message
> if self.message in row[2]:
> messageTimes[row[0]].append(row[1])
>
> # now pull out each message that is within the time context for each matched message
> # as well as the matched messages themselves
>
> # return true is we should keep this message
> def determine(tup):
> if self.message in tup[2]: return True # matched message
> if seconds == 0: return False # no time context specified
>
> times = messageTimes[tup[0]] # get the list of matched messages for this tool
>
> le = bisect.bisect_right(times, tup[1]) # find time less than or equal to tup[1]
> ge = bisect.bisect_left(times, tup[1]) # find time greater then to equal to tup[1]
> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
>
> cdata = [tup for tup in cdata if determine(tup)]
>
> In my test case, cdata started with 600k rows, 30k matched the users string, and a total of 110k needed to be returned (which is what cdata ended up with) - the 30k that matched the string, and 80k that were within the time threshold.
>
> I think the point you may have missed is the tool - I only return a row if it's the same tool as a matched message and within the threshold.
>
> I hope I've explained this better. Thanks again.


That is better, and the point I missed noticing before is that
messageTimes is substantially smaller than cdata; it's already been
filtered down by looking for self.message in its row[2]. The code was
there, but I didn't relate. Remember I was bothered that you didn't
look at tup[2] when you were comparing for time-similarity. Well, you
did that implicitly, since messageTimes was already filtered. Sorry
about that.

That also lowers my expectations for improvement ratio, since instead of
600,000 * 600,000, we're talking "only" 600,000 * 30,000, 5% as much. So
now my expectations are only 4:1 to 10:1.

Still, there's room for improvement. (1) You should only need one
bisect in determine, and (2) if you remember the last result for each
tool, you could speed that one up some.

(1) Instead of getting both and le and a ge, get just one, by searching
for tup[1]-tdiff. Then by comparing that row's value against
tup[1]+tdiff, you can return immediately the boolean, the expression
being about half of the one you've now got.

(2) Make a dict of ints, keyed by the tool, and initialized to zero.
Call that dict "found." Each time you do a bisect, specify a range
starting at found[tool]. Similarly, store the result of the bisect in
found[tool]. This should gradually restrict the range searched by
bisect, which COULD save some time. It works because everything's sorted.

Naturally, make these changes independently, and time the changes. In
particular, it's possible that #2 will degrade performance instead of
improving it. But #1 should double performance, pretty close.

I hope these were clear enough. I don't want to write the changes
myself, since with no test data, I could easily make a mess of it.
.....

I can think of other changes which are less likely to make substantial
improvement, and which might degrade things. For one drastic example,
instead of any messageTimes storage, you could have ("flags") a list of
bool, same size as ctimes, which tracked the state of each line. You
loop through cdata, identifying and marking any line whose tup[2]
matches. And for each match, you scan backwards and forwards till the
time gets out of range (in one direction stopping at time-tdiff or 0 or
tool change, and in the other direction at time+tdiff or len, or
toolchange), marking each one within tdiff.

Then after one pass through cdata, you can have a very simple list
comprehension, something like:

cdata = [tup for index, tup in enumerate(cdata) if flags[index]]

Will it be faster? Depends on the number of expected matches (eg.
30,000 out of 300,000 is 10%), and how much data forward and backwards
you need to scan.
.....

A very simple difference? Perhaps using implicit unpacking instead of
using tup[0] etc. will be faster. It'll certainly be easier to read.

for tool, time, text in cdata:
if self.message in text:
messageTimes[tool]. append(time)

def determine(tool, time, text):

and call it with determine(*tup)


--

DaveA

 
Reply With Quote
 
 
 
 
Larry.Martell@gmail.com
Guest
Posts: n/a
 
      12-24-2012
On Friday, December 21, 2012 11:47:10 PM UTC-7, Dave Angel wrote:
> On 12/21/2012 11:47 PM, (E-Mail Removed) wrote:
> > On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
> >> On 12/21/2012 03:36 PM, (E-Mail Removed) wrote:
> >>
> >>>> <snip

> > I think you're misunderstanding what I need to do. I have a set of rowsfrom the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rows that contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold.
> >
> > cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that didmatch.
> >
> > Here's my original code:
> >
> > # record time for each message matching the specified message for each tool
> > messageTimes = {}
> > for row in cdata: # tool, time, message
> > if self.message in row[2]:
> > messageTimes[row[0], row[1]] = 1
> >
> > # now pull out each message that is within the time diff for each matched message
> > # as well as the matched messages themselves
> >
> > def determine(tup):
> > if self.message in tup[2]: return True # matched message
> >
> > for (tool, date_time) in messageTimes:
> > if tool == tup[0]:
> > if abs(date_time-tup[1]) <= tdiff:
> > return True
> >
> > return False
> >
> > cdata[:] = [tup for tup in cdata if determine(tup)]
> >
> > Here's the code now:
> >
> >
> > # Parse data and make a list of the time for each message matching the specified message for each tool
> > messageTimes = defaultdict(list) # a dict with sorted lists
> >
> > for row in cdata: # tool, time, message
> > if self.message in row[2]:
> > messageTimes[row[0]].append(row[1])
> >
> > # now pull out each message that is within the time context foreach matched message
> > # as well as the matched messages themselves
> >
> > # return true is we should keep this message
> > def determine(tup):
> > if self.message in tup[2]: return True # matched message
> > if seconds == 0: return False # no time context specified
> >
> > times = messageTimes[tup[0]] # get the list of matched messages for this tool
> >
> > le = bisect.bisect_right(times, tup[1]) # find time less than or equal to tup[1]
> > ge = bisect.bisect_left(times, tup[1]) # find time greater then to equal to tup[1]
> > return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
> >
> > cdata = [tup for tup in cdata if determine(tup)]
> >
> > In my test case, cdata started with 600k rows, 30k matched the users string, and a total of 110k needed to be returned (which is what cdata ended up with) - the 30k that matched the string, and 80k that were within the time threshold.
> >
> > I think the point you may have missed is the tool - I only return a rowif it's the same tool as a matched message and within the threshold.
> >
> > I hope I've explained this better. Thanks again.

>
> That is better, and the point I missed noticing before is that
> messageTimes is substantially smaller than cdata; it's already been
> filtered down by looking for self.message in its row[2]. The code was
> there, but I didn't relate. Remember I was bothered that you didn't
> look at tup[2] when you were comparing for time-similarity. Well, you
> did that implicitly, since messageTimes was already filtered. Sorry
> about that.
>
> That also lowers my expectations for improvement ratio, since instead of
> 600,000 * 600,000, we're talking "only" 600,000 * 30,000, 5% as much. So
> now my expectations are only 4:1 to 10:1.
>
> Still, there's room for improvement. (1) You should only need one
> bisect in determine, and (2) if you remember the last result for each
> tool, you could speed that one up some.
>
> (1) Instead of getting both and le and a ge, get just one, by searching
> for tup[1]-tdiff. Then by comparing that row's value against
> tup[1]+tdiff, you can return immediately the boolean, the expression
> being about half of the one you've now got.


Dave, I cannot thank you enough. With this change it went from 20 minutes to 10.

> (2) Make a dict of ints, keyed by the tool, and initialized to zero.
> Call that dict "found." Each time you do a bisect, specify a range
> starting at found[tool]. Similarly, store the result of the bisect in
> found[tool]. This should gradually restrict the range searched by
> bisect, which COULD save some time. It works because everything's sorted.


And with this change it took 3 minutes. WOW!

Merry Christmas and Happy New Year!!!


> Naturally, make these changes independently, and time the changes. In
>
> particular, it's possible that #2 will degrade performance instead of
>
> improving it. But #1 should double performance, pretty close.
>
>
>
> I hope these were clear enough. I don't want to write the changes
>
> myself, since with no test data, I could easily make a mess of it.
>
> ....
>
>
>
> I can think of other changes which are less likely to make substantial
>
> improvement, and which might degrade things. For one drastic example,
>
> instead of any messageTimes storage, you could have ("flags") a list of
>
> bool, same size as ctimes, which tracked the state of each line. You
>
> loop through cdata, identifying and marking any line whose tup[2]
>
> matches. And for each match, you scan backwards and forwards till the
>
> time gets out of range (in one direction stopping at time-tdiff or 0 or
>
> tool change, and in the other direction at time+tdiff or len, or
>
> toolchange), marking each one within tdiff.
>
>
>
> Then after one pass through cdata, you can have a very simple list
>
> comprehension, something like:
>
>
>
> cdata = [tup for index, tup in enumerate(cdata) if flags[index]]
>
>
>
> Will it be faster? Depends on the number of expected matches (eg.
>
> 30,000 out of 300,000 is 10%), and how much data forward and backwards
>
> you need to scan.
>
> ....
>
>
>
> A very simple difference? Perhaps using implicit unpacking instead of
>
> using tup[0] etc. will be faster. It'll certainly be easier to read.
>
>
>
> for tool, time, text in cdata:
>
> if self.message in text:
>
> messageTimes[tool]. append(time)
>
>
>
> def determine(tool, time, text):
>
>
>
> and call it with determine(*tup)

 
Reply With Quote
 
 
 
 
Larry.Martell@gmail.com
Guest
Posts: n/a
 
      12-24-2012
On Friday, December 21, 2012 11:47:10 PM UTC-7, Dave Angel wrote:
> On 12/21/2012 11:47 PM, (E-Mail Removed) wrote:
> > On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
> >> On 12/21/2012 03:36 PM, (E-Mail Removed) wrote:
> >>
> >>>> <snip

> > I think you're misunderstanding what I need to do. I have a set of rowsfrom the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rows that contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold.
> >
> > cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that didmatch.
> >
> > Here's my original code:
> >
> > # record time for each message matching the specified message for each tool
> > messageTimes = {}
> > for row in cdata: # tool, time, message
> > if self.message in row[2]:
> > messageTimes[row[0], row[1]] = 1
> >
> > # now pull out each message that is within the time diff for each matched message
> > # as well as the matched messages themselves
> >
> > def determine(tup):
> > if self.message in tup[2]: return True # matched message
> >
> > for (tool, date_time) in messageTimes:
> > if tool == tup[0]:
> > if abs(date_time-tup[1]) <= tdiff:
> > return True
> >
> > return False
> >
> > cdata[:] = [tup for tup in cdata if determine(tup)]
> >
> > Here's the code now:
> >
> >
> > # Parse data and make a list of the time for each message matching the specified message for each tool
> > messageTimes = defaultdict(list) # a dict with sorted lists
> >
> > for row in cdata: # tool, time, message
> > if self.message in row[2]:
> > messageTimes[row[0]].append(row[1])
> >
> > # now pull out each message that is within the time context foreach matched message
> > # as well as the matched messages themselves
> >
> > # return true is we should keep this message
> > def determine(tup):
> > if self.message in tup[2]: return True # matched message
> > if seconds == 0: return False # no time context specified
> >
> > times = messageTimes[tup[0]] # get the list of matched messages for this tool
> >
> > le = bisect.bisect_right(times, tup[1]) # find time less than or equal to tup[1]
> > ge = bisect.bisect_left(times, tup[1]) # find time greater then to equal to tup[1]
> > return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
> >
> > cdata = [tup for tup in cdata if determine(tup)]
> >
> > In my test case, cdata started with 600k rows, 30k matched the users string, and a total of 110k needed to be returned (which is what cdata ended up with) - the 30k that matched the string, and 80k that were within the time threshold.
> >
> > I think the point you may have missed is the tool - I only return a rowif it's the same tool as a matched message and within the threshold.
> >
> > I hope I've explained this better. Thanks again.

>
> That is better, and the point I missed noticing before is that
> messageTimes is substantially smaller than cdata; it's already been
> filtered down by looking for self.message in its row[2]. The code was
> there, but I didn't relate. Remember I was bothered that you didn't
> look at tup[2] when you were comparing for time-similarity. Well, you
> did that implicitly, since messageTimes was already filtered. Sorry
> about that.
>
> That also lowers my expectations for improvement ratio, since instead of
> 600,000 * 600,000, we're talking "only" 600,000 * 30,000, 5% as much. So
> now my expectations are only 4:1 to 10:1.
>
> Still, there's room for improvement. (1) You should only need one
> bisect in determine, and (2) if you remember the last result for each
> tool, you could speed that one up some.
>
> (1) Instead of getting both and le and a ge, get just one, by searching
> for tup[1]-tdiff. Then by comparing that row's value against
> tup[1]+tdiff, you can return immediately the boolean, the expression
> being about half of the one you've now got.


Dave, I cannot thank you enough. With this change it went from 20 minutes to 10.

> (2) Make a dict of ints, keyed by the tool, and initialized to zero.
> Call that dict "found." Each time you do a bisect, specify a range
> starting at found[tool]. Similarly, store the result of the bisect in
> found[tool]. This should gradually restrict the range searched by
> bisect, which COULD save some time. It works because everything's sorted.


And with this change it took 3 minutes. WOW!

Merry Christmas and Happy New Year!!!


> Naturally, make these changes independently, and time the changes. In
>
> particular, it's possible that #2 will degrade performance instead of
>
> improving it. But #1 should double performance, pretty close.
>
>
>
> I hope these were clear enough. I don't want to write the changes
>
> myself, since with no test data, I could easily make a mess of it.
>
> ....
>
>
>
> I can think of other changes which are less likely to make substantial
>
> improvement, and which might degrade things. For one drastic example,
>
> instead of any messageTimes storage, you could have ("flags") a list of
>
> bool, same size as ctimes, which tracked the state of each line. You
>
> loop through cdata, identifying and marking any line whose tup[2]
>
> matches. And for each match, you scan backwards and forwards till the
>
> time gets out of range (in one direction stopping at time-tdiff or 0 or
>
> tool change, and in the other direction at time+tdiff or len, or
>
> toolchange), marking each one within tdiff.
>
>
>
> Then after one pass through cdata, you can have a very simple list
>
> comprehension, something like:
>
>
>
> cdata = [tup for index, tup in enumerate(cdata) if flags[index]]
>
>
>
> Will it be faster? Depends on the number of expected matches (eg.
>
> 30,000 out of 300,000 is 10%), and how much data forward and backwards
>
> you need to scan.
>
> ....
>
>
>
> A very simple difference? Perhaps using implicit unpacking instead of
>
> using tup[0] etc. will be faster. It'll certainly be easier to read.
>
>
>
> for tool, time, text in cdata:
>
> if self.message in text:
>
> messageTimes[tool]. append(time)
>
>
>
> def determine(tool, time, text):
>
>
>
> and call it with determine(*tup)

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Making code more efficient and effective cokofreedom@gmail.com Python 3 06-26-2008 03:53 PM
I could use some help making this Python code run faster using only Python code. Python Maniac Python 24 09-23-2007 05:48 PM
Writing more efficient code gonzlobo Python 13 01-03-2007 06:27 AM
Re: More efficient RFC 1867 uploads? Richard K Bethell ASP .Net 0 07-14-2003 05:32 PM
Re: More efficient RFC 1867 uploads? John Timney \(Microsoft MVP\) ASP .Net 1 07-11-2003 08:16 PM



Advertisments