| Home | Forums | Reviews | Guides | Newsgroups | Register | Search |
![]() |
| Thread Tools |
| Chris Angelico |
|
|
|
| |
|
Roy Smith
Guest
Posts: n/a
|
In article <cc869959-c568-4490-b45f->,
"" <> wrote: > On Thursday, December 20, 2012 5:38:03 PM UTC-7, Chris Angelico wrote: > > On Fri, Dec 21, 2012 at 11:19 AM, > > > > <> wrote: > > > > > This code works, but it takes way too long to run - e.g. when cdata has > > > 600,000 elements (which is typical for my app) it takes 2 hours for this > > > to run. > > > > > > > > > > Can anyone give me some suggestions on speeding this up? > > > > > > > > > > > > > It sounds like you may have enough data to want to not keep it all in > > > > memory. Have you considered switching to a database? You could then > > > > execute SQL queries against it. > > It came from a database. Originally I was getting just the data I wanted > using SQL, but that was taking too long also. I was selecting just the > messages I wanted, then for each one of those doing another query to get the > data within the time diff of each. That was resulting in tens of thousands of > queries. So I changed it to pull all the potential matches at once and then > process it in python. If you're doing free-text matching, an SQL database may not be the right tool. I suspect you want to be looking at some kind of text search engine, such as http://lucene.apache.org/ or http://xapian.org/. |
|
|
|
|
|||
|
|||
| Roy Smith |
|
|
|
| |
|
Dave Angel
Guest
Posts: n/a
|
On 12/20/2012 08:46 PM, wrote:
> On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote: >> <snip> > Of course it's a fragment - it's part of a large program and I was just showing the relevant parts. But it seems these are methods in a class, or something, so we're missing context. And you use self without it being an argument to the function. Like it's a global. > <snip> > Yes, the code works. I end up with just the rows I want. >> Are you only concerned about speed, not fixing features? > Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long. > >> As far as I can tell, the logic that includes the time comparison is bogus. > Not at all. > >> You don't do anything there to worry about the value of tup[2], just whether some >> item has a nearby time. Of course, I could misunderstand the spec. > The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta() I thought that tup[1] was the datetime. In any case, the loop makes no sense to me, so I can't really optimize it, just make suggestions. > >> Are you making a global called 'self' ? That name is by convention only >> used in methods to designate the instance object. What's the attribute >> self? > Yes, self is my instance object. self.message contains the string of interest that I need to look for. > >> Can cdata have duplicates, and are they significant? > No, it will not have duplicates. > >> Is the list sorted in any way? > Yes, the list is sorted by tool and datetime. > >> Chances are your performance bottleneck is the doubly-nested loop. You >> have a list comprehension at top-level code, and inside it calls a >> function that also loops over the 600,000 items. So the inner loop gets >> executed 360 billion times. You can cut this down drastically by some >> judicious sorting, as well as by having a map of lists, where the map is >> keyed by the tool. > Thanks. I will try that. So in your first loop, you could simply split the list into separate lists, one per tup[0] value, and the lists as dictionary items, keyed by that tool string. Then inside the determine() function, make a local ref to the particular list for the tool. recs = messageTimes[tup[0]] Instead of a for loop over recs, use a binary search to identify the first item that's >= date_time-tdiff. Then if it's less than date_time+tdiff, return True, otherwise False. Check out the bisect module. Function bisect_left() should do what you want in a sorted list. >>> cdata[:] = [tup for tup in cdata if determine(tup)] >> >> >> As the code exists, there's no need to copy the list. Just do a simple >> bind. > This statement is to remove the items from cdata that I don't want. I don't know what you mean by bind. I'm not familiar with that python function. Every "assignment" to a simple name is really a rebinding of that name. cdata = [tup for tup in cdata if determine(tup)] will rebind the name to the new object, much quicker than copying. If this is indeed a top-level line, it should be equivalent. But if in fact this is inside some other function, it may violate some other assumptions. In particular, if there are other names for the same object, then you're probably stuck with modifying it in place, using slice notation. BTW, a set is generally much more memory efficient than a dict, when you don't use the "value". But since I think you'll be better off with a dict of lists, it's a moot point. -- DaveA |
|
|
|
|
|||
|
|||
| Dave Angel |
|
Larry.Martell@gmail.com
Guest
Posts: n/a
|
On Thursday, December 20, 2012 8:31:18 PM UTC-7, Dave Angel wrote:
> On 12/20/2012 08:46 PM, wrote: > > > On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote: > > >> <snip> > > > Of course it's a fragment - it's part of a large program and I was just showing the relevant parts. > > But it seems these are methods in a class, or something, so we're > missing context. And you use self without it being an argument to the > function. Like it's a global. I didn't show the entire method, only what I thought was relevant to my issue. The method is declared as: def generate_data(self): > > > <snip> > > > Yes, the code works. I end up with just the rows I want. > > >> Are you only concerned about speed, not fixing features? > > > Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long. > > > > > >> As far as I can tell, the logic that includes the time comparison is bogus. > > > Not at all. > > > > > >> You don't do anything there to worry about the value of tup[2], just whether some > > >> item has a nearby time. Of course, I could misunderstand the spec. > > > The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta() > > I thought that tup[1] was the datetime. In any case, the loop makes no > sense to me, so I can't really optimize it, just make suggestions. Yes, tup[1] is the datetime. I mistyped last night. > >> Are you making a global called 'self' ? That name is by convention only > > >> used in methods to designate the instance object. What's the attribute > > >> self? > > > Yes, self is my instance object. self.message contains the string of interest that I need to look for. > > > > > >> Can cdata have duplicates, and are they significant? > > > No, it will not have duplicates. > > > > > >> Is the list sorted in any way? > > > Yes, the list is sorted by tool and datetime. > > > > > >> Chances are your performance bottleneck is the doubly-nested loop. You > > >> have a list comprehension at top-level code, and inside it calls a > > >> function that also loops over the 600,000 items. So the inner loop gets > > >> executed 360 billion times. You can cut this down drastically by some > > >> judicious sorting, as well as by having a map of lists, where the map is > > >> keyed by the tool. > > > Thanks. I will try that. > > > > So in your first loop, you could simply split the list into separate > lists, one per tup[0] value, and the lists as dictionary items, keyed by > that tool string. > > Then inside the determine() function, make a local ref to the particular > > list for the tool. > > recs = messageTimes[tup[0]] I made that change ant went from taking over 2 hours to 54 minutes. A dramatic improvement, but still not adequate for my app. > Instead of a for loop over recs, use a binary search to identify the > first item that's >= date_time-tdiff. Then if it's less than > date_time+tdiff, return True, otherwise False. Check out the bisect > module. Function bisect_left() should do what you want in a sorted list. Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. This was the code I had: times = messageTimes[tup[0]] le = bisect.bisect_right(times, tup[1]) ge = bisect.bisect_left(times, tup[1]) return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff) > >>> cdata[:] = [tup for tup in cdata if determine(tup)] > > >> As the code exists, there's no need to copy the list. Just do a simple > >> bind. > > > This statement is to remove the items from cdata that I don't want. I don't know what you mean by bind. I'm not familiar with that python function. > > Every "assignment" to a simple name is really a rebinding of that name. > cdata = [tup for tup in cdata if determine(tup)] > > will rebind the name to the new object, much quicker than copying. If > this is indeed a top-level line, it should be equivalent. But if in > fact this is inside some other function, it may violate some other > assumptions. In particular, if there are other names for the same > object, then you're probably stuck with modifying it in place, using > slice notation. The slice notation was left over when when cdata was a tuple. Now that it's a list I don't need that any more. > BTW, a set is generally much more memory efficient than a dict, when you > don't use the "value". But since I think you'll be better off with a > dict of lists, it's a moot point. I'm going back to square 1 and try and do all from SQL. |
|
|
|
|
|||
|
|||
| Larry.Martell@gmail.com |
|
Larry.Martell@gmail.com
Guest
Posts: n/a
|
On Thursday, December 20, 2012 8:31:18 PM UTC-7, Dave Angel wrote:
> On 12/20/2012 08:46 PM, wrote: > > > On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote: > > >> <snip> > > > Of course it's a fragment - it's part of a large program and I was just showing the relevant parts. > > But it seems these are methods in a class, or something, so we're > missing context. And you use self without it being an argument to the > function. Like it's a global. I didn't show the entire method, only what I thought was relevant to my issue. The method is declared as: def generate_data(self): > > > <snip> > > > Yes, the code works. I end up with just the rows I want. > > >> Are you only concerned about speed, not fixing features? > > > Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long. > > > > > >> As far as I can tell, the logic that includes the time comparison is bogus. > > > Not at all. > > > > > >> You don't do anything there to worry about the value of tup[2], just whether some > > >> item has a nearby time. Of course, I could misunderstand the spec. > > > The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta() > > I thought that tup[1] was the datetime. In any case, the loop makes no > sense to me, so I can't really optimize it, just make suggestions. Yes, tup[1] is the datetime. I mistyped last night. > >> Are you making a global called 'self' ? That name is by convention only > > >> used in methods to designate the instance object. What's the attribute > > >> self? > > > Yes, self is my instance object. self.message contains the string of interest that I need to look for. > > > > > >> Can cdata have duplicates, and are they significant? > > > No, it will not have duplicates. > > > > > >> Is the list sorted in any way? > > > Yes, the list is sorted by tool and datetime. > > > > > >> Chances are your performance bottleneck is the doubly-nested loop. You > > >> have a list comprehension at top-level code, and inside it calls a > > >> function that also loops over the 600,000 items. So the inner loop gets > > >> executed 360 billion times. You can cut this down drastically by some > > >> judicious sorting, as well as by having a map of lists, where the map is > > >> keyed by the tool. > > > Thanks. I will try that. > > > > So in your first loop, you could simply split the list into separate > lists, one per tup[0] value, and the lists as dictionary items, keyed by > that tool string. > > Then inside the determine() function, make a local ref to the particular > > list for the tool. > > recs = messageTimes[tup[0]] I made that change ant went from taking over 2 hours to 54 minutes. A dramatic improvement, but still not adequate for my app. > Instead of a for loop over recs, use a binary search to identify the > first item that's >= date_time-tdiff. Then if it's less than > date_time+tdiff, return True, otherwise False. Check out the bisect > module. Function bisect_left() should do what you want in a sorted list. Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. This was the code I had: times = messageTimes[tup[0]] le = bisect.bisect_right(times, tup[1]) ge = bisect.bisect_left(times, tup[1]) return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff) > >>> cdata[:] = [tup for tup in cdata if determine(tup)] > > >> As the code exists, there's no need to copy the list. Just do a simple > >> bind. > > > This statement is to remove the items from cdata that I don't want. I don't know what you mean by bind. I'm not familiar with that python function. > > Every "assignment" to a simple name is really a rebinding of that name. > cdata = [tup for tup in cdata if determine(tup)] > > will rebind the name to the new object, much quicker than copying. If > this is indeed a top-level line, it should be equivalent. But if in > fact this is inside some other function, it may violate some other > assumptions. In particular, if there are other names for the same > object, then you're probably stuck with modifying it in place, using > slice notation. The slice notation was left over when when cdata was a tuple. Now that it's a list I don't need that any more. > BTW, a set is generally much more memory efficient than a dict, when you > don't use the "value". But since I think you'll be better off with a > dict of lists, it's a moot point. I'm going back to square 1 and try and do all from SQL. |
|
|
|
|
|||
|
|||
| Larry.Martell@gmail.com |
|
Larry.Martell@gmail.com
Guest
Posts: n/a
|
On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry....@gmail.com wrote:
> On Thursday, December 20, 2012 8:31:18 PM UTC-7, Dave Angel wrote: > > On 12/20/2012 08:46 PM, wrote: > > > On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote: > > >> <snip> > > > Of course it's a fragment - it's part of a large program and I was just showing the relevant parts. > > But it seems these are methods in a class, or something, so we're > > missing context. And you use self without it being an argument to the > > function. Like it's a global. > I didn't show the entire method, only what I thought was relevant to my issue. The method is declared as: > > def generate_data(self): > > > <snip> > > > Yes, the code works. I end up with just the rows I want. > > >> Are you only concerned about speed, not fixing features? > > > Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long. > > >> As far as I can tell, the logic that includes the time comparison isbogus. > > > Not at all. > > >> You don't do anything there to worry about the value of tup[2], just whether some > > >> item has a nearby time. Of course, I could misunderstand the spec. > > > The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta() > > I thought that tup[1] was the datetime. In any case, the loop makes no > > sense to me, so I can't really optimize it, just make suggestions. > Yes, tup[1] is the datetime. I mistyped last night. > > >> Are you making a global called 'self' ? That name is by convention only > > >> used in methods to designate the instance object. What's the attribute > > >> self? > > > Yes, self is my instance object. self.message contains the string of interest that I need to look for. > > >> Can cdata have duplicates, and are they significant? > > > No, it will not have duplicates. > > >> Is the list sorted in any way? > > > Yes, the list is sorted by tool and datetime. > > >> Chances are your performance bottleneck is the doubly-nested loop. You > > >> have a list comprehension at top-level code, and inside it calls a > > >> function that also loops over the 600,000 items. So the inner loop gets > > >> executed 360 billion times. You can cut this down drastically by some > > >> judicious sorting, as well as by having a map of lists, where the map is > > >> keyed by the tool. > > > Thanks. I will try that. > > So in your first loop, you could simply split the list into separate > > lists, one per tup[0] value, and the lists as dictionary items, keyed by > > that tool string. > > Then inside the determine() function, make a local ref to the particular > > list for the tool. > > recs = messageTimes[tup[0]] > I made that change ant went from taking over 2 hours to 54 minutes. A dramatic improvement, but still not adequate for my app. > > Instead of a for loop over recs, use a binary search to identify the > > first item that's >= date_time-tdiff. Then if it's less than > > date_time+tdiff, return True, otherwise False. Check out the bisect > > module. Function bisect_left() should do what you want in a sorted list. > Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. The out of memory error had nothing to do with using bisect. I had introduced a typo that I really though would have caused a variable referenced before assignment error. But it did not do that, and instead somehow caused allthe memory in my machine to get used up. When I fixed that, it worked really well with bisect. The code that was taking 2 hours was down to 20 minutes, and even better, a query that was taking 40 minutes was down to 8 seconds. Thanks very much for all your help. > This was the code I had: > > times = messageTimes[tup[0]] > > le = bisect.bisect_right(times, tup[1]) > > ge = bisect.bisect_left(times, tup[1]) > > return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff) > > > > > >>> cdata[:] = [tup for tup in cdata if determine(tup)] > > > > > > >> As the code exists, there's no need to copy the list. Just do a simple > > > >> bind. > > > > > > > This statement is to remove the items from cdata that I don't want. Idon't know what you mean by bind. I'm not familiar with that python function. > > > > > > Every "assignment" to a simple name is really a rebinding of that name. > > > cdata = [tup for tup in cdata if determine(tup)] > > > > > > will rebind the name to the new object, much quicker than copying. If > > > this is indeed a top-level line, it should be equivalent. But if in > > > fact this is inside some other function, it may violate some other > > > assumptions. In particular, if there are other names for the same > > > object, then you're probably stuck with modifying it in place, using > > > slice notation. > > > > The slice notation was left over when when cdata was a tuple. Now that it's a list I don't need that any more. > > > > > BTW, a set is generally much more memory efficient than a dict, when you > > > don't use the "value". But since I think you'll be better off with a > > > dict of lists, it's a moot point. > > > > I'm going back to square 1 and try and do all from SQL. |
|
|
|
|
|||
|
|||
| Larry.Martell@gmail.com |
|
Larry.Martell@gmail.com
Guest
Posts: n/a
|
On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry....@gmail.com wrote:
> On Thursday, December 20, 2012 8:31:18 PM UTC-7, Dave Angel wrote: > > On 12/20/2012 08:46 PM, wrote: > > > On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote: > > >> <snip> > > > Of course it's a fragment - it's part of a large program and I was just showing the relevant parts. > > But it seems these are methods in a class, or something, so we're > > missing context. And you use self without it being an argument to the > > function. Like it's a global. > I didn't show the entire method, only what I thought was relevant to my issue. The method is declared as: > > def generate_data(self): > > > <snip> > > > Yes, the code works. I end up with just the rows I want. > > >> Are you only concerned about speed, not fixing features? > > > Don't know what you mean by 'fixing features'. The code does what I want, it just takes too long. > > >> As far as I can tell, the logic that includes the time comparison isbogus. > > > Not at all. > > >> You don't do anything there to worry about the value of tup[2], just whether some > > >> item has a nearby time. Of course, I could misunderstand the spec. > > > The data comes from a database. tup[2] is a datetime column. tdiff comes from a datetime.timedelta() > > I thought that tup[1] was the datetime. In any case, the loop makes no > > sense to me, so I can't really optimize it, just make suggestions. > Yes, tup[1] is the datetime. I mistyped last night. > > >> Are you making a global called 'self' ? That name is by convention only > > >> used in methods to designate the instance object. What's the attribute > > >> self? > > > Yes, self is my instance object. self.message contains the string of interest that I need to look for. > > >> Can cdata have duplicates, and are they significant? > > > No, it will not have duplicates. > > >> Is the list sorted in any way? > > > Yes, the list is sorted by tool and datetime. > > >> Chances are your performance bottleneck is the doubly-nested loop. You > > >> have a list comprehension at top-level code, and inside it calls a > > >> function that also loops over the 600,000 items. So the inner loop gets > > >> executed 360 billion times. You can cut this down drastically by some > > >> judicious sorting, as well as by having a map of lists, where the map is > > >> keyed by the tool. > > > Thanks. I will try that. > > So in your first loop, you could simply split the list into separate > > lists, one per tup[0] value, and the lists as dictionary items, keyed by > > that tool string. > > Then inside the determine() function, make a local ref to the particular > > list for the tool. > > recs = messageTimes[tup[0]] > I made that change ant went from taking over 2 hours to 54 minutes. A dramatic improvement, but still not adequate for my app. > > Instead of a for loop over recs, use a binary search to identify the > > first item that's >= date_time-tdiff. Then if it's less than > > date_time+tdiff, return True, otherwise False. Check out the bisect > > module. Function bisect_left() should do what you want in a sorted list. > Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. The out of memory error had nothing to do with using bisect. I had introduced a typo that I really though would have caused a variable referenced before assignment error. But it did not do that, and instead somehow caused allthe memory in my machine to get used up. When I fixed that, it worked really well with bisect. The code that was taking 2 hours was down to 20 minutes, and even better, a query that was taking 40 minutes was down to 8 seconds. Thanks very much for all your help. > This was the code I had: > > times = messageTimes[tup[0]] > > le = bisect.bisect_right(times, tup[1]) > > ge = bisect.bisect_left(times, tup[1]) > > return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff) > > > > > >>> cdata[:] = [tup for tup in cdata if determine(tup)] > > > > > > >> As the code exists, there's no need to copy the list. Just do a simple > > > >> bind. > > > > > > > This statement is to remove the items from cdata that I don't want. Idon't know what you mean by bind. I'm not familiar with that python function. > > > > > > Every "assignment" to a simple name is really a rebinding of that name. > > > cdata = [tup for tup in cdata if determine(tup)] > > > > > > will rebind the name to the new object, much quicker than copying. If > > > this is indeed a top-level line, it should be equivalent. But if in > > > fact this is inside some other function, it may violate some other > > > assumptions. In particular, if there are other names for the same > > > object, then you're probably stuck with modifying it in place, using > > > slice notation. > > > > The slice notation was left over when when cdata was a tuple. Now that it's a list I don't need that any more. > > > > > BTW, a set is generally much more memory efficient than a dict, when you > > > don't use the "value". But since I think you'll be better off with a > > > dict of lists, it's a moot point. > > > > I'm going back to square 1 and try and do all from SQL. |
|
|
|
|
|||
|
|||
| Larry.Martell@gmail.com |
|
Dave Angel
Guest
Posts: n/a
|
On 12/21/2012 03:36 PM, wrote:
> On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry....@gmail.com wrote: >> <snip> >> Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. > The out of memory error had nothing to do with using bisect. I had introduced a typo that I really though would have caused a variable referenced before assignment error. But it did not do that, and instead somehow caused all the memory in my machine to get used up. When I fixed that, it worked really well with bisect. The code that was taking 2 hours was down to 20 minutes, and even better, a query that was taking 40 minutes was down to 8 seconds. > > Thanks very much for all your help. > > Congratulations. But you're not done. A fourfold improvement isn't nearly as much as I'd expect. When you go from a n-squared algorithm to an n-log-n, for n of 600k, I'd be disappointed in less than a 100 fold improvement. Assuming we're talking just about time spent in the cdata = list-comprehension list. It's also probably possible to simplify, and get some speedup from other parts of the code other than that one loop. For example, if the bisect is really working right, it's probably unnecessary to even copy the original cdata. You said it was sorted. So bisect it directly, and skip the messageTimes intermediate data structure. It may not help, but i suspect it will. > >> This was the code I had: >> >> times = messageTimes[tup[0]] >> >> le = bisect.bisect_right(times, tup[1]) >> >> ge = bisect.bisect_left(times, tup[1]) >> >> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff) I'm stymied making further suggestions to this fragment. Without seeing the changes you apparently made to messageTimes, I can't even be sure this is equivalent. And I wonder if even looking up tup[1] is worthwhile, since your caller already knows the index in cdata. If you used cdata directly, you might not need a sort at all. I wish you could post some sample data, and identify which records are intended to be True. Obviously you could simplify, and just use ints where it's using datetimes here. But if your rule is simply accept all records that come in a cluster with no delta more than a threshold, then you don't even need any search at all. Just pass the index in, and compare the datetime with its predecessor and successor. Could we see the whole fragment as you originally had it, but with the optimizations you've done so far? The more I look at that determine() function, the more futile it seems. I just don't understand what the criteria are supposed to be. Your list-comp loops through all of cdata, and for each item, passes that item to determine(). determine() loops through a copy of cdata, checking to see if any of them is within tdiff of tup. But won't that always return True, since you'll encounter the record and compare it to itself? -- DaveA |
|
|
|
|
|||
|
|||
| Dave Angel |
|
Larry.Martell@gmail.com
Guest
Posts: n/a
|
On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
> On 12/21/2012 03:36 PM, wrote: > > > On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry....@gmail.com wrote: > > >> <snip> > > >> Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. > > > The out of memory error had nothing to do with using bisect. I had introduced a typo that I really though would have caused a variable referenced before assignment error. But it did not do that, and instead somehow causedall the memory in my machine to get used up. When I fixed that, it worked really well with bisect. The code that was taking 2 hours was down to 20 minutes, and even better, a query that was taking 40 minutes was down to 8 seconds. > > > > > > Thanks very much for all your help. > > Congratulations. But you're not done. A fourfold improvement isn't > nearly as much as I'd expect. When you go from a n-squared algorithm to > an n-log-n, for n of 600k, I'd be disappointed in less than a 100 fold > improvement. Assuming we're talking just about time spent in the cdata > = list-comprehension list. > > It's also probably possible to simplify, and get some speedup from other > parts of the code other than that one loop. For example, if the bisect > is really working right, it's probably unnecessary to even copy the > original cdata. You said it was sorted. So bisect it directly, and > skip the messageTimes intermediate data structure. It may not help, but > i suspect it will. > > > > >> This was the code I had: > >> > >> times = messageTimes[tup[0]] > > >> > > >> le = bisect.bisect_right(times, tup[1]) > >> ge = bisect.bisect_left(times, tup[1]) > >> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff) > > > > I'm stymied making further suggestions to this fragment. Without seeing > the changes you apparently made to messageTimes, I can't even be sure > this is equivalent. And I wonder if even looking up tup[1] is > worthwhile, since your caller already knows the index in cdata. If you > used cdata directly, you might not need a sort at all. > > I wish you could post some sample data, and identify which records are > intended to be True. Obviously you could simplify, and just use ints > where it's using datetimes here. But if your rule is simply accept all > records that come in a cluster with no delta more than a threshold, then > you don't even need any search at all. Just pass the index in, and > compare the datetime with its predecessor and successor. > > Could we see the whole fragment as you originally had it, but with the > optimizations you've done so far? The more I look at that determine() > function, the more futile it seems. I just don't understand what the > criteria are supposed to be. Your list-comp loops through all of cdata, > and for each item, passes that item to determine(). determine() loops > through a copy of cdata, checking to see if any of them is within tdiff > of tup. But won't that always return True, since you'll encounter the > record and compare it to itself? I think you're misunderstanding what I need to do. I have a set of rows from the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rowsthat contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold. cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that did match. Here's my original code: # record time for each message matching the specified message for each tool messageTimes = {} for row in cdata: # tool, time, message if self.message in row[2]: messageTimes[row[0], row[1]] = 1 # now pull out each message that is within the time diff for each matched message # as well as the matched messages themselves def determine(tup): if self.message in tup[2]: return True # matched message for (tool, date_time) in messageTimes: if tool == tup[0]: if abs(date_time-tup[1]) <= tdiff: return True return False cdata[:] = [tup for tup in cdata if determine(tup)] Here's the code now: # Parse data and make a list of the time for each message matching the specified message for each tool messageTimes = defaultdict(list) # a dict with sorted lists for row in cdata: # tool, time, message if self.message in row[2]: messageTimes[row[0]].append(row[1]) # now pull out each message that is within the time context for each matched message # as well as the matched messages themselves # return true is we should keep this message def determine(tup): if self.message in tup[2]: return True # matched message if seconds == 0: return False # no time context specified times = messageTimes[tup[0]] # get the list of matched messages for this tool le = bisect.bisect_right(times, tup[1]) # find time less than or equal to tup[1] ge = bisect.bisect_left(times, tup[1]) # find time greaterthen to equal to tup[1] return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff) cdata = [tup for tup in cdata if determine(tup)] In my test case, cdata started with 600k rows, 30k matched the users string, and a total of 110k needed to be returned (which is what cdata ended up with) - the 30k that matched the string, and 80k that were within the time threshold. I think the point you may have missed is the tool - I only return a row if it's the same tool as a matched message and within the threshold. I hope I've explained this better. Thanks again. |
|
|
|
|
|||
|
|||
| Larry.Martell@gmail.com |
|
Larry.Martell@gmail.com
Guest
Posts: n/a
|
On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
> On 12/21/2012 03:36 PM, wrote: > > > On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry....@gmail.com wrote: > > >> <snip> > > >> Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. > > > The out of memory error had nothing to do with using bisect. I had introduced a typo that I really though would have caused a variable referenced before assignment error. But it did not do that, and instead somehow causedall the memory in my machine to get used up. When I fixed that, it worked really well with bisect. The code that was taking 2 hours was down to 20 minutes, and even better, a query that was taking 40 minutes was down to 8 seconds. > > > > > > Thanks very much for all your help. > > Congratulations. But you're not done. A fourfold improvement isn't > nearly as much as I'd expect. When you go from a n-squared algorithm to > an n-log-n, for n of 600k, I'd be disappointed in less than a 100 fold > improvement. Assuming we're talking just about time spent in the cdata > = list-comprehension list. > > It's also probably possible to simplify, and get some speedup from other > parts of the code other than that one loop. For example, if the bisect > is really working right, it's probably unnecessary to even copy the > original cdata. You said it was sorted. So bisect it directly, and > skip the messageTimes intermediate data structure. It may not help, but > i suspect it will. > > > > >> This was the code I had: > >> > >> times = messageTimes[tup[0]] > > >> > > >> le = bisect.bisect_right(times, tup[1]) > >> ge = bisect.bisect_left(times, tup[1]) > >> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff) > > > > I'm stymied making further suggestions to this fragment. Without seeing > the changes you apparently made to messageTimes, I can't even be sure > this is equivalent. And I wonder if even looking up tup[1] is > worthwhile, since your caller already knows the index in cdata. If you > used cdata directly, you might not need a sort at all. > > I wish you could post some sample data, and identify which records are > intended to be True. Obviously you could simplify, and just use ints > where it's using datetimes here. But if your rule is simply accept all > records that come in a cluster with no delta more than a threshold, then > you don't even need any search at all. Just pass the index in, and > compare the datetime with its predecessor and successor. > > Could we see the whole fragment as you originally had it, but with the > optimizations you've done so far? The more I look at that determine() > function, the more futile it seems. I just don't understand what the > criteria are supposed to be. Your list-comp loops through all of cdata, > and for each item, passes that item to determine(). determine() loops > through a copy of cdata, checking to see if any of them is within tdiff > of tup. But won't that always return True, since you'll encounter the > record and compare it to itself? I think you're misunderstanding what I need to do. I have a set of rows from the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rowsthat contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold. cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that did match. Here's my original code: # record time for each message matching the specified message for each tool messageTimes = {} for row in cdata: # tool, time, message if self.message in row[2]: messageTimes[row[0], row[1]] = 1 # now pull out each message that is within the time diff for each matched message # as well as the matched messages themselves def determine(tup): if self.message in tup[2]: return True # matched message for (tool, date_time) in messageTimes: if tool == tup[0]: if abs(date_time-tup[1]) <= tdiff: return True return False cdata[:] = [tup for tup in cdata if determine(tup)] Here's the code now: # Parse data and make a list of the time for each message matching the specified message for each tool messageTimes = defaultdict(list) # a dict with sorted lists for row in cdata: # tool, time, message if self.message in row[2]: messageTimes[row[0]].append(row[1]) # now pull out each message that is within the time context for each matched message # as well as the matched messages themselves # return true is we should keep this message def determine(tup): if self.message in tup[2]: return True # matched message if seconds == 0: return False # no time context specified times = messageTimes[tup[0]] # get the list of matched messages for this tool le = bisect.bisect_right(times, tup[1]) # find time less than or equal to tup[1] ge = bisect.bisect_left(times, tup[1]) # find time greaterthen to equal to tup[1] return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff) cdata = [tup for tup in cdata if determine(tup)] In my test case, cdata started with 600k rows, 30k matched the users string, and a total of 110k needed to be returned (which is what cdata ended up with) - the 30k that matched the string, and 80k that were within the time threshold. I think the point you may have missed is the tool - I only return a row if it's the same tool as a matched message and within the threshold. I hope I've explained this better. Thanks again. |
|
|
|
|
|||
|
|||
| Larry.Martell@gmail.com |
|
|
|
| |
![]() |
| Thread Tools | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Making code more efficient and effective | cokofreedom@gmail.com | Python | 3 | 06-26-2008 03:53 PM |
| I could use some help making this Python code run faster using only Python code. | Python Maniac | Python | 24 | 09-23-2007 05:48 PM |
| Writing more efficient code | gonzlobo | Python | 13 | 01-03-2007 06:27 AM |
| Re: More efficient RFC 1867 uploads? | Richard K Bethell | ASP .Net | 0 | 07-14-2003 05:32 PM |
| Re: More efficient RFC 1867 uploads? | John Timney \(Microsoft MVP\) | ASP .Net | 1 | 07-11-2003 08:16 PM |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc..
SEO by vBSEO ©2010, Crawlability, Inc. |




