Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Interesting list() un-optimization (http://www.velocityreviews.com/forums/t958442-interesting-list-un-optimization.html)

Roy Smith 03-07-2013 03:20 AM

Interesting list() un-optimization
 
I stumbled upon an interesting bit of trivia concerning lists and list
comprehensions today.

We use mongoengine as a database model layer. A mongoengine query
returns an iterable object called a QuerySet. The "obvious" way to
create a list of the query results would be:

my_objects = list(my_query_set)

and, indeed, that works. But, then I found this code:

my_objects = [obj for obj in my_query_set]

which seemed a bit silly. I called over the guy who wrote it and asked
him why he didn't just write it using list(). I was astounded when it
turned out there's a good reason!

Apparently, list() has an "optimization" where it calls len() on its
argument to try and discover the number of items it's going to put into
the list. Presumably, list() uses this information to pre-allocate the
right amount of memory the first time, without any resizing. If len()
fails, it falls back to just iterating and resizing as needed.
Normally, this would be a win.

The problem is, QuerySets have a __len__() method. Calling it is a lot
faster than iterating over the whole query set and counting the items,
but it does result in an additional database query, which is a lot
slower than the list resizing! Writing the code as a list comprehension
prevents list() from trying to optimize when it shouldn't!

Dave Angel 03-07-2013 03:38 AM

Re: Interesting list() un-optimization
 
On 03/06/2013 10:20 PM, Roy Smith wrote:
> I stumbled upon an interesting bit of trivia concerning lists and list
> comprehensions today.
>
> We use mongoengine as a database model layer. A mongoengine query
> returns an iterable object called a QuerySet. The "obvious" way to
> create a list of the query results would be:
>
> my_objects = list(my_query_set)
>
> and, indeed, that works. But, then I found this code:
>
> my_objects = [obj for obj in my_query_set]
>
> which seemed a bit silly. I called over the guy who wrote it and asked
> him why he didn't just write it using list(). I was astounded when it
> turned out there's a good reason!
>
> Apparently, list() has an "optimization" where it calls len() on its
> argument to try and discover the number of items it's going to put into
> the list. Presumably, list() uses this information to pre-allocate the
> right amount of memory the first time, without any resizing. If len()
> fails, it falls back to just iterating and resizing as needed.
> Normally, this would be a win.
>
> The problem is, QuerySets have a __len__() method. Calling it is a lot
> faster than iterating over the whole query set and counting the items,
> but it does result in an additional database query, which is a lot
> slower than the list resizing! Writing the code as a list comprehension
> prevents list() from trying to optimize when it shouldn't!
>


That is very interesting. list() assumes the __len__() method would be
very quick.

Perhaps list() should take an optional second argument that specifies
the initial length to allocate. That way code that either doesn't want
__len__() to be used, or that already knows a reasonable number to use,
can supply the value to preallocate.

--
DaveA

Tim Chase 03-07-2013 03:57 AM

Re: Interesting list() un-optimization
 
On 2013-03-06 22:20, Roy Smith wrote:
> I stumbled upon an interesting bit of trivia concerning lists and
> list comprehensions today.


I agree with Dave Angel that this is interesting. A little testing
shows that this can be rewritten as

my_objects = list(iter(my_query_set))

which seems to then skip the costly __len__ call. Performance geeks
are welcome to time it against the list-comprehension version :-)

-tkc


class Foo(object):
def __init__(self):
self.items = range(10)
def __iter__(self):
return iter(self.items)
def __len__(self):
print "Calling costly __len__"
return len(self.items)

print "Ensuring we can iterate over it:"
for x in Foo():
print x

print "\nJust call list():"
lst = list(Foo())
print lst

print "\nCall list(iter())"
lst = list(iter(Foo()))
print lst

Kev Dwyer 03-07-2013 07:31 AM

Re: Interesting list() un-optimization
 
Roy Smith wrote:

> I stumbled upon an interesting bit of trivia concerning lists and list
> comprehensions today.
>
> We use mongoengine as a database model layer. A mongoengine query
> returns an iterable object called a QuerySet. The "obvious" way to
> create a list of the query results would be:
>
> my_objects = list(my_query_set)
>
> and, indeed, that works. But, then I found this code:
>
> my_objects = [obj for obj in my_query_set]
>
> which seemed a bit silly. I called over the guy who wrote it and asked
> him why he didn't just write it using list(). I was astounded when it
> turned out there's a good reason!
>
> Apparently, list() has an "optimization" where it calls len() on its
> argument to try and discover the number of items it's going to put into
> the list. Presumably, list() uses this information to pre-allocate the
> right amount of memory the first time, without any resizing. If len()
> fails, it falls back to just iterating and resizing as needed.
> Normally, this would be a win.
>
> The problem is, QuerySets have a __len__() method. Calling it is a lot
> faster than iterating over the whole query set and counting the items,
> but it does result in an additional database query, which is a lot
> slower than the list resizing! Writing the code as a list comprehension
> prevents list() from trying to optimize when it shouldn't!



Interesting discovery. Yet isn't this as much an issue with the mongoengine
library as with list()? Queryset.count() can be called if the "length" of a
resultset needs to be retrieved, so the __len__() methd seems redundant.
And given that it's not unheard of to call list() on iterables, perhaps the
library designers should either optimise the __len__() method, or document
the performance implications of calling list on the queryset?

Anyway, thanks for this thought-provoking post.

Cheers,

Kev


Wolfgang Maier 03-07-2013 11:22 AM

Re: Interesting list() un-optimization
 
Tim Chase <python.list <at> tim.thechases.com> writes:

> On 2013-03-06 22:20, Roy Smith wrote:
> > I stumbled upon an interesting bit of trivia concerning lists and
> > list comprehensions today.

>
> A little testing
> shows that this can be rewritten as
>
> my_objects = list(iter(my_query_set))
>
> which seems to then skip the costly __len__ call. Performance geeks
> are welcome to time it against the list-comprehension version
>
> class Foo(object):
> def __init__(self):
> self.items = range(10)
> def __iter__(self):
> return iter(self.items)
> def __len__(self):
> print "Calling costly __len__"
> return len(self.items)
>


Well, it skips the costly len() call because your iter(Foo()) returns
iter(range()) under the hood and list() uses that object's __len__() method. In
most cases, such a workaround will not be feasible. Why should iter(QuerySet())
have a faster __len__() method defined than QuerySet() itself. Most likely,
iter(QuerySet()) just returns self anyway?
Best,
Wolfgang




Ian Kelly 03-07-2013 04:00 PM

Re: Interesting list() un-optimization
 
On Thu, Mar 7, 2013 at 4:22 AM, Wolfgang Maier
<wolfgang.maier@biologie.uni-freiburg.de> wrote:
> Well, it skips the costly len() call because your iter(Foo()) returns
> iter(range()) under the hood and list() uses that object's __len__() method.


Iterators do not generally have __len__ methods.

>>> len(iter(range(10)))

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object of type 'range_iterator' has no len()

> In
> most cases, such a workaround will not be feasible. Why should iter(QuerySet())
> have a faster __len__() method defined than QuerySet() itself.


iter(QuerySet()) should not have any __len__ method defined at all,
which is why the optimization would be skipped.

> Most likely,
> iter(QuerySet()) just returns self anyway?


But on this point, you are correct. The mongoengine QuerySet.__iter__
method is defined as:

def __iter__(self):
self.rewind()
return self

This is unfortunate design. Not only does it mean that the iterator's
__len__ method cannot be trusted (what should the __len__ of a
partially exhausted iterator return?), but it also means that requesting
an iterator over the QuerySet will also silently invalidate any
existing iterators.

Ian Kelly 03-07-2013 05:31 PM

Re: Interesting list() un-optimization
 
On Thu, Mar 7, 2013 at 9:20 AM, Christian Heimes <christian@python.org> wrote:
> Am 07.03.2013 17:00, schrieb Ian Kelly:
>> On Thu, Mar 7, 2013 at 4:22 AM, Wolfgang Maier
>> <wolfgang.maier@biologie.uni-freiburg.de> wrote:
>>> Well, it skips the costly len() call because your iter(Foo()) returns
>>> iter(range()) under the hood and list() uses that object's __len__() method.

>>
>> Iterators do not generally have __len__ methods.
>>
>>>>> len(iter(range(10)))

>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> TypeError: object of type 'range_iterator' has no len()

>
> But iterators have a length hint method that are used for some
> optimizations and preallocations, too.
>
>>>> i = iter(range(10))
>>>> i.__length_hint__()

> 10
>
> See http://www.python.org/dev/peps/pep-0424/


Didn't know about that, thanks. Presumably a proper iter(QuerySet())
object could implement __length_hint__ in an efficient manner rather
than by just calling the __len__ of the underlying QuerySet,

Ian Kelly 03-07-2013 08:26 PM

Re: Interesting list() un-optimization
 
On Thu, Mar 7, 2013 at 12:19 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
>> Didn't know about that, thanks. Presumably a proper iter(QuerySet())
>> object could implement __length_hint__ in an efficient manner rather
>> than by just calling the __len__ of the underlying QuerySet,

>
> And how exactly would it do that, without either doing what __len__ does or
> reading the whole result set into memory?


If the underlying cursor provides its own efficient length hint, it
could return that. Or if the query is result-limited, use the limit
as a length hint, provided it's not absurdly large. And if you really
can't efficiently determine anything about the length of the result
set at all, you can always fall back on returning NotImplemented.

Terry Reedy 03-07-2013 08:29 PM

Re: Interesting list() un-optimization
 
On 3/7/2013 11:00 AM, Ian Kelly wrote:

> But on this point, you are correct. The mongoengine QuerySet.__iter__
> method is defined as:
>
> def __iter__(self):
> self.rewind()
> return self
>
> This is unfortunate design. Not only does it mean that the iterator's
> __len__ method cannot be trusted (what should the __len__ of a
> partially exhausted iterator return?), but it also means that requesting
> an iterator over the QuerySet will also silently invalidate any
> existing iterators.


I view that design as a violation of the iterator protocol and hence a
program bug. __iter__ should either *just* return self (if the self is
an iterator) or return a new object (if self is a non-iterator
iterable). File objects are iterators and .__iter__ does not rewind.

>>> f = open("f:/python/mypy/tem.py")
>>> next(f)

'class myit(list):\n'
>>> f2 = iter(f)
>>> f2 is f

True
>>> next(f2)

" def __bytes__(self): return b'hello'\n"

--
Terry Jan Reedy


Terry Reedy 03-07-2013 08:34 PM

Re: Interesting list() un-optimization
 
On 3/7/2013 11:20 AM, Christian Heimes wrote:

> But iterators have a length hint method that are used for some
> optimizations and preallocations, too.


This is easy when the base iterable has a length method, as do range
objects.

>>>> i = iter(range(10))
>>>> i.__length_hint__()

> 10


And the length_hint can (should be) decremented with each next call.

>>> next(i); next(i)

0
1
>>> i.__length_hint__()

8

--
Terry Jan Reedy



All times are GMT. The time now is 01:53 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.