Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > set and dict iteration

Reply
Thread Tools

set and dict iteration

 
 
Aaron Brady
Guest
Posts: n/a
 
      08-16-2012
Hello,

I observed an inconsistency in the behavior of 'set' and 'dict' iterators. It is "by design" according to the docs.

'''
http://docs.python.org/dev/library/s...tml#dict-views

iter(dictview). Iterating views while adding or deleting entries in the dictionary may raise a RuntimeError or fail to iterate over all entries.
'''

The 'set' has the same behavior. Iteration might also complete successfully.

The inconsistency is, if we remove an element from a set and add another during iteration, the new element might appear later in the iteration, and might not, depending on the hash code; therefore comparing the size of the set between iterations isn't adequate. Example:

http://home.comcast.net/~castironpi-...20iterators.py

'''
# py: { 'ver': '3' }

set0= set( ( 1, 2 ) )

iter0= iter( set0 )
print( next( iter0 ) )

set0.add( 3 )
set0.remove( 2 )
print( next( iter0 ) )


print( )

set0= set( ( 6, 7 ) )

iter0= iter( set0 )
print( next( iter0 ) )

set0.add( 8 )
set0.remove( 7 )
print( next( iter0 ) )
'''

Output:

'''
1
3

6
Traceback (most recent call last):
File [...] line 22, in <module>
print( next( iter0 ) )
StopIteration
'''

Iteration should behave the same regardless of the contents of the set. Continuing iteration over sets and dicts after a modification isn't defined; it should unconditionally raise an error.

What's going on, is '8' is added before the position of the iterator due tohashing in the second part, but the size doesn't change, so the iterator reaches the end of the set after '7' is removed.

The inconsistency isn't easily solved. One possibility is to use a timestamp or other serial index in the object and iterators, and compare them on every iteration to determine if a modification has occurred.

Another possibility which the author prefers, is to maintain a secondary collection of the iterators of an object, and invalidate them upon modification. The applicable collection structure is a doubly-linked linked list, informally depicted:

http://home.comcast.net/~castironpi-...0iterators.png

Upon modification, the set traverses its iterators, setting an 'invalid' flag on each; and subsequent calls to any of them raise an 'IterationError'. Adding and removing iterators to and from the secondary list is performed in O( 1 ) time with no penalty.

The above example depicted a 'Set'. 'Dicts' have the same anomaly, but thesolution is ambiguous, since dict values can be changed meaningfully without altering the structure of the object. In the author's opinion, the dictshould not raise an 'IterationError' on value changes, only key changes like the set, but the argument isn't conclusive.
 
Reply With Quote
 
 
 
 
Dave Angel
Guest
Posts: n/a
 
      08-16-2012
On 08/16/2012 02:00 PM, Aaron Brady wrote:
> Hello,
>
> I observed an inconsistency in the behavior of 'set' and 'dict' iterators. It is "by design" according to the docs.
>
> '''
> http://docs.python.org/dev/library/s...tml#dict-views
>
> iter(dictview). Iterating views while adding or deleting entries in the dictionary may raise a RuntimeError or fail to iterate over all entries.
> '''
>
> The 'set' has the same behavior. Iteration might also complete successfully.
>
> The inconsistency is, if we remove an element from a set and add another during iteration, the new element might appear later in the iteration, and might not, depending on the hash code; therefore comparing the size of the set between iterations isn't adequate. Example:
> <SNIP>
>
>
> Iteration should behave the same regardless of the contents of the set. Continuing iteration over sets and dicts after a modification isn't defined; it should unconditionally raise an error.


Why is it the iterator's job to protect against the user's bug? The
doc is clear enough. If you don't change the collection, you won't have
a problem.

> <SNIP more details>.


Everything else is implementation defined. Why should an implementation
be forced to have ANY extra data structure to detect a static bug in the
caller's code?



--

DaveA

 
Reply With Quote
 
 
 
 
Paul Rubin
Guest
Posts: n/a
 
      08-16-2012
Dave Angel <> writes:
> Everything else is implementation defined. Why should an implementation
> be forced to have ANY extra data structure to detect a static bug in the
> caller's code?


For the same reason the interpreter checks for type errors at runtime
and raises TypeError, instead of letting the program go into the weeds.
 
Reply With Quote
 
Ian Kelly
Guest
Posts: n/a
 
      08-16-2012
On Thu, Aug 16, 2012 at 4:55 PM, Ian Kelly <> wrote:
> On Thu, Aug 16, 2012 at 12:00 PM, Aaron Brady <> wrote:
>> The inconsistency is, if we remove an element from a set and add anotherduring iteration, the new element might appear later in the iteration, andmight not, depending on the hash code; therefore comparing the size of theset between iterations isn't adequate. Example:

>
> It can be more than just the new element. For example, here the
> entire set is repeated (Python 3.2):
>
>>>> s = set(range(8, 13))
>>>> it = iter(s)
>>>> from itertools import islice
>>>> list(islice(it, 5)) # avoid exhausting the iterator

> [8, 9, 10, 11, 12]
>>>> s.add(13)
>>>> s.remove(13)
>>>> list(it)

> [8, 9, 10, 11, 12]
>
> This occurs because the addition of the sixth item triggers a resize
> of the underlying hash table, and the existing items, which were
> originally in slots 0-4, are now in slots 8-12.


Another curious example:

>>> s = set(range(8, 48, )
>>> s

{8, 16, 40, 24, 32}
>>> it = iter(s)
>>> from itertools import islice
>>> list(islice(it, 4))

[8, 16, 40, 24]
>>> s.add(4
>>> s.remove(4
>>> list(it)

[8, 16, 40, 24]

Hey, what happened to 32?
 
Reply With Quote
 
Dave Angel
Guest
Posts: n/a
 
      08-16-2012
On 08/16/2012 05:26 PM, Paul Rubin wrote:
> Dave Angel <> writes:
>> Everything else is implementation defined. Why should an implementation
>> be forced to have ANY extra data structure to detect a static bug in the
>> caller's code?

> For the same reason the interpreter checks for type errors at runtime
> and raises TypeError, instead of letting the program go into the weeds.


There's an enormous difference between type errors, which affect the low
level dispatch, and checking for whether a dict has changed and may have
invalidated the iterator. If we were really going to keep track of what
iterators are tracking a given dict or set, why stop there? Why not
check if another process has changed a file we're iterating through? Or ...


 
Reply With Quote
 
Ian Kelly
Guest
Posts: n/a
 
      08-16-2012
On Thu, Aug 16, 2012 at 5:11 PM, Dave Angel <> wrote:
> There's an enormous difference between type errors, which affect the low
> level dispatch, and checking for whether a dict has changed and may have
> invalidated the iterator. If we were really going to keep track of what
> iterators are tracking a given dict or set, why stop there? Why not
> check if another process has changed a file we're iterating through? Or ...


How does this affect low-level dispatch (Python 2.7)?

>>> class Foo(object):

.... def bar(self):
.... return self
....
>>> Foo().bar()

<__main__.Foo object at 0x00CBEAB0>
>>> Foo.bar(Foo())

<__main__.Foo object at 0x00CC9390>
>>> Foo.bar(object())

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unbound method bar() must be called with Foo instance as
first argument (got object instance instead)

There is no low-level need for this TypeError -- it's purely a case of
not letting the developer shoot himself in the foot. Although to be
honest the interpreter doesn't give quite enough rope (to mix
metaphors) in this case, and I'm glad for the sake of duck typing that
they removed this particular error in Python 3.

With regard to key insertion and deletion while iterating over a dict
or set, though, there is just no good reason to be doing that
(especially as the result is very implementation-specific), and I
wouldn't mind a more complete low-level check against it as long as
it's not too expensive (which is not clearly the case with the current
suggestion at all).
 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      08-17-2012
Ian Kelly <> writes:
> With regard to key insertion and deletion while iterating over a dict
> or set, though, there is just no good reason to be doing that
> (especially as the result is very implementation-specific), and I
> wouldn't mind a more complete low-level check against it as long as
> it's not too expensive (which is not clearly the case with the current
> suggestion at all).


One possible approach is to freeze the dictionary against modification
while any iterator is open on it. You could keep a count of active
iterators in the dict structure, adjusting it whenever an iterator is
created or closed/destroyed.
 
Reply With Quote
 
Steven D'Aprano
Guest
Posts: n/a
 
      08-17-2012
On Thu, 16 Aug 2012 19:11:19 -0400, Dave Angel wrote:

> On 08/16/2012 05:26 PM, Paul Rubin wrote:
>> Dave Angel <> writes:
>>> Everything else is implementation defined. Why should an
>>> implementation be forced to have ANY extra data structure to detect a
>>> static bug in the caller's code?

>> For the same reason the interpreter checks for type errors at runtime
>> and raises TypeError, instead of letting the program go into the weeds.

>
> There's an enormous difference between type errors, which affect the low
> level dispatch, and checking for whether a dict has changed and may have
> invalidated the iterator. If we were really going to keep track of what
> iterators are tracking a given dict or set, why stop there? Why not
> check if another process has changed a file we're iterating through? Or
> ...


Which is why Python doesn't do it -- because it is (claimed to be)
excessively expensive for the benefit that you would get.

Not because it is a matter of principle that data integrity is
unimportant. Data integrity *is* important, but in the opinion of the
people who wrote these particular data structures, the effort required to
guarantee correct iteration in the face of mutation is too expensive for
the benefit.

Are they right? I don't know. I know that the list sort method goes to a
lot of trouble to prevent code from modifying lists while they are being
sorted. During the sort, the list temporarily appears to be empty to
anything which attempts to access it. So at least sometimes, the Python
developers spend effort to ensure data integrity.

Luckily, Python is open source. If anyone thinks that sets and dicts
should include more code protecting against mutation-during-iteration,
they are more than welcome to come up with a patch. Don't forget unit and
regression tests, and also a set of timing results which show that the
slow-down isn't excessive.


--
Steven
 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      08-17-2012
Steven D'Aprano <steve+> writes:
> Luckily, Python is open source. If anyone thinks that sets and dicts
> should include more code protecting against mutation-during-iteration,
> they are more than welcome to come up with a patch. Don't forget unit and
> regression tests, and also a set of timing results which show that the
> slow-down isn't excessive.


It could be a debugging option, in which case even a fairly significant
slowdown is acceptable.
 
Reply With Quote
 
Ulrich Eckhardt
Guest
Posts: n/a
 
      08-17-2012
Am 17.08.2012 03:01, schrieb Paul Rubin:
> Ian Kelly <> writes:
>> With regard to key insertion and deletion while iterating over a dict
>> or set, though, there is just no good reason to be doing that
>> (especially as the result is very implementation-specific), and I
>> wouldn't mind a more complete low-level check against it as long as
>> it's not too expensive (which is not clearly the case with the current
>> suggestion at all).

>
> One possible approach is to freeze the dictionary against modification
> while any iterator is open on it. You could keep a count of active
> iterators in the dict structure, adjusting it whenever an iterator is
> created or closed/destroyed.


What if there is an iterator left over from a loop that was terminated
early? That could block access to the sequence even though nothing is
/really/ iterating over it.

I personally prefer a reliable error, at least when __debug__ is set.
Someone suggested a timestamp or a list of active iterators, which both
sound reasonable. The two should be O(1) and O(#iterators) in complexity
on all mutating operations and O(1) on iteration, so they should be
acceptable. With a C implementation it probably boils down to very few
cycles (checking a pointer/incrementing an integer). I can't say if this
is feasible without compromising performance though, at the very least
it requires an additional member in all dicts and iterators.

Uli

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
dict.keys() and dict.values() are always the same order, is it? Menghan Zheng Python 1 04-20-2010 03:51 AM
Modify dict/set during iteration? metal Python 5 10-30-2009 02:21 PM
Struts - Problem with nested iteration or double iteration Rudi Java 5 10-01-2008 03:30 AM
Grabbing previous iteration in a dict dannywebster@googlemail.com Python 7 05-09-2008 03:44 PM
dict.items() vs dict.iteritems and similar questions Drew Python 19 03-15-2007 09:23 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57