Velocity Reviews > itertools.flatten()? and copying generators/iterators.

# itertools.flatten()? and copying generators/iterators.

Francis Avila
Guest
Posts: n/a

 10-28-2003
Below is an implementation a 'flattening' recursive generator (take a nested
iterator and remove all its nesting). Is this possibly general and useful
enough to be included in itertools? (I know *I* wanted something like it...)

Very basic examples:

>>> rl = [1, [2, 3, [4, 5], '678', 9]]
>>> list(flatten(rl))

[1, 2, 3, 4, 5, '6', '7', '8', 9]
>>> notstring = lambda obj: not isinstance(obj, type(''))
>>> list(flatten(rl, notstring))

[1, 2, 3, 4, 5, '678', 9]
>>> isstring = lambda obj: not notstring(obj)
>>> list(flatten(rl, isstring))

[1, [2, 3, [4, 5], '678', 9]]
>>> #The string is within a list, so we never descend that far.
>>> car_is_2 = lambda obj: isinstance(obj, type([])) and obj[0] == 2
>>> list(flatten(rl, car_is_2))

[1, 2, 3, [4, 5], '678', 9]
>>> rls = ['Here', 'are', ['some', ['nested'], 'strings']]
>>> list(flatten(rls))

['H', 'e', 'r', 'e', 'a', 'r', 'e', 's', 'o', 'm', 'e', 'n', 'e', 's', 't',
'e', 'd', 's', 't', 'r', 'i', 'n', 'g', 's']
>>> list(flatten(rls, notstring))

['Here', 'are', 'some', 'nested', 'strings']
>>> rli = iter([1, 2, iter(['abc', iter('ABC')]), 4])
>>> list(flatten(rli))

[1, 2, 'a', 'b', 'c', 'A', 'B', 'C', 4]
>>> list(flatten(rli, notstring))

[]
>>> #rli is an iterator, remember!
>>> rli = iter([1, 2, iter(['abc', iter('ABC')]), 4])
>>> list(flatten(rli, notstring))

[1, 2, 'abc', 'A', 'B', 'C', 4]
>>> # The following I'm not sure what to do about...
>>> empty = [1, [], 3]
>>> emptyiter = [1, iter([]), 3]
>>> list(flatten(empty))

[1, [], 3]
>>> list(flatten(emptyiter))

[1, 3]
>>>

I tried hard to get it to work with iterator and generator objects, too, and
it mostly does. However, I'm having some problems determining whether a
given object will iterate infinitely, if that object is already a
generator/iterator. Basically, I'm unable to copy an iterator (why?). See
isemptyorself() below for details. Aside from that, I'm just generally
unsure what the proper behavior should be when an iterator/generator is
encountered.

Also, why is the iterator type not included in the types module or described
in the language reference (Python 2.2)?

--- Code ---

def isiterable(obj):
try: iter(obj)
except TypeError: return False
else: return True

def isemptyorself(iterable):
"""True if iterable yields nothing or itself."""
it = iter(iterable)

# KLUDGE! This test must modify the object in order to test
# it. This means that a value of iterable will be discarded!
# Most likely, iterable is itself an iterator or generator,
# because id(iter(GENR or ITER)) == id(GENR or ITER).
# Unfortunately, we can't copy generators and iterators using
# the copy module, so we must just assume that this iterator
# doesn't yield itself or nothing....

if it is iterable:
return False

try: res = it.next()
except StopIteration: #Yields nothing
return True
else:
if res == iterable: #Yields itself
return True
return False

def flatten(iterable, isnested=None):
"""Iterate items in iterable, descending into nested items.

isnested is a function that returns true if the element of
iterable should be descended into. The default is to
consider iterable anything that iter() thinks is iterable (unless
doing so would cause an infinite recursion).

"""
if isnested is None:
isnested = lambda obj: True #Always descend

for item in iterable:
if isiterable(item) and not isemptyorself(item) \
and isnested(item):
for subitem in flatten(item, isnested):
yield subitem
else:
yield item

--
Francis Avila

Raymond Hettinger
Guest
Posts: n/a

 10-28-2003
[Francis Avila]
> Below is an implementation a 'flattening' recursive generator (take a nested
> iterator and remove all its nesting). Is this possibly general and useful
> enough to be included in itertools? (I know *I* wanted something like it...)

Interesting post!

first take
is that it more likely to be added to the "recipes" in the examples section.

Core itertools should be primitive building blocks that combine with one
another to make other tools. Also, I'm trying to keep the toolset as small
as possible by excluding new tools that can be easily and efficiently coded
in pure python.

I would code it like this:

def flatten(s):
try:
iter(s)
except TypeError:
yield s
else:
for elem in s:
for subelem in flatten(elem):
yield subelem

As your examples show, it does have some suprising behavior in that
strings get split apart instead of staying intact. That is easily taken care of
by an AtomicString subclass:

class AtomicString(str):
def __iter__(self):
raise TypeError

a = [1, 2, AtomicString('abc'), 4]

> >>> # The following I'm not sure what to do about...
> >>> empty = [1, [], 3]
> >>> emptyiter = [1, iter([]), 3]

The above code definition handles this without a problem.

> I'm having some problems determining whether a
> given object will iterate infinitely, if that object is already a
> generator/iterator.

In general, that is not knowable.

> Basically, I'm unable to copy an iterator (why?).

Alex Martelli just wrote a PEP about making many iterators copyable.
See www.python.org/peps/pep-0323.html

def tee(iterable):
"Return two independent iterators from a single iterable"
def gen(next, data={}, cnt=[0]):
dpop = data.pop
for i in count():
if i == cnt[0]:
item = data[i] = next()
cnt[0] += 1
else:
item = dpop(i)
yield item
next = iter(iterable).next
return (gen(next), gen(next))

> Also, why is the iterator type not included in the types module or described
> in the language reference (Python 2.2)?

There is no iterator type. Iterators can be any object that supports the
iterator
protocol: __iter__() and next(). Generators, container iterators, and each of
the itertools define their own type.

This was a long but highly instructive pair of posts. I hope it becomes
territory and touched on a number of frontier issues like copyability,
determining whether something is iterable, the various iterator types,
and the deep mathematical subject of determining whether a given
process is finite.

Raymond Hettinger

Peter Otten
Guest
Posts: n/a

 10-28-2003
Raymond Hettinger wrote:

> Core itertools should be primitive building blocks that combine with one
> another to make other tools. Also, I'm trying to keep the toolset as
> small as possible by excluding new tools that can be easily and
> efficiently coded in pure python.
>
> I would code it like this:
>
> def flatten(s):
> try:
> iter(s)
> except TypeError:
> yield s
> else:
> for elem in s:
> for subelem in flatten(elem):
> yield subelem
>
> As your examples show, it does have some suprising behavior in that
> strings get split apart instead of staying intact. That is easily taken
> care of by an AtomicString subclass:
>
> class AtomicString(str):
> def __iter__(self):
> raise TypeError
>
> a = [1, 2, AtomicString('abc'), 4]
>
>
>
>> >>> # The following I'm not sure what to do about...
>> >>> empty = [1, [], 3]
>> >>> emptyiter = [1, iter([]), 3]

>

I suggest a minor modification:

def flatten(s, toiter=iter):
try:
it = toiter(s)
except TypeError:
yield s
else:
for elem in it:
for subelem in flatten(elem, toiter):
yield subelem

def keepstrings(seq):
if isinstance(seq, basestring):
raise TypeError
return iter(seq)

sample = [1, 2, [3, "abc def".split()], 4]
print sample
print list(flatten(sample, keepstrings))
print list(flatten([1, [], 3]))
print list(flatten([1, iter([]), 3]))

The above handles strings in a way that is nonintrusive on client code.

Peter

Alex Martelli
Guest
Posts: n/a

 10-28-2003
Peter Otten wrote:
...
> def keepstrings(seq):
> if isinstance(seq, basestring):
> raise TypeError
> return iter(seq)

...
> The above handles strings in a way that is nonintrusive on client code.

Yes, very nice. I'd make keepstrings the default (one RARELY wants
to treat strings as nonatomic) AND replace the typetest with an
attempt to do a seq+'' (if it doesn't raise, seq ain't a string),
but that's just me.

Alex

Francis Avila
Guest
Posts: n/a

 10-28-2003

"Alex Martelli" <(E-Mail Removed)> wrote in message
news:Tesnb.57573\$(E-Mail Removed)...
> Peter Otten wrote:
> ...
> > def keepstrings(seq):
> > if isinstance(seq, basestring):
> > raise TypeError
> > return iter(seq)

> ...
> > The above handles strings in a way that is nonintrusive on client code.

>
> Yes, very nice. I'd make keepstrings the default (one RARELY wants
> to treat strings as nonatomic) AND replace the typetest with an
> attempt to do a seq+'' (if it doesn't raise, seq ain't a string),
> but that's just me.
>
> Alex
>

try: seq+'' seems more expensive. Is it just to handle the case where
someone makes a string-alike that doesn't derive from str? (Actually, I
guess that's about all pre-2.2 code.)

Is it expensive enough to even matter?
--
Francis Avila

Francis Avila
Guest
Posts: n/a

 10-28-2003
"Raymond Hettinger" <(E-Mail Removed)> wrote in message
news:niqnb.17753\$(E-Mail Removed)...
> [Francis Avila]

> > Also, why is the iterator type not included in the types module or

described
> > in the language reference (Python 2.2)?

>
> There is no iterator type. Iterators can be any object that supports the
> iterator
> protocol: __iter__() and next(). Generators, container iterators, and

each of
> the itertools define their own type.

Actually (Python 2.2):

>>> IteratorType = type(iter(()))
>>> IteratorType

<type 'iterator'>
>>> from types import GeneratorType
>>> GeneratorType

<type 'generator'>
>>> GeneratorType == IteratorType

0
>>> isinstance(IteratorType, GeneratorType)

0
>>> isinstance(GeneratorType, IteratorType)

0
>>> dir(GeneratorType)

['__class__', '__delattr__', '__doc__', '__getattribute__', '__hash__',
'__init__', '__iter__', '__new__', '__reduce__', '__repr__', '__setattr__',
'__str__', 'gi_frame', 'gi_running', 'next']
>>> dir(IteratorType)

['__class__', '__delattr__', '__doc__', '__getattribute__', '__hash__',
'__init__', '__iter__', '__new__', '__reduce__', '__repr__', '__setattr__',
'__str__', 'next']
>>> def g():

yield None

>>> G = g()
>>> type(G)

<type 'generator'>
>>> G

<generator object at 0x0156F1C0>
>>> iter(G)

<generator object at 0x0156F1C0>
>>> I = iter(())
>>> type(I)

<type 'iterator'>
>>> I

<iterator object at 0x014BAC80>
>>> iter(I)

<iterator object at 0x014BAC80>
>>>

The generator seems to be an actual execution frame, but an iterator only a
type that supports the iterator protocol. I think from this that iterator
needs to store all its elements prior to yielding them, whereas a generator
does not. For this reason, an iterator will always terminate eventually.
For these two reasons, I think it's useful to distinguish iterators and
generators.

I think I've clarified it a bit more to myself. For iterators, it iterates
infinitely if it yields itself, otherwise it doesn't. For generators, it is
unknowable.

Are classes supporting the iterator protocol always generators? Would they
*ever* be an iterator? (According to type(), I mean) Is there any other way
to get a strict iterator other than applying iter() to a sequence? (I
noticed you can't force Python to subclass the iterator type.)

The language reference (or maybe it's the module reference?) states that
generators yield iterator objects. That's not exactly true: they yield
generator objects which support the iterator protocol, but not iterator
objects. And again, iterator types are not mentioned as a separate type in
the language reference, which strikes me as odd.

Perhaps a guru can clarify the relationship between generators and
iterators?
--
Francis Avila

Peter Otten
Guest
Posts: n/a

 10-28-2003
Francis Avila wrote:

> try: seq+'' seems more expensive. Is it just to handle the case where
> someone makes a string-alike that doesn't derive from str? (Actually, I
> guess that's about all pre-2.2 code.)
>
> Is it expensive enough to even matter?
> --
> Francis Avila

I you think that the concatenation is costly, as I did when I read your
remark - it's not:

>>> smartSnakes = "don't rattle"
>>> smartSnakes is (smartSnakes + "")

True

Peter

Francis Avila
Guest
Posts: n/a

 10-28-2003

"Peter Otten" <(E-Mail Removed)> wrote in message
news:bnlp0m\$d9l\$01\$(E-Mail Removed)-online.com...
> Francis Avila wrote:
>
> > try: seq+'' seems more expensive. Is it just to handle the case where
> > someone makes a string-alike that doesn't derive from str? (Actually, I
> > guess that's about all pre-2.2 code.)
> >
> > Is it expensive enough to even matter?
> > --
> > Francis Avila

>
> I you think that the concatenation is costly, as I did when I read your
> remark - it's not:
>
> >>> smartSnakes = "don't rattle"
> >>> smartSnakes is (smartSnakes + "")

> True
>
> Peter

Actually, I meant the exception testing. Isn't Alex suggesting something
like?:

try:
seq+''
except:
seq is not a string
else:
seq is a string

--
Francis Avila

Alex Martelli
Guest
Posts: n/a

 10-28-2003
Francis Avila wrote:
...
> Actually, I meant the exception testing. Isn't Alex suggesting something
> like?:
>
> try:
> seq+''
> except:
> seq is not a string
> else:
> seq is a string

Right. Yes, if you have a lot of nonstrings this IS going to be
substantially slower than isinstance tests. As usual, setting up
a representative benchmark and measuring the effect is best.

Of course, it's not easy to decide what IS representative, but
let's give it a try:

# Peter Otten's suggestion (simplified/flattened out)
def flatten_po(s):
try:
if isinstance(s, basestring):
raise TypeError
else:
it = iter(s)
except TypeError:
yield s
else:
for elem in it:
for subelem in flatten_po(elem):
yield subelem

# my slight preference
def flatten_am(s):
try:
s+''
except TypeError:
try: it = iter(s)
except TypeError:
yield s
else:
for elem in s:
for subelem in flatten_am(elem):
yield subelem
else:
yield s

# an example tree -- a mix of strings, items, some nesting
subtree = ['foo']*18 + [1,2,3]*6
tree = [ subtree*10, [ subtree * 8 ] ]

# functions to benchmark
def process(flatten, tree=tree):
return list(flatten(tree))

def pro_po(): return process(flatten_po)
def pro_am(): return process(flatten_am)

Now, with this setup, I measure:

[alex@lancelot bo]\$ timeit.py -c -s'import fl' 'fl.pro_po()'
100 loops, best of 3: 9.4e+03 usec per loop
[alex@lancelot bo]\$ timeit.py -c -s'import fl' 'fl.pro_am()'
100 loops, best of 3: 8.3e+03 usec per loop

....which is a reminder of WHY it's best to always measure --
I fully expected pro_am to be slower, as it's more general
(handles UserString instances etc), and I was just trying
to check _how_ much slower it was...!

Upon reflection, the problem is with flatten_po elegant
'raise TypeError', which merges strings with noniterables
BUT only at the cost of an exception. Changing the
preample of flatten_po to:

try:
if isinstance(s, basestring):
yield s
return
...

improves its timeit.py-measured performance to 5.9e+03 usec.
So, on this sample of data, the _po version is either 13%
slower or 29% faster than the _am .

Now, fortunately, it's an EXTREMELY unlikely event that
performance differences of this order of magnitude should
deeply influence our choice of algorithms! Twice as slow,
well, one wants to keep an eye on that; 30% or less, it's
unlikely to matter except at the very heart of some key
bottleneck -- and, honestly, how often will flattening
that in this case we're basically doing no processing with
the items in the flattened sequence, just stuffing them into
flat lists. Do any processing whatsoever, and that will
probably dwarf the infrastructural overhead of the flattening
procedure itself.

The key thing is watching for difference in big-O behavior --
and, here, we have none... just somewhat different (NOT
by whole orders of magnitude!) multiplicative constants.
THAT kind of performance issue is one we can live with
more often than not! Which, in turn, is what makes the
ordinary use of exceptions generally tolerable from the
point of view of performance, except in the very hottest
bottlenecks...

Alex

Alex Martelli
Guest
Posts: n/a

 10-28-2003
Francis Avila wrote:
...
> Perhaps a guru can clarify the relationship between generators and
> iterators?

Sure: the use of "iterators" or "iterator objects" in the Python docs
does NOT mean "objects that are instance of <type 'iterator'>", it
means "objects that satisfy the iterator protocol".

The situation may be a bit less confusing in today's Python, just
because the typenames involved have been changed a bit in a few
important special cases:

>>> iter(())

<tupleiterator object at 0x402deecc>

>>> iter([])

<listiterator object at 0x402dedec>

>>> iter('')

<iterator object at 0x402dee8c>

These types are unrelated -- each only inherits from object, as
you can check by viewing their __bases__ -- and besides the small
performance improvement that results from each of these iterator
objects 'knowing' the exact type of sequence it's iterating on,
the name diversification may help avoid the understandable kind
of confusion you may have been led into by older Python versions.

Alex