Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > itertools.groupby usage to get structured data

Reply
Thread Tools

itertools.groupby usage to get structured data

 
 
Slafs
Guest
Posts: n/a
 
      02-04-2011
Hi there!

I'm having trouble to wrap my brain around this kind of problem:

What I have :
1) list of dicts
2) list of keys that i would like to be my grouping arguments of
elements from 1)
3) list of keys that i would like do "aggregation" on the elements
of 1) with some function e.g. sum

For instance i got:
1) [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
{ 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
{'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}, ... ]
2) ['g1', 'g2']
3) ['s_v1', 's_v2']

To be precise 1) is a result of a values_list method from a QuerySet
in Django; 2) is the arguments for that method; 3) those are the
annotation keys. so 1) is a result of:
qs.values_list('g1', 'g2').annotate(s_v1=Sum('v1'), s_v2=Sum('v2'))

What i want to have is:
a "big" nested dictionary with 'g1' values as 1st level keys and a
dictionary of aggregates and "subgroups" in it.

In my example it would be something like this:
{
1 : {
's_v1' : 7.0,
's_v2' : 6.5,
'g2' :{
8 : {
's_v1' : 5.0,
's_v2' : 3.5 },
9 : {
's_v1' : 2.0,
's_v2' : 3.0 }
}
},
2 : {
's_v1' : 6.0,
's_v2' : 8.0,
'g2' : {
8 : {
's_v1' : 6.0,
's_v2' : 8.0}
}
},
....
}

# notice the summed values of s_v1 and s_v2 when g1 == 1

I was looking for a solution that would let me do that kind of
grouping with variable lists of 2) and 3) i.e. having also 'g3' as
grouping element so the 'g2' dicts could also have their own
"subgroup" and be even more nested then.
I was trying something with itertools.groupby and updating nested
dicts, but as i was writing the code it started to feel too verbose to
me :/

Do You have any hints maybe? because i'm kind of stucked :/

Regards

Sławek
 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      02-05-2011
On Fri, 04 Feb 2011 15:14:24 -0800, Slafs wrote:

> Hi there!
>
> I'm having trouble to wrap my brain around this kind of problem:


Perhaps you should consider backing up and staring from somewhere else
with different input data, or changing the requirements. Just a thought.


> What I have :
> 1) list of dicts
> 2) list of keys that i would like to be my grouping arguments of
> elements from 1)
> 3) list of keys that i would like do "aggregation" on the elements
> of 1) with some function e.g. sum



You start with data:

dicts = [ {'g1': 1, 'g2': 8, 's_v1': 5.0, 's_v2': 3.5},
{'g1': 1, 'g2': 9, 's_v1': 2.0, 's_v2': 3.0},
{'g1': 2, 'g2': 8, 's_v1': 6.0, 's_v2': 8.0} ]


It sometimes helps me to think about data structures by drawing them out.
In this case, you have what is effectively a two-dimensional table:

g1 g2 s_v1 s_v2
=== === ===== ====
1 8 5.0 3.5
1 9 2.0 3.0
2 8 6.0 8.0


Nice and simple. But the result you want is a bit more complex -- it's a
dict of dicts of dicts:

{1: {'s_v1': 7.0, 's_v2': 6.5,
'g2': {8: {'s_v1': 5.0, 's_v2': 3.5},
9: {'s_v1': 2.0, 's_v2': 3.0}
}},
2: {'s_v1': 6.0, 's_v2': 8.0,
'g2': {8: {'s_v1' : 6.0, 's_v2': 8.0}
}}}


(I quote from the Zen of Python: "Flat is better than nested." Hmmm.)

which is equivalent to a *four* dimensional table, which is a bit hard to
write out

Here's a two-dimensional projection of a single slice with key = 1:

s_v1 s_v2 g2
===== ===== =====
7.0 6.5 | s_v1 s_v2
---------------
8 | 5.0 3.5
9 | 2.0 3.0


Does this help you to either (1) redesign your data structures, or (2)
work out how to go from there?

[...]
> I was looking for a solution that would let me do that kind of grouping
> with variable lists of 2) and 3) i.e. having also 'g3' as grouping
> element so the 'g2' dicts could also have their own "subgroup" and be
> even more nested then. I was trying something with itertools.groupby and
> updating nested dicts, but as i was writing the code it started to feel
> too verbose to me :/


I don't think groupby is the tool you want. It groups *consecutive* items
in sequences:


>>> from itertools import groupby
>>> for key, it in groupby([1,1,1,2,3,4,3,3,3,5,1]):

.... print(key, list(it))
....
1 [1, 1, 1]
2 [2]
3 [3]
4 [4]
3 [3, 3, 3]
5 [5]
1 [1]


Except for the name, I don't see any connection between this and what you
want to do.

The approach I would take is a top-down approach:

dicts = [ ... ] # list of dicts, as above.
result = {}
for d in dicts:
# process each dict in isolation
temp = process(d)
merge(result, temp)


merge() hopefully should be straight forward, and process only needs to
look at one dict at a time.



--
Steven
 
Reply With Quote
 
 
 
 
Paul Rubin
Guest
Posts: n/a
 
      02-05-2011
Slafs <(E-Mail Removed)> writes:
> What i want to have is:
> a "big" nested dictionary with 'g1' values as 1st level keys and a
> dictionary of aggregates and "subgroups" in it....
>
> I was looking for a solution that would let me do that kind of
> grouping with variable lists of 2) and 3) i.e. having also 'g3' as
> grouping element so the 'g2' dicts could also have their own
> "subgroup" and be even more nested then.
> I was trying something with itertools.groupby and updating nested
> dicts, but as i was writing the code it started to feel too verbose to
> me :/
>
> Do You have any hints maybe? because i'm kind of stucked :/


I'm not sure I understood the problem and it would help if you gave
sample data with the deeper nesting that you describe. But the
following messy code matches the sample that you did give:

from pprint import pprint
from itertools import groupby

x1 = [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
{ 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
{ 'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}
]
x2 = ['g1', 'g2']
x3 = ['s_v1', 's_v2']

def agg(xdata, group_keys, agg_keys):
if not group_keys:
return {}
k0, ks = group_keys[0], group_keys[1:]
r = {}
def gk(d): return d[k0]
for k, g in groupby(sorted(xdata, key=gk), gk):
gs = list(g)
aggs = dict((ak,sum(d[ak] for d in gs)) for ak in agg_keys)
r[k] = aggs
if ks:
r[k][ks[0]] = agg(gs,group_keys[1:], agg_keys)
return r

pprint (agg(x1, x2, x3))
 
Reply With Quote
 
Slafs
Guest
Posts: n/a
 
      02-05-2011
On 5 Lut, 05:58, Paul Rubin <(E-Mail Removed)> wrote:
> Slafs <(E-Mail Removed)> writes:
> > What i want to have is:
> > a "big" nested dictionary with 'g1' values as 1st level keys and a
> > dictionary of aggregates and "subgroups" in it....

>
> > I was looking for a solution that would let me do that kind of
> > grouping with variable lists of 2) and 3) i.e. having also 'g3' as
> > grouping element so the 'g2' dicts could also have their own
> > "subgroup" and be even more nested then.
> > I was trying something with itertools.groupby and updating nested
> > dicts, but as i was writing the code it started to feel too verbose to
> > me :/

>
> > Do You have any hints maybe? because i'm kind of stucked :/

>
> I'm not sure I understood the problem and it would help if you gave
> sample data with the deeper nesting that you describe. *But the
> following messy code matches the sample that you did give:
>
> * * from pprint import pprint
> * * from itertools import groupby
>
> * * x1 = [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
> * * * * * * * { 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
> * * * * * * * { 'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}
> * * * * * * * ]
> * * x2 = ['g1', 'g2']
> * * x3 = ['s_v1', 's_v2']
>
> * * def agg(xdata, group_keys, agg_keys):
> * * * * if not group_keys:
> * * * * * * return {}
> * * * * k0, ks = group_keys[0], group_keys[1:]
> * * * * r = {}
> * * * * def gk(d): return d[k0]
> * * * * for k, g in groupby(sorted(xdata, key=gk), gk):
> * * * * * * gs = list(g)
> * * * * * * aggs = dict((ak,sum(d[ak] for d in gs)) for ak in agg_keys)
> * * * * * * r[k] = aggs
> * * * * * * if ks:
> * * * * * * * * r[k][ks[0]] = agg(gs,group_keys[1:], agg_keys)
> * * * * return r
>
> * * pprint (agg(x1, x2, x3))


Thank you both Steven and Paul for your replies.

@Steven:
> Perhaps you should consider backing up and staring from somewhere else
> with different input data, or changing the requirements. Just a thought.


I think it's not the issue. The data as you noticed i well structured
(as a table for instance) and I don't think I can go better than that.

> I don't think groupby is the tool you want. It groups *consecutive* items
> in sequences:


I was using groupby just like in Paul's code.

@Paul:
OMG. I think this is it! (getting my jaw from the floor...)
The funny part is that I was kind of close to this solution . I was
considering the use of recursion for this.

Thank You so much!
 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      02-05-2011
Slafs wrote:

> Hi there!
>
> I'm having trouble to wrap my brain around this kind of problem:
>
> What I have :
> 1) list of dicts
> 2) list of keys that i would like to be my grouping arguments of
> elements from 1)
> 3) list of keys that i would like do "aggregation" on the elements
> of 1) with some function e.g. sum
>
> For instance i got:
> 1) [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
> { 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
> {'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}, ... ]
> 2) ['g1', 'g2']
> 3) ['s_v1', 's_v2']
>
> To be precise 1) is a result of a values_list method from a QuerySet
> in Django; 2) is the arguments for that method; 3) those are the
> annotation keys. so 1) is a result of:
> qs.values_list('g1', 'g2').annotate(s_v1=Sum('v1'), s_v2=Sum('v2'))
>
> What i want to have is:
> a "big" nested dictionary with 'g1' values as 1st level keys and a
> dictionary of aggregates and "subgroups" in it.
>
> In my example it would be something like this:
> {
> 1 : {
> 's_v1' : 7.0,
> 's_v2' : 6.5,
> 'g2' :{
> 8 : {
> 's_v1' : 5.0,
> 's_v2' : 3.5 },
> 9 : {
> 's_v1' : 2.0,
> 's_v2' : 3.0 }
> }
> },
> 2 : {
> 's_v1' : 6.0,
> 's_v2' : 8.0,
> 'g2' : {
> 8 : {
> 's_v1' : 6.0,
> 's_v2' : 8.0}
> }
> },
> ...
> }
>
> # notice the summed values of s_v1 and s_v2 when g1 == 1
>
> I was looking for a solution that would let me do that kind of
> grouping with variable lists of 2) and 3) i.e. having also 'g3' as
> grouping element so the 'g2' dicts could also have their own
> "subgroup" and be even more nested then.
> I was trying something with itertools.groupby and updating nested
> dicts, but as i was writing the code it started to feel too verbose to
> me :/
>
> Do You have any hints maybe? because i'm kind of stucked :/
>
> Regards
>
> Sławek


Not super-efficient, but simple:

$ cat python sumover.py
cat: python: No such file or directory
data = [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
{ 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
{'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}]
sum_over = ["s_v1", "s_v2"]
group_by = ["g1", "g2"]

wanted = {
1 : {
's_v1' : 7.0,
's_v2' : 6.5,
'g2' :{
8 : {
's_v1' : 5.0,
's_v2' : 3.5 },
9 : {
's_v1' : 2.0,
's_v2' : 3.0 }
}
},
2 : {
's_v1' : 6.0,
's_v2' : 8.0,
'g2' : {
8 : {
's_v1' : 6.0,
's_v2' : 8.0}
}
},
}

def calc(data, group_by, sum_over):
tree = {}
group_by = group_by + [None]
for item in data:
d = tree
for g in group_by:
for so in sum_over:
d[so] = d.get(so, 0.0) + item[so]
if g:
d = d.setdefault(g, {}).setdefault(item[g], {})
return tree

got = calc(data, group_by, sum_over)[group_by[0]]
assert got == wanted
$ python sumover.py
$

Untested.
 
Reply With Quote
 
nn
Guest
Posts: n/a
 
      02-07-2011
On Feb 5, 7:12*am, Peter Otten <(E-Mail Removed)> wrote:
> Slafs wrote:
> > Hi there!

>
> > I'm having trouble to wrap my brain around this kind of problem:

>
> > What I have :
> > * 1) list of dicts
> > * 2) list of keys that i would like to be my grouping arguments of
> > elements from 1)
> > * 3) list of keys that i would like do "aggregation" on the elements
> > of 1) with some function e.g. sum

>
> > For instance i got:
> > 1) [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
> > * * * { 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
> > * * * {'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}, .... ]
> > 2) ['g1', 'g2']
> > 3) ['s_v1', 's_v2']

>
> > To be precise 1) is a result of a values_list method from a QuerySet
> > in Django; 2) is the arguments for that method; 3) those are the
> > annotation keys. so 1) is a result of:
> > * *qs.values_list('g1', 'g2').annotate(s_v1=Sum('v1'), s_v2=Sum('v2'))

>
> > What i want to have is:
> > a "big" nested dictionary with 'g1' values as 1st level keys and a
> > dictionary of aggregates and "subgroups" in it.

>
> > In my example it would be something like this:
> > {
> > * 1 : {
> > * * * * * 's_v1' : 7.0,
> > * * * * * 's_v2' : 6.5,
> > * * * * * 'g2' :{
> > * * * * * * * * * *8 : {
> > * * * * * * * * * * * * * 's_v1' : 5.0,
> > * * * * * * * * * * * * * 's_v2' : 3.5 },
> > * * * * * * * * * *9 : *{
> > * * * * * * * * * * * * * 's_v1' : 2.0,
> > * * * * * * * * * * * * * 's_v2' : 3.0 }
> > * * * * * * * * }
> > * * * *},
> > * 2 : {
> > * * * * * *'s_v1' : 6.0,
> > * * * * * *'s_v2' : 8.0,
> > * * * * * *'g2' : {
> > * * * * * * * * * * 8 : {
> > * * * * * * * * * * * * * 's_v1' : 6.0,
> > * * * * * * * * * * * * * 's_v2' : 8.0}
> > * * * * * *}
> > * * * *},
> > ...
> > }

>
> > # notice the summed values of s_v1 and s_v2 when g1 == 1

>
> > I was looking for a solution that would let me do that kind of
> > grouping with variable lists of 2) and 3) i.e. having also 'g3' as
> > grouping element so the 'g2' dicts could also have their own
> > "subgroup" and be even more nested then.
> > I was trying something with itertools.groupby and updating nested
> > dicts, but as i was writing the code it started to feel too verbose to
> > me :/

>
> > Do You have any hints maybe? because i'm kind of stucked :/

>
> > Regards

>
> > Sławek

>
> Not super-efficient, but simple:
>
> $ cat python sumover.py
> cat: python: No such file or directory
> data = [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
> * * * * *{ 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
> * * * * *{'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}]
> sum_over = ["s_v1", "s_v2"]
> group_by = ["g1", "g2"]
>
> wanted = {
> * 1 : {
> * * * * * 's_v1' : 7.0,
> * * * * * 's_v2' : 6.5,
> * * * * * 'g2' :{
> * * * * * * * * * *8 : {
> * * * * * * * * * * * * * 's_v1' : 5.0,
> * * * * * * * * * * * * * 's_v2' : 3.5 },
> * * * * * * * * * *9 : *{
> * * * * * * * * * * * * * 's_v1' : 2.0,
> * * * * * * * * * * * * * 's_v2' : 3.0 }
> * * * * * * * * }
> * * * *},
> * 2 : {
> * * * * * *'s_v1' : 6.0,
> * * * * * *'s_v2' : 8.0,
> * * * * * *'g2' : {
> * * * * * * * * * * 8 : {
> * * * * * * * * * * * * * 's_v1' : 6.0,
> * * * * * * * * * * * * * 's_v2' : 8.0}
> * * * * * *}
> * * * *},
>
> }
>
> def calc(data, group_by, sum_over):
> * * tree = {}
> * * group_by = group_by + [None]
> * * for item in data:
> * * * * d = tree
> * * * * for g in group_by:
> * * * * * * for so in sum_over:
> * * * * * * * * d[so] = d.get(so, 0.0) + item[so]
> * * * * * * if g:
> * * * * * * * * d = d.setdefault(g, {}).setdefault(item[g], {})
> * * return tree
>
> got = calc(data, group_by, sum_over)[group_by[0]]
> assert got == wanted
> $ python sumover.py
> $
>
> Untested.


Very clever. I didn't understand how it worked until I rewrote it like
this:

def calc(data, group_by, sum_over):
tree = {}
group_by = [None] + group_by
for item in data:
d = tree
for g in group_by:
if g:
d = d.setdefault(g, {}).setdefault(item[g], {})
for so in sum_over:
d[so] = d.get(so, 0.0) + item[so]
return tree


Processing "None" in the last round of the loop was throwing me off.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
transmitting raw data vs. tree-structured data Max You Ruby 2 06-19-2010 09:06 PM
RSS extensions for hierarchical structured data Mike J XML 4 12-14-2005 11:55 PM
fastest way of storing structured data Vince C++ 6 09-01-2005 01:47 AM
What about a C compiler with structured data management INSIDE ? HALLES C Programming 3 06-26-2005 04:54 AM
Using xml source in datagrid -- possible with non-structured data? KatB ASP .Net 3 10-29-2003 07:09 AM



Advertisments