Velocity Reviews > Pivot Table/Groupby/Sum question

# Pivot Table/Groupby/Sum question

patrick.waldo@gmail.com
Guest
Posts: n/a

 12-27-2007
Hi all,

on the same subject, but it didn't work for me. I'm trying to learn
how to make pivot tables from some excel sheets and I am trying to
abstract this into a simple sort of example. Essentially I want to
take input data like this:

Name Time of day Amount
Bob Morn 240
Bob Aft 300
Joe Morn 70
Joe Aft 80
Jil Morn 100
Jil Aft 150

And output it as:

Name Total Morning Afternoon
Bob 540 240 300
Joe 150 70 80
Jil 250 100 150
Total 940 410 530

The writing the output part is the easy part. However, I have a
couple problems. 1) Grouping by name seems to work perfectly, but
working by time does not. ie

I will get:
Bob
240
300
Joe
70
80
Jil
100
150
which is great but...
Morn
240
Aft
300
Morn
70
Aft
80
Morn
100
Aft
150
And not
Morn
240
70
100
Aft
300
80
150

2) I can't figure out how to sum these values because of the
iteration. I always get an error like: TypeError: iteration over non-
sequence

Here's the code:

from itertools import groupby

data = [['Bob', 'Morn', 240],['Bob', 'Aft', 300],['Joe', 'Morn', 70],
['Joe', 'Aft', 80],\
['Jil', 'Morn', 100],['Jil', 'Aft', 150]]

NAME, TIME, AMOUNT=range(3)
for k, g in groupby(data, key=lambda r: r[NAME]):
print k
for record in g:
print "\t", record[AMOUNT]
for k, g in groupby(data, key=lambda r: r[TIME]):
print k
for record in g:
print "\t", record[AMOUNT]

John Machin
Guest
Posts: n/a

 12-27-2007
On Dec 28, 4:56 am, (E-Mail Removed) wrote:

> from itertools import groupby

You seem to have overlooked this important sentence in the
documentation: "Generally, the iterable needs to already be sorted on
the same key function"

patrick.waldo@gmail.com
Guest
Posts: n/a

 12-27-2007
On Dec 27, 10:59 pm, John Machin <(E-Mail Removed)> wrote:
> On Dec 28, 4:56 am, (E-Mail Removed) wrote:
>
> > from itertools import groupby

>
> You seem to have overlooked this important sentence in the
> documentation: "Generally, the iterable needs to already be sorted on
> the same key function"

Yes, but I imagine this shouldn't prevent me from using and
manipulating the data. It also doesn't explain why the names get
sorted correctly and the time does not.

I was trying to do this:

count_tot = []
for k, g in groupby(data, key=lambda r: r[NAME]):
for record in g:
count_tot.append((k,record[SALARY]))
for i in count_tot:
here I want to say add all the numbers for each person, but I'm
missing something.

If you have any ideas about how to solve this pivot table issue, which
seems to be scant on Google, I'd much appreciate it. I know I can do
this in Excel easily with the automated wizard, but I want to know how
to do it myself and format it to my needs.

John Machin
Guest
Posts: n/a

 12-28-2007
On Dec 28, 10:05 am, (E-Mail Removed) wrote:
> On Dec 27, 10:59 pm, John Machin <(E-Mail Removed)> wrote:
>
> > On Dec 28, 4:56 am, (E-Mail Removed) wrote:

>
> > > from itertools import groupby

>
> > You seem to have overlooked this important sentence in the
> > documentation: "Generally, the iterable needs to already be sorted on
> > the same key function"

>
> Yes, but I imagine this shouldn't prevent me from using and
> manipulating the data.

You imagine correctly (and pointlessly) in general; however in
particular it prevents you using itertools.groupby simplistically to
manipulate the data in the way you want to manipulate it.

> It also doesn't explain why the names get
> sorted correctly and the time does not.

The names in your example were NOT sorted, "correctly" or otherwise.
The output order is the same as the input order: Bob, Joe, Jil.

>>> seq = ['Bob', 'Joe', 'Jil']
>>> sorted(seq)

['Bob', 'Jil', 'Joe']
>>> seq == sorted(seq)

False
>>>

>
> I was trying to do this:
>
> count_tot = []
> for k, g in groupby(data, key=lambda r: r[NAME]):
> for record in g:
> count_tot.append((k,record[SALARY]))
> for i in count_tot:
> here I want to say add all the numbers for each person, but I'm
> missing something.
>
> If you have any ideas about how to solve this pivot table issue, which
> seems to be scant on Google, I'd much appreciate it. I know I can do
> this in Excel easily with the automated wizard, but I want to know how
> to do it myself and format it to my needs.

Watch this space.

John Machin
Guest
Posts: n/a

 12-28-2007
On Dec 28, 11:48 am, John Machin <(E-Mail Removed)> wrote:
> On Dec 28, 10:05 am, (E-Mail Removed) wrote:
>
>
> > If you have any ideas about how to solve this pivot table issue, which
> > seems to be scant on Google, I'd much appreciate it. I know I can do
> > this in Excel easily with the automated wizard, but I want to know how
> > to do it myself and format it to my needs.

>
> Watch this space.

Tested as much as you see:

8<---
class SimplePivotTable(object):

def __init__(
self,
row_order=None, col_order=None, # see example
missing=0, # what to return for an empty cell. Alternatives:
'', 0.0, None, 'NULL'
):
self.row_order = row_order
self.col_order = col_order
self.missing = missing
self.cell_dict = {}
self.row_total = {}
self.col_total = {}
self.grand_total = 0

self.grand_total += value
try:
self.col_total[col_key] += value
except KeyError:
self.col_total[col_key] = value
try:
self.cell_dict[row_key][col_key] += value
self.row_total[row_key] += value
except KeyError:
try:
self.cell_dict[row_key][col_key] = value
self.row_total[row_key] += value
except KeyError:
self.cell_dict[row_key] = {col_key: value}
self.row_total[row_key] = value

return
list(sorted(self.row_total.keys()))
list(sorted(self.col_total.keys()))

def generate_row_info(self):
row_dict = self.cell_dict[row_key]
row_vals = [row_dict.get(col_key, self.missing) for
yield row_key, self.row_total[row_key], row_vals

def get_col_totals(self):
row_dict = self.col_total
row_vals = [row_dict.get(col_key, self.missing) for col_key in
return self.grand_total, row_vals

if __name__ == "__main__":

data = [
['Bob', 'Morn', 240],
['Bob', 'Aft', 300],
['Joe', 'Morn', 70],
['Joe', 'Aft', 80],
['Jil', 'Morn', 100],
['Jil', 'Aft', 150],
['Bob', 'Aft', 40],
['Bob', 'Aft', 5],
['Dozy', 'Aft', 1], # Dozy doesn't show up till lunch-time
]
NAME, TIME, AMOUNT = range(3)

print
ptab = SimplePivotTable(
col_order=['Morn', 'Aft'],
missing='uh-oh',
)
for s in data:
value=s[AMOUNT])
for x in ptab.generate_row_info():
print x
print 'Tots', ptab.get_col_totals()
8<---

patrick.waldo@gmail.com
Guest
Posts: n/a

 12-28-2007
Wow, I did not realize it would be this complicated! I'm fairly new
to Python and somehow I thought I could find a simpler solution. I'll
have to mull over this to fully understand how it works for a bit.

Thanks a lot!

On Dec 28, 4:03 am, John Machin <(E-Mail Removed)> wrote:
> On Dec 28, 11:48 am, John Machin <(E-Mail Removed)> wrote:
>
> > On Dec 28, 10:05 am, (E-Mail Removed) wrote:

>
> > > If you have any ideas about how to solve this pivot table issue, which
> > > seems to be scant on Google, I'd much appreciate it. I know I can do
> > > this in Excel easily with the automated wizard, but I want to know how
> > > to do it myself and format it to my needs.

>
> > Watch this space.

>
> Tested as much as you see:
>
> 8<---
> class SimplePivotTable(object):
>
> def __init__(
> self,
> row_order=None, col_order=None, # see example
> missing=0, # what to return for an empty cell. Alternatives:
> '', 0.0, None, 'NULL'
> ):
> self.row_order = row_order
> self.col_order = col_order
> self.missing = missing
> self.cell_dict = {}
> self.row_total = {}
> self.col_total = {}
> self.grand_total = 0
>
> def add_item(self, row_key, col_key, value):
> self.grand_total += value
> try:
> self.col_total[col_key] += value
> except KeyError:
> self.col_total[col_key] = value
> try:
> self.cell_dict[row_key][col_key] += value
> self.row_total[row_key] += value
> except KeyError:
> try:
> self.cell_dict[row_key][col_key] = value
> self.row_total[row_key] += value
> except KeyError:
> self.cell_dict[row_key] = {col_key: value}
> self.row_total[row_key] = value
>
> return
> list(sorted(self.row_total.keys()))
> list(sorted(self.col_total.keys()))
>
>
> def generate_row_info(self):
> row_dict = self.cell_dict[row_key]
> row_vals = [row_dict.get(col_key, self.missing) for
> yield row_key, self.row_total[row_key], row_vals
>
> def get_col_totals(self):
> row_dict = self.col_total
> row_vals = [row_dict.get(col_key, self.missing) for col_key in
> return self.grand_total, row_vals
>
> if __name__ == "__main__":
>
> data = [
> ['Bob', 'Morn', 240],
> ['Bob', 'Aft', 300],
> ['Joe', 'Morn', 70],
> ['Joe', 'Aft', 80],
> ['Jil', 'Morn', 100],
> ['Jil', 'Aft', 150],
> ['Bob', 'Aft', 40],
> ['Bob', 'Aft', 5],
> ['Dozy', 'Aft', 1], # Dozy doesn't show up till lunch-time
> ]
> NAME, TIME, AMOUNT = range(3)
>
> print
> ptab = SimplePivotTable(
> col_order=['Morn', 'Aft'],
> missing='uh-oh',
> )
> for s in data:
> value=s[AMOUNT])
> for x in ptab.generate_row_info():
> print x
> print 'Tots', ptab.get_col_totals()
> 8<---

petr.jakes.tpc@gmail.com
Guest
Posts: n/a

 12-28-2007
What about to let SQL to work for you.
HTH

Petr Jakes

Tested on Python 2.5.1
8<----------------------

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sqlite3
con = sqlite3.connect(":memory:")
cur = con.cursor()

inputData=(
('Bob', 'Morn', 240),
('Bob', 'Aft', 300),
('Joe', 'Morn', 70),
('Joe', 'Aft', 80),
('Jil', 'Morn', 100),
('Jil', 'Aft', 150),
('Jil', 'Aft', 150),
('Jil', 'Aft', 150))

def data_generator(dataSet):
for dataSetRow in dataSet:
yield dataSetRow

pivotSelect='''
SELECT
NAME,
sum (AMOUNT) as TOTAL,
sum (case when (TIME_OF_DAY) = 'Morn' then AMOUNT else 0 END) as
MORN,
sum (case when (TIME_OF_DAY) = 'Aft' then AMOUNT else 0 END) as AFT
FROM MY_NAMES
GROUP BY 1'''

cur.execute("create table MY_NAMES(NAME, TIME_OF_DAY, AMOUNT)")
cur.executemany("""insert into MY_NAMES(NAME, TIME_OF_DAY, AMOUNT)
values (?,?,?)""", data_generator(inputData))
cur.execute(pivotSelect)
for row in cur.fetchall():
print row

John Machin
Guest
Posts: n/a

 12-29-2007
On Dec 29, 9:58 am, (E-Mail Removed) wrote:
> What about to let SQL to work for you.

The OP is "trying to learn how to make pivot tables from some excel
sheets". You had better give him a clue on how to use ODBC on an
"excel sheet"

[snip]
> SELECT
> NAME,
> sum (AMOUNT) as TOTAL,
> sum (case when (TIME_OF_DAY) = 'Morn' then AMOUNT else 0 END) as
> MORN,
> sum (case when (TIME_OF_DAY) = 'Aft' then AMOUNT else 0 END) as AFT

This technique requires advance knowledge of what the column key
values are (the hard-coded 'Morn' and 'Aft').

<rant>
It is the sort of thing that one sees when %SQL% is the *only*
language used to produce end-user reports. Innocuous when there are
only 2 possible columns, but bletchworthy when there are more than 20
and the conditions are complex and the whole thing is replicated
several times in the %SQL% script because either %SQL% doesn't support
temporary procedures/functions or the BOsFH won't permit their use...
not in front of the newbies, please!
</rant>

patrick.waldo@gmail.com
Guest
Posts: n/a

 12-29-2007
Petr, thanks for the SQL suggestion, but I'm having enough trouble in
Python.

John would you mind walking me through your class in normal speak? I
only have a vague idea of why it works and this would help me a lot to
get a grip on classes and this sort of particular problem. The next
step is to imagine if there was another variable, like departments and
add up the information by name, department, and time, and so on...that
will come another day.

Thanks.

On Dec 29, 1:00 am, John Machin <(E-Mail Removed)> wrote:
> On Dec 29, 9:58 am, (E-Mail Removed) wrote:
>
> > What about to let SQL to work for you.

>
> The OP is "trying to learn how to make pivot tables from some excel
> sheets". You had better give him a clue on how to use ODBC on an
> "excel sheet"
>
> [snip]
>
> > SELECT
> > NAME,
> > sum (AMOUNT) as TOTAL,
> > sum (case when (TIME_OF_DAY) = 'Morn' then AMOUNT else 0 END) as
> > MORN,
> > sum (case when (TIME_OF_DAY) = 'Aft' then AMOUNT else 0 END) as AFT

>
> This technique requires advance knowledge of what the column key
> values are (the hard-coded 'Morn' and 'Aft').
>
> <rant>
> It is the sort of thing that one sees when %SQL% is the *only*
> language used to produce end-user reports. Innocuous when there are
> only 2 possible columns, but bletchworthy when there are more than 20
> and the conditions are complex and the whole thing is replicated
> several times in the %SQL% script because either %SQL% doesn't support
> temporary procedures/functions or the BOsFH won't permit their use...
> not in front of the newbies, please!
> </rant>

John Machin
Guest
Posts: n/a

 12-29-2007
On Dec 29, 11:51 am, (E-Mail Removed) wrote:

> John would you mind walking me through your class in normal speak?

Yes.

> I
> only have a vague idea of why it works and this would help me a lot to
> get a grip on classes and this sort of particular problem.

It's about time you got a *concrete* idea of how something works. Grab
a pencil and a large piece of paper, pretend you are python.exe and
follow through what happens when it executes
ptab = SimplePivotTable(.......)
and the ptab.add_item(......) loop with this set of data:
data = [
['Bob', 'Morn', 240],
['Bob', 'Aft', 300],
['Bob', 'Morn', 40],
]
with the goal of being able to say what is in ptab.cell_dict and
understanding how it got there.

Cheers,
John