Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Learning Python via a little word frequency program

Reply
Thread Tools

Learning Python via a little word frequency program

 
 
Andrew Savige
Guest
Posts: n/a
 
      01-09-2008
I'm learning Python by reading David Beazley's "Python Essential Reference"
book and writing a few toy programs. To get a feel for hashes and sorting,
I set myself this little problem today (not homework, BTW):

Given a string containing a space-separated list of names:

names = "freddy fred bill jock kevin andrew kevin kevin jock"

produce a frequency table of names, sorted descending by frequency.
then ascending by name. For the above data, the output should be:

kevin : 3
jock : 2
andrew : 1
bill : 1
fred : 1
freddy : 1

Here's my first attempt:

names = "freddy fred bill jock kevin andrew kevin kevin jock"
freq = {}
for name in names.split():
freq[name] = 1 + freq.get(name, 0)
deco = zip([-x for x in freq.values()], freq.keys())
deco.sort()
for v, k in deco:
print "%-10s: %d" % (k, -v)

I'm interested to learn how more experienced Python folks would solve
this little problem. Though I've read about the DSU Python sorting idiom,
I'm not sure I've strictly applied it above ... and the -x hack above to
achieve a descending sort feels a bit odd to me, though I couldn't think
of a better way to do it.

I also have a few specific questions. Instead of:

for name in names.split():
freq[name] = 1 + freq.get(name, 0)

I might try:

for name in names.split():
try:
freq[name] += 1
except KeyError:
freq[name] = 1

Which is preferred?

Ditto for:

deco = zip([-x for x in freq.values()], freq.keys())

versus:

deco = zip(map(operator.neg, freq.values()), freq.keys())

Finally, I might replace:

for v, k in deco:
print "%-10s: %d" % (k, -v)

with:

print "\n".join("%-10s: %d" % (k, -v) for v, k in deco)

Any feedback on good Python style, performance tips, good books
to read, etc. is appreciated.

Thanks,
/-\


Make the switch to the world's best email. Get the new Yahoo!7 Mail now. www.yahoo7.com.au/worldsbestemail


 
Reply With Quote
 
 
 
 
Ant
Guest
Posts: n/a
 
      01-09-2008
> I'm interested to learn how more experienced Python folks would solve
> this little problem.


I think I'd do the following:

from collections import defaultdict

names = "freddy fred bill jock kevin andrew kevin kevin jock"
freq = defaultdict(lambda: 0)

for name in names.split():
freq[name] += 1

pairs = [(v, k) for k, v in freq.iteritems()]

for v, k in reversed(sorted(pairs)):
print "%-10s: %d" % (k, v)


defaultdict makes the frequency accumulation neater.

reversed(sorted(pairs)) avoids the little -v hack and makes it more
obvious what you are doing. Of course this could also be achieved by
doing pairs.sort() and pairs.reverse() before iterating over the pairs
list.

Cheers,

--
Ant.
 
Reply With Quote
 
 
 
 
Peter Otten
Guest
Posts: n/a
 
      01-09-2008
Andrew Savige wrote:

> I'm learning Python by reading David Beazley's "Python Essential
> Reference" book and writing a few toy programs. To get a feel for hashes
> and sorting, I set myself this little problem today (not homework, BTW):
>
> Given a string containing a space-separated list of names:
>
> names = "freddy fred bill jock kevin andrew kevin kevin jock"
>
> produce a frequency table of names, sorted descending by frequency.
> then ascending by name. For the above data, the output should be:
>
> kevin : 3
> jock : 2
> andrew : 1
> bill : 1
> fred : 1
> freddy : 1
>
> Here's my first attempt:
>
> names = "freddy fred bill jock kevin andrew kevin kevin jock" freq = {}
> for name in names.split():
> freq[name] = 1 + freq.get(name, 0)
> deco = zip([-x for x in freq.values()], freq.keys()) deco.sort() for v,
> k in deco:
> print "%-10s: %d" % (k, -v)
>
> I'm interested to learn how more experienced Python folks would solve
> this little problem. Though I've read about the DSU Python sorting
> idiom, I'm not sure I've strictly applied it above ... and the -x hack
> above to achieve a descending sort feels a bit odd to me, though I
> couldn't think of a better way to do it.


You can specify a reverse sort with

deco.sort(reverse=True)

Newer versions of Python have the whole idiom built in:

>>> items = freq.items()
>>> from operator import itemgetter
>>> items.sort(key=itemgetter(1), reverse=True)
>>> for item in items:

.... print "%-10s: %d" % item
....
kevin : 3
jock : 2
bill : 1
andrew : 1
fred : 1
freddy : 1

You can pass an arbitrary function as key. itemgetter(1) is equivalent to

def key(item): return item[1]

> I also have a few specific questions. Instead of:
>
> for name in names.split():
> freq[name] = 1 + freq.get(name, 0)
>
> I might try:
>
> for name in names.split():
> try:
> freq[name] += 1
> except KeyError:
> freq[name] = 1
>
> Which is preferred?


I have no strong opinion about that. Generally speaking try...except is
faster when you have many hits, i. e. the except suite is rarely invoked.
Starting with Python 2.5 you can alternatively use

from collections import defaultdict
freq = defaultdict(int)
for name in names.split():
freq[name] += 1

> Ditto for:
>
> deco = zip([-x for x in freq.values()], freq.keys())
>
> versus:
>
> deco = zip(map(operator.neg, freq.values()), freq.keys())


I think the list comprehension is slightly more readable.

> Finally, I might replace:
>
> for v, k in deco:
> print "%-10s: %d" % (k, -v)
>
> with:
>
> print "\n".join("%-10s: %d" % (k, -v) for v, k in deco)


Again, I find the explicit for loop more readable, but sometimes use the
genexp, too.

Peter
 
Reply With Quote
 
Bruno Desthuilliers
Guest
Posts: n/a
 
      01-09-2008
Andrew Savige a écrit :
> I'm learning Python by reading David Beazley's "Python Essential Reference"
> book and writing a few toy programs. To get a feel for hashes and sorting,
> I set myself this little problem today (not homework, BTW):
>
> Given a string containing a space-separated list of names:
>
> names = "freddy fred bill jock kevin andrew kevin kevin jock"
>
> produce a frequency table of names, sorted descending by frequency.
> then ascending by name. For the above data, the output should be:
>
> kevin : 3
> jock : 2
> andrew : 1
> bill : 1
> fred : 1
> freddy : 1
>
> Here's my first attempt:
>
> names = "freddy fred bill jock kevin andrew kevin kevin jock"
> freq = {}
> for name in names.split():
> freq[name] = 1 + freq.get(name, 0)
> deco = zip([-x for x in freq.values()], freq.keys())
> deco.sort()
> for v, k in deco:
> print "%-10s: %d" % (k, -v)
>
> I'm interested to learn how more experienced Python folks would solve
> this little problem.


For a one-shot Q&D script:

names = "freddy fred bill jock kevin andrew kevin kevin jock"
freqs = [(- names.count(name), name) for name in set(names.split())]
print "\n".join("%-10s : %s" % (n, -f) for f, n in sorted(freqs))


Now I might choose a very different solution for a more serious
application, depending on detailed specs and intended use of the
"frequency table".

> Though I've read about the DSU Python sorting idiom,
> I'm not sure I've strictly applied it above ...


Perhaps not "strictly" since you don't really "undecorate", but that's
another application of the same principle : provided the appropriate
data structure, sort() (or sorted()) will do the right thing.


> and the -x hack above to
> achieve a descending sort feels a bit odd to me, though I couldn't think
> of a better way to do it.


The "other" way would be to pass a custom comparison callback to sort,
which would be both slower and more complicated. Your solution is IMHO
the right thing to do here.

> I also have a few specific questions. Instead of:
>
> for name in names.split():
> freq[name] = 1 + freq.get(name, 0)
>
> I might try:
>
> for name in names.split():
> try:
> freq[name] += 1
> except KeyError:
> freq[name] = 1


or a couple other solutions, including a defaultdict (python >= 2.5).

> Which is preferred?


It's a FAQ - or it should be one. Globally: the second one tends to be
faster when there's no exception (ie the key already exists), but slower
when exceptions happen. So it mostly depends on what you expect your
dataset to be.

Now note that you don't necessarily need a dict here !-)

> Ditto for:
>
> deco = zip([-x for x in freq.values()], freq.keys())
>
> versus:
>
> deco = zip(map(operator.neg, freq.values()), freq.keys())


As far as I'm concerned, I'd favor the first solution here. Reads better
IMHO

> Finally, I might replace:
>
> for v, k in deco:
> print "%-10s: %d" % (k, -v)
>
> with:
>
> print "\n".join("%-10s: %d" % (k, -v) for v, k in deco)


That's what I'd do here too - but it depends on context (ie: for huge
datasets and/or complex formating, i'd use a for loop).

 
Reply With Quote
 
Bruno Desthuilliers
Guest
Posts: n/a
 
      01-09-2008
Ant a écrit :
>> I'm interested to learn how more experienced Python folks would solve
>> this little problem.

>
> I think I'd do the following:
>
> from collections import defaultdict
>
> names = "freddy fred bill jock kevin andrew kevin kevin jock"
> freq = defaultdict(lambda: 0)
>
> for name in names.split():
> freq[name] += 1
>
> pairs = [(v, k) for k, v in freq.iteritems()]
>
> for v, k in reversed(sorted(pairs)):
> print "%-10s: %d" % (k, v)
>
>
> defaultdict makes the frequency accumulation neater.
>
> reversed(sorted(pairs)) avoids the little -v hack and makes it more
> obvious what you are doing.


But fails to implement the specs (emphasis is mine):
"""
produce a frequency table of names, sorted descending by frequency.
*then ascending by name*. For the above data, the output should be:

kevin : 3
jock : 2
andrew : 1
bill : 1
fred : 1
freddy : 1
"""

With your solution, you get:

kevin : 3
jock : 2
freddy : 1
fred : 1
bill : 1
andrew : 1


 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      01-09-2008
On Jan 9, 12:19 pm, Bruno Desthuilliers <bruno.
(E-Mail Removed)> wrote:
> Andrew Savige a écrit :
>
>
>
> > I'm learning Python by reading David Beazley's "Python Essential Reference"
> > book and writing a few toy programs. To get a feel for hashes and sorting,
> > I set myself this little problem today (not homework, BTW):

>
> > Given a string containing a space-separated list of names:

>
> > names = "freddy fred bill jock kevin andrew kevin kevin jock"

>
> > produce a frequency table of names, sorted descending by frequency.
> > then ascending by name. For the above data, the output should be:

>
> > kevin : 3
> > jock : 2
> > andrew : 1
> > bill : 1
> > fred : 1
> > freddy : 1

>
> > Here's my first attempt:

>
> > names = "freddy fred bill jock kevin andrew kevin kevin jock"
> > freq = {}
> > for name in names.split():
> > freq[name] = 1 + freq.get(name, 0)
> > deco = zip([-x for x in freq.values()], freq.keys())
> > deco.sort()
> > for v, k in deco:
> > print "%-10s: %d" % (k, -v)

>
> > I'm interested to learn how more experienced Python folks would solve
> > this little problem.

>
> For a one-shot Q&D script:
>
> names = "freddy fred bill jock kevin andrew kevin kevin jock"
> freqs = [(- names.count(name), name) for name in set(names.split())]
> print "\n".join("%-10s : %s" % (n, -f) for f, n in sorted(freqs))
>

[snip]
That actually prints:

kevin : 3
fred : 2
jock : 2
andrew : 1
bill : 1
freddy : 1

It says that "fred" occurs twice because of "freddy".

names = "freddy fred bill jock kevin andrew kevin kevin jock"
name_list = names.split()
freqs = [(- name_list.count(name), name) for name in set(name_list)]
print "\n".join("%-10s : %s" % (n, -f) for f, n in sorted(freqs))
 
Reply With Quote
 
Bruno Desthuilliers
Guest
Posts: n/a
 
      01-10-2008
MRAB a écrit :
> On Jan 9, 12:19 pm, Bruno Desthuilliers <bruno.
> (E-Mail Removed)> wrote:

(snip)
> That actually prints:
>
> kevin : 3
> fred : 2
> jock : 2
> andrew : 1
> bill : 1
> freddy : 1
>
> It says that "fred" occurs twice because of "freddy".


oops ! My bad, didn't spot that one

Thanks for pointing this out.
 
Reply With Quote
 
rent
Guest
Posts: n/a
 
      01-11-2008
import collections

names = "freddy fred bill jock kevin andrew kevin kevin jock"
freq = collections.defaultdict(int)
for name in names.split():
freq[name] += 1
keys = freq.keys()
keys.sort(key = freq.get, reverse = True)
for k in keys:
print "%-10s: %d" % (k, freq[k])

On Jan 9, 6:58 pm, Andrew Savige <(E-Mail Removed)> wrote:
> I'm learning Python by reading David Beazley's "Python Essential Reference"
> book and writing a few toy programs. To get a feel for hashes and sorting,
> I set myself this little problem today (not homework, BTW):
>
> Given a string containing a space-separated list of names:
>
> names = "freddy fred bill jock kevin andrew kevin kevin jock"
>
> produce a frequency table of names, sorted descending by frequency.
> then ascending by name. For the above data, the output should be:
>
> kevin : 3
> jock : 2
> andrew : 1
> bill : 1
> fred : 1
> freddy : 1
>
> Here's my first attempt:
>
> names = "freddy fred bill jock kevin andrew kevin kevin jock"
> freq = {}
> for name in names.split():
> freq[name] = 1 + freq.get(name, 0)
> deco = zip([-x for x in freq.values()], freq.keys())
> deco.sort()
> for v, k in deco:
> print "%-10s: %d" % (k, -v)
>
> I'm interested to learn how more experienced Python folks would solve
> this little problem. Though I've read about the DSU Python sorting idiom,
> I'm not sure I've strictly applied it above ... and the -x hack above to
> achieve a descending sort feels a bit odd to me, though I couldn't think
> of a better way to do it.
>
> I also have a few specific questions. Instead of:
>
> for name in names.split():
> freq[name] = 1 + freq.get(name, 0)
>
> I might try:
>
> for name in names.split():
> try:
> freq[name] += 1
> except KeyError:
> freq[name] = 1
>
> Which is preferred?
>
> Ditto for:
>
> deco = zip([-x for x in freq.values()], freq.keys())
>
> versus:
>
> deco = zip(map(operator.neg, freq.values()), freq.keys())
>
> Finally, I might replace:
>
> for v, k in deco:
> print "%-10s: %d" % (k, -v)
>
> with:
>
> print "\n".join("%-10s: %d" % (k, -v) for v, k in deco)
>
> Any feedback on good Python style, performance tips, good books
> to read, etc. is appreciated.
>
> Thanks,
> /-\
>
> Make the switch to the world's best email. Get the new Yahoo!7 Mail now.www.yahoo7.com.au/worldsbestemail


 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      01-11-2008
rent <(E-Mail Removed)> writes:
> keys = freq.keys()
> keys.sort(key = freq.get, reverse = True)
> for k in keys:
> print "%-10s: %d" % (k, freq[k])


I prefer (untested):

def snd((x,y)): return y # I wish this was built-in
sorted_freq = sorted(freq.iteritems(), key=snd, reverse=True)
for k,f in sorted_freq:
print "%-10s: %d" % (k, f)
 
Reply With Quote
 
Mike Meyer
Guest
Posts: n/a
 
      01-11-2008
On 11 Jan 2008 03:50:53 -0800 Paul Rubin <"http://phr.cx"@NOSPAM.invalid> wrote:

> rent <(E-Mail Removed)> writes:
> > keys = freq.keys()
> > keys.sort(key = freq.get, reverse = True)
> > for k in keys:
> > print "%-10s: %d" % (k, freq[k])

>
> I prefer (untested):
>
> def snd((x,y)): return y # I wish this was built-in


What's wrong with operator.itemgetter?

> sorted_freq = sorted(freq.iteritems(), key=snd, reverse=True)


(still untested)

from operator import itemgetter
sorted_freq = sorted(freq.iteritems(), key=itemgetter(2), reverse=True)

<mike

--
Mike Meyer <(E-Mail Removed)> http://www.mired.org/consulting.html
Independent Network/Unix/Perforce consultant, email for more information.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
FAQ 6.15 How can I print out a word-frequency or line-frequency summary? PerlFAQ Server Perl Misc 0 03-26-2011 04:00 AM
FAQ 6.15 How can I print out a word-frequency or line-frequency summary? PerlFAQ Server Perl Misc 0 02-01-2011 11:00 AM
1 little 2 little 3 little Kennedys dale Digital Photography 0 03-23-2008 01:03 PM
Counting Frequency of Values in an Array (And Sorting by Frequency?) x1 Ruby 9 10-12-2006 04:04 PM
faster data structure to count word frequency? Kevin Java 16 04-19-2005 05:13 PM



Advertisments