On Sun, 03 Aug 2003 00:28:53 GMT,
(John Machin) wrote:
>On Sat, 02 Aug 2003 12:41:19 GMT, "Raymond Hettinger"
><> wrote:
>
>
>>it-all-starts-with-a-good-data-structure-ly yours,
>
>Amen, brother.
>
>Even worse: I recall seeing code somewhere that had a data structure
>like this:
>
>{k1: [v1,v2], k2: v3, k3: None, ...}
>instead of
>{k1: [v1,v2], k2: [v3], k3: [], ...}
>
Page 1
>I liked the elegant code example for building a book index. However in
>practice the user requirement would be not to have duplicated page
>numbers when a word occurs more than once on the same page. If you can
>achieve that elegantly, please post it!
>
Page 2
Don't know about elegance, but I wanted to play with my new 2.3 version
Assuming the Timbot has special-cased small sets efficiently, and
assuming pageSeq below is a generator can yield a sequence of word-iterable
page objects (e.g. lists of words left after cleaning out punctuation
etc of a page, however page boundaries are detected, etc etc):
(FTHOI, the following pageSeq generator expects to find a "Page xxx"
line ending each page, with nothing else on the line). I'm going to use
this post as a "book"
Page 3
====< bookindex.py >=============================================== ==
keepalnum = ''.join([c.isalnum() and c or ' ' for c in [chr(i) for i in range(256)]])
def pageSeq(bookFile): # expect open bookFile ready to read
pageWords = []
pageId = '<prologue>'
while 1:
line = bookFile.readline()
if not line: break
lineToks = line.translate(keepalnum).split()
if not lineToks: continue
# expect single footer line with only "Page pgid" at end of page
if len(lineToks)==2 and lineToks[0]=='Page':
pageId = lineToks[1]
yield pageId, pageWords
pageWords = []; pageId = '(after %s)'% pageId
else: pageWords.extend([tok for tok in lineToks if tok.isalpha()]) # rej digits in words
if pageWords:
yield pageId, pageWords
def bookIndex(bookFile):
# (untested, requires 2.3 features)
from sets import Set
wordIndex = {}
for pgId, page in pageSeq(bookFile):
for word in page:
try:
wordIndex[word].add(pgId)
except KeyError:
wordIndex[word] = Set([pgId])
words = wordIndex.keys()
words.sort()
for word in words:
pageIds = list(wordIndex[word])
pnMaxWidth = max(map(len,pageIds))
pageIds = [pid.rjust(pnMaxWidth) for pid in pageIds]
pageIds.sort()
pageIds = map(str.strip, pageIds)
yield (word, ', '.join(pageIds))
if __name__ == '__main__':
import sys
args = sys.argv[1:]
if args:
arg = args.pop(0)
if arg == '-': f = sys.stdin
else: f = file(arg)
iCol=0
for wordOccLine in bookIndex(f):
print ('%14s: %-10s' % wordOccLine),
if iCol%3 == 2: print
iCol += 1
print
else:
raise SystemExit(
'Usage: python bookindex.py bookfilepath\n'
' a single minus for bookfilepath uses sys.stdin')
================================================== ===================
Page CODE
I started to post just the bookIndex generator (still marked untested, which is
almost right still

Then one think led to another
Page LAST
I'll copy above this line to the clip board and filter it through. Result:
[22:22] C:\pywk\clp>getclip | python bookindex.py - |more
Amen: 1 Assuming: 3 Aug: 1
Don: 3 Even: 1 FTHOI: 3
GMT: 1 Hettinger: 1 However: 2
I: 1, 2, 3, LAST If: 2 John: 1
KeyError: CODE Machin: 1 None: 1
On: 1 Page: 3, CODE Raymond: 1
Sat: 1 Set: CODE Sun: 1
SystemExit: CODE Then: LAST Timbot: 3
Usage: CODE a: 1, 2, 3, CODE about
: 3
achieve: 2 add: CODE after: 3, CODE
all: 1 almost: LAST and: 3, CODE
another: LAST are: 3 arg: CODE
args: CODE argv: CODE as: 3
assuming: 3 at: CODE be: 2
below: 3 book: 2, 3 bookFile: CODE
bookIndex: CODE, LAST bookfilepath: CODE bookindex: CODE
boundaries: 3 break: CODE brother: 1
building: 2 but: 3 c: CODE
can: 2, 3 cased: 3 chr: CODE
cleaning: 3 code: 1, 2 continue: CODE
data: 1 def: CODE detected: 3
digits: CODE duplicated: 2 e: 3
each: 3 efficiently: 3 elegance: 3
elegant: 2 elegantly: 2 else: 3, CODE
end: CODE ending: 3 etc: 3
example: 2 except: CODE expect: CODE
expects: 3 extend: CODE f: CODE
features: CODE file: CODE find: 3
following: 3 footer: CODE for: 2, CODE
from: CODE g: 3 generator: 3, LAST
going: 3 good: 1 had: 1
has: 3 have: 2 however: 3
i: CODE iCol: CODE if: CODE
import: CODE in: 2, CODE index: 2
instead: 1 is: 3, LAST isalnum: CODE
isalpha: CODE it: 1, 2 iterable: 3
join: CODE just: LAST keepalnum: CODE
keys: CODE know: 3 led: LAST
left: 3 len: CODE lexicon: 1
like: 1 liked: 2 line: 3, CODE
lineToks: CODE list: CODE lists: 3
ly: 1 m: 3 main: CODE
map: CODE marked: LAST max: CODE
minus: CODE more: 2 my: 3
n: CODE name: CODE net: 1
new: 3 not: 2, CODE nothing: 3
numbers: 2 objects: 3 occurs: 2
of: 1, 3, CODE on: 2, 3 once: 2
one: LAST only: CODE open: CODE
or: CODE out: 3 page: 2, 3, CODE
pageId: CODE pageIds: CODE pageSeq: 3, CODE
pageWords: CODE pgId: CODE pgid: CODE
pid: CODE play: 3 please: 2
pnMaxWidth: CODE pop: CODE post: 2, 3, LAST
practice: 2 print: CODE prologue: CODE
punctuation: 3 py: CODE python: CODE
raise: CODE range: CODE read: CODE
readline: CODE ready: CODE recall: 1
rej: CODE requirement: 2 requires: CODE
right: LAST rjust: CODE s: CODE
same: 2 seeing: 1 sequence: 3
sets: 3, CODE single: CODE sjmachin: 1
small: 3 somewhere: 1 sort: CODE
special: 3 split: CODE started: LAST
starts: 1 stdin: CODE still: LAST
str: CODE strip: CODE structure: 1
sys: CODE t: 3 than: 2
that: 1, 2 the: 2, 3, LAST think: LAST
this: 1, 3 to: 2, 3, CODE, LAST tok: CODE
translate: CODE try: CODE untested: CODE, LAST
use: 3 user: 2 uses: CODE
verizon: 1 version: 3 wanted: 3
when: 2 which: LAST while: CODE
with: 1, 3, CODE word: 2, 3, CODE wordIndex: CODE
wordOccLine: CODE words: 3, CODE worse: 1
would: 2 wrote: 1 xxx: 3
yield: 3, CODE you: 2 yours: 1
Regards,
Bengt Richter