Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Howegrown wordcount

Reply
Thread Tools

Howegrown wordcount

 
 
Thomas Philips
Guest
Posts: n/a
 
      06-11-2004
I've coded a little word counting routine that handles a reasonably
wide range of inputs. How could it be made to cover more, though
admittedly more remote, possibilites such as nested lists of lists,
items for which the string representation is a string containing lists
etc. etc. without significantly increasing the complexity of the
program?

Thomas Philips

def wordcount(input):

from string import whitespace

#Treat iterable inputs differently
if "__iter__" in dir(input):
wordList =(" ".join([str(item) for item in input])).split()
else:
wordList = [str(input)]

#Remove any words that are just whitespace
for i,word in enumerate(wordList):
while word and word[-1] in whitespace:
word = word[:-1]
wordList[i] = word
wc = len(filter(None,wordList)) #Filter out any empty strings
return wc
 
Reply With Quote
 
 
 
 
David Wilson
Guest
Posts: n/a
 
      06-11-2004
On Fri, Jun 11, 2004 at 11:05:32AM -0700, Thomas Philips wrote:
> I've coded a little word counting routine that handles a reasonably
> wide range of inputs. How could it be made to cover more, though
> admittedly more remote, possibilites such as nested lists of lists,
> items for which the string representation is a string containing lists
> etc. etc. without significantly increasing the complexity of the
> program?


Hello,

Such 'magical' behaviour is error prone and causes many a headache when
debugging. Some might think that even this is too much:

> #Treat iterable inputs differently
> if "__iter__" in dir(input):
> wordList =(" ".join([str(item) for item in input])).split()
> else:
> wordList = [str(input)]


Myself included. Perhaps instead of increasing the complexity of this
function, why not write a few wrapper functions if you have the need.


David.

--
"Science is what we understand well enough to explain to a
computer. Art is everything else we do."
-- Donald Knuth

 
Reply With Quote
 
 
 
 
Larry Bates
Guest
Posts: n/a
 
      06-11-2004
Something like this?

def wordcount(input, sep=" "):
global words
if isinstance(input, str):
words+=len([x.strip() for x in input.split(sep)])
return words
else:
for item in input:
wordcount(item)

return words

#
# Test with a string
#
words=0
print wordcount("This is a test") # String test
words=0
print wordcount(["This is a test", "This is a test"]) # List test
words=0
print wordcount([["This is a test","This is a test"],
["This is a test","This is a test"]]) # List of lists
words=0
data=[["this is a test"],["this", "is", "a", "test"],"This is a test"]
print wordcount(data)

HTH,
Larry Bates


"Thomas Philips" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) m...
> I've coded a little word counting routine that handles a reasonably
> wide range of inputs. How could it be made to cover more, though
> admittedly more remote, possibilites such as nested lists of lists,
> items for which the string representation is a string containing lists
> etc. etc. without significantly increasing the complexity of the
> program?
>
> Thomas Philips
>
> def wordcount(input):
>
> from string import whitespace
>
> #Treat iterable inputs differently
> if "__iter__" in dir(input):
> wordList =(" ".join([str(item) for item in input])).split()
> else:
> wordList = [str(input)]
>
> #Remove any words that are just whitespace
> for i,word in enumerate(wordList):
> while word and word[-1] in whitespace:
> word = word[:-1]
> wordList[i] = word
> wc = len(filter(None,wordList)) #Filter out any empty strings
> return wc



 
Reply With Quote
 
Thomas Philips
Guest
Posts: n/a
 
      06-12-2004
An embarrassing mistake on my part: I should have typed
#Treat iterable inputs differently
if "__iter__" in dir(input):
wordList =(" ".join([str(item) for item in input])).split()
else:
wordList = str(input).split()

I wish I knew how to treat all possible inputs in a uniform fashion,
but I'm nowhere near there as yet, hence the question. That said, it
addressess the situations that arise in practice fairly well, though I
am sure it can be sped up substantially.

Thomas Philips
 
Reply With Quote
 
=?ISO-8859-1?Q?Gr=E9goire_Dooms?=
Guest
Posts: n/a
 
      06-12-2004
Larry Bates wrote:
> Something like this?
>
> def wordcount(input, sep=" "):
> global words
> if isinstance(input, str):
> words+=len([x.strip() for x in input.split(sep)])


What's the purpose of stripping the items in the list if you just count
their number ? Isn't this equivalent to
words += len(input.split(sep))

> return words
> else:
> for item in input:
> wordcount(item)
>
> return words


Removing the global statement and sep param, you get:

def wordcount(input):
if isinstance(input, str):
return len(input.split())
else:
return sum([wordcount(item) for item in input])

--
Grégoire Dooms
 
Reply With Quote
 
Keith P. Boruff
Guest
Posts: n/a
 
      06-13-2004
Grégoire Dooms wrote:


> What's the purpose of stripping the items in the list if you just count
> their number ? Isn't this equivalent to
> words += len(input.split(sep))
>
>> return words
>> else:
>> for item in input:
>> wordcount(item)
>>
>> return words

>
>
> Removing the global statement and sep param, you get:
>
> def wordcount(input):
> if isinstance(input, str):
> return len(input.split())
> else:
> return sum([wordcount(item) for item in input])
>
> --
> Grégoire Dooms


After reading this thread, I decided to embark on a word counting
program of my own. One thing I like to do when learning new programming
languages is to try and emulate some of my favorite UNIX type programs.

That said, to get the count of words in a string, I merely did the
following:


# Beginning of program

import re

# Right now my simple wc program just reads piped data
if not sys.stdin.isatty(): input_data = sys.stdin.read()

print "number of words:", len(re.findall('[^\s]+', input_data))

# End of program

Though I've only done trivial tests on this up to now, the word count of
this script seems to match that of the wc on my system (RH Linux WS). I
ran some big RFC text files through this too.

There could be some flaws here; I don't know. I'll have to look at it
better when I get back from the gym. If anyone here finds a problem, I'd
be interested in hearing it.

Like I said, I love using these UNIX type programs to learn a new
language. It helps me learn things like file I/O, command line
arguments, string manipulations.. etc.

Keith P. Boruff






 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off




Advertisments