Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > How do I skip over multiple words in a file?

Reply
Thread Tools

How do I skip over multiple words in a file?

 
 
chad
Guest
Posts: n/a
 
      11-11-2010
Let's say that I have an article. What I want to do is read in this
file and have the program skip over ever instance of the words "the",
"and", "or", and "but". What would be the general strategy for
attacking a problem like this?
 
Reply With Quote
 
 
 
 
Tim Chase
Guest
Posts: n/a
 
      11-11-2010
On 11/11/10 09:07, chad wrote:
> Let's say that I have an article. What I want to do is read in
> this file and have the program skip over ever instance of the
> words "the", "and", "or", and "but". What would be the
> general strategy for attacking a problem like this?


I'd keep a file of "stop words", read them into a set
(normalizing case in the process). Then, as I skim over each
word in my target file, check if the case-normalized version of
the word is in your stop-words and skipping if it is. It might
look something like this:

def normalize_word(s):
return s.strip().upper()

stop_words = set(
normalize_word(word)
for word in file('stop_words.txt')
)
for line in file('data.txt'):
for word in line.split():
if normalize_word(word) in stop_words: continue
process(word)

-tkc



 
Reply With Quote
 
 
 
 
r0g
Guest
Posts: n/a
 
      11-11-2010
On 11/11/10 15:07, chad wrote:
> Let's say that I have an article. What I want to do is read in this
> file and have the program skip over ever instance of the words "the",
> "and", "or", and "but". What would be the general strategy for
> attacking a problem like this?



If your files are not too big I'd simply read them into a string and do
a string replace for each word you want to skip. If you want case
insensitivity use re.replace() instead of the default string.replace()
method. Neither are elegant or all that efficient but both are very
easy. If your use case requires something high performance then best
keep looking

Roger.
 
Reply With Quote
 
Paul Watson
Guest
Posts: n/a
 
      11-11-2010
On 2010-11-11 08:07, chad wrote:
> Let's say that I have an article. What I want to do is read in this
> file and have the program skip over ever instance of the words "the",
> "and", "or", and "but". What would be the general strategy for
> attacking a problem like this?


I realize that you may need or want to do this in Python. This would be
trivial in an awk script.
 
Reply With Quote
 
Paul Rubin
Guest
Posts: n/a
 
      11-11-2010
chad <> writes:

> Let's say that I have an article. What I want to do is read in this
> file and have the program skip over ever instance of the words "the",
> "and", "or", and "but". What would be the general strategy for
> attacking a problem like this?


Something like (untested):

stopwords = set (('and', 'or', 'but'))

def goodwords():
for line in file:
for w in line.split():
if w.lower() not in stopwords:
yield w

Removing punctuation is left as an exercise.
 
Reply With Quote
 
Stefan Sonnenberg-Carstens
Guest
Posts: n/a
 
      11-11-2010
Am 11.11.2010 21:33, schrieb Paul Watson:
> On 2010-11-11 08:07, chad wrote:
>> Let's say that I have an article. What I want to do is read in this
>> file and have the program skip over ever instance of the words "the",
>> "and", "or", and "but". What would be the general strategy for
>> attacking a problem like this?

>
> I realize that you may need or want to do this in Python. This would
> be trivial in an awk script.

There are several ways to do this.

skip = ('and','or','but')
all=[]
[[all.append(w) for w in l.split() if w not in skip] for l in
open('some.txt').readlines()]
print all

If some.txt contains your original question, it returns this:
["Let's", 'say', 'that', 'I', 'have', 'an', 'article.', 'What', 'I',
'want', 'to
', 'do', 'is', 'read', 'in', 'this', 'file', 'have', 'the', 'program',
'skip', '
over', 'ever', 'instance', 'of', 'the', 'words', '"the",', '"and",',
'"or",', '"
but".', 'What', 'would', 'be', 'the', 'general', 'strategy', 'for',
'attacking',
'a', 'problem', 'like', 'this?']

But this _one_ way to get there.
Faster solutions could be based on a regex:
import re
skip = ('and','or','but')
all = re.compile('(\w+)')
print [w for w in all.findall(open('some.txt').read()) if w not in skip]

this gives this result (you loose some punctuation etc):
['Let', 's', 'say', 'that', 'I', 'have', 'an', 'article', 'What', 'I',
'want', '
to', 'do', 'is', 'read', 'in', 'this', 'file', 'have', 'the', 'program',
'skip',
'over', 'ever', 'instance', 'of', 'the', 'words', 'the', 'What',
'would', 'be',
'the', 'general', 'strategy', 'for', 'attacking', 'a', 'problem',
'like', 'this
']

But there are some many ways to do it ...


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
VOIP over VPN over TCP over WAP over 3G Theo Markettos UK VOIP 2 02-14-2008 03:27 PM
Non-noise words are incorrectly recognised as noise words. Peter Strĝiman ASP .Net 1 08-23-2005 01:26 PM
skip over table head values Matt Williamson Javascript 2 08-05-2005 05:31 AM
Re: A little bit of help regarding my linked list program required. - "words.c" - "words.c" Richard Heathfield C Programming 7 10-05-2003 02:38 PM



Advertisments