Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Replace stop words (remove words from a string)

Reply
Thread Tools

Replace stop words (remove words from a string)

 
 
BerlinBrown
Guest
Posts: n/a
 
      01-17-2008
if I have an array of "stop" words, and I want to replace those values
with something else; in a string, how would I go about doing this. I
have this code that splits the string and then does a difference but I
think there is an easier approach:

E.g.

mystr =
kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsldfsd;

if I have an array stop_list = [ "[BAD]", "[BAD2]" ]

I want to replace the values in that list with a zero length string.

I had this before, but I don't want to use this approach; I don't want
to use the split.

line_list = line.lower().split()
res = list(set(keywords_list).difference(set(ENTITY_IGNO RE_LIST)))


 
Reply With Quote
 
 
 
 
Karthik
Guest
Posts: n/a
 
      01-17-2008
How about -

for s in stoplist:
string.replace(mystr, s, "")

Hope this should work.

-----Original Message-----
From: python-list-bounces+karthik3186=(E-Mail Removed)
[mailtoython-list-bounces+karthik3186=(E-Mail Removed)] On Behalf Of
BerlinBrown
Sent: Thursday, January 17, 2008 1:55 PM
To: http://www.velocityreviews.com/forums/(E-Mail Removed)
Subject: Replace stop words (remove words from a string)

if I have an array of "stop" words, and I want to replace those values
with something else; in a string, how would I go about doing this. I
have this code that splits the string and then does a difference but I
think there is an easier approach:

E.g.

mystr =
kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsldf
sd;

if I have an array stop_list = [ "[BAD]", "[BAD2]" ]

I want to replace the values in that list with a zero length string.

I had this before, but I don't want to use this approach; I don't want
to use the split.

line_list = line.lower().split()
res = list(set(keywords_list).difference(set(ENTITY_IGNO RE_LIST)))


--
http://mail.python.org/mailman/listinfo/python-list

 
Reply With Quote
 
 
 
 
Gary Herron
Guest
Posts: n/a
 
      01-17-2008
BerlinBrown wrote:
> if I have an array of "stop" words, and I want to replace those values
> with something else; in a string, how would I go about doing this. I
> have this code that splits the string and then does a difference but I
> think there is an easier approach:
>
> E.g.
>
> mystr =
> kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsldfsd;
>
> if I have an array stop_list = [ "[BAD]", "[BAD2]" ]
>
> I want to replace the values in that list with a zero length string.
>
> I had this before, but I don't want to use this approach; I don't want
> to use the split.
>
> line_list = line.lower().split()
> res = list(set(keywords_list).difference(set(ENTITY_IGNO RE_LIST)))
>

String have a replace method that will produce a new string with (all
occurrences of) one substring replaced with another. You'd have to loop
through your stop_list one word at a time.

>>> s = 'abcxyzabc'
>>> s.replace('xyz','')

'abcabc'


If either the string or the stop_list grows particularly large, this
approach won't scale very well since the whole string would be
re-created anew for each stop_list entry. In that case, I'd look into
the regular expression (re) module. You may be able to finagle a way to
find and replace all stop_list entries in one pass. (Finding them all
is easy -- not so sure you could replace them all at once though. )


Gary Herron


 
Reply With Quote
 
Gary Herron
Guest
Posts: n/a
 
      01-17-2008
Karthik wrote:
> How about -
>
> for s in stoplist:
> string.replace(mystr, s, "")
>

That will work, but the string module is long outdated. Better to use
string methods:

for s in stoplist:
mystr.replace(s, "")

Gary Herron


> Hope this should work.
>
> -----Original Message-----
> From: python-list-bounces+karthik3186=(E-Mail Removed)
> [mailtoython-list-bounces+karthik3186=(E-Mail Removed)] On Behalf Of
> BerlinBrown
> Sent: Thursday, January 17, 2008 1:55 PM
> To: (E-Mail Removed)
> Subject: Replace stop words (remove words from a string)
>
> if I have an array of "stop" words, and I want to replace those values
> with something else; in a string, how would I go about doing this. I
> have this code that splits the string and then does a difference but I
> think there is an easier approach:
>
> E.g.
>
> mystr =
> kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsldf
> sd;
>
> if I have an array stop_list = [ "[BAD]", "[BAD2]" ]
>
> I want to replace the values in that list with a zero length string.
>
> I had this before, but I don't want to use this approach; I don't want
> to use the split.
>
> line_list = line.lower().split()
> res = list(set(keywords_list).difference(set(ENTITY_IGNO RE_LIST)))
>
>
>


 
Reply With Quote
 
Raymond Hettinger
Guest
Posts: n/a
 
      01-17-2008
On Jan 17, 12:25*am, BerlinBrown <(E-Mail Removed)> wrote:
> if I have an array of "stop" words, and I want to replace those values
> with something else;
> mystr =
> kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsld*fsd;
> if I have an array stop_list = [ "[BAD]", "[BAD2]" ]
> I want to replace the values in that list with a zero length string.


Regular expressions should do the trick.

Try this:

>>> mystr = 'kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsld*fsd;'
>>> stoplist = ["[BAD]", "[BAD2]"]
>>> import re
>>> stoppattern = '|'.join(map(re.escape, stoplist))
>>> re.sub(stoppattern, '', mystr)

'kljsldkfjksjdfjsdjflkdjslkfKkjkkkkjkkjkLSKJFKSFJK SJF;Lkjsld\xadfsd;'

Raymond
 
Reply With Quote
 
Bruno Desthuilliers
Guest
Posts: n/a
 
      01-17-2008
BerlinBrown a écrit :
> if I have an array of "stop" words, and I want to replace those values
> with something else; in a string, how would I go about doing this. I
> have this code that splits the string and then does a difference but I
> think there is an easier approach:
>
> E.g.
>
> mystr =
> kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsldfsd;
>


<ot>you forgot the quotes</ot>

> if I have an array stop_list = [ "[BAD]", "[BAD2]" ]


s/array/list/

> I want to replace the values in that list with a zero length string.
>
> I had this before, but I don't want to use this approach; I don't want
> to use the split.
>
> line_list = line.lower().split()
> res = list(set(keywords_list).difference(set(ENTITY_IGNO RE_LIST)))


res = mystr
for stop_word in stop_list:
res = res.replace(stop_word, '')


 
Reply With Quote
 
bearophileHUGS@lycos.com
Guest
Posts: n/a
 
      01-17-2008
Raymond Hettinger:
> Regular expressions should do the trick.
> >>> stoppattern = '|'.join(map(re.escape, stoplist))
> >>> re.sub(stoppattern, '', mystr)


If the stop words are many (and similar) then that RE can be optimized
with a trie-based strategy, like this one called "List":
http://search.cpan.org/~dankogai/Reg...Regexp/List.pm

"List" is used by something more complex called "Optimizer" that's
overkill for the OP problem:
http://search.cpan.org/~dankogai/Reg...p/Optimizer.pm

I don't know if a Python module similar to "List" is available, I may
write it

Bye,
bearophile
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Words and non-words, according to Microsoft et al Steve B NZ Computing 11 03-21-2008 11:52 PM
Words Words utab C++ 6 02-16-2006 07:00 PM
Non-noise words are incorrectly recognised as noise words. Peter Strĝiman ASP .Net 1 08-23-2005 01:26 PM
replace words with bold words Lasse Edsvik ASP General 9 10-07-2003 01:19 PM
Re: A little bit of help regarding my linked list program required. - "words.c" - "words.c" Richard Heathfield C Programming 7 10-05-2003 02:38 PM



Advertisments