Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Extracting repeated words

Reply
Thread Tools

Extracting repeated words

 
 
candide
Guest
Posts: n/a
 
      04-01-2011
Another question relative to regular expressions.

How to extract all word duplicates in a given text by use of regular
expression methods ? To make the question concrete, if the text is

------------------
Now is better than never.
Although never is often better than *right* now.
------------------

duplicates are :

------------------------
better is now than never
------------------------

Some code can solve the question, for instance

# ------------------
import re

regexp=r"\w+"

c=re.compile(regexp, re.IGNORECASE)

text="""
Now is better than never.
Although never is often better than *right* now."""

z=[s.lower() for s in c.findall(text)]

for d in set([s for s in z if z.count(s)>1]):
print d,
# ------------------

but I'm in search of "plain" re code.



 
Reply With Quote
 
 
 
 
Ian Kelly
Guest
Posts: n/a
 
      04-01-2011
On Fri, Apr 1, 2011 at 2:54 PM, candide <(E-Mail Removed)> wrote:
> Another question relative to regular expressions.
>
> How to extract all word duplicates in a given text by use of regular
> expression methods ? *To make the question concrete, if the text is
>
> ------------------
> Now is better than never.
> Although never is often better than *right* now.
> ------------------
>
> duplicates are :
>
> ------------------------
> better is now than never
> ------------------------
>
> Some code can solve the question, for instance
>
> # ------------------
> import re
>
> regexp=r"\w+"
>
> c=re.compile(regexp, re.IGNORECASE)
>
> text="""
> Now is better than never.
> Although never is often better than *right* now."""
>
> z=[s.lower() for s in c.findall(text)]
>
> for d in set([s for s in z if z.count(s)>1]):
> * *print d,
> # ------------------
>
> but I'm in search of "plain" re code.


You could use a look-ahead assertion with a captured group:

>>> regexp = r'\b(?P<dup>\w+)\b(?=.+\b(?P=dup)\b)'
>>> c = re.compile(regexp, re.IGNORECASE | re.DOTALL)
>>> c.findall(text)


But note that this is computationally expensive. The regex that you
posted is probably more efficient if you use a collections.Counter
object instead of z.count.

Cheers,
Ian
 
Reply With Quote
 
 
 
 
candide
Guest
Posts: n/a
 
      04-02-2011
Le 02/04/2011 00:42, Ian Kelly a écrit :

> You could use a look-ahead assertion with a captured group:
>
>>>> regexp = r'\b(?P<dup>\w+)\b(?=.+\b(?P=dup)\b)'
>>>> c = re.compile(regexp, re.IGNORECASE | re.DOTALL)
>>>> c.findall(text)


It works fine, lookahead assertions in action is what exatly i was
looking for, many thanks.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Remove repeated words from a file arnuld C Programming 3 09-18-2009 10:25 PM
counting repeated words in input arnuld C++ 10 08-03-2007 02:58 PM
extracting numbers from a file, excluding fixed words dawenliu Python 5 10-29-2005 09:15 PM
Finding repeated words in text documents: what Algorithm ? Daniele Menozzi Java 9 07-18-2005 06:31 AM
extracting HTML fragments and counting words Ksenia Marasanova Python 0 02-18-2005 08:28 PM



Advertisments