Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > find and replace with regular expressions

Reply
Thread Tools

find and replace with regular expressions

 
 
chrispoliquin@gmail.com
Guest
Posts: n/a
 
      07-31-2008
I am using regular expressions to search a string (always full
sentences, maybe more than one sentence) for common abbreviations and
remove the periods. I need to break the string into different
sentences but split('.') doesn't solve the whole problem because of
possible periods in the middle of a sentence.

So I have...

----------------

import re

middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')

# this will find abbreviations like e.g. or i.e. in the middle of a
sentence.
# then I want to remove the periods.

----------------

I want to keep the ie or eg but just take out the periods. Any
ideas? Of course newString = middle_abbr.sub('',txt) where txt is the
string will take out the entire abbreviation with the alphanumeric
characters included.
 
Reply With Quote
 
 
 
 
Mensanator
Guest
Posts: n/a
 
      07-31-2008
On Jul 31, 3:07*pm, chrispoliq...@gmail.com wrote:
> I am using regular expressions to search a string (always full
> sentences, maybe more than one sentence) for common abbreviations and
> remove the periods. *I need to break the string into different
> sentences but split('.') doesn't solve the whole problem because of
> possible periods in the middle of a sentence.
>
> So I have...
>
> ----------------
>
> import re
>
> middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')
>
> # this will find abbreviations like e.g. or i.e. in the middle of a
> sentence.
> # then I want to remove the periods.
>
> ----------------
>
> I want to keep the ie or eg but just take out the periods. *Any
> ideas? *Of course newString = middle_abbr.sub('',txt) where txt is the
> string will take out the entire abbreviation with the alphanumeric
> characters included.


>>> middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')
>>> s = 'A test, i.e., an example.'
>>> a = middle_abbr.search(s) # find the abbreviation
>>> b = re.compile('\.') # period pattern
>>> c = b.sub('',a.group(0)) # remove periods from abbreviation
>>> d = middle_abbr.sub(c,s) # substitute new abbr for old
>>> d

'A test, ie, an example.'
 
Reply With Quote
 
 
 
 
Mensanator
Guest
Posts: n/a
 
      07-31-2008
On Jul 31, 3:56*pm, Mensanator <mensana...@aol.com> wrote:
> On Jul 31, 3:07*pm, chrispoliq...@gmail.com wrote:
>
>
>
>
>
> > I am using regular expressions to search a string (always full
> > sentences, maybe more than one sentence) for common abbreviations and
> > remove the periods. *I need to break the string into different
> > sentences but split('.') doesn't solve the whole problem because of
> > possible periods in the middle of a sentence.

>
> > So I have...

>
> > ----------------

>
> > import re

>
> > middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')

>
> > # this will find abbreviations like e.g. or i.e. in the middle of a
> > sentence.
> > # then I want to remove the periods.

>
> > ----------------

>
> > I want to keep the ie or eg but just take out the periods. *Any
> > ideas? *Of course newString = middle_abbr.sub('',txt) where txt is the
> > string will take out the entire abbreviation with the alphanumeric
> > characters included.
> >>> middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')
> >>> s = 'A test, i.e., an example.'
> >>> a = middle_abbr.search(s) * * *# find the abbreviation
> >>> b = re.compile('\.') * * * * * # period pattern
> >>> c = b.sub('',a.group(0)) * * * # remove periods from abbreviation
> >>> d = middle_abbr.sub(c,s) * * * # substitute new abbr for old
> >>> d

>
> 'A test, ie, an example.'



A more versatile version:

import re

middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')
s = 'A test, i.e., an example.'
a = middle_abbr.search(s) # find the abbreviation
b = re.compile('\.') # period pattern
c = b.sub('',a.group(0)) # remove periods from abbreviation
d = middle_abbr.sub(c,s) # substitute new abbr for old

print d
print
print

s = """A test, i.e., an example.
Yet another test, i.e., example with 2 abbr."""

a = middle_abbr.search(s) # find the abbreviation
c = b.sub('',a.group(0)) # remove periods from abbreviation
d = middle_abbr.sub(c,s) # substitute new abbr for old

print d
print
print

s = """A test, i.e., an example.
Yet another test, i.e., example with 2 abbr.
A multi-test, e.g., one with different abbr."""

done = False

while not done:
a = middle_abbr.search(s) # find the abbreviation
if a:
c = b.sub('',a.group(0)) # remove periods from abbreviation
s = middle_abbr.sub(c,s,1) # substitute new abbr for old ONCE
else: # repeat until all removed
done = True

print s

## A test, ie, an example.
##
##
## A test, ie, an example.
## Yet another test, ie, example with 2 abbr.'
##
##
## A test, ie, an example.
## Yet another test, ie, example with 2 abbr.
## A multi-test, eg, one with different abbr.
 
Reply With Quote
 
Paul McGuire
Guest
Posts: n/a
 
      07-31-2008
On Jul 31, 3:07*pm, chrispoliq...@gmail.com wrote:
>
> middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')
>


When defining re's with string literals, it is good practice to use
the raw string literal format (precede with an 'r'):
middle_abbr = re.compile(r'[A-Za-z0-9]\.[A-Za-z0-9]\.')

What abbreviations have numeric digits in them?

I hope your input string doesn't include something like this:
For a good approximation of pi, use 3.1.

-- Paul
 
Reply With Quote
 
MRAB
Guest
Posts: n/a
 
      07-31-2008
On Jul 31, 9:07*pm, chrispoliq...@gmail.com wrote:
> I am using regular expressions to search a string (always full
> sentences, maybe more than one sentence) for common abbreviations and
> remove the periods. *I need to break the string into different
> sentences but split('.') doesn't solve the whole problem because of
> possible periods in the middle of a sentence.
>
> So I have...
>
> ----------------
>
> import re
>
> middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')
>
> # this will find abbreviations like e.g. or i.e. in the middle of a
> sentence.
> # then I want to remove the periods.
>
> ----------------
>
> I want to keep the ie or eg but just take out the periods. *Any
> ideas? *Of course newString = middle_abbr.sub('',txt) where txt is the
> string will take out the entire abbreviation with the alphanumeric
> characters included.


It's recommended that you should use a raw strings for regular
expressions.

Capture the letters using parentheses:

middle_abbr = re.compile(r'([A-Za-z0-9])\.([A-Za-z0-9])\.')

and replace what was found with what was captured:

newString = middle_abbr.sub(r'\1\2', txt)

HTH
 
Reply With Quote
 
dusans
Guest
Posts: n/a
 
      08-01-2008
On Jul 31, 10:07*pm, chrispoliq...@gmail.com wrote:
> I am using regular expressions to search a string (always full
> sentences, maybe more than one sentence) for common abbreviations and
> remove the periods. *I need to break the string into different
> sentences but split('.') doesn't solve the whole problem because of
> possible periods in the middle of a sentence.
>
> So I have...
>
> ----------------
>
> import re
>
> middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')
>
> # this will find abbreviations like e.g. or i.e. in the middle of a
> sentence.
> # then I want to remove the periods.
>
> ----------------
>
> I want to keep the ie or eg but just take out the periods. *Any
> ideas? *Of course newString = middle_abbr.sub('',txt) where txt is the
> string will take out the entire abbreviation with the alphanumeric
> characters included.


Its impossible with regex. U could try it with a statistical analysis;
and even this would give u a good split.
 
Reply With Quote
 
dusans
Guest
Posts: n/a
 
      08-01-2008
On Aug 1, 12:53*pm, dusans <dusan.smit...@gmail.com> wrote:
> On Jul 31, 10:07*pm, chrispoliq...@gmail.com wrote:
>
>
>
>
>
> > I am using regular expressions to search a string (always full
> > sentences, maybe more than one sentence) for common abbreviations and
> > remove the periods. *I need to break the string into different
> > sentences but split('.') doesn't solve the whole problem because of
> > possible periods in the middle of a sentence.

>
> > So I have...

>
> > ----------------

>
> > import re

>
> > middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')

>
> > # this will find abbreviations like e.g. or i.e. in the middle of a
> > sentence.
> > # then I want to remove the periods.

>
> > ----------------

>
> > I want to keep the ie or eg but just take out the periods. *Any
> > ideas? *Of course newString = middle_abbr.sub('',txt) where txt is the
> > string will take out the entire abbreviation with the alphanumeric
> > characters included.

>
> Its impossible with regex. U could try it with a statistical analysis;
> and even this would give u a good split.


"and even this wont* give u a good split."
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expressions replace !! maverick1611 ASP .Net 9 05-04-2006 11:20 PM
regular expressions replace JohnZing ASP .Net 3 10-21-2005 09:26 PM
Regular Expressions - Replace =?Utf-8?B?SGVtYW50?= ASP .Net 4 12-23-2004 09:07 AM
Regular Expressions for Find and Replace Mark ASP .Net 0 05-18-2004 01:24 PM
Add custom regular expressions to the validation list of available expressions Jay Douglas ASP .Net 0 08-15-2003 10:19 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57