Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > re.split() not keeping matched text

Reply
Thread Tools

re.split() not keeping matched text

 
 
Robert Oschler
Guest
Posts: n/a
 
      07-25-2004
Hello,

Given the following program:

--------------

import re

x = "The dog ran. The cat eats! The bird flies? Done."
l = re.split("[.?!]", x)

for s in l:
print s.strip()
# for
---------------

I am getting the following output:

The dog ran
The cat eats
The bird flies
Done

As you can see the end of sentence punctuation marks are being removed. Yet
the the docs for re.split() say that the matched text is supposed to be
returned. I want to keep the punctuation marks.

Where am I going wrong here?

Thanks,
--
Robert


 
Reply With Quote
 
 
 
 
Test
Guest
Posts: n/a
 
      07-25-2004
Hi Robert,

Robert Oschler wrote:

> l = re.split("[.?!]", x)


> I want to keep the punctuation marks.


The docs say: If _capturing parentheses_ are used in pattern, then the text
of all groups in the pattern are also returned as part of the resulting
list.

So:

l = re.split("([.?!])", x)

will work as wanted.

Bye,
Kai
 
Reply With Quote
 
 
 
 
Robert Oschler
Guest
Posts: n/a
 
      07-25-2004
"Test" <(E-Mail Removed)> wrote in message
news:ce12hh$vc9$06$(E-Mail Removed)-online.com...
> Hi Robert,
>
> The docs say: If _capturing parentheses_ are used in pattern, then the

text
> of all groups in the pattern are also returned as part of the resulting
> list.
>
> So:
>
> l = re.split("([.?!])", x)
>
> will work as wanted.
>
> Bye,
> Kai


Kai,

That works. Unfortunately the punctuation marks (matched text) are returned
as separate list entries. Is there any way to avoid having to walk the list
by steps of 2, and rejoin the "n" and "n+1" elements, to get back the
original sentence(s)? I'm trying to save some processing time if possible.

Thanks,
--
Robert


 
Reply With Quote
 
Christopher T King
Guest
Posts: n/a
 
      07-25-2004
On Sun, 25 Jul 2004, Robert Oschler wrote:

> Given the following program:
>
> --------------
>
> import re
>
> x = "The dog ran. The cat eats! The bird flies? Done."
> l = re.split("[.?!]", x)
>
> for s in l:
> print s.strip()
> # for
> ---------------


> I want to keep the punctuation marks.
>
> Where am I going wrong here?


What you need is some magic with the (?<=...), or 'look-behind assertion'
operator:

re.split(r'(?<=[.?!])\s*')

What this regex is saying is "match a string of spaces that follows one of
[.?!]". This way, it will not consume the punctuation, but will consume
the spaces (thus killing two birds with one stone by obviating the need
for the subsequent s.strip()).

Unfortunately, there is a slight bug, where if the punctuation is not
followed by whitespace, re.split won't split, because the regex returns a
zero-length string. There is a patch to fix this (SF #988761, see the end
of the message for a link), but until then, you can prevent the error by
using:

re.split(r'(?<=[.?!])\s+')

This won't match end-of-character marks not followed by whitespace, but
that may be preferable behaviour anyways (e.g. if you're parsing Python
documentation).

Hope this helps.

Patch #988761:
http://sourceforge.net/tracker/index...70&atid=305470

 
Reply With Quote
 
mark@wutka.com
Guest
Posts: n/a
 
      07-26-2004
I don't know if this will save you any processing time, but you can just
replace the split with a findall like this:
l = re.findall("[^.?!]+[?!.]+", x)

This should handle your example, plus it handles multiple occurances of
the punctuation at the end of the sentence.

Robert Oschler <no_replies@fake_email_address.invalid> wrote:
> Hello,
>
> Given the following program:
>
> --------------
>
> import re
>
> x = "The dog ran. The cat eats! The bird flies? Done."
> l = re.split("[.?!]", x)
>
> for s in l:
> print s.strip()
> # for
> ---------------
>
> I am getting the following output:
>
> The dog ran
> The cat eats
> The bird flies
> Done
>
> As you can see the end of sentence punctuation marks are being removed. Yet
> the the docs for re.split() say that the matched text is supposed to be
> returned. I want to keep the punctuation marks.
>
> Where am I going wrong here?
>
> Thanks,

 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      07-26-2004
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

> I don't know if this will save you any processing time, but you can just
> replace the split with a findall like this:
> l = re.findall("[^.?!]+[?!.]+", x)
>
> This should handle your example, plus it handles multiple occurances of
> the punctuation at the end of the sentence.


One caveat: the invariant

"".join(re.findall("[^?!.]+[?!.]+", s)) == s

will no longer hold as you will lose leading punctuation and trailing
non-punctuation:

>>> re.findall("[^?!.]+[?!.]+", "!so what! you're done? yes done")

['so what!', " you're done?"]
>>>


Peter

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
matching dates are not been matched =?Utf-8?B?ZF9jYW1wZWxvQGhvdG1haWwuY29t?= Microsoft Certification 0 11-22-2006 09:03 PM
How to export the matched text to log file? gunajoe@gmail.com Perl Misc 2 05-25-2006 12:40 PM
xsl:template not getting matched tentstitcher@gmail.com XML 2 05-18-2006 12:16 AM
Keeping Grid Checkbox Control matched with correct row Micah N ASP .Net Datagrid Control 0 05-17-2004 07:49 PM
How do I not store matched patterns in the $1.. strings? Jonathan Perl 1 12-25-2003 11:56 AM



Advertisments