Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Extracting subsequences composed of the same character

Reply
Thread Tools

Extracting subsequences composed of the same character

 
 
candide
Guest
Posts: n/a
 
      04-01-2011
Suppose you have a string, for instance

"pyyythhooonnn ---> ++++"

and you search for the subquences composed of the same character, here
you get :

'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

It's not difficult to write a Python code that solves the problem, for
instance :

def f(text):
ch=text
r=[]
if not text:
return r
else:
x=ch[0]
i=0
for c in ch:
if c!=x:
if i>1:
r+=[x*i]
x=c
i=1
else:
i+=1
return r+(i>1)*[i*x]

print f("pyyythhooonnn ---> ++++")


I should confess that this code is rather cumbersome so I was looking
for an alternative. I imagine that a regular expressions approach could
provide a better method. Does a such code exist ? Note that the string
is not restricted to the ascii charset.
 
Reply With Quote
 
 
 
 
MRAB
Guest
Posts: n/a
 
      04-01-2011
On 01/04/2011 01:43, candide wrote:
> Suppose you have a string, for instance
>
> "pyyythhooonnn ---> ++++"
>
> and you search for the subquences composed of the same character, here
> you get :
>
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'
>
> It's not difficult to write a Python code that solves the problem, for
> instance :
>

[snip]
>
> I should confess that this code is rather cumbersome so I was looking
> for an alternative. I imagine that a regular expressions approach could
> provide a better method. Does a such code exist ? Note that the string
> is not restricted to the ascii charset.


>>> import re
>>> re.findall(r"((.)\2+)", s)

[('yyy', 'y'), ('hh', 'h'), ('ooo', 'o'), ('nnn', 'n'), ('---', '-'),
('++++', '+')]
>>> [m[0] for m in re.findall(r"((.)\2+)", s)]

['yyy', 'hh', 'ooo', 'nnn', '---', '++++']
 
Reply With Quote
 
 
 
 
Roy Smith
Guest
Posts: n/a
 
      04-01-2011
In article <4d952008$0$3943$(E-Mail Removed)>,
candide <(E-Mail Removed)> wrote:

> Suppose you have a string, for instance
>
> "pyyythhooonnn ---> ++++"
>
> and you search for the subquences composed of the same character, here
> you get :
>
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'


I got the following. It's O(n) (with the minor exception that the string
addition isn't, but that's trivial to fix, and in practice, the bunches
are short enough it hardly matters).

#!/usr/bin/env python

s = "pyyythhooonnn ---> ++++"
answer = ['yyy', 'hh', 'ooo', 'nnn', '---', '++++']

last = None
bunches = []
bunch = ''
for c in s:
if c == last:
bunch += c
else:
if bunch:
bunches.append(bunch)
bunch = c
last = c
bunches.append(bunch)

multiples = [bunch for bunch in bunches if len(bunch) > 1]
print multiples
assert(multiples == answer)


[eagerly awaiting a PEP for collections.bunch and
collections.frozenbunch]
 
Reply With Quote
 
Tim Chase
Guest
Posts: n/a
 
      04-01-2011
On 03/31/2011 07:43 PM, candide wrote:
> Suppose you have a string, for instance
>
> "pyyythhooonnn ---> ++++"
>
> and you search for the subquences composed of the same character, here
> you get :
>
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'


>>> import re
>>> s = "pyyythhooonnn ---> ++++"
>>> [m.group(0) for m in re.finditer(r"(.)\1+", s)]

['yyy', 'hh', 'ooo', 'nnn', '---', '++++']
>>> [(m.group(0),m.group(1)) for m in re.finditer(r"(.)\1+", s)]

[('yyy', 'y'), ('hh', 'h'), ('ooo', 'o'), ('nnn', 'n'), ('---',
'-'), ('++++', '+')]

-tkc





 
Reply With Quote
 
Tim Chase
Guest
Posts: n/a
 
      04-01-2011
On 03/31/2011 07:43 PM, candide wrote:
> "pyyythhooonnn ---> ++++"
>
> and you search for the subquences composed of the same character, here
> you get :
>
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'


Or, if you want to do it with itertools instead of the "re" module:

>>> s = "pyyythhooonnn ---> ++++"
>>> from itertools import groupby
>>> [c*length for c, length in ((k, len(list(g))) for k, g in

groupby(s)) if length > 1]
['yyy', 'hh', 'ooo', 'nnn', '---', '++++']


-tkc


 
Reply With Quote
 
Terry Reedy
Guest
Posts: n/a
 
      04-01-2011
On 3/31/2011 10:20 PM, Tim Chase wrote:
> On 03/31/2011 07:43 PM, candide wrote:
>> "pyyythhooonnn ---> ++++"
>>
>> and you search for the subquences composed of the same character, here
>> you get :
>>
>> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

>
> Or, if you want to do it with itertools instead of the "re" module:
>
> >>> s = "pyyythhooonnn ---> ++++"
> >>> from itertools import groupby
> >>> [c*length for c, length in ((k, len(list(g))) for k, g in

> groupby(s)) if length > 1]
> ['yyy', 'hh', 'ooo', 'nnn', '---', '++++']


Slightly shorter:
[r for r in (''.join(g) for k, g in groupby(s)) if len(r) > 1]

--
Terry Jan Reedy

 
Reply With Quote
 
candide
Guest
Posts: n/a
 
      04-01-2011
Thanks, yours responses gave me the opportunity to understand the
"backreference" feature, it was not clear in spite of my intensive study
of the well known RE howto manual.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Non Continuous Subsequences bearophileHUGS@lycos.com Python 5 08-01-2008 01:56 AM
A string composed of a character repeated x number of times Sathyaish Java 11 04-04-2007 07:03 PM
How to iterate through a sequence, grabbing subsequences? Matthew Wilson Python 4 09-29-2006 03:01 PM
subsequences Eli Bendersky Ruby 14 04-19-2006 07:13 AM
grouping subsequences with BIO tags Steven Bethard Python 6 04-23-2005 02:10 AM



Advertisments