Velocity Reviews > Extracting subsequences composed of the same character

# Extracting subsequences composed of the same character

candide
Guest
Posts: n/a

 04-01-2011
Suppose you have a string, for instance

"pyyythhooonnn ---> ++++"

and you search for the subquences composed of the same character, here
you get :

'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

It's not difficult to write a Python code that solves the problem, for
instance :

def f(text):
ch=text
r=[]
if not text:
return r
else:
x=ch[0]
i=0
for c in ch:
if c!=x:
if i>1:
r+=[x*i]
x=c
i=1
else:
i+=1
return r+(i>1)*[i*x]

print f("pyyythhooonnn ---> ++++")

I should confess that this code is rather cumbersome so I was looking
for an alternative. I imagine that a regular expressions approach could
provide a better method. Does a such code exist ? Note that the string
is not restricted to the ascii charset.

MRAB
Guest
Posts: n/a

 04-01-2011
On 01/04/2011 01:43, candide wrote:
> Suppose you have a string, for instance
>
> "pyyythhooonnn ---> ++++"
>
> and you search for the subquences composed of the same character, here
> you get :
>
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'
>
> It's not difficult to write a Python code that solves the problem, for
> instance :
>

[snip]
>
> I should confess that this code is rather cumbersome so I was looking
> for an alternative. I imagine that a regular expressions approach could
> provide a better method. Does a such code exist ? Note that the string
> is not restricted to the ascii charset.

>>> import re
>>> re.findall(r"((.)\2+)", s)

[('yyy', 'y'), ('hh', 'h'), ('ooo', 'o'), ('nnn', 'n'), ('---', '-'),
('++++', '+')]
>>> [m[0] for m in re.findall(r"((.)\2+)", s)]

['yyy', 'hh', 'ooo', 'nnn', '---', '++++']

Roy Smith
Guest
Posts: n/a

 04-01-2011
In article <4d952008\$0\$3943\$(E-Mail Removed)>,
candide <(E-Mail Removed)> wrote:

> Suppose you have a string, for instance
>
> "pyyythhooonnn ---> ++++"
>
> and you search for the subquences composed of the same character, here
> you get :
>
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

I got the following. It's O(n) (with the minor exception that the string
addition isn't, but that's trivial to fix, and in practice, the bunches
are short enough it hardly matters).

#!/usr/bin/env python

s = "pyyythhooonnn ---> ++++"
answer = ['yyy', 'hh', 'ooo', 'nnn', '---', '++++']

last = None
bunches = []
bunch = ''
for c in s:
if c == last:
bunch += c
else:
if bunch:
bunches.append(bunch)
bunch = c
last = c
bunches.append(bunch)

multiples = [bunch for bunch in bunches if len(bunch) > 1]
print multiples

[eagerly awaiting a PEP for collections.bunch and
collections.frozenbunch]

Tim Chase
Guest
Posts: n/a

 04-01-2011
On 03/31/2011 07:43 PM, candide wrote:
> Suppose you have a string, for instance
>
> "pyyythhooonnn ---> ++++"
>
> and you search for the subquences composed of the same character, here
> you get :
>
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

>>> import re
>>> s = "pyyythhooonnn ---> ++++"
>>> [m.group(0) for m in re.finditer(r"(.)\1+", s)]

['yyy', 'hh', 'ooo', 'nnn', '---', '++++']
>>> [(m.group(0),m.group(1)) for m in re.finditer(r"(.)\1+", s)]

[('yyy', 'y'), ('hh', 'h'), ('ooo', 'o'), ('nnn', 'n'), ('---',
'-'), ('++++', '+')]

-tkc

Tim Chase
Guest
Posts: n/a

 04-01-2011
On 03/31/2011 07:43 PM, candide wrote:
> "pyyythhooonnn ---> ++++"
>
> and you search for the subquences composed of the same character, here
> you get :
>
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

Or, if you want to do it with itertools instead of the "re" module:

>>> s = "pyyythhooonnn ---> ++++"
>>> from itertools import groupby
>>> [c*length for c, length in ((k, len(list(g))) for k, g in

groupby(s)) if length > 1]
['yyy', 'hh', 'ooo', 'nnn', '---', '++++']

-tkc

Terry Reedy
Guest
Posts: n/a

 04-01-2011
On 3/31/2011 10:20 PM, Tim Chase wrote:
> On 03/31/2011 07:43 PM, candide wrote:
>> "pyyythhooonnn ---> ++++"
>>
>> and you search for the subquences composed of the same character, here
>> you get :
>>
>> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

>
> Or, if you want to do it with itertools instead of the "re" module:
>
> >>> s = "pyyythhooonnn ---> ++++"
> >>> from itertools import groupby
> >>> [c*length for c, length in ((k, len(list(g))) for k, g in

> groupby(s)) if length > 1]
> ['yyy', 'hh', 'ooo', 'nnn', '---', '++++']

Slightly shorter:
[r for r in (''.join(g) for k, g in groupby(s)) if len(r) > 1]

--
Terry Jan Reedy

candide
Guest
Posts: n/a

 04-01-2011
Thanks, yours responses gave me the opportunity to understand the
"backreference" feature, it was not clear in spite of my intensive study
of the well known RE howto manual.