Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > returning regex matches as lists

Reply
Thread Tools

returning regex matches as lists

 
 
Jonathan Lukens
Guest
Posts: n/a
 
      02-15-2008
I am in the last phase of building a Django app based on something I
wrote in Java a while back. Right now I am stuck on how to return the
matches of a regular expression as a list *at all*, and in particular
given that the regex has a number of groupings. The only method I've
seen that returns a list is .findall(string), but then I get back the
groups as tuples, which is sort of a problem.

Thank you,
Jonathan
 
Reply With Quote
 
 
 
 
John Machin
Guest
Posts: n/a
 
      02-15-2008
On Feb 16, 6:07 am, Jonathan Lukens <jonathan.luk...@gmail.com> wrote:
> I am in the last phase of building a Django app based on something I
> wrote in Java a while back. Right now I am stuck on how to return the
> matches of a regular expression as a list *at all*, and in particular
> given that the regex has a number of groupings. The only method I've
> seen that returns a list is .findall(string), but then I get back the
> groups as tuples, which is sort of a problem.
>


It would help if you explained what you want the contents of the list
to be, why you want a list as opposed to a tuple or a generator or
whatever ... we can't be expected to imagine why getting groups as
tuples is "sort of a problem".

Use a concrete example, e.g.

>>> import re
>>> regex = re.compile(r'(\w+)\s+(\d+)')
>>> text = 'python 1 junk xyzzy 42 java 666'
>>> r = regex.findall(text)
>>> r

[('python', '1'), ('xyzzy', '42'), ('java', '666')]
>>>


What would you like to see instead?
 
Reply With Quote
 
 
 
 
Jonathan Lukens
Guest
Posts: n/a
 
      02-15-2008
> What would you like to see instead?

I had mostly just expected that there was some method that would
return each entire match as an item on a list. I have this pattern:

>>> import re
>>> corporate_names = re.compile(u'(?u)\\b([-]{2,}\\s+)([<<"][--]+)(\\s*-?[--]+)*([>>"])')
>>> terms = corporate_names.findall(sourcetext)


Which matches a specific way that Russian company names are
formatted. I was expecting a method that would return this:

>>> terms

[u'string one', u'string two', u'string three']

...mostly because I was working it this way in Java and haven't
learned to do things the Python way yet. At the suggestion from
someone on the list, I just used list() on all the tuples like so:

>>> detupled_terms =[list(term_tuple) for term_tuple in terms]
>>> delisted_terms = [''.join(term_list) for term_list in detupled_terms]


which achieves the desired result, but I am not a programmer and so I
would still be interested to know if there is a more elegant way of
doing this.

I appreciate the help.

Jonathan
 
Reply With Quote
 
Gabriel Genellina
Guest
Posts: n/a
 
      02-16-2008
En Fri, 15 Feb 2008 17:07:21 -0200, Jonathan Lukens
<> escribi:

> I am in the last phase of building a Django app based on something I
> wrote in Java a while back. Right now I am stuck on how to return the
> matches of a regular expression as a list *at all*, and in particular
> given that the regex has a number of groupings. The only method I've
> seen that returns a list is .findall(string), but then I get back the
> groups as tuples, which is sort of a problem.


Do you want something like this?

py> re.findall(r"([a-z]+)([0-9]+)", "foo bar3 w000 no abc123")
[('bar', '3'), ('w', '000'), ('abc', '123')]
py> re.findall(r"(([a-z]+)([0-9]+))", "foo bar3 w000 no abc123")
[('bar3', 'bar', '3'), ('w000', 'w', '000'), ('abc123', 'abc', '123')]
py> groups = re.findall(r"(([a-z]+)([0-9]+))", "foo bar3 w000 no abc123")
py> groups
[('bar3', 'bar', '3'), ('w000', 'w', '000'), ('abc123', 'abc', '123')]
py> [group[0] for group in groups]
['bar3', 'w000', 'abc123']

--
Gabriel Genellina

 
Reply With Quote
 
Gabriel Genellina
Guest
Posts: n/a
 
      02-16-2008
En Fri, 15 Feb 2008 19:25:59 -0200, Jonathan Lukens
<> escribió:

>> What would you like to see instead?

>
> I had mostly just expected that there was some method that would
> return each entire match as an item on a list. I have this pattern:
>
>>>> import re
>>>> corporate_names =
>>>> re.compile(u'(?u)\\b([А-Я]{2,}\\s+)([<<"][а-яА-Я]+)(\\s*-?[а-яА-Я]+)*([>>"])')
>>>> terms = corporate_names.findall(sourcetext)

>
> Which matches a specific way that Russian company names are
> formatted. I was expecting a method that would return this:
>
>>>> terms

> [u'string one', u'string two', u'string three']
>
> ...mostly because I was working it this way in Java and haven't
> learned to do things the Python way yet. At the suggestion from
> someone on the list, I just used list() on all the tuples like so:


The group() method of match objects does what you want:

terms = [match.group() for match in corporate_names.finditer(sourcetext)]

See http://docs.python.org/lib/match-objects.html

>>>> detupled_terms =[list(term_tuple) for term_tuple in terms]
>>>> delisted_terms = [''.join(term_list) for term_list in detupled_terms]

>
> which achieves the desired result, but I am not a programmer and so I
> would still be interested to know if there is a more elegant way of
> doing this.


That ''.join(...) works equally well on tuples; you don't have to convert
tuples to lists first:

delisted_terms = [''.join(term_list) for term in terms]

--
Gabriel Genellina

 
Reply With Quote
 
Jonathan Lukens
Guest
Posts: n/a
 
      02-16-2008
On Feb 15, 8:31 pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
> En Fri, 15 Feb 2008 19:25:59 -0200, Jonathan Lukens
> <jonathan.luk...@gmail.com> escribi:
>
>
>
> >> What would you like to see instead?

>
> > I had mostly just expected that there was some method that would
> > return each entire match as an item on a list. I have this pattern:

>
> >>>> import re
> >>>> corporate_names =
> >>>> re.compile(u'(?u)\\b([-]{2,}\\s+)([<<"][--]+)(\\s*-?[--]+)*([>>"])')
> >>>> terms = corporate_names.findall(sourcetext)

>
> > Which matches a specific way that Russian company names are
> > formatted. I was expecting a method that would return this:

>
> >>>> terms

> > [u'string one', u'string two', u'string three']

>
> > ...mostly because I was working it this way in Java and haven't
> > learned to do things the Python way yet. At the suggestion from
> > someone on the list, I just used list() on all the tuples like so:

>
> The group() method of match objects does what you want:
>
> terms = [match.group() for match in corporate_names.finditer(sourcetext)]
>
> Seehttp://docs.python.org/lib/match-objects.html
>
> >>>> detupled_terms =[list(term_tuple) for term_tuple in terms]
> >>>> delisted_terms = [''.join(term_list) for term_list in detupled_terms]

>
> > which achieves the desired result, but I am not a programmer and so I
> > would still be interested to know if there is a more elegant way of
> > doing this.

>
> That ''.join(...) works equally well on tuples; you don't have to convert
> tuples to lists first:
>
> delisted_terms = [''.join(term_list) for term in terms]
>
> --
> Gabriel Genellina


Thanks Gabriel,

That is just what I was looking for.

Jonathan
 
Reply With Quote
 
John Machin
Guest
Posts: n/a
 
      02-16-2008
On Feb 16, 8:25 am, Jonathan Lukens <jonathan.luk...@gmail.com> wrote:
> > What would you like to see instead?

>
> I had mostly just expected that there was some method that would
> return each entire match as an item on a list. I have this pattern:
>
> >>> import re
> >>> corporate_names = re.compile(u'(?u)\\b([-]{2,}\\s+)([<<"][--]+)(\\s*-?[--]+)*([>>"])')
> >>> terms = corporate_names.findall(sourcetext)

>
> Which matches a specific way that Russian company names are
> formatted. I was expecting a method that would return this:
>
> >>> terms

>
> [u'string one', u'string two', u'string three']


What is the point of having parenthesised groups in the regex if you
are interested only in the whole match?

Other comments:
(1) raw string for improved legibility
ru'(?u)\b([-]{2,}\s+)([<<"][--]+)(\s*-?[--]+)*([>>"])'
(2) consider not including space at the end of a group
ru'(?u)\b([-]{2,})\s+([<<"][--]+)\s*(-?[--]+)*([>>"])'
(3) what appears between [] is a set of characters, so [<<"] is the
same as [<"] and probably isn't doing what you expect; have you tested
this regex for correctness?

>
> ...mostly because I was working it this way in Java and haven't
> learned to do things the Python way yet. At the suggestion from
> someone on the list, I just used list() on all the tuples like so:
>
> >>> detupled_terms =[list(term_tuple) for term_tuple in terms]
> >>> delisted_terms = [''.join(term_list) for term_list in detupled_terms]

>
> which achieves the desired result, but I am not a programmer and so I
> would still be interested to know if there is a more elegant way of
> doing this.


I can't imagine how "not a programmer" implies "interested to know if
there is a more elegant way". In any case, explore the correctness
axis first.

Cheers,
John
 
Reply With Quote
 
Jonathan Lukens
Guest
Posts: n/a
 
      02-16-2008
John,

> (1) raw string for improved legibility
> ru'(?u)\b([-]{2,}\s+)([<<"][--]+)(\s*-?[--]+)*([>>"])'


This actually escaped my notice after I had posted -- the letters with
diacritics are incorrectly decoded Cyrillic letters -- I suppose I
code use the Unicode escape sequences (the sets [-] and [--] are
the Cyrillic equivalents of [a-z] and [A-Za-z]) but then suddenly the
legibility goes out the window again.

> (3) what appears between [] is a set of characters, so [<<"] is the
> same as [<"] and probably isn't doing what you expect; have you tested
> this regex for correctness?


These were angled quotation marks in the original Unicode. Sorry
again. The regex matches everything it is supposed to. The extra
parentheses were because I had somehow missed the .group method and it
had only been returning what was only in the one needed set of
parentheses.

> I can't imagine how "not a programmer" implies "interested to know if
> there is a more elegant way".


More carefully stated: "I am self-taught have no real training or
experience as a programmer and would be interested in seeing how a
programmer with training
and experience would go about this."

Thank you,
Jonathan
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Match a pattern multiple times, returning matches, captures andoffset? Markus Fischer Ruby 9 04-08-2011 07:53 PM
Returning "nearest in document" matches using XPath Nick Leverton XML 2 12-05-2008 03:18 PM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
List of lists of lists of lists... =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==?= Python 5 05-15-2006 11:47 AM
RegEx engine returning empty matches between valid tokens. John otac0n Gietzen Perl Misc 2 02-05-2006 12:55 AM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57