Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Loop over regexp groups

Reply
Thread Tools

Loop over regexp groups

 
 
January Weiner
Guest
Posts: n/a
 
      11-13-2006
Hello,

I am matching a regexp with an a priori unknown number of groups. I would
like to loop over all groups that were matched. For example:

/(\w+)\s(\w+)/ ;
#or
/(\w+)\s(\w+)\s(\w+)/ ;
# or something else

@groups = ...???

for( @groups ) {
process_match( $_ ) ;
}

Of course, the above example is simplifying reality and could be replaced
by split(). Here are more details on the problem:

I am processing protein sequence files in the FASTA format. Depending on
the database, the FASTA headers may look like that:

>O81231 (Q81999) Dehydrogenase alpha subunit


or like that

> O81231 123 Q81999


or

>gi|O81231||li|Q81999


or, possibly,

>O81231; synonyms: Q81999, P89812, O77781


or, basically, anything else. As you might guess, I'm interested in the
"Q81231" or "Q81231" part. The idea is that my utility can take an
optional "regexp" string that matches the type of headers that are found in
a given database; while looping through the database, the regexp is
matched, and entries are made for any of the synonymous identifiers found
in one header.

Currently, I am assuming that I will not find more than four synonims, and
I do the following:

for( $1, $2, $3, $4 ) {
last unless $_ ;
process_match( $_ ) ;
}

....which is, of course, crap.

Thanks in advance,
January

P.S. No, ([A-Z]\d{5}) would not match any identifier; the id format can
differ as well. Sometimes it is HBA_HUMAN.

--
 
Reply With Quote
 
 
 
 
Dr.Ruud
Guest
Posts: n/a
 
      11-13-2006
January Weiner schreef:

> I am matching a regexp with an a priori unknown number of groups. I
> would like to loop over all groups that were matched.


Use the g-modifier, see perlre.
Or use split + grep.

--
Affijn, Ruud

"Gewoon is een tijger."
 
Reply With Quote
 
 
 
 
micmath@gmail.com
Guest
Posts: n/a
 
      11-13-2006


On Nov 13, 12:49 pm, January Weiner <(E-Mail Removed)> wrote:
> I am matching a regexp with an a priori unknown number of groups. I would
> like to loop over all groups that were matched. For example:
>
> /(\w+)\s(\w+)/ ;
> #or
> /(\w+)\s(\w+)\s(\w+)/ ;
> # or something else
>
> @groups = ...???
>
> for( @groups ) {
> process_match( $_ ) ;
> }



use strict;
use warnings;

my %styles = (
style1 => qr/([A-Z]\d{5})/,
style2 => qr/([A-Z]{3}_[A-Z]{5})/,
);

my $header1 = "O81231 (Q81999) Dehydrogenase alpha subunit";
my $header2 = "O81231 (HBA_HUMAN) Dehydrogenase alpha subunit";

sub get_id {
my ($header, $style) = @_;
my ($id) = $header =~ m/$style/;
return $id;
}

print get_id($header1, $styles{style1}), "\n"; # prints Q81999
print get_id($header2, $styles{style2}), "\n"; # prints HBA_HUMAN

__END__

I'm not sure I entirely understand your question, but if you want to
store regular expressions in a structure you can loop over, you just
need the qr// operator. If I'm off base, just clarify what you mean and
I'll try again, but I hope that helps!

Regards,
Michael
http://www.perlcircus.org/

 
Reply With Quote
 
anno4000@radom.zrz.tu-berlin.de
Guest
Posts: n/a
 
      11-13-2006
January Weiner <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> Hello,
>
> I am matching a regexp with an a priori unknown number of groups. I would
> like to loop over all groups that were matched. For example:
>
> /(\w+)\s(\w+)/ ;
> #or
> /(\w+)\s(\w+)\s(\w+)/ ;
> # or something else
>
> @groups = ...???


Very easy. Assuming the regex (with captures) in $re, and the string to
match in $_ (untested):

my @groups = m/$re/;

A regex in list context returns all its captures.

> for( @groups ) {
> process_match( $_ ) ;
> }


Right on. Even

process_match( $_) for m/$re/;

would work.

Anno


> Of course, the above example is simplifying reality and could be replaced
> by split(). Here are more details on the problem:
>
> I am processing protein sequence files in the FASTA format. Depending on
> the database, the FASTA headers may look like that:
>
> >O81231 (Q81999) Dehydrogenase alpha subunit

>
> or like that
>
> > O81231 123 Q81999

>
> or
>
> >gi|O81231||li|Q81999

>
> or, possibly,
>
> >O81231; synonyms: Q81999, P89812, O77781

>
> or, basically, anything else. As you might guess, I'm interested in the
> "Q81231" or "Q81231" part. The idea is that my utility can take an
> optional "regexp" string that matches the type of headers that are found in
> a given database; while looping through the database, the regexp is
> matched, and entries are made for any of the synonymous identifiers found
> in one header.
>
> Currently, I am assuming that I will not find more than four synonims, and
> I do the following:
>
> for( $1, $2, $3, $4 ) {
> last unless $_ ;
> process_match( $_ ) ;
> }
>
> ...which is, of course, crap.
>
> Thanks in advance,
> January
>
> P.S. No, ([A-Z]\d{5}) would not match any identifier; the id format can
> differ as well. Sometimes it is HBA_HUMAN.
>
> --



 
Reply With Quote
 
January Weiner
Guest
Posts: n/a
 
      11-13-2006
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> I'm not sure I entirely understand your question, but if you want to
> store regular expressions in a structure you can loop over, you just
> need the qr// operator. If I'm off base, just clarify what you mean and
> I'll try again, but I hope that helps!


Sorry, I think I did not get it clear. Assume the following:

- you have a regular expression
- the regular expression contains an unknown number of groups enclosed in
parentheses
- you would like to print these groups, one by one.


If you know exactly that there are two groups, you can do the following:

$a =~ /(one) (two)/ ;

print "group one: $1\n" ;
print "group two: $2\n" ;

My question is: what can I do if I do not know the number of the groups?
For example, the regexp can be
/(one) (two)/

or it can be
/(one) (two) (three)/

or even
/(one) (two) (three) (four)/

My question rephrased: how can I loop through the automatic variables $1
.... $n, where n is the number of groups in the regexp?

Regards,
j.

--
 
Reply With Quote
 
January Weiner
Guest
Posts: n/a
 
      11-13-2006
http://www.velocityreviews.com/forums/(E-Mail Removed)-berlin.de wrote:
> Very easy. Assuming the regex (with captures) in $re, and the string to
> match in $_ (untested):


> my @groups = m/$re/;


> A regex in list context returns all its captures.


Yes! That's it. Thank you so much. (very intuitive, when you think of it!)

j.

--
 
Reply With Quote
 
Mumia W. (reading news)
Guest
Posts: n/a
 
      11-13-2006
On 11/13/2006 06:49 AM, January Weiner wrote:
> Hello,
>
> I am matching a regexp with an a priori unknown number of groups. I would
> like to loop over all groups that were matched. For example:
>
> /(\w+)\s(\w+)/ ;
> #or
> /(\w+)\s(\w+)\s(\w+)/ ;
> # or something else
>
> @groups = ...???
>
> for( @groups ) {
> process_match( $_ ) ;
> }
>
> Of course, the above example is simplifying reality and could be replaced
> by split(). Here are more details on the problem:
>
> I am processing protein sequence files in the FASTA format. Depending on
> the database, the FASTA headers may look like that:
>
>> O81231 (Q81999) Dehydrogenase alpha subunit

>
> or like that
>
>> O81231 123 Q81999

>
> or
>
>> gi|O81231||li|Q81999

>
> or, possibly,
>
>> O81231; synonyms: Q81999, P89812, O77781

>
> or, basically, anything else. As you might guess, I'm interested in the
> "Q81231" or "Q81231" part. The idea is that my utility can take an
> optional "regexp" string that matches the type of headers that are found in
> a given database; while looping through the database, the regexp is
> matched, and entries are made for any of the synonymous identifiers found
> in one header.
>
> Currently, I am assuming that I will not find more than four synonims, and
> I do the following:
>
> for( $1, $2, $3, $4 ) {
> last unless $_ ;
> process_match( $_ ) ;
> }
>
> ....which is, of course, crap.
>
> Thanks in advance,
> January
>
> P.S. No, ([A-Z]\d{5}) would not match any identifier; the id format can
> differ as well. Sometimes it is HBA_HUMAN.
>


This

my @ids = /([[:upper:]\d]{3,})/g;

is a possibility.


--
(E-Mail Removed)
 
Reply With Quote
 
Dr.Ruud
Guest
Posts: n/a
 
      11-13-2006
(E-Mail Removed)-berlin.de schreef:

> Very easy. Assuming the regex (with captures) in $re, and the string
> to match in $_ (untested):
>
> my @groups = m/$re/;
>
> A regex in list context returns all its captures.



I think he meant to have only one (multi-format) capture in $re, so I am
missing the g-modifier.

$ perl -wle'
$_ = "a b c";
@_ = /([a-z])/;
print "@_"
'
a

$ perl -wle'
$_ = "a b c";
@_ = /([a-z])/g;
print "@_"
'
a b c

--
Affijn, Ruud

"Gewoon is een tijger."

 
Reply With Quote
 
anno4000@radom.zrz.tu-berlin.de
Guest
Posts: n/a
 
      11-14-2006
January Weiner <(E-Mail Removed)> wrote in comp.lang.perl.misc:
> (E-Mail Removed)-berlin.de wrote:
> > Very easy. Assuming the regex (with captures) in $re, and the string to
> > match in $_ (untested):

>
> > my @groups = m/$re/;

>
> > A regex in list context returns all its captures.

>
> Yes! That's it. Thank you so much. (very intuitive, when you think of it!)


It is. The behavior varies slightly with whether the regex has captures
and/or the /g modifier, but the variations usually do what you mean.

In fact, list assignment is the preferred method of accessing regex
captures. You avoid the special package variables $1, $2, ... and
their scoping issues. You can give the captures meaningful names,
individually or collectively. And, (your case), you don't have to
know in advance how many captures there are.

The only case where you can't avoid $1 etc. is when you need the
behavior of /g in scalar context and have captures.

Anno
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Triple nested loop python (While loop insde of for loop inside ofwhile loop) Isaac Won Python 9 03-04-2013 10:08 AM
loop over list and process into groups Sneaky Wombat Python 11 03-05-2010 05:16 PM
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
VOIP over VPN over TCP over WAP over 3G Theo Markettos UK VOIP 2 02-14-2008 03:27 PM
loop over a string to do search/replacement using regexp? Rick Perl Misc 1 10-31-2006 12:22 AM



Advertisments