Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Regular expressions (multiple match problem)

Reply
Thread Tools

Regular expressions (multiple match problem)

 
 
mikko.n
Guest
Posts: n/a
 
      04-02-2008
I have recently been experimenting with GNU C library regular
expression functions and noticed a problem with pattern matching. It
seems to recognize only the first match but ignoring the rest of them.
An example:

mikko.c:
-----

#include <stdio.h>
#include <regex.h>
#include <sys/types.h>

int main(int argc, char *argv[]) {
regex_t p;
regmatch_t pm[2];
regcomp(&p,"k",0);
regexec(&p,"mikko",2,pm,0);
printf("start=%d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
printf("start=%d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
regfree(&p);
return 0;
}

-----

This intends to match regular expression 'k' against string 'mikko'
and return start and end of two first matches in the array pm of
regmatch_t:s. The output is, however:

$ ./mikko
start=2 end=3
start=-1 end=-1

instead of the expected

start=2 end=3
start=3 end=4

Is this a bug in GNU library or have I overlooked something? I have
not found any examples from the Internet of multiple subexpression
matching with <regex.h> either.
With more complicated regular expressions it usually seems to return
only the first match as here, but with wildcards the largest match,
nevertheless only one of them.

Thanks,

Mikko Nummelin
 
Reply With Quote
 
 
 
 
Walter Roberson
Guest
Posts: n/a
 
      04-02-2008
In article <(E-Mail Removed)>,
mikko.n <(E-Mail Removed)> wrote:
>I have recently been experimenting with GNU C library regular
>expression functions and noticed a problem with pattern matching.


Then you should ask in a GNU newsgroup. Regular expressions are
not part of the C standard, so the proper usage of
any particular regular expression library should be discussed
in the appropriate forum for that library.
--
"They called it golf because all the other four letter words
were taken." -- Walter Hagen
 
Reply With Quote
 
 
 
 
Antoninus Twink
Guest
Posts: n/a
 
      04-02-2008
On 2 Apr 2008 at 6:20, mikko.n wrote:
> I have recently been experimenting with GNU C library regular
> expression functions and noticed a problem with pattern matching. It
> seems to recognize only the first match but ignoring the rest of them.
> An example:
>
> mikko.c:
> -----
>
> #include <stdio.h>
> #include <regex.h>
> #include <sys/types.h>
>
> int main(int argc, char *argv[]) {
> regex_t p;
> regmatch_t pm[2];
> regcomp(&p,"k",0);
> regexec(&p,"mikko",2,pm,0);
> printf("start=%d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
> printf("start=%d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
> regfree(&p);
> return 0;
> }
>
> -----
>
> This intends to match regular expression 'k' against string 'mikko'
> and return start and end of two first matches in the array pm of
> regmatch_t:s. The output is, however:
>
> $ ./mikko
> start=2 end=3
> start=-1 end=-1
>
> instead of the expected
>
> start=2 end=3
> start=3 end=4
>
> Is this a bug in GNU library or have I overlooked something? I have
> not found any examples from the Internet of multiple subexpression
> matching with <regex.h> either.
> With more complicated regular expressions it usually seems to return
> only the first match as here, but with wildcards the largest match,
> nevertheless only one of them.


The problem is that you misunderstand what a match is.

If the regex matches, then pm[0] contains the offsets of the (first)
match for the whole regex. But pm[1],... don't contain the offets for
subsequent matches of the whole regex, but rather contain the offsets of
any parenthesized subexpressions that matched (in the match recorded in
pm[0]).

For example, try:

#include <stdio.h>
#include <regex.h>
#include <sys/types.h>

int main(void)
{
regex_t p;
regmatch_t pm[2];
regcomp(&p,"k\\(.\\)",0);
regexec(&p,"mikko",2,pm,0);
printf("start=%d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
printf("start=%d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
regfree(&p);
return 0;
}


$ ./a
start=2 end=4
start=3 end=4

 
Reply With Quote
 
mikko.n
Guest
Posts: n/a
 
      04-02-2008
On 2 huhti, 11:01, Antoninus Twink <(E-Mail Removed)> wrote:
> On 2 Apr 2008 at 6:20, mikko.n wrote:
>
>
>
> > I have recently been experimenting with GNU C library regular
> > expression functions and noticed a problem with pattern matching. It
> > seems to recognize only the first match but ignoring the rest of them.
> > An example:

>
> > mikko.c:
> > -----

>
> > #include <stdio.h>
> > #include <regex.h>
> > #include <sys/types.h>

>
> > int main(int argc, char *argv[]) {
> > regex_t p;
> > regmatch_t pm[2];
> > regcomp(&p,"k",0);
> > regexec(&p,"mikko",2,pm,0);
> > printf("start=%d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
> > printf("start=%d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
> > regfree(&p);
> > return 0;
> > }

>
> > -----

>
> > This intends to match regular expression 'k' against string 'mikko'
> > and return start and end of two first matches in the array pm of
> > regmatch_t:s. The output is, however:

>
> > $ ./mikko
> > start=2 end=3
> > start=-1 end=-1

>
> > instead of the expected

>
> > start=2 end=3
> > start=3 end=4

>
> > Is this a bug in GNU library or have I overlooked something? I have
> > not found any examples from the Internet of multiple subexpression
> > matching with <regex.h> either.
> > With more complicated regular expressions it usually seems to return
> > only the first match as here, but with wildcards the largest match,
> > nevertheless only one of them.

>
> The problem is that you misunderstand what a match is.
>
> If the regex matches, then pm[0] contains the offsets of the (first)
> match for the whole regex. But pm[1],... don't contain the offets for
> subsequent matches of the whole regex, but rather contain the offsets of
> any parenthesized subexpressions that matched (in the match recorded in
> pm[0]).
>
> For example, try:
>
> #include <stdio.h>
> #include <regex.h>
> #include <sys/types.h>
>
> int main(void)
> {
> regex_t p;
> regmatch_t pm[2];
> regcomp(&p,"k\\(.\\)",0);
> regexec(&p,"mikko",2,pm,0);
> printf("start=%d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
> printf("start=%d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
> regfree(&p);
> return 0;
>
> }
>
> $ ./a
> start=2 end=4
> start=3 end=4


Is there then a simple alternative which would work so that it returns
all the matches of the original regexp in the text?

Mikko Nummelin
 
Reply With Quote
 
Flash Gordon
Guest
Posts: n/a
 
      04-02-2008
mikko.n wrote, On 02/04/08 09:37:
> On 2 huhti, 11:01, Antoninus Twink <(E-Mail Removed)> wrote:
>> On 2 Apr 2008 at 6:20, mikko.n wrote:


<snip>

> Is there then a simple alternative which would work so that it returns
> all the matches of the original regexp in the text?


As Walter suggested, ask in a GNU group or mailing list where your
question would be topical (there is one specifically for regexp) instead
of comp.lang.c where it is not.

I note that this time you have added a cross post to
comp.unix.programmer where your question might be topical, but why
continue posting where it is not?
--
Flash Gordon
 
Reply With Quote
 
Antoninus Twink
Guest
Posts: n/a
 
      04-02-2008
On 2 Apr 2008 at 8:37, mikko.n wrote:
> Is there then a simple alternative which would work so that it returns
> all the matches of the original regexp in the text?


Just use a loop, like this:


#include <stdio.h>
#include <regex.h>
#include <sys/types.h>

int main(void)
{
regex_t p;
regmatch_t pm;
char *s="mikko mikko";
regoff_t last_match=0;
regcomp(&p, "k", 0);
while(regexec(&p, s+last_match, 1, &pm, 0) == 0) {
printf("start=%d end=%d\n", pm.rm_so + last_match, pm.rm_eo + last_match);
last_match += pm.rm_so+1;
}
regfree(&p);
return 0;
}

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expressions in match attribute Pedro Costa XML 0 06-20-2010 10:12 PM
How to print all expressions that match a regular expression hzhuo1@gmail.com Python 23 02-08-2010 05:17 AM
Parse::Recdescent match word exactly with regular expressions ccm news Perl 0 01-15-2009 12:34 PM
How to match literal backslashes read from a text file using regular expressions? cricfan@gmail.com Python 2 07-12-2005 11:53 PM
Add custom regular expressions to the validation list of available expressions Jay Douglas ASP .Net 0 08-15-2003 10:19 PM



Advertisments