Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > RegEx engine returning empty matches between valid tokens.

Reply
Thread Tools

RegEx engine returning empty matches between valid tokens.

 
 
John otac0n Gietzen
Guest
Posts: n/a
 
      02-05-2006
Dear RegEx Gurus,

I am writing an application to evaluate mathematics functions. The
first step in the process of creating the expressions is tokenizing the
input. I decided to use one large regular expression to preform this
tokenization:

~\G([a-zA-Z]\w*\(|[a-zA-Z]\w*|(<=|>=|!=|<>|==|=)|0x[\da-fA-F.]*|0b[\d.]*|[\d.]*|\s*|.)~

Now, according to my intuition, this should work. However, any time a
single character that is not explicitly recognized as a token comes by,
the regex engine returns two matches: one empty and one of the correct
character.

To simplify this odd behavior, I have prepared the following example:

Match the string
abcdefghijklmnop
to the expression
~\G(a|b|c*|\w)~

This "anomaly" is seen in the Perl, PHP, and C# regex engines (which
makes me think that it is expected behavior). The final destination
for this regex is C#, so I can not just ignore null entries. (The C#
regex engine stops after the first null match.) Any help or advice
would be much appreciated.

Sincerely,
John "Otac0n" Gietzen

 
Reply With Quote
 
 
 
 
Xicheng
Guest
Posts: n/a
 
      02-05-2006
John otac0n Gietzen wrote:
> Dear RegEx Gurus,
>
> I am writing an application to evaluate mathematics functions. The
> first step in the process of creating the expressions is tokenizing the
> input. I decided to use one large regular expression to preform this
> tokenization:
>
> ~\G([a-zA-Z]\w*\(|[a-zA-Z]\w*|(<=|>=|!=|<>|==|=)|0x[\da-fA-F.]*|0b[\d.]*|[\d.]*|\s*|.)~
> Now, according to my intuition, this should work. However, any time a
> single character that is not explicitly recognized as a token comes by,
> the regex engine returns two matches: one empty and one of the correct
> character
>
> To simplify this odd behavior, I have prepared the following example:
>
> Match the string
> abcdefghijklmnop
> to the expression
> ~\G(a|b|c*|\w)~

when you make "c*" as an alternation, the regex actually does like
this:

~\G(a|b|c+||\w)~

so you have five choices(instead of four), one of which is NULL which
always takes a place between two characters. if you do want one or
multiple "c" to show in your matched text, use "c+" instead of "c*"..

Xicheng

> This "anomaly" is seen in the Perl, PHP, and C# regex engines (which
> makes me think that it is expected behavior). The final destination
> for this regex is C#, so I can not just ignore null entries. (The C#
> regex engine stops after the first null match.) Any help or advice
> would be much appreciated.
>
> Sincerely,
> John "Otac0n" Gietzen


 
Reply With Quote
 
 
 
 
John otac0n Gietzen
Guest
Posts: n/a
 
      02-05-2006
Brilliant! Thanks very much.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Match a pattern multiple times, returning matches, captures andoffset? Markus Fischer Ruby 9 04-08-2011 07:53 PM
Returning "nearest in document" matches using XPath Nick Leverton XML 2 12-05-2008 03:18 PM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
returning regex matches as lists Jonathan Lukens Python 7 02-16-2008 12:27 PM
Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine? =?Utf-8?B?SmViQnVzaGVsbA==?= ASP .Net 2 10-22-2005 02:43 PM



Advertisments