Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Help simplify complex regexp needing positive lookahead and reluctant quantifers

Reply
Thread Tools

Help simplify complex regexp needing positive lookahead and reluctant quantifers

 
 
david.karr@wamu.net
Guest
Posts: n/a
 
      03-21-2005
I'm trying to build a regexp to handle somewhat complex data.

My sample data is the following (abstracted from real data):
--------------
*XXXlkjsflkw34lkjsfd
2XXXlkjsdfojsfjoimf344
3XXXabcdef9999999
4XXX9f9f9f9f9f9f9f9f
5XXXg8g8g8g8g8g8g8g
6XXXe6e6e6e6e6e6e6e6e
YYY=D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
ZZZ=gggggggggggg
AAA=hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
/XXX 2
--------------

The important elements are "XXX", "YYY", "ZZZ", and "AAA". Each of
"YYY", "ZZZ", and "AAA" could be in any order, and some could be
missing, or others like it could be added. What I'd like to build is a
regexp that can group each of "YYY", "ZZZ", and "AAA" along with their
"associated data", up to either the next "[A-Z]{3}=", or the ending
"/XXX". If I can get the "associated data" into group values, I can
use other regexps for the detail in those group values.

The regexp that I've built so far comes close to solving this, but not
quite. This is what I have so far:

--------------
"(?sm)\\*.{3}.*\n" +
"2.{3}.*\n" +
"3.{3}.*\n" +
"4.{3}.*\n" +
"5.{3}.*\n" +
"6.{3}.*\n" +
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
"/[A-Z]{3}.*";
--------------

You can ignore for now the fact that I'm not verifying that all the
places that require "XXX" are all "XXX". The problem area is the
"[A-Z]{3}=" groups. This regexp works for my sample data, but I wasn't
able to simplify those three repeated lines into a single expression,
which would handle any number of those. I tried the following, to
replace those three lines:

"( ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*"

but that didn't seem to work, and I'm not sure why.

The following is the output from my Java program, using the working
regexp, where it iterated through the found groups. I provide this
just as another view of what I'm trying to capture:

--------------
group[YYY=]
group[D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
]
group[ZZZ=]
group[gggggggggggg
]
group[AAA=]
group[hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
]
--------------

 
Reply With Quote
 
 
 
 
Lisa
Guest
Posts: n/a
 
      03-21-2005

<(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
> I'm trying to build a regexp to handle somewhat complex data.
>
> My sample data is the following (abstracted from real data):
> --------------
> *XXXlkjsflkw34lkjsfd
> 2XXXlkjsdfojsfjoimf344
> 3XXXabcdef9999999
> 4XXX9f9f9f9f9f9f9f9f
> 5XXXg8g8g8g8g8g8g8g
> 6XXXe6e6e6e6e6e6e6e6e
> YYY=D/23333333
> -xxxxxxxxxxxx
> -yyyyyyyyyyyy
> ZZZ=gggggggggggg
> AAA=hhhhhhhhhh
> -jjjjjjjjjjj
> -kkkkkkkkkkk
> /XXX 2
> --------------
>
> The important elements are "XXX", "YYY", "ZZZ", and "AAA". Each of
> "YYY", "ZZZ", and "AAA" could be in any order, and some could be
> missing, or others like it could be added. What I'd like to build is a
> regexp that can group each of "YYY", "ZZZ", and "AAA" along with their
> "associated data", up to either the next "[A-Z]{3}=", or the ending
> "/XXX". If I can get the "associated data" into group values, I can
> use other regexps for the detail in those group values.
>
> The regexp that I've built so far comes close to solving this, but not
> quite. This is what I have so far:
>
> --------------
> "(?sm)\\*.{3}.*\n" +
> "2.{3}.*\n" +
> "3.{3}.*\n" +
> "4.{3}.*\n" +
> "5.{3}.*\n" +
> "6.{3}.*\n" +
> " ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
> " ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
> " ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
> "/[A-Z]{3}.*";
> --------------
>
> You can ignore for now the fact that I'm not verifying that all the
> places that require "XXX" are all "XXX". The problem area is the
> "[A-Z]{3}=" groups. This regexp works for my sample data, but I wasn't
> able to simplify those three repeated lines into a single expression,
> which would handle any number of those. I tried the following, to
> replace those three lines:
>
> "( ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*"
>
> but that didn't seem to work, and I'm not sure why.
>
> The following is the output from my Java program, using the working
> regexp, where it iterated through the found groups. I provide this
> just as another view of what I'm trying to capture:
>
> --------------
> group[YYY=]
> group[D/23333333
> -xxxxxxxxxxxx
> -yyyyyyyyyyyy
> ]
> group[ZZZ=]
> group[gggggggggggg
> ]
> group[AAA=]
> group[hhhhhhhhhh
> -jjjjjjjjjjj
> -kkkkkkkkkkk
> ]
> --------------
>


did you consider having a simpler expression and passing
over the data in two passes like unix folks like to do

grep "pat1" filename | grep "pat2" | grep "pat3"


 
Reply With Quote
 
 
 
 
Alan Moore
Guest
Posts: n/a
 
      03-21-2005
On 20 Mar 2005 18:54:39 -0800, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

>I'm trying to build a regexp to handle somewhat complex data.
>
>My sample data is the following (abstracted from real data):
>--------------
>*XXXlkjsflkw34lkjsfd
>2XXXlkjsdfojsfjoimf344
>3XXXabcdef9999999
>4XXX9f9f9f9f9f9f9f9f
>5XXXg8g8g8g8g8g8g8g
>6XXXe6e6e6e6e6e6e6e6e
> YYY=D/23333333
> -xxxxxxxxxxxx
> -yyyyyyyyyyyy
> ZZZ=gggggggggggg
> AAA=hhhhhhhhhh
> -jjjjjjjjjjj
> -kkkkkkkkkkk
>/XXX 2
>--------------
>
>The important elements are "XXX", "YYY", "ZZZ", and "AAA". Each of
>"YYY", "ZZZ", and "AAA" could be in any order, and some could be
>missing, or others like it could be added. What I'd like to build is a
>regexp that can group each of "YYY", "ZZZ", and "AAA" along with their
>"associated data", up to either the next "[A-Z]{3}=", or the ending
>"/XXX". If I can get the "associated data" into group values, I can
>use other regexps for the detail in those group values.
>
>The regexp that I've built so far comes close to solving this, but not
>quite. This is what I have so far:
>
>--------------
>"(?sm)\\*.{3}.*\n" +
>"2.{3}.*\n" +
>"3.{3}.*\n" +
>"4.{3}.*\n" +
>"5.{3}.*\n" +
>"6.{3}.*\n" +
>" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
>" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
>" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
>"/[A-Z]{3}.*";
>--------------
>
>You can ignore for now the fact that I'm not verifying that all the
>places that require "XXX" are all "XXX". The problem area is the
>"[A-Z]{3}=" groups. This regexp works for my sample data, but I wasn't
>able to simplify those three repeated lines into a single expression,
>which would handle any number of those. I tried the following, to
>replace those three lines:
>
>"( ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*"
>
>but that didn't seem to work, and I'm not sure why.
>
>The following is the output from my Java program, using the working
>regexp, where it iterated through the found groups. I provide this
>just as another view of what I'm trying to capture:
>
>--------------
>group[YYY=]
>group[D/23333333
> -xxxxxxxxxxxx
> -yyyyyyyyyyyy
>]
>group[ZZZ=]
>group[gggggggggggg
>]
>group[AAA=]
>group[hhhhhhhhhh
> -jjjjjjjjjjj
> -kkkkkkkkkkk
>]
>--------------


The "(?sm)" at the beginnng puts the whole regex in DOTALL and
MULTILINE mode. The 'm' is having no effect, since you aren't using
any line anchors; the 's' is what's causing your problem. Each ".*"
initially gobbles up the whole rest of the input, then backs off as
far as necessary to permit the next part of the regex to match. That
works as intended until the line starting with '6' is reached. After
the dot-star there wolfs everything down, it starts regurgitating as
usual. When it reaches the '/' at the beginning of the last line, the
rest of the regex is able to match, because your combined
subexpression is optional. The dot-star in the '6' line ends up
keeping all the text the subexpression was supposed to match.
Changing the "*" that controls the subexpression to a "+" won't
help--it will only force the subexpression to match once, letting the
dot-star keep anything else.

You could fix that by making all the dot-stars reluctant, but a better
way (more efficient, less error-prone) would be to remove the "(?sm)"
and add "(?s)" to the subexpression, since that's the only place you
actually need DOTALL mode:

--------------
"\\*.{3}.*\n" +
"2.{3}.*\n" +
"3.{3}.*\n" +
"4.{3}.*\n" +
"5.{3}.*\n" +
"6.{3}.*\n" +
"((?s: ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*)" +
"/[A-Z]{3}.*";
--------------

Note that I also changed the subexpression's enclosing group to
non-capturing, and put the capturing group around it and its
quantifier. That way, all the YYY|ZZZ|AAA entries with their
associated data are captured in group(1). The way you had it, only
the last entry would have been retained.

 
Reply With Quote
 
david.karr@wamu.net
Guest
Posts: n/a
 
      03-23-2005
Ok, this looks very promising, but it doesn't quite work yet. I'll
provide both the regexp I'm using a sample string, so you could
validate what I see, if you can.

I'm also wondering whether you meant to enter "?s:", or "(?s)" instead.
I tried both variations, with the same result.

The regexp I'm now using is this:
---------------
"\\*.{3}.*\n" +
"2.{3}.*\n" +
"3.{3}.*\n" +
"4.{3}.*\n" +
"5.{3}.*\n" +
"6.{3}.*\n" +
"((?s: ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*)" +
"/[A-Z]{3}.*";
---------------

My sample data is this:
---------------
*XXXlkjsflkw34lkjsfd
2XXXlkjsdfojsfjoimf344
3XXXabcdef9999999
4XXX9f9f9f9f9f9f9f9f
5XXXg8g8g8g8g8g8g8g
6XXXe6e6e6e6e6e6e6e6e
YYY=D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
ZZZ=gggggggggggg
AAA=hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
/XXX 2
---------------

My code is roughly this:
---------------
Pattern pattern = Pattern.compile(patternMask);
Matcher matcher = pattern.matcher(readSample);
System.out.println("groupCount[" + matcher.groupCount() + "]");
boolean found = matcher.find();
System.out.println("found[" + found + "]");
---------------

Where "patternMask" and "readSample" correspond to my regexp and the
sample data.

With this regexp and sample data, the "groupCount" prints out as "3",
and "found" is false.

 
Reply With Quote
 
Alan Moore
Guest
Posts: n/a
 
      03-24-2005
On 23 Mar 2005 12:25:04 -0800, (E-Mail Removed) wrote:

>Ok, this looks very promising, but it doesn't quite work yet. I'll
>provide both the regexp I'm using a sample string, so you could
>validate what I see, if you can.


That looks like what I'm doing; here's my test code:

//==== code ================================================== ======

import java.util.regex.*;

public class Test
{
public static void main(String[] args)
{
String regex = "\\*.{3}.*\n"
+ "2.{3}.*\n"
+ "3.{3}.*\n"
+ "4.{3}.*\n"
+ "5.{3}.*\n"
+ "6.{3}.*\n"
+ "((?s: ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*)"
+ "/[A-Z]{3}.*";

String input = "*XXXlkjsflkw34lkjsfd\n"
+ "2XXXlkjsdfojsfjoimf344\n"
+ "3XXXabcdef9999999\n"
+ "4XXX9f9f9f9f9f9f9f9f\n"
+ "5XXXg8g8g8g8g8g8g8g\n"
+ "6XXXe6e6e6e6e6e6e6e6e\n"
+ " YYY=D/23333333\n"
+ " -xxxxxxxxxxxx\n"
+ " -yyyyyyyyyyyy\n"
+ " ZZZ=gggggggggggg\n"
+ " AAA=hhhhhhhhhh\n"
+ " -jjjjjjjjjjj\n"
+ " -kkkkkkkkkkk\n"
+ "/XXX 2";

Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(input);
if (m.find())
{
System.out.println(m.group(1));
}
}
}

//================================================== ================

This prints:

YYY=D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
ZZZ=gggggggggggg
AAA=hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk

>
>I'm also wondering whether you meant to enter "?s:", or "(?s)" instead.
> I tried both variations, with the same result.


"(?s)" sets the DOTALL flag for the rest of the rest of the regex or
until you cancel it with "(?-s)". "(?s:<expr>)" both creates a
non-capturing group and sets the flag, but the flag is in effect only
within that group.
 
Reply With Quote
 
david.karr@wamu.net
Guest
Posts: n/a
 
      03-24-2005
Ok, the difference between our two was that my sample has "\r\n" for
eols. Once I changed my pattern to check for that explicitly, I get
similar output. I tried some variations with "$" and "(?m)", but it
only got past this if I specifically used "\r\n".

However, now I have to go deeper into this, and the current expression
doesn't quite do what I need.

What I really need to capture in individual groups would be the
following (each group surrounded by brackets):

[YYY=]
[D/23333333
xxxxxxxxxxxx
yyyyyyyyyyyy]
[ZZZ=]
[gggggggggggg]
[AAA=]
[hhhhhhhhhh
jjjjjjjjjjj
kkkkkkkkkkk]

Note that I've removed the initial spaces and dashes. That's my end
state, but I can work to that step by step.

When my code steps through all the groups it found, it finds this:

---------------
group[ YYY=D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
ZZZ=gggggggggggg
AAA=hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
]
group[AAA=]
group[hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
]
---------------

I don't care about the first group, because that surrounds all of the
subrecords. I would have hoped that the next group would be "YYY=",
followed by the group with its associated data, and so on.

 
Reply With Quote
 
Alan Moore
Guest
Posts: n/a
 
      03-24-2005
On 24 Mar 2005 09:10:47 -0800, (E-Mail Removed) wrote:

>Ok, the difference between our two was that my sample has "\r\n" for
>eols. Once I changed my pattern to check for that explicitly, I get
>similar output. I tried some variations with "$" and "(?m)", but it
>only got past this if I specifically used "\r\n".
>
>However, now I have to go deeper into this, and the current expression
>doesn't quite do what I need.
>
>What I really need to capture in individual groups would be the
>following (each group surrounded by brackets):
>
>[YYY=]
>[D/23333333
>xxxxxxxxxxxx
>yyyyyyyyyyyy]
>[ZZZ=]
>[gggggggggggg]
>[AAA=]
>[hhhhhhhhhh
>jjjjjjjjjjj
>kkkkkkkkkkk]
>
>Note that I've removed the initial spaces and dashes. That's my end
>state, but I can work to that step by step.
>
>When my code steps through all the groups it found, it finds this:
>
>---------------
>group[ YYY=D/23333333
> -xxxxxxxxxxxx
> -yyyyyyyyyyyy
> ZZZ=gggggggggggg
> AAA=hhhhhhhhhh
> -jjjjjjjjjjj
> -kkkkkkkkkkk
>]
>group[AAA=]
>group[hhhhhhhhhh
> -jjjjjjjjjjj
> -kkkkkkkkkkk
>]
>---------------
>
>I don't care about the first group, because that surrounds all of the
>subrecords. I would have hoped that the next group would be "YYY=",
>followed by the group with its associated data, and so on.


When you have a capturing group that's controlled by a quantifier, the
only thing you can retrieve after a successful match is the *last*
thing that was matched by that group. Remember that the groupCount()
method only tells you how many capturing groups there are in the
Matcher's parent Pattern; it doesn't say anything about what was
actually matched.

You initially changed your regex to match all the subrecords with a
quantified subexpression because you didn't know how many subrecords
there would be. When you did that, you gave up the ability to break
out the individual subrecords in a single pass. What you have to do
now is take the substring containing the subrecords and process it
separately to break them out. In the following code, I went ahead and
added a third layer of processing to get rid of those initial spaces
and dashes as well.

//==== code ================================================== ======

import java.util.regex.*;

public class Test
{
public static void main(String[] args)
{
String regex1 = "\\*.{3}.*\r?\n"
+ "2.{3}.*\r?\n"
+ "3.{3}.*\r?\n"
+ "4.{3}.*\r?\n"
+ "5.{3}.*\r?\n"
+ "6.{3}.*\r?\n"
+ "((?s: [A-Z]{3}=.*?(?=[ /][A-Z]{3}))*)"
+ "/[A-Z]{3}.*";
Pattern p1 = Pattern.compile(regex1);

String regex2 = "(?s) ([A-Z]{3}=)(.*?)(?=\r?\n [A-Z]{3}|$)";
Pattern p2 = Pattern.compile(regex2);

String regex3 = "(?: -)?(.+)";
Pattern p3 = Pattern.compile(regex3);

String input = "*XXXlkjsflkw34lkjsfd\n"
+ "2XXXlkjsdfojsfjoimf344\n"
+ "3XXXabcdef9999999\n"
+ "4XXX9f9f9f9f9f9f9f9f\n"
+ "5XXXg8g8g8g8g8g8g8g\n"
+ "6XXXe6e6e6e6e6e6e6e6e\n"
+ " YYY=D/23333333\n"
+ " -xxxxxxxxxxxx\n"
+ " -yyyyyyyyyyyy\n"
+ " ZZZ=gggggggggggg\n"
+ " AAA=hhhhhhhhhh\n"
+ " -jjjjjjjjjjj\n"
+ " -kkkkkkkkkkk\n"
+ "/XXX 2";

Matcher m1 = p1.matcher(input);
if (m1.find())
{
String sub = m1.group(1);
Matcher m2 = p2.matcher(sub);
while (m2.find())
{
System.out.println("[" + m2.group(1) + "]");
String subsub = m2.group(2);
System.out.print("[");
Matcher m3 = p3.matcher(subsub);
while (m3.find())
{
System.out.println(m3.group(1));
}
System.out.println("]");
}
}
}
}

//================================================== ================

result:

[YYY=]
[D/23333333
xxxxxxxxxxxx
yyyyyyyyyyyy
]
[ZZZ=]
[gggggggggggg
]
[AAA=]
[hhhhhhhhhh
jjjjjjjjjjj
kkkkkkkkkkk
]
 
Reply With Quote
 
david.karr@wamu.net
Guest
Posts: n/a
 
      03-25-2005
Excellent. Thanks for the thorough detail. This could have been a
whole chapter in "Regular Expression Recipes" .

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
floor(positive double) vs trunc(positive double) different Hicham Mouline C Programming 2 04-23-2010 06:50 PM
Treetop positive lookahead problem Tom Aadland Ruby 4 07-14-2008 04:16 AM
positive/negative lookahead issue. greedy = problems? vbgunz Javascript 6 11-28-2007 09:02 PM
Positive lookahead assertion tobiah Python 8 09-08-2006 08:11 AM
Help simplify complex regexp needing positive lookahead and reluctant quantifers david.karr@wamu.net Perl Misc 1 03-20-2005 08:59 PM



Advertisments