Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > recursive brace matching with Ruby regexp

Reply
Thread Tools

recursive brace matching with Ruby regexp

 
 
Jason Sweat
Guest
Posts: n/a
 
      11-05-2004
I wanted to learn Ruby, so I picked a small task of trying to write a
command line script to parse PHP classes and shell out some unit test
cases. I have it working for the most part, but I ran across a
problem trying to use Ruby regexp to find a set of matching curly
braces.

Please forgive the intrusion of this PHP code onto the list, but I
wanted to give you the flavor of what I am attempting to do, that can
be easily done with recursive regular expression available in the Perl
compatiable regexp engine.

<php>
$test = <<<EOS
/* some stuff */
class foo {
public \$var;
public function __construct() {}
public function bar() {
if (false) {
}
}

}
// some other stuff
EOS;

$re = <<<EOS
~(class\s+\w+\s+({((?>[^{}]+)|(?2))*}))~xms
EOS;
preg_match($re, $test, $match);
echo "your class matched:\n", $match[1];

</php>

Now it appears the regexp engine in Ruby does not support recursion
(at least in Ruby ruby 1.8.2 (2004-07-29) [i686-linux] that I am
working on, and with what I know how to test), thus the only
workaround I found was very ugly, model the nesting of braces to a
fixed depth, i.e.

open = '\{'
close = '\}'
other = '[^\{\}]*'
l1 = other+open+other+close+other
l2 = other+open+'('+l1 +')+'+other+close+other
l3 = other+open+'('+l2 +')+'+other+close+other
l4 = other+open+'('+l3 +')+'+other+close+other
l5 = other+open+'('+l4 +')+'+other+close+other
re = Regexp.new('class\s+'+@name+'\s+'+open+'((?:'+l5+' )|(?:'+l4+')|(?:'+l3+')|(?:'+l2+')|(?:'+l1+')|(?:' +other+
'))+'+close, 'ixm')

This code did work, but sometimes timed out on valid real classes.

I expect I am probably missing some facet of Ruby that eaily allows me
to next regexp inside of the Ruby code in some fasion to achieve the
result I am looking for, but how to do so eludes me. Can anyone
provide some insight for me on this situation?

Thanks,

Regards,
Jason
--
http://blog.casey-sweat.us/


 
Reply With Quote
 
 
 
 
James Edward Gray II
Guest
Posts: n/a
 
      11-05-2004
On Nov 4, 2004, at 6:04 PM, Jason Sweat wrote:

> I expect I am probably missing some facet of Ruby that eaily allows me
> to next regexp inside of the Ruby code in some fasion to achieve the
> result I am looking for, but how to do so eludes me. Can anyone
> provide some insight for me on this situation?


Ruby's regex engine doesn't yet support this, but will in future
versions.

Hope that helps.

James Edward Gray II



 
Reply With Quote
 
 
 
 
Bill Kelly
Guest
Posts: n/a
 
      11-05-2004

From: "James Edward Gray II" <(E-Mail Removed)>
> On Nov 4, 2004, at 6:04 PM, Jason Sweat wrote:
>
> > I expect I am probably missing some facet of Ruby that eaily allows me
> > to next regexp inside of the Ruby code in some fasion to achieve the
> > result I am looking for, but how to do so eludes me. Can anyone
> > provide some insight for me on this situation?

>
> Ruby's regex engine doesn't yet support this, but will in future
> versions.


Just for fun,


dat = <<-"END_TEXT"
/* some stuff */
class foo {
public \$var;
public function __construct() {}
public function bar() {
if (false) {
}
}

}
// some other stuff
END_TEXT

repl = []
true while dat.gsub!(/\([^(){}]*\)|\{[^(){}]*\}/) do |str|
repl.push(str)
"\0#{repl.length - 1}\0"
end
dat.scan(/class\s+\w+\s+\0\d+\0/) do
match = $&
true while match.gsub!(/\0(\d+)\0/) { repl[$1.to_i] }
puts "Your class matched:", match
end



The above matches arbitrarily deeply nested ( ) and { }
blocks... but all the "classes" it matches have to be
at toplevel, can't be inside a { } or ( ) themselves...

I don't think it would be hard to modify to handle nested
classes, but I haven't thought it through...


In case it's not obvious how it works... It replaces,
from innermost to outermost, () and {} spans with a
token, and saves the original span in an array.

When it finds a class match followed by one of these
tokens, it expands it back out to its original
representation.


Regards,

Bill




 
Reply With Quote
 
Mark Hubbart
Guest
Posts: n/a
 
      11-05-2004
Hi,

On Fri, 5 Nov 2004 09:04:30 +0900, Jason Sweat <(E-Mail Removed)> wrote:
> I wanted to learn Ruby, so I picked a small task of trying to write a
> command line script to parse PHP classes and shell out some unit test
> cases. I have it working for the most part, but I ran across a
> problem trying to use Ruby regexp to find a set of matching curly
> braces.
>
> Please forgive the intrusion of this PHP code onto the list, but I
> wanted to give you the flavor of what I am attempting to do, that can
> be easily done with recursive regular expression available in the Perl
> compatiable regexp engine.
>
> <php>
> $test = <<<EOS
> /* some stuff */
> class foo {
> public \$var;
> public function __construct() {}
> public function bar() {
> if (false) {
> }
> }
>
> }
> // some other stuff
> EOS;
>
> $re = <<<EOS
> ~(class\s+\w+\s+({((?>[^{}]+)|(?2))*}))~xms
> EOS;
> preg_match($re, $test, $match);
> echo "your class matched:\n", $match[1];
>
> </php>
>
> Now it appears the regexp engine in Ruby does not support recursion
> (at least in Ruby ruby 1.8.2 (2004-07-29) [i686-linux] that I am
> working on, and with what I know how to test), thus the only
> workaround I found was very ugly, model the nesting of braces to a
> fixed depth, i.e.
>
> open = '\{'
> close = '\}'
> other = '[^\{\}]*'
> l1 = other+open+other+close+other
> l2 = other+open+'('+l1 +')+'+other+close+other
> l3 = other+open+'('+l2 +')+'+other+close+other
> l4 = other+open+'('+l3 +')+'+other+close+other
> l5 = other+open+'('+l4 +')+'+other+close+other
> re = Regexp.new('class\s+'+@name+'\s+'+open+'((?:'+l5+' )|(?:'+l4+')|(?:'+l3+')|(?:'+l2+')|(?:'+l1+')|(?:' +other+
> '))+'+close, 'ixm')
>
> This code did work, but sometimes timed out on valid real classes.
>
> I expect I am probably missing some facet of Ruby that eaily allows me
> to next regexp inside of the Ruby code in some fasion to achieve the
> result I am looking for, but how to do so eludes me. Can anyone
> provide some insight for me on this situation?


Regular expressions, by all standard definitions, aren't recursive.
Perl's regexen have been extended to allow it, but it really isn't
considered a standard regex feature. You might try using a simple
tokenizer... here's a quick attempt:

def parse(code)
chunks = []
loop do
chunks << text.slice!(/\A.*?(?=[{}])/m) # match start of string to
before next bracket
bracket = text.slice! 0
chunks << parse(text) if bracket == ?{
return chunks if bracket == ?}
return chunks if text.size == 0
end
end

This returns a recursive array that holds all the text chunks around
the brackets. here's some sample code (can't remember exactly what php
looks like ATM) and sample output:

PHP code:

class Foo {
def bar(){
if true{
do_stuff()
}else{
do_nothing()
}
clean_up()
}
}


Then, after recursively stripping whitespace from strings, the
pretty-printed array:

["class Foo",
["def bar()",
["if true", ["do_stuff()"], "else", ["do_nothing()"], "clean_up()"],
""],
nil]

uh, yeah, I know it drops off the last piece of text (reads it as nil)
but I don't want to figure that out just yet. Dinner calls

HTH,
Mark



>
> Thanks,
>
> Regards,
> Jason
> --
> http://blog.casey-sweat.us/
>
>



 
Reply With Quote
 
Jason Sweat
Guest
Posts: n/a
 
      11-05-2004
On Fri, 5 Nov 2004 10:39:11 +0900, Bill Kelly <(E-Mail Removed)> wrote:
> repl = []
> true while dat.gsub!(/\([^(){}]*\)|\{[^(){}]*\}/) do |str|
> repl.push(str)
> "\0#{repl.length - 1}\0"
> end
> dat.scan(/class\s+\w+\s+\0\d+\0/) do
> match = $&
> true while match.gsub!(/\0(\d+)\0/) { repl[$1.to_i] }
> puts "Your class matched:", match
> end
>
> The above matches arbitrarily deeply nested ( ) and { }
> blocks... but all the "classes" it matches have to be
> at toplevel, can't be inside a { } or ( ) themselves...
>
> In case it's not obvious how it works... It replaces,
> from innermost to outermost, () and {} spans with a
> token, and saves the original span in an array.


Thanks Bill,

I think I see how that works. I will play with it and see if it
solves my problem. I knew there had to be a much cleaner way than
what I was doing

Regards,
Jason
--
http://blog.casey-sweat.us/


 
Reply With Quote
 
James Edward Gray II
Guest
Posts: n/a
 
      11-05-2004
On Nov 4, 2004, at 8:04 PM, Mark Hubbart wrote:

> Regular expressions, by all standard definitions, aren't recursive.
> Perl's regexen have been extended to allow it, but it really isn't
> considered a standard regex feature.


Of course, you are correct. However, recursive "Regular Expressions"
are becoming fairly common place now. Ruby's next regex engine will
support them as well.

Regular Expression has evolved considerably from the original
definition, with recursive capabilities being just another change in a
long line of added usefulness. Does that really means they cease to be
Regular Expressions? Longhorn will still be Windows, right?

I think of it more in terms of supersets or, for a programming slant,
subclassing. The spirit of Regular Expressions, patterns to
locate/breakdown text, is intact, I believe. They're just even more
handy now.

Just my two cents.

James Edward Gray II

P.S. Proving whether or not Perl 6 changes will still be "Regular
Expression" is left as an exercise for the reader.



 
Reply With Quote
 
Mark Hubbart
Guest
Posts: n/a
 
      11-05-2004
Hi,

On Fri, 5 Nov 2004 12:17:01 +0900, James Edward Gray II
<(E-Mail Removed)> wrote:
> On Nov 4, 2004, at 8:04 PM, Mark Hubbart wrote:
>
> > Regular expressions, by all standard definitions, aren't recursive.
> > Perl's regexen have been extended to allow it, but it really isn't
> > considered a standard regex feature.

>
> Of course, you are correct. However, recursive "Regular Expressions"
> are becoming fairly common place now. Ruby's next regex engine will
> support them as well.


Are you saying Oniguruma *will* support it, or *does* support it? It
would be nice if it already does

> Regular Expression has evolved considerably from the original
> definition, with recursive capabilities being just another change in a
> long line of added usefulness. Does that really means they cease to be
> Regular Expressions? Longhorn will still be Windows, right?
>
> I think of it more in terms of supersets or, for a programming slant,
> subclassing. The spirit of Regular Expressions, patterns to
> locate/breakdown text, is intact, I believe. They're just even more
> handy now.


My statement wasn't supposed to insinuate that adding recursion is
bad, or turns regular expressions into something else; I was just
pointing out that (last time I checked) you can't *expect* to be able
to use recursion in regexen in a particular language. Perl's regexen
are *more* than regexen; you can even embed perl code in them. Also,
I'm pretty sure the recursive matching wasn't in Perl when I last used
it, a couple years back.

> Just my two cents.
>
> James Edward Gray II
>
> P.S. Proving whether or not Perl 6 changes will still be "Regular
> Expression" is left as an exercise for the reader.




cheers,
Mark


 
Reply With Quote
 
James Edward Gray II
Guest
Posts: n/a
 
      11-05-2004
On Nov 4, 2004, at 9:59 PM, Mark Hubbart wrote:

> Are you saying Oniguruma *will* support it, or *does* support it? It
> would be nice if it already does


Sure does.

James Edward Gray II



 
Reply With Quote
 
Mark Hubbart
Guest
Posts: n/a
 
      11-05-2004
On Fri, 5 Nov 2004 13:06:40 +0900, James Edward Gray II
<(E-Mail Removed)> wrote:
> On Nov 4, 2004, at 9:59 PM, Mark Hubbart wrote:
>
> > Are you saying Oniguruma *will* support it, or *does* support it? It
> > would be nice if it already does

>
> Sure does.


Indeed it does! I just tried it out. And named subexpressions work too
many new features since I last checked it out:

http://www.geocities.jp/kosako1/oniguruma/doc/RE.txt

Thanks for pointing it out.

cheers,
Mark


 
Reply With Quote
 
Jos Backus
Guest
Posts: n/a
 
      11-05-2004
On Fri, Nov 05, 2004 at 02:05:00PM +0900, Mark Hubbart wrote:
> http://www.geocities.jp/kosako1/oniguruma/doc/RE.txt


A-1. Syntax depend options

+ ONIG_SYNTAX_RUBY
(?m): dot(.) match newline

+ ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA
(?s): dot(.) match newline
(?m): ^ match after newline, $ match before newline

Any idea why Ruby was chosen to behave differently from Perl here?

--
Jos Backus _/ _/_/_/ Sunnyvale, CA
_/ _/ _/
_/ _/_/_/
_/ _/ _/ _/
jos at catnook.com _/_/ _/_/_/ require 'std/disclaimer'


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Matching block of nested brace pairs Peng Yu Perl Misc 4 06-14-2010 03:16 AM
[regexp] How to convert string "/regexp/i" to /regexp/i - ? Joao Silva Ruby 16 08-21-2009 05:52 PM
Matching brace in eclipse Orson Java 8 07-04-2008 03:22 PM
Searching for Kevin Brace (Graphic chip research information) Derek Simmons VHDL 1 03-31-2005 01:51 AM
xerces parser won't open file with brace in the name Andy Fish Java 3 11-19-2003 08:21 PM



Advertisments