![]() |
Match a pattern multiple times, returning matches, captures andoffset?
Hi,
I'm used to be able to use the following in PHP. What is basically does is: return me all matches, including the captures, order by matching set and provide me the offsets. $ php -r 'preg_match_all("/_(\w+)_/", "_foo_ _bar_", $matches, PREG_SET_ORDER|PREG_OFFSET_CAPTURE); var_dump($matches);' array(2) { [0]=> array(2) { [0]=> array(2) { [0]=> string(5) "_foo_" [1]=> int(0) } [1]=> array(2) { [0]=> string(3) "foo" [1]=> int(1) } } [1]=> array(2) { [0]=> array(2) { [0]=> string(5) "_bar_" [1]=> int(6) } [1]=> array(2) { [0]=> string(3) "bar" [1]=> int(7) } } } I've found two ways in ruby getting in this direction, either use String#match or String#scan, but both only provide me partial information. I guess I can combine the knowledge of both, but before attempting this I wanted to verify if I didn't overlook something. Here are my ruby attempts: ruby-1.9.2-p180 :001 > m = "_foo_ _bar_".match(/_(\w+)_/) => #<MatchData "_foo_" 1:"foo"> ruby-1.9.2-p180 :002 > [ m[0], m[1] ] => ["_foo_", "foo"] ruby-1.9.2-p180 :003 > [ m.begin(0), m.begin(1) ] => [0, 1] But here I'm missing the further possible matches, "_bar_" and "bar". Or the #scan approach: ruby-1.9.2-p180 :004 > m = "_foo_ _bar_".scan(/_(\w+)_/) => [["foo"], ["bar"]] But in this case I've even less information, the match including _foo_ or _bar_ is not present and I can't get the offsets too. I re-read the documentation for Regexp#match and found out that you can pass an offset into the string as second parameter, so I guess I can iterate over the string in a loop until I find no further matches ...? Considering this I came up with: $ cat test_match_all.rb require 'pp' class String def match_all(pattern) matches = [] offset = 0 while m = match(pattern, offset) do matches << m offset = m.begin(0) + m[0].length end matches end end pp "_foo_ _bar_ _baz_".match_all(/_(\w+)_/) $ ruby test_match_all.rb [#<MatchData "_foo_" 1:"foo">, #<MatchData "_bar_" 1:"bar">, #<MatchData "_baz_" 1:"baz">] I've lots of data to parse so I could foresee that this approach can become a bottleneck. Is there a more direct solution to it? thanks, - Markus |
Re: Match a pattern multiple times, returning matches, captures andoffset?
String#scan with a block may do what you want:
>> "_foo_ _bar_".scan(/_(\w+)_/) { |x| puts "Offset #{$`.size}, captures #{x.inspect}" } Offset 0, captures ["foo"] Offset 6, captures ["bar"] => "_foo_ _bar_" But it doesn't give you offsets to the individual captures, just to the start of the whole match. (You also get the full match in $& and the rest of the string after the match in $') -- Posted via http://www.ruby-forum.com/. |
Re: Match a pattern multiple times, returning matches, captures andoffset?
Markus Fischer wrote in post #991092:
> > But here I'm missing the further possible matches, "_bar_" and "bar". Or > the #scan approach: > > ruby-1.9.2-p180 :004 > m = "_foo_ _bar_".scan(/_(\w+)_/) > => [["foo"], ["bar"]] > > But in this case I've even less information, the match including _foo_ > or _bar_ is not present and I can't get the offsets too. > > I re-read the documentation for Regexp#match If you look at the preamble in the docs for the MatchData class, you can retrieve a MatchData object using Regexp.last_match, which you can call inside a scan() block: str = "_foo_ _bar_" str.scan(/_(\w+)_/) do |match| md = Regexp.last_match p [md[0], md[1], md.offset(1)] end --output:-- ["_foo_", "foo", [1, 4]] ["_bar_", "bar", [7, 10]] -- Posted via http://www.ruby-forum.com/. |
Re: Match a pattern multiple times, returning matches, captures andoffset?
On Wed, Apr 6, 2011 at 3:37 AM, 7stud -- <bbxx789_05ss@yahoo.com> wrote:
> Markus Fischer wrote in post #991092: >> >> But here I'm missing the further possible matches, "_bar_" and "bar". Or >> the #scan approach: >> >> ruby-1.9.2-p180 :004 > m =3D "_foo_ _bar_".scan(/_(\w+)_/) >> =A0=3D> [["foo"], ["bar"]] >> >> But in this case I've even less information, the match including _foo_ >> or _bar_ is not present and I can't get the offsets too. >> >> I re-read the documentation for Regexp#match > > If you look at the preamble in the docs for the MatchData class, you can > retrieve a MatchData object using Regexp.last_match, which you can call > inside a scan() block: When doing nested matching it may be better to use $~ because that is local to the current stack frame which Regexp.last_match isn't. Example with relative offsets as well: irb(main):022:0> str.scan /_(\w+)_/ do irb(main):023:1* 2.times {|i| p [$~[i], $~.offset(i), $~.offset(i).map {|o| o - $~.offset(0)[0]}]} irb(main):024:1> end ["_foo_", [0, 5], [0, 5]] ["foo", [1, 4], [1, 4]] ["_bar_", [6, 11], [0, 5]] ["bar", [7, 10], [1, 4]] =3D> "_foo_ _bar_" Kind regards robert --=20 remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/ |
Re: Match a pattern multiple times, returning matches, captures andoffset?
You can also get the relative offset like this:
str = "_foo_ _bar_" str.scan(/_(\w+)_/) do |curr_match| md = Regexp.last_match whole_match = md[0] captures = md.captures captures.each do |capture| p [whole_match, capture, whole_match.index(capture)] end -- Posted via http://www.ruby-forum.com/. |
Re: Match a pattern multiple times, returning matches, captures andoffset?
On Thu, Apr 7, 2011 at 1:58 AM, 7stud -- <bbxx789_05ss@yahoo.com> wrote:
> You can also get the relative offset like this: > > str =3D "_foo_ _bar_" > > str.scan(/_(\w+)_/) do |curr_match| > =A0md =3D Regexp.last_match > =A0whole_match =3D md[0] > =A0captures =3D md.captures > =A0captures.each do |capture| > =A0 =A0p [whole_match, capture, whole_match.index(capture)] > end That's nice! I wasn't aware of this. Thanks for sharing! I also just read this in the docs: "Note that the last_match is local to the thread and method scope of the me= thod that did the pattern match." So forget my point about $~ being safer. Kind regards robert --=20 remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/ |
Re: Match a pattern multiple times, returning matches, captures andoffset?
7stud -- wrote in post #991338:
> You can also get relative beginning offsets like this: > > str = "_foo_ _bar_" > > str.scan(/_(\w+)_/) do |curr_match| > md = Regexp.last_match > whole_match = md[0] > captures = md.captures > > captures.each do |capture| > p [whole_match, capture, whole_match.index(capture)] > end > > end Using 'index' doesn't work if you have multiple captures which have the same pattern, or one is a substring of the other. Use captures.begin and captures.end instead. >> md = /(...)(...)/.match "foofoo" => #<MatchData "foofoo" 1:"foo" 2:"foo"> >> md.captures => ["foo", "foo"] >> md.begin(1) => 0 >> md.begin(2) => 3 -- Posted via http://www.ruby-forum.com/. |
Re: Match a pattern multiple times, returning matches, captures andoffset?
Brian Candler wrote in post #991406:
> 7stud -- wrote in post #991338: >> You can also get relative beginning offsets like this: >> >> str = "_foo_ _bar_" >> >> str.scan(/_(\w+)_/) do |curr_match| >> md = Regexp.last_match >> whole_match = md[0] >> captures = md.captures >> >> captures.each do |capture| >> p [whole_match, capture, whole_match.index(capture)] >> end >> >> end > > Using 'index' doesn't work if you have multiple captures which have the > same pattern, or one is a substring of the other. > > Use captures.begin and captures.end instead. > begin() and end() are the two elements of offset(), which we've already discussed above: The idea was to get the relative offsets within a match, not the absolute offsets within the string. -- Posted via http://www.ruby-forum.com/. |
Re: Match a pattern multiple times, returning matches, captures andoffset?
7stud -- wrote in post #991546:
> However, note that > begin() and end() are the two elements of offset(), which we've already > discussed above. The idea was to additionally provide the relative > offsets within a match, not just the absolute offsets within the string. That's easy - subtract begin(0) which is the absolute offset of the start of the match. >> "foo bar" =~ /ba(.)/ => 4 >> $~.captures => ["r"] >> $~.begin(1) => 6 >> $~.begin(1) - $~.begin(0) => 2 -- Posted via http://www.ruby-forum.com/. |
Re: Match a pattern multiple times, returning matches, captures andoffset?
Brian Candler wrote in post #991686:
> 7stud -- wrote in post #991546: >> However, note that >> begin() and end() are the two elements of offset(), which we've already >> discussed above. The idea was to additionally provide the relative >> offsets within a match, not just the absolute offsets within the string. > > That's easy - subtract begin(0) which is the absolute offset of the > start of the match. The "subtraction method" was thoroughly vetted earlier. -- Posted via http://www.ruby-forum.com/. |
| All times are GMT. The time now is 12:24 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.