Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Ruby (http://www.velocityreviews.com/forums/f66-ruby.html)
-   -   Match a pattern multiple times, returning matches, captures andoffset? (http://www.velocityreviews.com/forums/t866824-match-a-pattern-multiple-times-returning-matches-captures-andoffset.html)

Markus Fischer 04-05-2011 05:22 PM

Match a pattern multiple times, returning matches, captures andoffset?
 
Hi,

I'm used to be able to use the following in PHP. What is basically does
is: return me all matches, including the captures, order by matching set
and provide me the offsets.

$ php -r 'preg_match_all("/_(\w+)_/", "_foo_ _bar_", $matches,
PREG_SET_ORDER|PREG_OFFSET_CAPTURE); var_dump($matches);'
array(2) {
[0]=>
array(2) {
[0]=>
array(2) {
[0]=>
string(5) "_foo_"
[1]=>
int(0)
}
[1]=>
array(2) {
[0]=>
string(3) "foo"
[1]=>
int(1)
}
}
[1]=>
array(2) {
[0]=>
array(2) {
[0]=>
string(5) "_bar_"
[1]=>
int(6)
}
[1]=>
array(2) {
[0]=>
string(3) "bar"
[1]=>
int(7)
}
}
}

I've found two ways in ruby getting in this direction, either use
String#match or String#scan, but both only provide me partial
information. I guess I can combine the knowledge of both, but before
attempting this I wanted to verify if I didn't overlook something. Here
are my ruby attempts:

ruby-1.9.2-p180 :001 > m = "_foo_ _bar_".match(/_(\w+)_/)
=> #<MatchData "_foo_" 1:"foo">
ruby-1.9.2-p180 :002 > [ m[0], m[1] ]
=> ["_foo_", "foo"]
ruby-1.9.2-p180 :003 > [ m.begin(0), m.begin(1) ]
=> [0, 1]

But here I'm missing the further possible matches, "_bar_" and "bar". Or
the #scan approach:

ruby-1.9.2-p180 :004 > m = "_foo_ _bar_".scan(/_(\w+)_/)
=> [["foo"], ["bar"]]

But in this case I've even less information, the match including _foo_
or _bar_ is not present and I can't get the offsets too.

I re-read the documentation for Regexp#match and found out that you can
pass an offset into the string as second parameter, so I guess I can
iterate over the string in a loop until I find no further matches ...?
Considering this I came up with:

$ cat test_match_all.rb
require 'pp'

class String
def match_all(pattern)
matches = []
offset = 0
while m = match(pattern, offset) do
matches << m
offset = m.begin(0) + m[0].length
end
matches
end
end

pp "_foo_ _bar_ _baz_".match_all(/_(\w+)_/)


$ ruby test_match_all.rb
[#<MatchData "_foo_" 1:"foo">,
#<MatchData "_bar_" 1:"bar">,
#<MatchData "_baz_" 1:"baz">]


I've lots of data to parse so I could foresee that this approach can
become a bottleneck. Is there a more direct solution to it?

thanks,
- Markus



Brian Candler 04-05-2011 06:07 PM

Re: Match a pattern multiple times, returning matches, captures andoffset?
 
String#scan with a block may do what you want:

>> "_foo_ _bar_".scan(/_(\w+)_/) { |x| puts "Offset #{$`.size}, captures

#{x.inspect}" }
Offset 0, captures ["foo"]
Offset 6, captures ["bar"]
=> "_foo_ _bar_"

But it doesn't give you offsets to the individual captures, just to the
start of the whole match. (You also get the full match in $& and the
rest of the string after the match in $')

--
Posted via http://www.ruby-forum.com/.


7stud -- 04-06-2011 01:37 AM

Re: Match a pattern multiple times, returning matches, captures andoffset?
 
Markus Fischer wrote in post #991092:
>
> But here I'm missing the further possible matches, "_bar_" and "bar". Or
> the #scan approach:
>
> ruby-1.9.2-p180 :004 > m = "_foo_ _bar_".scan(/_(\w+)_/)
> => [["foo"], ["bar"]]
>
> But in this case I've even less information, the match including _foo_
> or _bar_ is not present and I can't get the offsets too.
>
> I re-read the documentation for Regexp#match


If you look at the preamble in the docs for the MatchData class, you can
retrieve a MatchData object using Regexp.last_match, which you can call
inside a scan() block:

str = "_foo_ _bar_"

str.scan(/_(\w+)_/) do |match|
md = Regexp.last_match
p [md[0], md[1], md.offset(1)]

end

--output:--
["_foo_", "foo", [1, 4]]
["_bar_", "bar", [7, 10]]

--
Posted via http://www.ruby-forum.com/.


Robert Klemme 04-06-2011 09:42 AM

Re: Match a pattern multiple times, returning matches, captures andoffset?
 
On Wed, Apr 6, 2011 at 3:37 AM, 7stud -- <bbxx789_05ss@yahoo.com> wrote:
> Markus Fischer wrote in post #991092:
>>
>> But here I'm missing the further possible matches, "_bar_" and "bar". Or
>> the #scan approach:
>>
>> ruby-1.9.2-p180 :004 > m =3D "_foo_ _bar_".scan(/_(\w+)_/)
>> =A0=3D> [["foo"], ["bar"]]
>>
>> But in this case I've even less information, the match including _foo_
>> or _bar_ is not present and I can't get the offsets too.
>>
>> I re-read the documentation for Regexp#match

>
> If you look at the preamble in the docs for the MatchData class, you can
> retrieve a MatchData object using Regexp.last_match, which you can call
> inside a scan() block:


When doing nested matching it may be better to use $~ because that is
local to the current stack frame which Regexp.last_match isn't.
Example with relative offsets as well:

irb(main):022:0> str.scan /_(\w+)_/ do
irb(main):023:1* 2.times {|i| p [$~[i], $~.offset(i), $~.offset(i).map
{|o| o - $~.offset(0)[0]}]}
irb(main):024:1> end
["_foo_", [0, 5], [0, 5]]
["foo", [1, 4], [1, 4]]
["_bar_", [6, 11], [0, 5]]
["bar", [7, 10], [1, 4]]
=3D> "_foo_ _bar_"

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/


7stud -- 04-06-2011 11:58 PM

Re: Match a pattern multiple times, returning matches, captures andoffset?
 
You can also get the relative offset like this:

str = "_foo_ _bar_"

str.scan(/_(\w+)_/) do |curr_match|
md = Regexp.last_match
whole_match = md[0]
captures = md.captures
captures.each do |capture|
p [whole_match, capture, whole_match.index(capture)]
end

--
Posted via http://www.ruby-forum.com/.


Robert Klemme 04-07-2011 07:13 AM

Re: Match a pattern multiple times, returning matches, captures andoffset?
 
On Thu, Apr 7, 2011 at 1:58 AM, 7stud -- <bbxx789_05ss@yahoo.com> wrote:
> You can also get the relative offset like this:
>
> str =3D "_foo_ _bar_"
>
> str.scan(/_(\w+)_/) do |curr_match|
> =A0md =3D Regexp.last_match
> =A0whole_match =3D md[0]
> =A0captures =3D md.captures
> =A0captures.each do |capture|
> =A0 =A0p [whole_match, capture, whole_match.index(capture)]
> end


That's nice! I wasn't aware of this. Thanks for sharing!

I also just read this in the docs:

"Note that the last_match is local to the thread and method scope of the me=
thod
that did the pattern match."

So forget my point about $~ being safer.

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/


Brian Candler 04-07-2011 08:39 AM

Re: Match a pattern multiple times, returning matches, captures andoffset?
 
7stud -- wrote in post #991338:
> You can also get relative beginning offsets like this:
>
> str = "_foo_ _bar_"
>
> str.scan(/_(\w+)_/) do |curr_match|
> md = Regexp.last_match
> whole_match = md[0]
> captures = md.captures
>
> captures.each do |capture|
> p [whole_match, capture, whole_match.index(capture)]
> end
>
> end


Using 'index' doesn't work if you have multiple captures which have the
same pattern, or one is a substring of the other.

Use captures.begin and captures.end instead.

>> md = /(...)(...)/.match "foofoo"

=> #<MatchData "foofoo" 1:"foo" 2:"foo">
>> md.captures

=> ["foo", "foo"]
>> md.begin(1)

=> 0
>> md.begin(2)

=> 3

--
Posted via http://www.ruby-forum.com/.


7stud -- 04-07-2011 07:04 PM

Re: Match a pattern multiple times, returning matches, captures andoffset?
 
Brian Candler wrote in post #991406:
> 7stud -- wrote in post #991338:
>> You can also get relative beginning offsets like this:
>>
>> str = "_foo_ _bar_"
>>
>> str.scan(/_(\w+)_/) do |curr_match|
>> md = Regexp.last_match
>> whole_match = md[0]
>> captures = md.captures
>>
>> captures.each do |capture|
>> p [whole_match, capture, whole_match.index(capture)]
>> end
>>
>> end

>
> Using 'index' doesn't work if you have multiple captures which have the
> same pattern, or one is a substring of the other.
>
> Use captures.begin and captures.end instead.
>


begin() and end() are the two elements of offset(), which we've already
discussed above:

The idea was to get the relative offsets within a match, not the
absolute offsets within the string.

--
Posted via http://www.ruby-forum.com/.


Brian Candler 04-08-2011 07:19 AM

Re: Match a pattern multiple times, returning matches, captures andoffset?
 
7stud -- wrote in post #991546:
> However, note that
> begin() and end() are the two elements of offset(), which we've already
> discussed above. The idea was to additionally provide the relative
> offsets within a match, not just the absolute offsets within the string.


That's easy - subtract begin(0) which is the absolute offset of the
start of the match.

>> "foo bar" =~ /ba(.)/

=> 4
>> $~.captures

=> ["r"]
>> $~.begin(1)

=> 6
>> $~.begin(1) - $~.begin(0)

=> 2

--
Posted via http://www.ruby-forum.com/.


7stud -- 04-08-2011 07:53 PM

Re: Match a pattern multiple times, returning matches, captures andoffset?
 
Brian Candler wrote in post #991686:
> 7stud -- wrote in post #991546:
>> However, note that
>> begin() and end() are the two elements of offset(), which we've already
>> discussed above. The idea was to additionally provide the relative
>> offsets within a match, not just the absolute offsets within the string.

>
> That's easy - subtract begin(0) which is the absolute offset of the
> start of the match.


The "subtraction method" was thoroughly vetted earlier.

--
Posted via http://www.ruby-forum.com/.



All times are GMT. The time now is 12:24 PM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.