Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Negate a character sequence in a regular expression?

Reply
Thread Tools

Negate a character sequence in a regular expression?

 
 
crm_114@mac.com
Guest
Posts: n/a
 
      12-02-2007
Hello all - this was my first post, so thank you James, Jordan,
yermej, Daniel, Raul and Tanaka. I really appreciate the help. I've
learned quite a bit - especially about negative lookahead. Tricky
stuff.

As you might expect, I am working on a script to scan a big file for
multiple cat...dog pairs. Well not cats and dogs, more like time
stamps, etc.

I've tried all the solutions posted here and they work very nicely for
the string I posted. But I botched the original specification in at
least two ways:

1. I should have mentioned that there might be multiple cat...dog
pairs. Then James would probably have given me something looking like
Tanaka's solution, which uses negative lookahead but only consumes one
character at a time.

2. I should have mentioned that the characters between cat and dog
should contain neither 'cat' nor 'dog'. I'm really looking for the
'innermost' cat...dog pair.

So I tried Daniel's solution and that works pretty well. If there is
a way to break it I did not find it.

I also tweaked Tanaka's solution a little bit after a trip back to the
pickaxe book. I added the (?: to keep the group from generating
backreferences, and I added an alternate 'dog' to the negative
assertion. This is confusing but it seems to work:


irb(main):001:0> x = 'cat sheep horse cat tac dog cat cat sheep dog
dog '
=> "cat sheep horse cat tac dog cat cat sheep dog dog "
irb(main):002:0> x.scan(/cat(??!cat|dog).)*dog/)
=> ["cat tac dog", "cat sheep dog"]


Here is my attempt at breaking it down, for anyone who is interested:
/cat(??!cat|dog).)*dog/
1 2 3 4 5 6

1. The engine starts scanning through the string until it matches
'cat'.

2. At the first position after the t in cat, the engine tries to
match this expression (??!cat|dog).), zero or more times. The (?:
is nothing special. It is just a grouping that does not generate
backreferences.

3. The negative lookahead assertion. If the regular expression
following the ! matches the string, starting from the current position
of the engine, then the expression to which the assertion belongs will
fail. But since the negative assertion belongs to a grouping that can
match 0 or more times, the assertion can fail but the full regular
expression can succeed.
The hard part for me is that the negative lookahead assertion does
not consume any characters. It looks ahead, tries to match its
regular expression, and whether it returns 'MATCH' or 'NO MATCH', the
current position of the engine stays the same.

4. The regexp for the negative assertion. The stuff between each
cat...dog pair should contain neither 'cat' nor 'dog'.

5. One character is consumed, the engine moves down the string one
position. Thanks to the negative lookahead, we know that that the
consumed character is not the start of a 'cat' or 'dog' substring.

6. Eventually, the (??!cat|dog).)* fails when the current position
of the engine reaches the beginning of a 'dog' substring. But then
the last part of the regular expression (6) will match.

---

Also, another way of getting the 'innermost' cat...dog pair is to use
non-greedy kleene star(*?). This way, the engine will take the first
'dog' suffix that it finds, rather than the last possible:

irb(main):003:0> x.scan(/cat(??!cat).)*?dog/)
=> ["cat tac dog", "cat sheep dog"]

So there is more than one way to scan a cat! (Sorry!!!)

....In any event, I think I still like Daniel's solution better,
because I can look at it and feel fairly certain that it will do
exactly what it should do.


Thanks--
Josh
 
Reply With Quote
 
 
 
 
Raul Parolari
Guest
Posts: n/a
 
      12-02-2007
unknown wrote:

> I also tweaked Tanaka's solution a little bit after a trip back to the
> pickaxe book. I added the (?: to keep the group from generating
> backreferences,

Great

> and I added an alternate 'dog' to the negative assertion.

Just using a non-greedy expression removes this complication (as you
realized later), although it made the following even more interesting.

> Here is my attempt at breaking it down, for anyone who is interested:
> /cat(??!cat|dog).)*dog/
> 1 2 3 4 5 6
>
> 1. The engine starts scanning through the string until matches 'cat'.
> 2. At the first position...


This analysis was great (I had missed Tanaka's solution).

> ...In any event, I think I still like Daniel's solution better,
> because I can look at it and feel fairly certain that it will do
> exactly what it should do.


Josh
I think both solutions are indestructible. I was curious to benchmark
them; here, they showed interesting differences:

a) for one-line sentences, where each substring cat...dog contains a few
words, Tanaka's solution is approx 25% faster.

b) for long paragraphs where cat...dog are separated by (say) 2 dozen
words, Daniels' solution becomes faster by 20-25% (for longer intervals,
the advantage can go up to 50% and more).

c) For long paragraphs, where cat and dog were separated by only a few
words, the performance is almost identical. So, it is not so much the
length of the string, but the frequency of appearance of the keywords
that is important.

[Note: you need to add the /m modifier if your strings are paragraphs.
In fact, it may be better to always write /m, so that you do not need to
touch the regexp].

The reason for the performance discrepancies is clear: for short gaps
between the keywords (ie cat, dog), the trio scan+map+gsub exacts a
toll. But for long paragraphs with many words between the keywords, it
is instead the negative lookahead on 'cat' (at every character) which
suffers.

You may want to take this in account, depending on the type of text
configurations you expect.

In any case, both methods seem technically perfect.

All the best, and thanks for that great summary,

Raul

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
regular expression negate a word (not character) Summercool Ruby 22 08-06-2010 08:58 AM
regular expression negate a word (not character) Summercool Python 13 02-01-2008 10:36 AM
regular expression negate a word (not character) Summercool Perl Misc 14 02-01-2008 10:36 AM
character encoding +missing character sequence raavi Java 2 03-02-2006 05:01 AM
regular expression to negate the start of line char '^'? Neil Morris Javascript 1 07-15-2003 10:07 PM



Advertisments