Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > (Maybe) a simple question about regex

Reply
Thread Tools

(Maybe) a simple question about regex

 
 
Sam Kong
Guest
Posts: n/a
 
      03-24-2005
Hello!

I think that I am missing a very simple concept about regex.

s = '0123456789'
s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]

Now I want to exclude "45".
How can I express it in the regex?
When it's only one character, I can use ^.
But for 2 characters, I don't think I can use it.

What I want is:

s = '0123456789'
s.scan(some_regex) #-> ["01", "23", "67", "89"]

What should some_regex be?

Can somebody help me?

Sam

 
Reply With Quote
 
 
 
 
Assaph Mehr
Guest
Posts: n/a
 
      03-24-2005

> s = '0123456789'
> s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
>
> Now I want to exclude "45".
> How can I express it in the regex?
> When it's only one character, I can use ^.
> But for 2 characters, I don't think I can use it.
>
> What I want is:
>
> s = '0123456789'
> s.scan(some_regex) #-> ["01", "23", "67", "89"]


Negative lookahead:
s.scan /(?!4|5)\d\d/
Note the OR sign ('|') between the digits, otherwise it would produce:
["01", "23", "56", "78"]

You need to tune it to your exact domain.

Cheers,
Assaph

 
Reply With Quote
 
 
 
 
Carlos
Guest
Posts: n/a
 
      03-24-2005
[Sam Kong <(E-Mail Removed)>, 2005-03-24 02.49 CET]
> Hello!
>
> I think that I am missing a very simple concept about regex.
>
> s = '0123456789'
> s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
>
> Now I want to exclude "45".
> How can I express it in the regex?
> When it's only one character, I can use ^.
> But for 2 characters, I don't think I can use it.


You can use a "negative lookahead assertion":

s.scan(/(?!45)\d\d/)

This means, at every point the regex tries to match, "if the next two
characters aren't "45", match \d\d".

HTH.
--


 
Reply With Quote
 
Jason Sweat
Guest
Posts: n/a
 
      03-24-2005
On Thu, 24 Mar 2005 10:49:49 +0900, Sam Kong <(E-Mail Removed)> wrote:
> Hello!
>
> I think that I am missing a very simple concept about regex.
>
> s = '0123456789'
> s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
>
> Now I want to exclude "45".
> How can I express it in the regex?
> When it's only one character, I can use ^.
> But for 2 characters, I don't think I can use it.
>
> What I want is:
>
> s = '0123456789'
> s.scan(some_regex) #-> ["01", "23", "67", "89"]
>
> What should some_regex be?


You can use a negative assertion to say you want to skip "45", but it
will bump forward one space and you will end up with the last matches
being "56" and "78"

>> s.scan(/(?!45)\d\d/)

=> ["01", "23", "56", "78"]

So with a little uglier assertion, you can say:

>> s.scan(/(?!45|5)\d\d/)

=> ["01", "23", "67", "89"]

and get what you specified, but though it works for your toy case, I
would be worried that this might not extrapolate out to your real goal
well.

HTH

Regards,
Jason
http://blog.casey-sweat.us/


 
Reply With Quote
 
Patrick Hurley
Guest
Posts: n/a
 
      03-24-2005
What they said, but also if you can be more precise about your real
problem, we might be able to better model a solution. You might find
matching the expression you want and then scanning it to be more
flexible for example.


On Thu, 24 Mar 2005 11:09:51 +0900, Assaph Mehr <(E-Mail Removed)> wrote:
>
> > s = '0123456789'
> > s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
> >
> > Now I want to exclude "45".
> > How can I express it in the regex?
> > When it's only one character, I can use ^.
> > But for 2 characters, I don't think I can use it.
> >
> > What I want is:
> >
> > s = '0123456789'
> > s.scan(some_regex) #-> ["01", "23", "67", "89"]

>
> Negative lookahead:
> s.scan /(?!4|5)\d\d/
> Note the OR sign ('|') between the digits, otherwise it would produce:
> ["01", "23", "56", "78"]
>
> You need to tune it to your exact domain.
>
> Cheers,
> Assaph
>
>



 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      03-24-2005

"Assaph Mehr" <(E-Mail Removed)> schrieb im Newsbeitrag
news:(E-Mail Removed) oups.com...
>
> > s = '0123456789'
> > s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
> >
> > Now I want to exclude "45".
> > How can I express it in the regex?
> > When it's only one character, I can use ^.
> > But for 2 characters, I don't think I can use it.
> >
> > What I want is:
> >
> > s = '0123456789'
> > s.scan(some_regex) #-> ["01", "23", "67", "89"]

>
> Negative lookahead:
> s.scan /(?!4|5)\d\d/
> Note the OR sign ('|') between the digits, otherwise it would produce:
> ["01", "23", "56", "78"]


But:

>> s = '01234567894657'

=> "01234567894657"
>> s.scan /(?!4|5)\d\d/

=> ["01", "23", "67", "89", "65"]
>> s.scan /\d\d/

=> ["01", "23", "45", "67", "89", "46", "57"]

IOW, you loose "46" and "57".

I prefer a non RE solution in these cases as it's simpler

>> s.scan(/\d\d/).reject {|x| "45" == x}

=> ["01", "23", "67", "89", "46", "57"]

Otherwise RE becomes really complex if you want to make it right - if it's
possible at all (see other postings).

Kind regards

robert

 
Reply With Quote
 
Sam Kong
Guest
Posts: n/a
 
      03-24-2005
Thank you and other posters for the answers.
Actually s.scan(/(?!45)\d\d/) suffices my real problem.

What I was trying to solve was...
To extract url's from an html source which includes list of sites.
They are formatted like <a href="something.html">.
But I wanted to exclude <a href="index.html"> from the list.
So (?!index.html) will do.
Actually my toy case was not well-defined (I realized this later) and
thus it required more complex solutions like your second case -
s.scan(/(?!45|5)\d\d/) .

I think non-RE solution would be better like Mr. Robert Klemme said.
But I wanted to learn some RE.

Thanks.
Sam

 
Reply With Quote
 
Simon Strandgaard
Guest
Posts: n/a
 
      03-24-2005
On Thu, 24 Mar 2005 18:09:50 +0900, Sam Kong <(E-Mail Removed)> wrote:
> To extract url's from an html source which includes list of sites.
> They are formatted like <a href="something.html">.
> But I wanted to exclude <a href="index.html"> from the list.
> So (?!index.html) will do.



does this help?

ary=%w(a.html index.html other.txt evil.html.exe stuff.html)
ary.select{|s| s =~ /\A(?!index).*\.html\z/ } #=> ["a.html", "stuff.html"]


--
Simon Strandgaard


 
Reply With Quote
 
Csaba Henk
Guest
Posts: n/a
 
      03-25-2005
On 2005-03-24, Sam Kong <(E-Mail Removed)> wrote:
> What I was trying to solve was...
> To extract url's from an html source which includes list of sites.
> They are formatted like <a href="something.html">.
> But I wanted to exclude <a href="index.html"> from the list.
> So (?!index.html) will do.
> Actually my toy case was not well-defined (I realized this later) and
> thus it required more complex solutions like your second case -
> s.scan(/(?!45|5)\d\d/) .


Why don't you use a dedicated html parser? Eg. there's htmltokenizer,
available ar Rubyforge, quite lightweight and very easy to use, but
there are others, of course.

> I think non-RE solution would be better like Mr. Robert Klemme said.
> But I wanted to learn some RE.


This thread was useful, I admit

Csaba
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
Simple Python REGEX Question johnny Python 4 05-12-2007 09:38 PM
a simple regex question John Salerno Python 6 04-02-2006 02:55 PM
Simple regex question Todd Ruby 3 10-25-2005 11:49 AM
RegEx Woes! Please Help, Simple Question Saad Malik Java 5 05-02-2005 04:06 PM



Advertisments