Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > regular expression too big

Reply
Thread Tools

regular expression too big

 
 
Peter Schrammel
Guest
Posts: n/a
 
      11-12-2006
Hi,

got problem with big regexes:
I have a regex of about 70000+ words concated with '|' that I'd like to
match as a regex. /bla|blub|foo|bar|.....(70000)/

But unfortunately ruby gives me a 'regular expression too big' if I'm
trying to build such a thing.
I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
regexes. Is there a way around this (without going down to 2000 words) ?

Thanks for any hint

Peter
 
Reply With Quote
 
 
 
 
Peter Schrammel
Guest
Posts: n/a
 
      11-12-2006
Jeffrey Schwab wrote:
> Peter Schrammel wrote:
>
>> got problem with big regexes:
>> I have a regex of about 70000+ words concated with '|' that I'd like to
>> match as a regex. /bla|blub|foo|bar|.....(70000)/
>>
>> But unfortunately ruby gives me a 'regular expression too big' if I'm
>> trying to build such a thing.
>> I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
>> regexes. Is there a way around this (without going down to 2000 words) ?
>>
>> Thanks for any hint

>
> You could optimize the regex a little for size, e.g. by factoring out
> common prefixes:
>
> (b(l(a|ub)|ar)|foo)...


Thought of that.


> Of course, that will only help if the | alternatives have a reasonable
> amount of redundancy. Alternatively, you could just break the whole
> thing into multiple expressions. Instead of
>
> if /first_part|second_part/ =~ text
>
> You could try:
>
> if /first_part/ =~ text or /second_part/ =~ text


Yes, that was my next thought but where to split? Just count the bytes
and splitt near 1 <<16?


Why is there a limitation at all? I implemented the same thing in perl
and it no complains ...
Is the regexp engine of perl that much better?

Thanks for the reply
 
Reply With Quote
 
 
 
 
Louis J Scoras
Guest
Posts: n/a
 
      11-12-2006
This doesn't really help with the actual question of getting past
regex size limits, but are you sure that regexen are correct solution
to this problem? Unless I'm mistaken, the above match is going to be
horribly, painfully slow; the issue you're running into is probably an
indication that you might want to look elsewhere.

I don't want to make any silly claims without doing any benchmarking
on your data, but I would imagine that even doing an efficient search
on a sorted array of your words would give you better performance than
the regex search. You can move up from there into hashes or other
data structures for this sort of thing.


--
Lou

 
Reply With Quote
 
Ken Bloom
Guest
Posts: n/a
 
      11-12-2006
On Sun, 12 Nov 2006 15:25:49 +0100, Peter Schrammel wrote:

> Hi,
>
> got problem with big regexes:
> I have a regex of about 70000+ words concated with '|' that I'd like to
> match as a regex. /bla|blub|foo|bar|.....(70000)/
>
> But unfortunately ruby gives me a 'regular expression too big' if I'm
> trying to build such a thing.
> I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
> regexes. Is there a way around this (without going down to 2000 words) ?
>
> Thanks for any hint
>
> Peter


Maybe a trie would be useful?
http://rubyforge.org/projects/trie/
(or there's another trie at http://kzk9.net/software/miscprograms/ruby/)

--Ken

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/
 
Reply With Quote
 
Ken Bloom
Guest
Posts: n/a
 
      11-12-2006
On Sun, 12 Nov 2006 19:15:12 +0000, Ken Bloom wrote:

> On Sun, 12 Nov 2006 15:25:49 +0100, Peter Schrammel wrote:
>
>> Hi,
>>
>> got problem with big regexes:
>> I have a regex of about 70000+ words concated with '|' that I'd like to
>> match as a regex. /bla|blub|foo|bar|.....(70000)/
>>
>> But unfortunately ruby gives me a 'regular expression too big' if I'm
>> trying to build such a thing.
>> I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
>> regexes. Is there a way around this (without going down to 2000 words) ?
>>
>> Thanks for any hint
>>
>> Peter

>
> Maybe a trie would be useful?
> http://rubyforge.org/projects/trie/
> (or there's another trie at http://kzk9.net/software/miscprograms/ruby/)


On one more thing: to implement substring search using a trie, when
adding words to the trie, you should generate appropriate back links so
that you can implement =~ use the Knuth-Morris-Pratt algorithm for matching.

--Ken

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/
 
Reply With Quote
 
Simon Strandgaard
Guest
Posts: n/a
 
      11-12-2006
On 11/12/06, Peter Schrammel <(E-Mail Removed)> wrote:
> got problem with big regexes:
> I have a regex of about 70000+ words concated with '|' that I'd like to
> match as a regex. /bla|blub|foo|bar|.....(70000)/


if you have many words to check for then consider using a
http://en.wikipedia.org/wiki/Bloom_filter


--
Simon Strandgaard

 
Reply With Quote
 
Peter Schrammel
Guest
Posts: n/a
 
      11-12-2006
> When you post an question like this, it is always a good idea to
reveal the
> purpose as well as the method.
>
>


First of all: Thanks for the replies, I think I have enough input to
chew on.

And to reveal the purpose:
I'd like to match a LOT of words/strings (words with spaces) and
sometimes regexes against a lot of small and medium size( < 10000 byte )
strings.

The comparison with Perl was just a comparison of the regexp engines and
not the language itself. It was just that until now I had never to think
much about the underlying hardware .... because ruby did that for me. So
I was surprised by the "too big exception".

Peter
 
Reply With Quote
 
Ken Bloom
Guest
Posts: n/a
 
      11-13-2006
On Mon, 13 Nov 2006 05:46:30 +0900, Simon Strandgaard wrote:

> On 11/12/06, Peter Schrammel <(E-Mail Removed)> wrote:
>> got problem with big regexes:
>> I have a regex of about 70000+ words concated with '|' that I'd like to
>> match as a regex. /bla|blub|foo|bar|.....(70000)/

>
> if you have many words to check for then consider using a
> http://en.wikipedia.org/wiki/Bloom_filter


The only reason I didn't suggest that is becuase it can have false
positives.

--Ken Bloom
^^^^^

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/
 
Reply With Quote
 
Ken Bloom
Guest
Posts: n/a
 
      11-13-2006
On Mon, 13 Nov 2006 00:07:52 -0800, Paul Lutus wrote:

> Peter Schrammel wrote:
>
>>> When you post an question like this, it is always a good idea to

>> reveal the
>>> purpose as well as the method.
>>>
>>>

>>
>> First of all: Thanks for the replies, I think I have enough input to
>> chew on.
>>
>> And to reveal the purpose:
>> I'd like to match a LOT of words/strings (words with spaces) and
>> sometimes regexes against a lot of small and medium size( < 10000 byte )
>> strings.

>
> If you are going to compare strings to strings as well as strings to
> regexes, maybe a two-tiered scheme would be better. First tier, simple
> textual comparison using a hash of strings. Second tier, regexes.
>
> This removes the ambiguity that a particular data string might pass a string
> comparison but fail any of the provided regexes, or the opposite. It will
> also be much faster than a gigantic list of strings and regexes, all
> presented to the regex engine as though they were all regular expressions,
> even though some are simple string comparisons.
>
> I think this (e.g speed improvement) would have been true in Perl also.
>


His regex isn't anchored to the beginning and end of the string. This makes
hashes useless for the kind of comparison he seems to want to perform.

--Ken

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/
 
Reply With Quote
 
Gabriele Marrone
Guest
Posts: n/a
 
      11-13-2006

On 12/nov/06, at 15:30, Peter Schrammel wrote:

> Hi,
>
> got problem with big regexes:
> I have a regex of about 70000+ words concated with '|' that I'd
> like to
> match as a regex. /bla|blub|foo|bar|.....(70000)/
>
> But unfortunately ruby gives me a 'regular expression too big' if I'm
> trying to build such a thing.
> I had a look at the regex.c code and saw the limit of 1 << 16 bytes
> for
> regexes. Is there a way around this (without going down to 2000
> words) ?


I friend of mine told me to suggest you to system() a perl script
which does the job

However, since he was just kidding, I think I'll suggest you
something like this:


class String
def include_at_least_one_of?(words)
words.find { |w| include? w }
end
end

my_words = %w(bla blub foo bar)

"zump blub asd".include_at_least_one_of? my_words # => "blub"
"nah none included".include_at_least_one_of? my_words # => nil


You probably don't want to hack core classes in order to do this, but
you get the idea.
>
> Thanks for any hint
>
> Peter
>


--
Gabriele Marrone



 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
GIDS 2009 .Net:: Save Big, Win Big, Learn Big: Act Before Dec 29 2008 Shaguf ASP .Net 0 12-26-2008 09:29 AM
GIDS 2009 .Net:: Save Big, Win Big, Learn Big: Act Before Dec 29 2008 Shaguf ASP .Net Web Controls 0 12-26-2008 06:11 AM
GIDS 2009 Java:: Save Big, Win Big, Learn Big: Act Before Dec 29 2008 Shaguf Python 0 12-24-2008 07:35 AM
boost regex --- sregex_iterator -- Regular expression too big wolverine C++ 2 08-29-2006 11:22 PM
Dynamically changing the regular expression of Regular Expression validator VSK ASP .Net 2 08-24-2003 02:47 PM



Advertisments