Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > regular expression too big

Reply
Thread Tools

regular expression too big

 
 
Ken Bloom
Guest
Posts: n/a
 
      11-13-2006
Simon Strandgaard <> wrote:
> On 11/12/06, Peter Schrammel <> wrote:
>> got problem with big regexes:
>> I have a regex of about 70000+ words concated with '|' that I'd like to
>> match as a regex. /bla|blub|foo|bar|.....(70000)/

>
> if you have many words to check for then consider using a
> http://en.wikipedia.org/wiki/Bloom_filter


There's even a library for doing Bloom filters:
http://raa.ruby-lang.org/project/rbloomfilter/

(Bloom filters were invented by Burton Bloom, who is not even remotely
related to me)

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/
 
Reply With Quote
 
 
 
 
James Edward Gray II
Guest
Posts: n/a
 
      11-13-2006
On Nov 13, 2006, at 11:45 AM, Gabriele Marrone wrote:

> class String
> def include_at_least_one_of?(words)
> words.find { |w| include? w }
> end
> end


Just a tiny style related suggestion. I would use any?() instead of
find() there.

James Edward Gray II

 
Reply With Quote
 
 
 
 
Peter Schrammel
Guest
Posts: n/a
 
      11-16-2006
Paul Lutus wrote:
> Yes, unless he is matching whole words, he is stuck with regexes. There is
> very likely to be a refactoring for this problem, and it would have to
> start with a clear statement of the problem to be solved.
>

Sorry, my fault.

The problem is to match a whole bunch (>70000) of words (later regexps)
on a string.
The actual implementation is to concat the word with | and create a big
regexp. word1|word2|word3|word4 ...
This went well until I tested with some 100 words. Now I have the 'big
regexp problem' problem. The solution has to work with regexps as words
as well like: word1(the|a|an)word11|regexp2|...
So the currently working solutions are:
-use ruby 1.9 (the performance is far below perl, but I have to
benchmark this.)
-loop through the words/regexp and match each of them on the string
(don't ask for performance here).
-perhaps I'll realy pipe the problem into a
perl-implemented-matching-server (not that bad idea)

peter
 
Reply With Quote
 
Simon Strandgaard
Guest
Posts: n/a
 
      11-17-2006
On 11/17/06, Peter Schrammel <> wrote:
> Paul Lutus wrote:
> > Yes, unless he is matching whole words, he is stuck with regexes. There is
> > very likely to be a refactoring for this problem, and it would have to
> > start with a clear statement of the problem to be solved.
> >

> Sorry, my fault.
>
> The problem is to match a whole bunch (>70000) of words (later regexps)
> on a string.
> The actual implementation is to concat the word with | and create a big
> regexp. word1|word2|word3|word4 ...


Is it exact word matching?

if so then you can split the input-string on blankspace,
so you get an array of words. Then you can sort the array of words

irb(main):001:0> a = %w(a b c e f g)
=> ["a", "b", "c", "e", "f", "g"]
irb(main):002:0> b = %w(b d f)
=> ["b", "d", "f"]
irb(main):003:0> a & b
=> ["b", "f"]


--
Simon Strandgaard

 
Reply With Quote
 
Peter Schrammel
Guest
Posts: n/a
 
      11-17-2006
Paul Lutus wrote:
> Peter Schrammel wrote:
>
>> Paul Lutus wrote:
>>> Yes, unless he is matching whole words, he is stuck with regexes. There
>>> is very likely to be a refactoring for this problem, and it would have to
>>> start with a clear statement of the problem to be solved.
>>>

>> Sorry, my fault.
>>
>> The problem is to match a whole bunch (>70000) of words (later regexps)
>> on a string.

>
> You are not being very clear. Do you mean to match entire words, letter for
> letter, beginning to end, at least some of the time? If so, for those
> cases, you can use a hash table that is preloaded with the words as keys.
> That will be fast.


no its a substring match kind of /hello/.match("dear customer id like to
say hello to ...."). I have to find 70000 words in a large textfile:
if lotofwords.match(bigtext) .....
I need a trigger to classify the bitext depending on wether any of the
word (later regexps) are in it or not. Tries work well for substring
matching in this case but fail with regexps.

> Also precompile the regexes before use. You probably already know this, but
> I thought I would mention it anyway.


yes, done that.

> Another option is C++, which has a readily available regexp library. The way
> I would go about this is to design the entire thing in Ruby, then, if the
> speed was not acceptable, recreate it in C++. This gives you the advantage
> of speedy development in Ruby, followed by speedy execution.
>
> If this is a full-on language analysis problem, you really should be using
> Prolog or Lisp anyway. If the problem really is as complex as you are
> hinting at, you may not be using the right language, or even class of
> language.
>

yes, but I wanted to try it with "my favorit" language first before
doing something wild. It's a rails application so I'd have to call
lisp/perl/.... programms, interfacing them (calling on each test would
be a mess because reading somethousand lines every time is a performance
killer.)

Anyway, thanks for the replies I'll give a perl interface a try (due to
some other reasons like robust html parsing ...).

 
Reply With Quote
 
brabuhr@gmail.com
Guest
Posts: n/a
 
      11-17-2006
Peter Schrammel wrote:
> >> got problem with big regexes:
> >> I have a regex of about 70000+ words concated with '|' that I'd like to
> >> match as a regex. /bla|blub|foo|bar|.....(70000)/
> >>
> >> But unfortunately ruby gives me a 'regular expression too big' if I'm
> >> trying to build such a thing.
> >> I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
> >> regexes. Is there a way around this (without going down to 2000 words) ?
> >>
> >> Thanks for any hint


Jeffrey Schwab wrote:
> > You could optimize the regex a little for size, e.g. by factoring out
> > common prefixes:
> >
> > (b(l(a|ub)|ar)|foo)...


Peter Schrammel wrote:
> Thought of that.


Have you seen:
"Converts a list of words to a regular expression with minimum
backtracking by joining words with common prefixes. It is a port
of the Perl module MakeRegex.pm by Hakan Kjellerstrand with
some improvements."
http://raa.ruby-lang.org/project/makeregex/

YMMV; I have never used it on anything like the scale you are.

 
Reply With Quote
 
brabuhr@gmail.com
Guest
Posts: n/a
 
      11-17-2006
> Jeffrey Schwab wrote:
> > > You could optimize the regex a little for size, e.g. by factoring out
> > > common prefixes:
> > >
> > > (b(l(a|ub)|ar)|foo)...

>
> Peter Schrammel wrote:
> > Thought of that.

>
> Have you seen:
> "Converts a list of words to a regular expression with minimum
> backtracking by joining words with common prefixes. It is a port
> of the Perl module MakeRegex.pm by Hakan Kjellerstrand with
> some improvements."
> http://raa.ruby-lang.org/project/makeregex/
>
> YMMV; I have never used it on anything like the scale you are.


In a little bit of testing here, it goes to long after about 8,000
words.

> require 'makeregex'
>
> 20.times do |n|
> words = IO.readlines("/usr/share/dict/words")[0..(2 ** n)]
>
> start = Time.now
>
> r = Regexp.make(words)
>
> finish = Time.now
>
> puts "Took #{finish - start} seconds to convert #{words.size} words into a regex #{r.size} bytes long."
>
> "FOO".match(r)
> end


Took 0.000372 seconds to convert 2 words into a regex 20 bytes long.
Took 0.000285 seconds to convert 3 words into a regex 25 bytes long.
Took 0.000359 seconds to convert 5 words into a regex 51 bytes long.
Took 0.000493 seconds to convert 9 words into a regex 86 bytes long.
Took 0.000973 seconds to convert 17 words into a regex 157 bytes long.
Took 0.001773 seconds to convert 33 words into a regex 285 bytes long.
Took 0.005386 seconds to convert 65 words into a regex 491 bytes long.
Took 0.00823 seconds to convert 129 words into a regex 933 bytes long.
Took 0.019234 seconds to convert 257 words into a regex 1876 bytes long.
Took 0.042557 seconds to convert 513 words into a regex 3856 bytes long.
Took 0.09146 seconds to convert 1025 words into a regex 7807 bytes long.
Took 0.196851 seconds to convert 2049 words into a regex 15669 bytes long.
Took 0.399155 seconds to convert 4097 words into a regex 32325 bytes long.
Took 0.968776 seconds to convert 8193 words into a regex 64671 bytes long.
foo:14:in `match': regular expression too big:
/(?:1(?:0(?:80\n|\-point\n|th\n)|...

 
Reply With Quote
 
brabuhr@gmail.com
Guest
Posts: n/a
 
      11-17-2006
> In a little bit of testing here, it goes to long after about 8,000
> words.
>
> > require 'makeregex'
> >
> > 20.times do |n|
> > words = IO.readlines("/usr/share/dict/words")[0..(2 ** n)]


Of course, that isn't the best sample of words to test with.

Here is a revised version (fails with a smaller word count):

> require 'makeregex'
>
> class Array
> def sample n
> a = []; s = self.size
> n.times do
> a << self[rand(s)].chomp
> end
> a
> end
> end
>
> words = IO.readlines("/usr/share/dict/words")
>
> 20.times do |n|
> w = words.sample(2 ** n)
>
> start = Time.now
>
> r = Regexp.make(w)
>
> finish = Time.now
>
> puts "Took #{finish - start} seconds to convert #{w.size} words into a regex #{r.size} bytes long."
>
> "FOO".match(r)
> end


Took 9.9e-05 seconds to convert 1 words into a regex 9 bytes long.
Took 0.000163 seconds to convert 2 words into a regex 18 bytes long.
Took 0.000169 seconds to convert 4 words into a regex 42 bytes long.
Took 0.000299 seconds to convert 8 words into a regex 80 bytes long.
Took 0.000618 seconds to convert 16 words into a regex 176 bytes long.
Took 0.001366 seconds to convert 32 words into a regex 324 bytes long.
Took 0.003021 seconds to convert 64 words into a regex 689 bytes long.
Took 0.007129 seconds to convert 128 words into a regex 1391 bytes long.
Took 0.019143 seconds to convert 256 words into a regex 2740 bytes long.
Took 0.149488 seconds to convert 512 words into a regex 5295 bytes long.
Took 0.304324 seconds to convert 1024 words into a regex 10236 bytes long.
Took 0.307832 seconds to convert 2048 words into a regex 19755 bytes long.
Took 0.404071 seconds to convert 4096 words into a regex 37308 bytes long.
foo:26:in `match': regular expression too big:
/(?:A(?:ME|c(?:arapis|kley|mispon)|...

 
Reply With Quote
 
Giles Bowkett
Guest
Posts: n/a
 
      11-17-2006
> > If this is a full-on language analysis problem, you really should be using
> > Prolog or Lisp anyway. If the problem really is as complex as you are
> > hinting at, you may not be using the right language, or even class of
> > language.
> >

> yes, but I wanted to try it with "my favorit" language first before
> doing something wild. It's a rails application so I'd have to call
> lisp/perl/.... programms, interfacing them (calling on each test would
> be a mess because reading somethousand lines every time is a performance
> killer.)
>
> Anyway, thanks for the replies I'll give a perl interface a try (due to
> some other reasons like robust html parsing ...).


there's got to be a better way to do this. I recently attempted to
implement a particle simulation algorithm for arranging nodes in a
graph in Rails, before I came to my senses.

solving this problem with regexes sounds like a nightmare on the
maintenance front. I think it's possible that you're using the wrong
language but more likely that you're just using the wrong technique.

--
Giles Bowkett
http://www.gilesgoatboy.org

 
Reply With Quote
 
brabuhr@gmail.com
Guest
Posts: n/a
 
      11-17-2006
On 11/12/06, Ross Bamford <> wrote:
> On Sun, 12 Nov 2006 15:01:56 -0000, Peter Schrammel
> <> wrote:
> > Why is there a limitation at all? I implemented the same thing in perl
> > and it no complains ...
> > Is the regexp engine of perl that much better?
> >

>
> Irrespective of whether regex the best solution for your needs, it seems
> Oniguruma will improve the situation somewhat with respect to large
> regular expressions.


I built a local version of 1.8.5 with the oniguruma engine:
http://raa.ruby-lang.org/project/oniguruma/

And re-ran (a slight variation of) my test program:

[~]$ ruby foo
Using the <undefined> regex engine.
Converted a list of 1 words into a regex 8 bytes long.
Converted a list of 2 words into a regex 36 bytes long.
Converted a list of 4 words into a regex 48 bytes long.
Converted a list of 8 words into a regex 73 bytes long.
Converted a list of 16 words into a regex 173 bytes long.
Converted a list of 32 words into a regex 352 bytes long.
Converted a list of 64 words into a regex 718 bytes long.
Converted a list of 128 words into a regex 1415 bytes long.
Converted a list of 256 words into a regex 2656 bytes long.
Converted a list of 512 words into a regex 5210 bytes long.
Converted a list of 1024 words into a regex 10105 bytes long.
Converted a list of 2048 words into a regex 19432 bytes long.
Converted a list of 4096 words into a regex 37509 bytes long.
@_@

[~]$ /usr/local/bin/ruby foo
Using the Oniguruma regex engine.
Converted a list of 1 words into a regex 11 bytes long.
Converted a list of 2 words into a regex 16 bytes long.
Converted a list of 4 words into a regex 38 bytes long.
Converted a list of 8 words into a regex 97 bytes long.
Converted a list of 16 words into a regex 185 bytes long.
Converted a list of 32 words into a regex 359 bytes long.
Converted a list of 64 words into a regex 686 bytes long.
Converted a list of 128 words into a regex 1387 bytes long.
Converted a list of 256 words into a regex 2715 bytes long.
Converted a list of 512 words into a regex 5264 bytes long.
Converted a list of 1024 words into a regex 10074 bytes long.
Converted a list of 2048 words into a regex 19439 bytes long.
Converted a list of 4096 words into a regex 37452 bytes long.
Converted a list of 8192 words into a regex 71931 bytes long.
Converted a list of 16384 words into a regex 135572 bytes long.
Converted a list of 32768 words into a regex 253027 bytes long.
Converted a list of 65536 words into a regex 461607 bytes long.
Converted a list of 131072 words into a regex 808171 bytes long.
Converted a list of 262144 words into a regex 1326345 bytes long.
Converted a list of 479625 words into a regex 1873539 bytes long.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
GIDS 2009 .Net:: Save Big, Win Big, Learn Big: Act Before Dec 29 2008 Shaguf ASP .Net 0 12-26-2008 09:29 AM
GIDS 2009 .Net:: Save Big, Win Big, Learn Big: Act Before Dec 29 2008 Shaguf ASP .Net Web Controls 0 12-26-2008 06:11 AM
GIDS 2009 Java:: Save Big, Win Big, Learn Big: Act Before Dec 29 2008 Shaguf Python 0 12-24-2008 07:35 AM
boost regex --- sregex_iterator -- Regular expression too big wolverine C++ 2 08-29-2006 11:22 PM
Dynamically changing the regular expression of Regular Expression validator VSK ASP .Net 2 08-24-2003 02:47 PM



Advertisments