Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Optimization anyone

Reply
Thread Tools

Optimization anyone

 
 
Horacio Sanson
Guest
Posts: n/a
 
      11-22-2005

I have this little script that takes a list of keyword sets, each set has only
two keywords and for each one of them the script creates a regular expression
like this:

Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")

then I match it to a string that contains a long text fetched from a web page.

a more complete pseudo-code

#########################################
long_text = get_web_page(url)

keyword_hash = load_keyword_array_from_database

keyword_hash.each_pair { |id, value|

key1 = value[0]
key2 = value[1]

r = Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")
return id if long_text =~ r
}

return -1
###########################################


Now this code works perfect, the problem is that the keyword_hash has more
than 300 elements and running this code can take between 50 to 120 seconds.
Since I am processing more than 1000 pages with this code it takes forever.


I solved this problem by replacing the regular expression match to

r1 = Regexp.new("#{key1}\.*#{key2}")
r2 = Regexp.new("#{key2}\.*#{key1}")

return id if long_text =~ r1 or long_text =~ r2


I simply put the or statement outside the regular expresion and the speedup
was from 50~120sec to 0.40 secs per page.


using the Benchmark class and running some test I got

normal: 0 0
27.688000 0.015000 27.703000 ( 27.765000 )
fast:
0.469000 0.000000 0.484000 (0.954000)


the speed difference is totally diferent.

Is this expected when using regular expressions??


regards,
Horacio


 
Reply With Quote
 
 
 
 
ts
Guest
Posts: n/a
 
      11-22-2005
>>>>> "H" == Horacio Sanson <(E-Mail Removed)> writes:

H> Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")

vs

H> r1 = Regexp.new("#{key1}\.*#{key2}")
H> r2 = Regexp.new("#{key2}\.*#{key1}")

H> Is this expected when using regular expressions??

yes, ruby has some optimizations. For example with the regexp /abc.*def/

svg% ruby -rjj -e '/abc.*def/.dump'
Regexp /abc.*def/
0 exactn "abc" (3)
1 anychar_repeat
2 exactn "def" (3)
3 end
must : abc
optimize : exactn
svg%

It call the regexp engine (which is slow) only when it has found the
substring "abc" in the string

Now if you use /abc.*def|def.*abc/ you break this optimization


svg% ruby -rjj -e '/abc.*def|def.*abc/.dump'
Regexp /abc.*def|def.*abc/
0 on_failure_jump ==> 5
1 exactn "abc" (3)
2 anychar_repeat
3 exactn "def" (3)
4 jump ==> 8
5 exactn "def" (3)
6 anychar_repeat
7 exactn "abc" (3)
8 end
svg%


it must call the stupid () regexp engine for each line


Guy Decoux


 
Reply With Quote
 
 
 
 
Robert Klemme
Guest
Posts: n/a
 
      11-22-2005
Horacio Sanson wrote:
> I have this little script that takes a list of keyword sets, each set
> has only two keywords and for each one of them the script creates a
> regular expression like this:
>
> Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")
>
> then I match it to a string that contains a long text fetched from a
> web page.
>
> a more complete pseudo-code
>
> #########################################
> long_text = get_web_page(url)
>
> keyword_hash = load_keyword_array_from_database
>
> keyword_hash.each_pair { |id, value|
>
> key1 = value[0]
> key2 = value[1]
>
> r = Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")
> return id if long_text =~ r
> }
>
> return -1
> ###########################################
>
>
> Now this code works perfect, the problem is that the keyword_hash has
> more than 300 elements and running this code can take between 50 to
> 120 seconds. Since I am processing more than 1000 pages with this
> code it takes forever.
>
>
> I solved this problem by replacing the regular expression match to
>
> r1 = Regexp.new("#{key1}\.*#{key2}")
> r2 = Regexp.new("#{key2}\.*#{key1}")
>
> return id if long_text =~ r1 or long_text =~ r2
>
>
> I simply put the or statement outside the regular expresion and the
> speedup was from 50~120sec to 0.40 secs per page.
>
>
> using the Benchmark class and running some test I got
>
> normal: 0 0
> 27.688000 0.015000 27.703000 ( 27.765000
> )
> fast:
> 0.469000 0.000000 0.484000 (0.954000)
>
>
> the speed difference is totally diferent.
>
> Is this expected when using regular expressions??


On obvious optimization is to create all regexps during
load_keyword_array_from_database() and not during iteration of the hash.
That way you just have to do it once and can reuse those regexps with
multiple pages you check.

Another possible optimization is to take your approach of splitting the
regexps a bit further and create two regexps - one for each keyword - and
return the id if both match. This works only correctly if (i) keywords
don't overlap or (ii) you can use \b to ensure matching on word
boundaries.

Kind regards

robert


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Zero Optimization and Sign Optimization??? Ravikiran C Programming 22 11-24-2008 03:19 AM
Anyone happen to have optimization hints for this loop? dp_pearce Python 5 07-15-2008 02:10 PM
Anyone happen to have optimization hints for this loop? dp_pearce Python 2 07-09-2008 05:42 PM
Does anyone knows an HTML / JS / CSS Optimization tool ? neoswf Javascript 4 01-18-2006 12:33 AM
Help and optimization hints, anyone? Kim Petersen Python 4 01-23-2004 04:39 PM



Advertisments