Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Searching for a very fast string parser

Reply
Thread Tools

Searching for a very fast string parser

 
 
|MKSM|
Guest
Posts: n/a
 
      03-08-2006
Hello,

I want to parse a log file containing several line in the same format.
My log files are about 50mb each (350k lines) so i need something
quite fast. The current (and fastest) solution i came up with is using
StringScanner.

I save what i get into variables and then pass them all into a Struct
i created. Each new struct is then passed into an Array that holds all
structs.


Here's my test code:

require 'strscan'

a =3D "1140908573.050732 rule 19/0(match): pass unkn(255) on sis1:
80.202.226.15.50000 > 192.168.0.6.52525: UDP, length 64"

s =3D StringScanner.new(a)
time =3D s.scan(/\d+\.\d+/)
s.pos +=3D 23
rule_no =3D s.scan(/\d+/)
s.skip(/[\d\D]*?\s/)
stat =3D s.scan(/\w+/)
s.skip(/.*on\s/)
interface =3D s.scan(/\w+\:/)
s.skip(/\D+?\s/)
out_ip =3D s.scan(/(\d+\.){3}\d{0,3}/)
s.pos +=3D 1
out_port =3D s.scan(/\d+/)
s.skip(/\D+/)
in_ip =3D s.scan(/(\d+\.){3}\d{0,3}/)
s.pos +=3D 1
in_port =3D s.scan(/\d+/)
s.pos +=3D 2
proto =3D s.scan(/\w+/)
proto
s.pos +=3D 1

Running that on a 10k times loop it takes about 0.6 seconds to
complete. Is there a better/faster way on doing it?

Regards,

Ricardo.


 
Reply With Quote
 
 
 
 
ara.t.howard@noaa.gov
Guest
Posts: n/a
 
      03-08-2006
On Thu, 9 Mar 2006, |MKSM| wrote:

> Hello,
>
> I want to parse a log file containing several line in the same format.
> My log files are about 50mb each (350k lines) so i need something
> quite fast. The current (and fastest) solution i came up with is using
> StringScanner.
>
> I save what i get into variables and then pass them all into a Struct
> i created. Each new struct is then passed into an Array that holds all
> structs.
>
>
> Here's my test code:
>
> require 'strscan'
>
> a = "1140908573.050732 rule 19/0(match): pass unkn(255) on sis1:
> 80.202.226.15.50000 > 192.168.0.6.52525: UDP, length 64"
>
> s = StringScanner.new(a)
> time = s.scan(/\d+\.\d+/)
> s.pos += 23
> rule_no = s.scan(/\d+/)
> s.skip(/[\d\D]*?\s/)
> stat = s.scan(/\w+/)
> s.skip(/.*on\s/)
> interface = s.scan(/\w+\:/)
> s.skip(/\D+?\s/)
> out_ip = s.scan(/(\d+\.){3}\d{0,3}/)
> s.pos += 1
> out_port = s.scan(/\d+/)
> s.skip(/\D+/)
> in_ip = s.scan(/(\d+\.){3}\d{0,3}/)
> s.pos += 1
> in_port = s.scan(/\d+/)
> s.pos += 2
> proto = s.scan(/\w+/)
> proto
> s.pos += 1
>
> Running that on a 10k times loop it takes about 0.6 seconds to
> complete. Is there a better/faster way on doing it?
>
> Regards,
>
> Ricardo.


can you put a demo log file on the web somewhere?

-a

--
knowledge is important, but the much more important is the use toward which it
is put. this depends on the heart and mine the one who uses it.
- h.h. the 14th dali lama


 
Reply With Quote
 
 
 
 
|MKSM|
Guest
Posts: n/a
 
      03-08-2006
On 3/8/06, http://www.velocityreviews.com/forums/(E-Mail Removed) <(E-Mail Removed)> wrote:
> On Thu, 9 Mar 2006, |MKSM| wrote:
>
> > Hello,
> >
> > I want to parse a log file containing several line in the same format.
> > My log files are about 50mb each (350k lines) so i need something
> > quite fast. The current (and fastest) solution i came up with is using
> > StringScanner.
> >
> > I save what i get into variables and then pass them all into a Struct
> > i created. Each new struct is then passed into an Array that holds all
> > structs.
> >
> >
> > Here's my test code:
> >
> > require 'strscan'
> >
> > a =3D "1140908573.050732 rule 19/0(match): pass unkn(255) on sis1:
> > 80.202.226.15.50000 > 192.168.0.6.52525: UDP, length 64"
> >
> > s =3D StringScanner.new(a)
> > time =3D s.scan(/\d+\.\d+/)
> > s.pos +=3D 23
> > rule_no =3D s.scan(/\d+/)
> > s.skip(/[\d\D]*?\s/)
> > stat =3D s.scan(/\w+/)
> > s.skip(/.*on\s/)
> > interface =3D s.scan(/\w+\:/)
> > s.skip(/\D+?\s/)
> > out_ip =3D s.scan(/(\d+\.){3}\d{0,3}/)
> > s.pos +=3D 1
> > out_port =3D s.scan(/\d+/)
> > s.skip(/\D+/)
> > in_ip =3D s.scan(/(\d+\.){3}\d{0,3}/)
> > s.pos +=3D 1
> > in_port =3D s.scan(/\d+/)
> > s.pos +=3D 2
> > proto =3D s.scan(/\w+/)
> > proto
> > s.pos +=3D 1
> >
> > Running that on a 10k times loop it takes about 0.6 seconds to
> > complete. Is there a better/faster way on doing it?
> >
> > Regards,
> >
> > Ricardo.

>
> can you put a demo log file on the web somewhere?
>
> -a
>
> --
> knowledge is important, but the much more important is the use toward whi=

ch it
> is put. this depends on the heart and mine the one who uses it.
> - h.h. the 14th dali lama
>
>

I'm sorry, the log file i have comes from a live firewall. I'd rather
not release it.

The log is only consisted by several line such as the one i used in the cod=
e.

Regards,

Ricardo


 
Reply With Quote
 
James Edward Gray II
Guest
Posts: n/a
 
      03-08-2006
On Mar 8, 2006, at 12:09 PM, |MKSM| wrote:

> I'm sorry, the log file i have comes from a live firewall. I'd rather
> not release it.


Would randomizing the data render it safe?

>> "ABC 123".gsub(/[a-zA-Z0-9]/i) { |chr| ("0".."9").include?(chr) ?

rand(10) : ("A".."Z").to_a[rand(26)] }
=> "HNQ 265"

James Edward Gray II


 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      03-09-2006
Caleb Clausen wrote:
> OK, so first off, your sample implementation seemed to have several
> bugs in it. After fixing those, I thought you might be able to save
> some time by glomming all the regexp's together, obviating the need
> for StringScanner altogether. However, that doesn't seem to have
> actually made any difference...


I don't buy this. A single plain RX is usually faster than a more complex
solution. Even on a machine with constant high load (I had no different
available at the moment) I get a significant difference (north of 6%):

>> 15:22:14 [source]: /c/temp/ruby/logscan.rb

Rehearsal ------------------------------------------------
strscan 5.969000 0.000000 5.969000 ( 6.095000)
rx 5.828000 0.000000 5.828000 ( 5.951000)
rx with conv 5.860000 0.000000 5.860000 ( 5.922000)
-------------------------------------- total: 17.657000sec

user system total real
strscan 5.953000 0.000000 5.953000 ( 6.043000)
rx 5.547000 0.000000 5.547000 ( 5.747000)
rx with conv 5.765000 0.000000 5.765000 ( 5.924000)

(script attached)

> if anything it seems to have been a
> little slower. I don't know why. And the great big long Regexp is
> considerably harder to read.


Using %r{} and /x makes a great deal in readability (see script).

Kind regards

robert

 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      03-09-2006
Robert Klemme wrote:
> Caleb Clausen wrote:
>> OK, so first off, your sample implementation seemed to have several
>> bugs in it. After fixing those, I thought you might be able to save
>> some time by glomming all the regexp's together, obviating the need
>> for StringScanner altogether. However, that doesn't seem to have
>> actually made any difference...

>
> I don't buy this. A single plain RX is usually faster than a more
> complex solution. Even on a machine with constant high load (I had
> no different available at the moment) I get a significant difference
> (north of 6%):
>
>>> 15:22:14 [source]: /c/temp/ruby/logscan.rb

> Rehearsal ------------------------------------------------
> strscan 5.969000 0.000000 5.969000 ( 6.095000)
> rx 5.828000 0.000000 5.828000 ( 5.951000)
> rx with conv 5.860000 0.000000 5.860000 ( 5.922000)
> -------------------------------------- total: 17.657000sec
>
> user system total real
> strscan 5.953000 0.000000 5.953000 ( 6.043000)
> rx 5.547000 0.000000 5.547000 ( 5.747000)
> rx with conv 5.765000 0.000000 5.765000 ( 5.924000)
>
> (script attached)
>
>> if anything it seems to have been a
>> little slower. I don't know why. And the great big long Regexp is
>> considerably harder to read.

>
> Using %r{} and /x makes a great deal in readability (see script).
>
> Kind regards
>
> robert


I redid the test on an idle Linux machine with Ruby 1.8.1 and the
StringScanner is actually faster:

[root@fox tmp]# ./logscan.rb
Rehearsal ------------------------------------------------
strscan 2.990000 0.000000 2.990000 ( 2.991096)
rx 4.870000 0.000000 4.870000 ( 4.868536)
rx with 4.280000 0.010000 4.290000 ( 4.284334)
rx with conv 5.240000 0.000000 5.240000 ( 5.459702)
-------------------------------------- total: 17.390000sec

user system total real
strscan 3.000000 0.000000 3.000000 ( 2.999783)
rx 4.870000 0.000000 4.870000 ( 4.899242)
rx with 4.300000 0.010000 4.310000 ( 4.869835)
rx with conv 5.240000 0.000000 5.240000 ( 5.442722)

Apparently I have to correct myself...

robert

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
very very very long integer shanx__=|;- C Programming 19 10-19-2004 03:55 PM
very very very long integer Abhishek Jha C Programming 4 10-17-2004 08:19 AM
Looking for fast string hash searching Thomas Christmann C Programming 4 05-16-2004 02:18 PM
Quick Book file access very very very slow Thomas Reed Computer Support 7 04-09-2004 08:09 PM
very Very VERY dumb Question About The new Set( ) 's Raymond Arthur St. Marie II of III Python 4 07-27-2003 12:09 AM



Advertisments