Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > parser performance comparisions

Reply
Thread Tools

parser performance comparisions

 
 
Eric Mahurin
Guest
Posts: n/a
 
      11-04-2005
------=_Part_37706_8148117.1131146360947
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Jim Freeze was curious as to how Grammar (LL parser) compares to RACC (LR
parser). I was a little bit too... As a test case, I compared against the
racc calc.y parser (simple expression calculator) with a huge
2.6MBexpression (randomly generated by a script). Since ruby could
evaluate this
expression directly (I used load), I also threw it into the mix. Here are
the results:

parser lexer user vmsize notes
------ ----- ---- ------ -----
ruby ruby 0.8s 30.6MB -
racc Regexp 33.4s 278.2MB w/o racc C extension
racc Regexp 15.2s 142.1MB w/ racc C extension
Grammar Grammar 22.1s 11.4MB multi-threaded (300 token buf)
Grammar Grammar 20.9s 11.4MB multi-threaded, inlined
Grammar Grammar 21.1s 11.3MB one token at a time
Grammar Grammar 20.1s 11.4MB one token at a time, inlined
Grammar Regexp 15.6s 76.0MB -
Grammar Regexp 13.8s 74.4MB inlined

I think the best apples-to-apples comparision is racc w/ its Regexp lexer
and no C extension (mine is pure ruby) vs. my Grammar parser with a similar
Regexp lexer. You can even get a little more speed by inlining your actions
with Grammar. Unfortunately this makes it less readable because you have to
put your code in strings instead of a simple block. My parser is more than
twice as fast with those circumstances. racc's C extension gets it back on
par with my pure ruby solution. I think the memory usage is so high for the
racc solutions because with racc they typically generate all the tokens up
front (in memory). I think every racc example I found did this. I'm not sur=
e
why.

I only showed the Regexp lexer solution because that is what the racc
examples use. For a Grammar parser, I recommend using a Grammar lexer. A
Regexp lexer isn't as readable, flexible, and most importantly a Regexp is
meant to work with a String, not an IO. Because of Regexp's work with
Strings, all of the racc examples I found read the IO into a String first.
That is why the memory usage is so much higher for the Regexp lexers above.

BTW, I did this on my experimental code. Hopefully this will become version
0.6 in a few weeks.

------=_Part_37706_8148117.1131146360947--


 
Reply With Quote
 
 
 
 
Phil Tomson
Guest
Posts: n/a
 
      11-04-2005
In article <29256ea00511041519i654e648fia23b73856ef95936@mail .gmail.com>,
Eric Mahurin <(E-Mail Removed)> wrote:
>------=_Part_37706_8148117.1131146360947
>Content-Type: text/plain; charset=ISO-8859-1
>Content-Transfer-Encoding: quoted-printable
>Content-Disposition: inline
>
>Jim Freeze was curious as to how Grammar (LL parser) compares to RACC (LR
>parser). I was a little bit too... As a test case, I compared against the
>racc calc.y parser (simple expression calculator) with a huge
>2.6MBexpression (randomly generated by a script). Since ruby could
>evaluate this
>expression directly (I used load), I also threw it into the mix. Here are
>the results:
>
>parser lexer user vmsize notes
>------ ----- ---- ------ -----
>ruby ruby 0.8s 30.6MB -
>racc Regexp 33.4s 278.2MB w/o racc C extension
>racc Regexp 15.2s 142.1MB w/ racc C extension
>Grammar Grammar 22.1s 11.4MB multi-threaded (300 token buf)
>Grammar Grammar 20.9s 11.4MB multi-threaded, inlined
>Grammar Grammar 21.1s 11.3MB one token at a time
>Grammar Grammar 20.1s 11.4MB one token at a time, inlined
>Grammar Regexp 15.6s 76.0MB -
>Grammar Regexp 13.8s 74.4MB inlined
>
>I think the best apples-to-apples comparision is racc w/ its Regexp lexer
>and no C extension (mine is pure ruby) vs. my Grammar parser with a similar
>Regexp lexer. You can even get a little more speed by inlining your actions
>with Grammar. Unfortunately this makes it less readable because you have to
>put your code in strings instead of a simple block. My parser is more than
>twice as fast with those circumstances. racc's C extension gets it back on
>par with my pure ruby solution. I think the memory usage is so high for the
>racc solutions because with racc they typically generate all the tokens up
>front (in memory). I think every racc example I found did this. I'm not sur=
>e
>why.
>
>I only showed the Regexp lexer solution because that is what the racc
>examples use. For a Grammar parser, I recommend using a Grammar lexer. A
>Regexp lexer isn't as readable, flexible, and most importantly a Regexp is
>meant to work with a String, not an IO. Because of Regexp's work with
>Strings, all of the racc examples I found read the IO into a String first.
>That is why the memory usage is so much higher for the Regexp lexers above.
>
>BTW, I did this on my experimental code. Hopefully this will become version
>0.6 in a few weeks.
>


It's been a while since I looked into the LL(k) vs. LALR issues (LL(k) in
he case of ANTLR and LALR in the case of YACC), but isn't the calc.y
grammar pretty simple and thus maybe not a good way to compare the two?
Specifically, did you need to use any lookahead?

A better comparison would be to create a Ruby grammar...
You're working on that right

Phil
 
Reply With Quote
 
 
 
 
Jim Freeze
Guest
Posts: n/a
 
      11-05-2005
On 11/4/05, Phil Tomson <(E-Mail Removed)> wrote:
> A better comparison would be to create a Ruby grammar...


Other comparisons may be needed, but this one is very important.
It says that Grammar can play equally with RACC in terms of speed,
which is important if you are the new kid on the block.

What I would like to see is grammar speed up a bit so that the
one-token-at-a-time
is as fast as Racc with the C extension. Then it would almost be a no brain=
er
to use Grammar instead of Racc.

Good work Eric.



--
Jim Freeze


 
Reply With Quote
 
Eric Mahurin
Guest
Posts: n/a
 
      11-05-2005
------=_Part_41583_5431276.1131203036023
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

On 11/5/05, Jim Freeze <(E-Mail Removed)> wrote:
>
> On 11/4/05, Phil Tomson <(E-Mail Removed)> wrote:
> > A better comparison would be to create a Ruby grammar...



And compare to what? We have no ruby parsers in ruby.

I would like to compare to something else too. If anybody else has other
suggestions, here's what I would need to compare with:

- racc (or even working rockit) grammar
- a large test file that takes at least several seconds to parse
- hope something simple enough that it won't take much time to convert to
Grammar (I don't want to distract from my ruby parser much)

Other comparisons may be needed, but this one is very important.
> It says that Grammar can play equally with RACC in terms of speed,
> which is important if you are the new kid on the block.
>
> What I would like to see is grammar speed up a bit so that the
> one-token-at-a-time
> is as fast as Racc with the C extension. Then it would almost be a no
> brainer
> to use Grammar instead of Racc.



Why is the one-token-at-a-time Grammar lexer important? If you already have
a Regexp lexer (for RACC or whatever), it is very easy to make it a lexer
for a Grammar parser - just make it an object that has a read1next method
(like a Cursor) - similar to next_token with yacc. Until I start generating
C code, it will be hard to compete against a Regexp lexer since it does
stuff in C.

If you are starting from scratch though, the recommended way for making a
lexer for a Grammar parser is with Grammar is the multi-threaded approach.
This gives the most flexibility and better readability for complex lexers.
With this approach, the Grammar of the lexer parses the whole file and
generates all tokens (instead of parsing just one token). By doing it this
way, most lexers don't need to hold state internal to the lexer. For
example, if a string corresponds to multiple tokens, the token-at-atime
lexer would have to set a lexer state when it entered a string so that the
next time it is called it would return a string-type token (and reset that
state when it found the end-of-string). With the lexer parsing all tokens,
when it finds a string it just generates all the tokens for that string
right there - no need for a lexer state.

But, Grammar will get faster over time. Here are some of my ideas:

- use ruby2c to convert my generated ruby code to C.
- if that doesn't pan out, make another set of Grammar classes/methods (sam=
e
API) that generate C code and inline it.
- optimize the methods from various Cursor classes that Grammar might use.

Also keep in mind that with my Grammar stuff, parser generation is done at
run-time (included in the numbers I gave). This should typically take a
fraction of a second for most parsers. If it becomes an issue I will add th=
e
ability to dump the generated code to a file. But, for now I'd like to leav=
e
it because the programmer not having to worry about the parser generation
stage and how to integrate that the generated file makes the usability much
higher. Grammar can be treated as just another library.

Good work Eric.


Thanks, Jim.

------=_Part_41583_5431276.1131203036023--


 
Reply With Quote
 
Phil Tomson
Guest
Posts: n/a
 
      11-05-2005
In article <29256ea00511050703x4222076qf6bb2ef52b2c5b2b@mail. gmail.com>,
Eric Mahurin <(E-Mail Removed)> wrote:
>------=_Part_41583_5431276.1131203036023
>Content-Type: text/plain; charset=ISO-8859-1
>Content-Transfer-Encoding: quoted-printable
>Content-Disposition: inline
>
>On 11/5/05, Jim Freeze <(E-Mail Removed)> wrote:
>>
>> On 11/4/05, Phil Tomson <(E-Mail Removed)> wrote:
>> > A better comparison would be to create a Ruby grammar...

>
>
>And compare to what? We have no ruby parsers in ruby.


Well, the suggestion wasn't entirely serious...



Phil
 
Reply With Quote
 
Jim Freeze
Guest
Posts: n/a
 
      11-05-2005
On 11/5/05, Eric Mahurin <(E-Mail Removed)> wrote:
> Other comparisons may be needed, but this one is very important.
> > It says that Grammar can play equally with RACC in terms of speed,
> > which is important if you are the new kid on the block.
> >
> > What I would like to see is grammar speed up a bit so that the
> > one-token-at-a-time
> > is as fast as Racc with the C extension. Then it would almost be a no
> > brainer
> > to use Grammar instead of Racc.

>
> Why is the one-token-at-a-time Grammar lexer important? If you already ha=

ve
> a Regexp lexer (for RACC or whatever), it is very easy to make it a lexer
> for a Grammar parser - just make it an object that has a read1next method
> (like a Cursor) - similar to next_token with yacc. Until I start generati=

ng
> C code, it will be hard to compete against a Regexp lexer since it does
> stuff in C.


Memory consumption.
From the data that you showed, it looks like I should avoid
Regexp if I am concerned about memory usage.

> Good work Eric.
>
> Thanks, Jim.


Don't mention it.

--
Jim Freeze


 
Reply With Quote
 
Eric Mahurin
Guest
Posts: n/a
 
      11-05-2005
------=_Part_43477_18019238.1131231440761
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

On 11/5/05, Jim Freeze <(E-Mail Removed)> wrote:
>
> On 11/5/05, Eric Mahurin <(E-Mail Removed)> wrote:
> > Other comparisons may be needed, but this one is very important.
> > > It says that Grammar can play equally with RACC in terms of speed,
> > > which is important if you are the new kid on the block.
> > >
> > > What I would like to see is grammar speed up a bit so that the
> > > one-token-at-a-time
> > > is as fast as Racc with the C extension. Then it would almost be a no
> > > brainer
> > > to use Grammar instead of Racc.

> >
> > Why is the one-token-at-a-time Grammar lexer important? If you already

> have
> > a Regexp lexer (for RACC or whatever), it is very easy to make it a

> lexer
> > for a Grammar parser - just make it an object that has a read1next

> method
> > (like a Cursor) - similar to next_token with yacc. Until I start

> generating
> > C code, it will be hard to compete against a Regexp lexer since it does
> > stuff in C.

>
> Memory consumption.
> From the data that you showed, it looks like I should avoid
> Regexp if I am concerned about memory usage.
>


Yes, but you have the same problem in racc (and more since typical usage is
to generate the entire token stream in memory first). The Regexp lexers
could be brought down to more reasonable memory levels if you read in a
block (or line) at a time to match against. In this particular example a
line a time wouldn't help since I put the entire 2.6MB expression on a line
(not realistic of course). With reading in a block (fixed number of
characters) or line at a time, you'd have to deal with tokens that can span
multiple blocks or lines. You'd lose some of the Regexp advantage but
probably not much. The racc memory usage could also be helped if you don't
generate the entire token stream in memory first. I'm not sure why there ar=
e
no racc examples with low memory usage (don't read the entire file into a
string and don't generate the entire token stream in memory up front).

------=_Part_43477_18019238.1131231440761--


 
Reply With Quote
 
Steven Jenkins
Guest
Posts: n/a
 
      11-05-2005
Eric Mahurin wrote:
> The racc memory usage could also be helped if you don't
> generate the entire token stream in memory first. I'm not sure why there are
> no racc examples with low memory usage (don't read the entire file into a
> string and don't generate the entire token stream in memory up front).


Probably because for most applications it doesn't matter. But the
example I sent you doesn't read the entire file into a string. It does
load the entire token stream into an array, but that could easily be
changed by running the lexer in a separate thread and using a SizedQueue
for the token stream. Maybe a five-line change.

For this appliation, I really don't care about memory consumption. The
largest file I'll ever parse is 100 times smaller than the physical
memory on the machine, much less the available virtual memory.

Steve


 
Reply With Quote
 
Eric Mahurin
Guest
Posts: n/a
 
      11-06-2005
------=_Part_43698_7580707.1131236160605
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

On 11/5/05, Steven Jenkins <(E-Mail Removed)> wrote:
>
> Eric Mahurin wrote:
> > The racc memory usage could also be helped if you don't
> > generate the entire token stream in memory first. I'm not sure why ther=

e
> are
> > no racc examples with low memory usage (don't read the entire file into

> a
> > string and don't generate the entire token stream in memory up front).

>
> Probably because for most applications it doesn't matter. But the
> example I sent you doesn't read the entire file into a string. It does
> load the entire token stream into an array, but that could easily be
> changed by running the lexer in a separate thread and using a SizedQueue
> for the token stream. Maybe a five-line change.
>
> For this appliation, I really don't care about memory consumption. The
> largest file I'll ever parse is 100 times smaller than the physical
> memory on the machine, much less the available virtual memory.
>
> Steve
>
>

Do you even need a separate thread? Why can't you just have #next_token jus=
t
parse the next token and return it? From what I saw of your lexer, it looke=
d
like that would be possible, but then again I have almost no racc experienc=
e
(I've only learned enough to try to benchmark against it). When you go
mutli-threaded, you do have to make sure your doing enough work in each
thread because thread switching is relatively expensive. For my
multi-threaded test, increasing my token buffer from no buffering (lexer an=
d
parser synchronized) to a 300 token buffer made a couple orders of magnitud=
e
difference in performance.

To me, if a parser is reading in a whole file into memory and/or generating
all the tokens of the file in memory up front, it doesn't seem like a "real=
"
parser. Regexp fits in this camp (parses from a string - in memory). With
some of features that are being added to regexes (look at some of perl's
advanced/beta features where you can embed code and do recursion), you coul=
d
call a regex a parser, but it doesn't seem like one to me because it needs
the whole thing it is parsing in memory. It just seems strange to me that
the common racc usage does things this way. You could actually optimize a
parser quite a bit if you knew everything it was parsing was in memory
(regexes make this assumption for doing backtracking - similar to lookahead
for a standard parser).

------=_Part_43698_7580707.1131236160605--


 
Reply With Quote
 
Steven Jenkins
Guest
Posts: n/a
 
      11-06-2005
Eric Mahurin wrote:
> Do you even need a separate thread? Why can't you just have #next_token just
> parse the next token and return it? From what I saw of your lexer, it looked
> like that would be possible, but then again I have almost no racc experience
> (I've only learned enough to try to benchmark against it). When you go
> mutli-threaded, you do have to make sure your doing enough work in each
> thread because thread switching is relatively expensive. For my
> multi-threaded test, increasing my token buffer from no buffering (lexer and
> parser synchronized) to a 300 token buffer made a couple orders of magnitude
> difference in performance.


If you want to limit memory consumption, a separate thread and
SizedQueue are sufficient. That might not give the highest performance,
but it's the cleanest code. If performance is an issue, then yes, you
could have next_token read another line from the input file if the
current buffer is empty.

> To me, if a parser is reading in a whole file into memory and/or generating
> all the tokens of the file in memory up front, it doesn't seem like a "real"
> parser. Regexp fits in this camp (parses from a string - in memory). With
> some of features that are being added to regexes (look at some of perl's
> advanced/beta features where you can embed code and do recursion), you could
> call a regex a parser, but it doesn't seem like one to me because it needs
> the whole thing it is parsing in memory. It just seems strange to me that
> the common racc usage does things this way. You could actually optimize a
> parser quite a bit if you knew everything it was parsing was in memory
> (regexes make this assumption for doing backtracking - similar to lookahead
> for a standard parser).


Again, only if performance and memory usage are issues. I'd guess that
for most applications using racc, they aren't. And if they really are,
Ruby might not be the best choice anyway. Ruby/racc will never seriously
compete with C/yacc in speed. For my particular application, the largest
file I'll ever encounter can be parsed in less than one second, so I put
exactly zero effort into making it faster.

Steve


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Pointer comparisions of same type T Ryi C Programming 4 03-24-2010 06:43 PM
String comparisions Talib Hussain Ruby 4 02-12-2009 01:13 PM
String comparisions and counting Stuart Clarke Ruby 7 11-05-2008 02:47 PM
XMLparser: Difference between parser.setErrorHandler() vs. parser.setContentHandler() Bernd Oninger Java 0 06-09-2004 01:26 AM
Synthesis Tool Device Support Comparisions? td VHDL 0 09-13-2003 04:18 PM



Advertisments