Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Huffman coding and Parse::RecDescent

Reply
Thread Tools

Huffman coding and Parse::RecDescent

 
 
Jon Ericson
Guest
Posts: n/a
 
      04-22-2004
I've been playing with Parse::RecDescent lately and I've run into an
annoyance. I'd appreciate some advice about how to deal with it.

Take, for an example, this abbreviated grammar for reading a baseball
score-sheet:

events: play(s) /\z/

play: home_run | caught_stealing | hit_by_pitch | catchers_interference
| <error>

home_run: 'H' | 'HR'
caught_stealing: 'CS'
hit_by_pitch: 'HP'
catchers_interference: 'C'

Obviously, there are bugs in the grammar. For instance, it fails when
given 'HP' or 'HR', since the first production in the home_run rule
greedily gobbles up the initial 'H'. It isn't that hard to fix:

events: play(s) /\z/

play: caught_stealing | catchers_interference | hit_by_pitch | home_run
| <error>

caught_stealing: 'CS'
catchers_interference: 'C'
hit_by_pitch: 'HP'
home_run: 'HR' | 'H'

This works and it's easy to maintain, but it's massively inefficient.
Home runs are substantially more common than hit batsmen, but because
the code for hit by pitch is longer than one of the home run codes,
you have to try it first. That means a lot of expensive backtracking,
which is criminal in this sort of data. It's trivial to order
baseball plays by frequency.

Notice that Huffman coding hurts us here. The most common plays in
baseball have the shortest codes, in general. Cather's interference
and caught stealing are the exceptions that prove the rule.
Fortunately, we can do negative lookahead and preserve the original
ordering:

home_run: 'HR' |'H' ...!'P'
{$item[1]}

The action explicitly returns the first item, otherwise we get
whatever was returned by the negative lookahead when 'H' matches. I'm
not sure if this is a bug, since it doesn't seem to be documented.
Maybe I should use regexp lookahead instead?

But this is harder to maintain. The next time you add a play, you
have to go back and make sure it doesn't interfere with one of the
other productions. Or if you reorder the productions to squeeze a bit
more performance out of the grammar, you have to audit the rule.

How have other people dealt with this?

Thanks,
Jon
 
Reply With Quote
 
 
 
 
Uri Guttman
Guest
Posts: n/a
 
      04-22-2004
>>>>> "JE" == Jon Ericson <(E-Mail Removed)> writes:

JE> events: play(s) /\z/

JE> play: home_run | caught_stealing | hit_by_pitch | catchers_interference
JE> | <error>

JE> home_run: 'H' | 'HR'
JE> caught_stealing: 'CS'
JE> hit_by_pitch: 'HP'
JE> catchers_interference: 'C'

JE> Obviously, there are bugs in the grammar. For instance, it fails when
JE> given 'HP' or 'HR', since the first production in the home_run rule
JE> greedily gobbles up the initial 'H'. It isn't that hard to fix:

two thoughts (i have used P:RD but i am no expert in it). first, you can
use regexes instead of strings and then use anchors/assertions of some
sort:

home_run: /\bH\b/ | /\bHR\b/

also i was under the impression that basic P:RD will split input on
white space so the home_run production should only match H when the
token is H.

all this is conjecture and hopefully the damian will see this and jump
in to correct me. will someone please turn on the damian signal lamp?

uri
 
Reply With Quote
 
 
 
 
Jon Ericson
Guest
Posts: n/a
 
      04-23-2004
Uri Guttman <(E-Mail Removed)> writes:

> two thoughts (i have used P:RD but i am no expert in it). first, you can
> use regexes instead of strings and then use anchors/assertions of some
> sort:
>
> home_run: /\bH\b/ | /\bHR\b/
>
> also i was under the impression that basic P:RD will split input on
> white space so the home_run production should only match H when the
> token is H.


Right. The problem is this data is *not* split into tokens by white
space. For instance `H8' is the code for an inside the park home run
fielded by the center fielder. Sorry I didn't make that clear.

Someone wrote an email with the suggestion:

home_run: /H(?![A-Z])/ | 'HR'

That does the trick, I think.

I'm not sure Parse::RecDescent is the right tool for the job, but it
sure is fun to play around with.

Thanks,
Jon
 
Reply With Quote
 
Uri Guttman
Guest
Posts: n/a
 
      04-23-2004
>>>>> "JE" == Jon Ericson <(E-Mail Removed)> writes:

JE> Uri Guttman <(E-Mail Removed)> writes:
>>
>> home_run: /\bH\b/ | /\bHR\b/
>>
>> also i was under the impression that basic P:RD will split input on
>> white space so the home_run production should only match H when the
>> token is H.


JE> Right. The problem is this data is *not* split into tokens by white
JE> space. For instance `H8' is the code for an inside the park home run
JE> fielded by the center fielder. Sorry I didn't make that clear.

well, i can't help much with poor specs!

JE> Someone wrote an email with the suggestion:

JE> home_run: /H(?![A-Z])/ | 'HR'

JE> That does the trick, I think.

looks ok. if you gave a complete spec then we could help more.

JE> I'm not sure Parse::RecDescent is the right tool for the job, but it
JE> sure is fun to play around with.

sounds like overkill for this. is it all just one line? P:RD is meant
for real grammars and such and what you have seems like a simple string
with some tricky parsing. a decent set of regexes could do that but a
proper and full spec is needed.

uri

--
Uri Guttman ------ http://www.velocityreviews.com/forums/(E-Mail Removed) -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
 
Reply With Quote
 
Jon Ericson
Guest
Posts: n/a
 
      04-23-2004
Uri Guttman <(E-Mail Removed)> writes:

>>>>>> "JE" == Jon Ericson <(E-Mail Removed)> writes:

>
> JE> Someone wrote an email with the suggestion:
>
> JE> home_run: /H(?![A-Z])/ | 'HR'
>
> JE> That does the trick, I think.
>
> looks ok. if you gave a complete spec then we could help more.


Ok. http://www.retrosheet.org/datause.txt is a close to a spec as
exists. There are quite a few things that are not documented. I'm
looking at item 6. a., but that is fairly incomplete. I've been
writing the grammar by trial and error. (Emphasis on error.

Mostly, however, this is an exercise in learning. I suppose this
example is an aberration, but it really bothered me that adding an
uncommon case broke the frequent case.

Jon
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Fast and efficent Huffman coding for nibbles and bytes in C++? 88888 Dihedral C++ 4 02-28-2012 03:40 PM
javax.imageio.IIOException: Missing Huffman code niko Java 3 02-12-2005 08:37 PM
Python Huffman encoding dot Python 3 11-30-2004 12:18 PM
Compression (besides Huffman) and Ruby Josef 'Jupp' SCHUGT Ruby 10 01-08-2004 01:23 AM
deflater/inflater and dictionnary for huffman NOBODY Java 2 10-17-2003 08:56 AM



Advertisments