Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Is pyparsing really a recursive descent parser?

Reply
Thread Tools

Is pyparsing really a recursive descent parser?

 
 
Just Another Victim of the Ambient Morality
Guest
Posts: n/a
 
      11-02-2007
Is pyparsing really a recursive descent parser? I ask this because
there are grammars it can't parse that my recursive descent parser would
parse, should I have written one. For instance:


from pyparsing import *

grammar = OneOrMore(Word(alphas)) + Literal('end')
grammar.parseString('First Second Third end')


Amazingly enough, this will fail to parse!
Now, maybe I have no idea what I'm talking about but shouldn't a
recursive descent parser recursively descend through all possible rule
combinations to discover one that works? Why does pyparsing fail to parse
this? Is there a way to get pyparsing to parse a grammar like this?
Thank you...


 
Reply With Quote
 
 
 
 
Marc 'BlackJack' Rintsch
Guest
Posts: n/a
 
      11-02-2007
On Fri, 02 Nov 2007 06:05:13 +0000, Just Another Victim of the Ambient
Morality wrote:

> Is pyparsing really a recursive descent parser? I ask this because
> there are grammars it can't parse that my recursive descent parser would
> parse, should I have written one. For instance:
>
>
> from pyparsing import *
>
> grammar = OneOrMore(Word(alphas)) + Literal('end')
> grammar.parseString('First Second Third end')
>
>
> Amazingly enough, this will fail to parse!
> Now, maybe I have no idea what I'm talking about but shouldn't a
> recursive descent parser recursively descend through all possible rule
> combinations to discover one that works? Why does pyparsing fail to parse
> this?



Pyparsing is no recursive descent parser. It doesn't go back in the input
stream. The ``OneOrMore(Word(alphas))`` part "eats" the 'end' and when it
can't get more, the parser moves to the ``Literal('end')`` part which
fails because the 'end' is already gone.

> Is there a way to get pyparsing to parse a grammar like this?


Negative lookahead maybe:

grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
+ Literal('end'))

Ciao,
Marc 'BlackJack' Rintsch
 
Reply With Quote
 
 
 
 
Paul McGuire
Guest
Posts: n/a
 
      11-02-2007
On Nov 2, 5:47 am, Marc 'BlackJack' Rintsch <(E-Mail Removed)> wrote:
>
> Pyparsing is no recursive descent parser. It doesn't go back in the input
> stream. The ``OneOrMore(Word(alphas))`` part "eats" the 'end' and when it
> can't get more, the parser moves to the ``Literal('end')`` part which
> fails because the 'end' is already gone.
>
> > Is there a way to get pyparsing to parse a grammar like this?

>
> Negative lookahead maybe:
>
> grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
> + Literal('end'))
>
> Ciao,
> Marc 'BlackJack' Rintsch- Hide quoted text -
>
> - Show quoted text -


Well I'll be darned! All this time, I thought "recursive descent"
described the recursive behavior of the parser, which pyparsing
definitely has. I never knew that backing up in the face of parse
mismatches was a required part of the picture.

In pyparsing, each construction gets composed from more fine-grained
constructions, and they are called recursively to match or not match
against the input string.

For example, taking your working parser example:

grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
+ Literal('end'))

This creates the following data structure:

- And
- OneOrMore
- And
- NotAny
- Literal('end')
- Word(alphas)
- Literal('end')

Every instance in this structure derives from the base class
ParserElement, which defines a method parse(). parse() in turn calls
preParse(), parseImpl(), and postParse(). If parseImpl succeeds, it
returns a ParseResults object and the next location within the input
string to continue parsing.

The parseImpl methods are most often overridden in the subclasses (a
few override postParse as well), such as:
- And.parseImpl invokes parse() (note the recursion) on each of the
expressions in its list. All must succeed or And.parseImpl throws an
exception.
- OneOrMore.parseImpl invokes parse() on its contained expression,
which must succeed at least once; if not, the exception is rethrown.
If the contained expression succeeds once, then its parse() method is
called again and again until it fails, but no exceptions are rethrown,
since only one match was actually required.
- NotAny inverts the success/failure of its contained expression. If
the expression's parse() method succeeds, NotAny.parseImpl throws an
exception. If the contained expression's parse() method throws an
exception, NotAny returns successfully. (NotAny does not advance the
parse location, nor does it return any tokens.)
- Literal and Word are terminals, in that they do not invoke any other
expression's parse() method. Literal.parseImpl tests whether its
string exists at the current parse location, and throws an exception
if it doesn't. Word.parseImpl tests whether the current parse
location contains a letter in the Word instance's set of valid initial
characters - if so success; if not, throws an exception. It then
advances through the input string, matching characters in the Word
instance's set of valid body characters. The entire matched string is
then returned, along with an updated string index at which to continue
parsing.

In my concept of "recursive descent" parsing, I was under the
impression that pyparsing's use of this data structure, and
*recursive* calls of parse() as it *descends* through the data
structure, constituted a recursive descent parser. What the OP
requested was a more regular expression-ish matcher, with backup (or
backtracking).

Your example did not include either of the alternation classes,
MatchFirst or Or. These classes implement a form of backtracking on
the stack, that if we descend into an expression on the list of
alternatives, and that expression fails, then MatchFirst or Or will
back up to the same starting location and try the next alternative.
(MatchFirst is an "eager" matcher, in that it will pick the first
matching expression in its list - Or is an "exhaustive" matcher, in
that it will evaluate all of the alternatives, and select the longest
match.)

The Wikipedia entry for "Recursive Descent Parser" describes this
parser model as a "predictive parser", and later goes on to say that
some (uncited) authors equate "predictive parser" with "recursive
descent parsers". The article makes a special distinction for a
"recursive parser with backup", which is what I believe the OP was
asking for. Yes, pyparsing is definitely *not* a "recursive descent
parser with backup." The solution, as you correctly posted, is to add
a negative lookahead. NotAny is also represented using the '~'
operator.

By the way, there are a couple of other places where pyparsing
diverges from the standard model:
- implicit whitespace skipping
- user-definable comment skipping
- attachment of parse actions to expressions
These features, while atypical, provide some added capability and ease-
of-use in creating quick and simple parsers.

So I will take my stance with the uncited authors who lump predictive
parsers in with recursive descent parsers.

-- Paul

 
Reply With Quote
 
Just Another Victim of the Ambient Morality
Guest
Posts: n/a
 
      11-03-2007

"Paul McGuire" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ups.com...
> On Nov 2, 5:47 am, Marc 'BlackJack' Rintsch <(E-Mail Removed)> wrote:
>>
>> Pyparsing is no recursive descent parser. It doesn't go back in the
>> input
>> stream. The ``OneOrMore(Word(alphas))`` part "eats" the 'end' and when
>> it
>> can't get more, the parser moves to the ``Literal('end')`` part which
>> fails because the 'end' is already gone.
>>
>> > Is there a way to get pyparsing to parse a grammar like this?

>>
>> Negative lookahead maybe:
>>
>> grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
>> + Literal('end'))
>>
>> Ciao,
>> Marc 'BlackJack' Rintsch- Hide quoted text -
>>
>> - Show quoted text -

>
> Well I'll be darned! All this time, I thought "recursive descent"
> described the recursive behavior of the parser, which pyparsing
> definitely has. I never knew that backing up in the face of parse
> mismatches was a required part of the picture.


It has recursion in it but that's not sufficient to call it a recursive
descent parser any more than having a recursive implementation of the
factorial function. The important part is what it recurses through...


> In pyparsing, each construction gets composed from more fine-grained
> constructions, and they are called recursively to match or not match
> against the input string.
>
> For example, taking your working parser example:
>
> grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
> + Literal('end'))
>
> This creates the following data structure:
>
> - And
> - OneOrMore
> - And
> - NotAny
> - Literal('end')
> - Word(alphas)
> - Literal('end')
>
> Every instance in this structure derives from the base class
> ParserElement, which defines a method parse(). parse() in turn calls
> preParse(), parseImpl(), and postParse(). If parseImpl succeeds, it
> returns a ParseResults object and the next location within the input
> string to continue parsing.
>
> The parseImpl methods are most often overridden in the subclasses (a
> few override postParse as well), such as:
> - And.parseImpl invokes parse() (note the recursion) on each of the
> expressions in its list. All must succeed or And.parseImpl throws an
> exception.
> - OneOrMore.parseImpl invokes parse() on its contained expression,
> which must succeed at least once; if not, the exception is rethrown.
> If the contained expression succeeds once, then its parse() method is
> called again and again until it fails, but no exceptions are rethrown,
> since only one match was actually required.
> - NotAny inverts the success/failure of its contained expression. If
> the expression's parse() method succeeds, NotAny.parseImpl throws an
> exception. If the contained expression's parse() method throws an
> exception, NotAny returns successfully. (NotAny does not advance the
> parse location, nor does it return any tokens.)
> - Literal and Word are terminals, in that they do not invoke any other
> expression's parse() method. Literal.parseImpl tests whether its
> string exists at the current parse location, and throws an exception
> if it doesn't. Word.parseImpl tests whether the current parse
> location contains a letter in the Word instance's set of valid initial
> characters - if so success; if not, throws an exception. It then
> advances through the input string, matching characters in the Word
> instance's set of valid body characters. The entire matched string is
> then returned, along with an updated string index at which to continue
> parsing.


Thank you for the detailed description of pyparsing's implementation.


> In my concept of "recursive descent" parsing, I was under the
> impression that pyparsing's use of this data structure, and
> *recursive* calls of parse() as it *descends* through the data
> structure, constituted a recursive descent parser. What the OP
> requested was a more regular expression-ish matcher, with backup (or
> backtracking).


In my concept of "recursive descent" parsing, I was under the impression
that one should recurse through all rule combinations to ensure that the
grammar is fully applied. As I have mentioned before, merely having
recursion in your algorithm is insufficient. What you recurse through is
key. pyparsing recurses through rules but what's important is to recurse
through rule combinations. Otherwise, the parser won't be correct. That is
to say, there will be strings that are grammatically correct for which the
parser will fail to parse. Honestly, what is the point of a recursive
descent parser if it doesn't parse correct string? If you're willing to
have correct strings unparsed, you might as well go for LALR parsing, or
some such...
In my opinion, the rule set I mentioned in my original post:


grammar = OneOrMore(Word(alphas)) + Literal('end')


...should be translated (internally) to something like this:


word = Word(alphas)
grammar = Forward()
grammar << ((word + grammar) | (word + Literal(end)))


This allows a recursive search to find the string correct without any
need for "backtracking," if I understand what you mean by that. Couldn't
pyparsing have done something like this?


> Your example did not include either of the alternation classes,
> MatchFirst or Or. These classes implement a form of backtracking on
> the stack, that if we descend into an expression on the list of
> alternatives, and that expression fails, then MatchFirst or Or will
> back up to the same starting location and try the next alternative.
> (MatchFirst is an "eager" matcher, in that it will pick the first
> matching expression in its list - Or is an "exhaustive" matcher, in
> that it will evaluate all of the alternatives, and select the longest
> match.)


I guess I was hoping that I could simply specify, mathematically, a
grammar and have the software Do The Right Thing(tm)...


> The Wikipedia entry for "Recursive Descent Parser" describes this
> parser model as a "predictive parser", and later goes on to say that
> some (uncited) authors equate "predictive parser" with "recursive
> descent parsers". The article makes a special distinction for a
> "recursive parser with backup", which is what I believe the OP was
> asking for. Yes, pyparsing is definitely *not* a "recursive descent
> parser with backup." The solution, as you correctly posted, is to add
> a negative lookahead. NotAny is also represented using the '~'
> operator.
>
> So I will take my stance with the uncited authors who lump predictive
> parsers in with recursive descent parsers.


Well, the good thing about Wikipedia is that, if you don't like the
definition on the page, you're welcome to change it! Although, I'd
recommend discussing it on the talk page before you do, just to make sure
there isn't a good reason for the page to be as it is...
Thank you...


 
Reply With Quote
 
Kay Schluehr
Guest
Posts: n/a
 
      11-03-2007
On Nov 3, 6:33 am, "Just Another Victim of the Ambient Morality"
<(E-Mail Removed)> wrote:
> "Paul McGuire" <(E-Mail Removed)> wrote in message
>
> news:(E-Mail Removed) ups.com...
>
>
>
> > On Nov 2, 5:47 am, Marc 'BlackJack' Rintsch <(E-Mail Removed)> wrote:

>
> >> Pyparsing is no recursive descent parser. It doesn't go back in the
> >> input
> >> stream. The ``OneOrMore(Word(alphas))`` part "eats" the 'end' and when
> >> it
> >> can't get more, the parser moves to the ``Literal('end')`` part which
> >> fails because the 'end' is already gone.

>
> >> > Is there a way to get pyparsing to parse a grammar like this?

>
> >> Negative lookahead maybe:

>
> >> grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
> >> + Literal('end'))

>
> >> Ciao,
> >> Marc 'BlackJack' Rintsch- Hide quoted text -

>
> >> - Show quoted text -

>
> > Well I'll be darned! All this time, I thought "recursive descent"
> > described the recursive behavior of the parser, which pyparsing
> > definitely has. I never knew that backing up in the face of parse
> > mismatches was a required part of the picture.

>
> It has recursion in it but that's not sufficient to call it a recursive
> descent parser any more than having a recursive implementation of the
> factorial function. The important part is what it recurses through...
>
>
>
> > In pyparsing, each construction gets composed from more fine-grained
> > constructions, and they are called recursively to match or not match
> > against the input string.

>
> > For example, taking your working parser example:

>
> > grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
> > + Literal('end'))

>
> > This creates the following data structure:

>
> > - And
> > - OneOrMore
> > - And
> > - NotAny
> > - Literal('end')
> > - Word(alphas)
> > - Literal('end')

>
> > Every instance in this structure derives from the base class
> > ParserElement, which defines a method parse(). parse() in turn calls
> > preParse(), parseImpl(), and postParse(). If parseImpl succeeds, it
> > returns a ParseResults object and the next location within the input
> > string to continue parsing.

>
> > The parseImpl methods are most often overridden in the subclasses (a
> > few override postParse as well), such as:
> > - And.parseImpl invokes parse() (note the recursion) on each of the
> > expressions in its list. All must succeed or And.parseImpl throws an
> > exception.
> > - OneOrMore.parseImpl invokes parse() on its contained expression,
> > which must succeed at least once; if not, the exception is rethrown.
> > If the contained expression succeeds once, then its parse() method is
> > called again and again until it fails, but no exceptions are rethrown,
> > since only one match was actually required.
> > - NotAny inverts the success/failure of its contained expression. If
> > the expression's parse() method succeeds, NotAny.parseImpl throws an
> > exception. If the contained expression's parse() method throws an
> > exception, NotAny returns successfully. (NotAny does not advance the
> > parse location, nor does it return any tokens.)
> > - Literal and Word are terminals, in that they do not invoke any other
> > expression's parse() method. Literal.parseImpl tests whether its
> > string exists at the current parse location, and throws an exception
> > if it doesn't. Word.parseImpl tests whether the current parse
> > location contains a letter in the Word instance's set of valid initial
> > characters - if so success; if not, throws an exception. It then
> > advances through the input string, matching characters in the Word
> > instance's set of valid body characters. The entire matched string is
> > then returned, along with an updated string index at which to continue
> > parsing.

>
> Thank you for the detailed description of pyparsing's implementation.
>
> > In my concept of "recursive descent" parsing, I was under the
> > impression that pyparsing's use of this data structure, and
> > *recursive* calls of parse() as it *descends* through the data
> > structure, constituted a recursive descent parser. What the OP
> > requested was a more regular expression-ish matcher, with backup (or
> > backtracking).

>
> In my concept of "recursive descent" parsing, I was under the impression
> that one should recurse through all rule combinations to ensure that the
> grammar is fully applied. As I have mentioned before, merely having
> recursion in your algorithm is insufficient. What you recurse through is
> key. pyparsing recurses through rules but what's important is to recurse
> through rule combinations.


I think the confusing aspect of pyparsing for someone like me coming
from an (E)BNF and formal language background is that scanning and
parsing are merged into one. pyparsing is scannerless and where a
tokenizer such as Pythons performs a longest match tokenization but
interprets CFGs correctly ( in the bounds of the parsers power ) one
has to disambiguate grammars in pyparsing where you expect a CFG to be
established.

Note that I don't like pyparsings approach and it is clearly not for
everyone. I rather tend to make the experience that token definitions
can be reused across different languages. I count 65 token for Python
and I added 10 token for an experimental C++ parser and deactivated
some others. The distance between both languages on a lexical level is
not that huge.

What pyparsing has been established in the Python community is
something different IMO: readable and verbose pattern matching syntax
which is not regexp line noise. Around 8 years ago when I started
using Python I found a little module called reverb.py. It was actually
nothing but a front end for regular expressions. It didn't make any
career but I think it is the only one survived the anagrammatical
release hopping from 1.5.2 to 2.5.1 in my own case:

from reverb import*
>>> p = required(required(alphanum)+whitespace)+text("end" )
>>> p

<reverb.raw instance at 0x00C0D260>
>>> RE(p).pattern

'(?:\\w+\\s)+end'


 
Reply With Quote
 
Paul McGuire
Guest
Posts: n/a
 
      11-03-2007
On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
<(E-Mail Removed)> wrote:
> It has recursion in it but that's not sufficient to call it a recursive
> descent parser any more than having a recursive implementation of the
> factorial function. The important part is what it recurses through...


<snip>

> In my opinion, the rule set I mentioned in my original post:
>
> grammar = OneOrMore(Word(alphas)) + Literal('end')
>
> ...should be translated (internally) to something like this:
>
> word = Word(alphas)
> grammar = Forward()
> grammar << ((word + grammar) | (word + Literal(end)))
>
> This allows a recursive search to find the string correct without any
> need for "backtracking," if I understand what you mean by that. Couldn't
> pyparsing have done something like this?
>


Dear JAVotAM -

This was a great post! (Although I'm not sure the comment in the
first paragraph was entirely fair - I know the difference between
recursive string parsing and recursive multiplication.)

You really got me thinking more about how this recursion actually
behaves, especially with respect to elements such as OneOrMore. Your
original question is really quite common, and up until now, my stock
answer has been to use negative lookahead. The expression you posted
is the first time I've seen this solution, and it *does* work.

I was all set to write a cocky response on why your expression
wouldn't work. I've seen it many times before, where people (usually
coming from EBNF experience) implement listOfXs = OneOrMore(x) as:

listOfXs = Forward()
listOfXs << ( x + listOfXs | x )

Actually, what they usually write is:

listOfXs << ( listOfXs + x )

but this sends pyparsing into a recursive tailspin.

So I fired up SciTE and copy/pasted your code into the editor and ran
it, and it worked just fine - this was a shock! I studied this for a
few minutes, and then realized what was happening. First of all, I
misread what you posted. You posted this:

grammar << ((word + grammar) | (word + Literal(end)))

which works. I *thought* you posted this:

grammar << ((word + grammar) | word) + Literal(end)

which doesn't work. In fact this behaves the same as your original
post, except it iterates over the input string's words recursively,
vs. repetition ins a for-loop, as is done by OneOrMore.

So going back to your working version, I had to see why it works.
Initially, the first term in the MatchFirst (the '|' operator creates
MatchFirst instances) is just the same, and by grammar referencing
itself, it just goes word by word through the input trying to find a
match. I'll try to visualize the steps:

level "First Second Third end"
1 word grammar
2 word grammar
3 word grammar
4 word grammar <- fails!
4 word end <- fails!
(therefore level 3 grammar fails!)
3 word end <-- success!!!

grammar has 2 options: to match a word followed by a grammar, or to
match a word followed by 'end'. At 4 levels deep into the Forward's
recursion, the first option fails, because after parsing "end" as the
initial word, there is no more text to try to match against grammar.
Level 4's Forward then also tries to match a word followed by 'end',
but this fails for the same reason. So at this point, the 4th level
Forward fails to match either of its options, so it throws its
exception back up to level 3, indicating that the first alternative,
word followed by grammar, failed. Level 3 then moves on to see if
word followed by the literal 'end' matches, and it does - success!

Here's where I am stuck now. In the original grammar that you posted,
which you want to render into this behavior, grammar is defined as:

grammar = OneOrMore(Word(alphas)) + Literal('end')

Unfortunately, when the OneOrMore gets constructed, it does not have
any visibility beyond knowing what is to be repeated. Again, here is
the data structure that is being built:

- And
- OneOrMore
- Word(alphas)
- Literal('end')

Only at the level of the And is there any awareness that the OneOrMore
is followed by anything, let alone by something which could be
misinterpreted as something matching the OneOrMore's repetition
expression.

Can you suggest a way I could generalize this, so that OneOrMore stops
matching before it gets to 'end'?

-- Paul

 
Reply With Quote
 
Neil Cerutti
Guest
Posts: n/a
 
      11-03-2007
On 2007-11-03, Paul McGuire <(E-Mail Removed)> wrote:
> On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
><(E-Mail Removed)> wrote:
>> It has recursion in it but that's not sufficient to call it a recursive
>> descent parser any more than having a recursive implementation of the
>> factorial function. The important part is what it recurses through...

>
><snip>
>
>> In my opinion, the rule set I mentioned in my original post:
>>
>> grammar = OneOrMore(Word(alphas)) + Literal('end')
>>
>> ...should be translated (internally) to something like this:
>>
>> word = Word(alphas)
>> grammar = Forward()
>> grammar << ((word + grammar) | (word + Literal(end)))
>>
>> This allows a recursive search to find the string correct without any
>> need for "backtracking," if I understand what you mean by that. Couldn't
>> pyparsing have done something like this?
>>

>
> Dear JAVotAM -
>
> This was a great post! (Although I'm not sure the comment in the
> first paragraph was entirely fair - I know the difference between
> recursive string parsing and recursive multiplication.)
>
> You really got me thinking more about how this recursion actually
> behaves, especially with respect to elements such as OneOrMore. Your
> original question is really quite common, and up until now, my stock
> answer has been to use negative lookahead. The expression you posted
> is the first time I've seen this solution, and it *does* work.
>
> I was all set to write a cocky response on why your expression
> wouldn't work. I've seen it many times before, where people (usually
> coming from EBNF experience) implement listOfXs = OneOrMore(x) as:
>
> listOfXs = Forward()
> listOfXs << ( x + listOfXs | x )



>
> Actually, what they usually write is:
>
> listOfXs << ( listOfXs + x )
>
> but this sends pyparsing into a recursive tailspin.
>
> So I fired up SciTE and copy/pasted your code into the editor and ran
> it, and it worked just fine - this was a shock! I studied this for a
> few minutes, and then realized what was happening. First of all, I
> misread what you posted. You posted this:
>
> grammar << ((word + grammar) | (word + Literal(end)))
>
> which works. I *thought* you posted this:
>
> grammar << ((word + grammar) | word) + Literal(end)
>
> which doesn't work. In fact this behaves the same as your original
> post, except it iterates over the input string's words recursively,
> vs. repetition ins a for-loop, as is done by OneOrMore.
>
> So going back to your working version, I had to see why it works.
> Initially, the first term in the MatchFirst (the '|' operator creates
> MatchFirst instances) is just the same, and by grammar referencing
> itself, it just goes word by word through the input trying to find a
> match. I'll try to visualize the steps:
>
> level "First Second Third end"
> 1 word grammar
> 2 word grammar
> 3 word grammar
> 4 word grammar <- fails!
> 4 word end <- fails!
> (therefore level 3 grammar fails!)
> 3 word end <-- success!!!
>
> grammar has 2 options: to match a word followed by a grammar, or to
> match a word followed by 'end'. At 4 levels deep into the Forward's
> recursion, the first option fails, because after parsing "end" as the
> initial word, there is no more text to try to match against grammar.
> Level 4's Forward then also tries to match a word followed by 'end',
> but this fails for the same reason. So at this point, the 4th level
> Forward fails to match either of its options, so it throws its
> exception back up to level 3, indicating that the first alternative,
> word followed by grammar, failed. Level 3 then moves on to see if
> word followed by the literal 'end' matches, and it does - success!
>
> Here's where I am stuck now. In the original grammar that you posted,
> which you want to render into this behavior, grammar is defined as:
>
> grammar = OneOrMore(Word(alphas)) + Literal('end')


Is there not an ambiguity in the grammar?

In EBNF:

goal --> WORD { WORD } END

WORD is '[a-zA-Z]+'
END is 'end'

I think it is fine that PyParsing can't guess what the composer
of that grammar meant.

--
Neil Cerutti
 
Reply With Quote
 
Just Another Victim of the Ambient Morality
Guest
Posts: n/a
 
      11-03-2007

"Paul McGuire" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ups.com...
> On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
> <(E-Mail Removed)> wrote:
>> It has recursion in it but that's not sufficient to call it a
>> recursive
>> descent parser any more than having a recursive implementation of the
>> factorial function. The important part is what it recurses through...

>
> <snip>
>
>> In my opinion, the rule set I mentioned in my original post:
>>
>> grammar = OneOrMore(Word(alphas)) + Literal('end')
>>
>> ...should be translated (internally) to something like this:
>>
>> word = Word(alphas)
>> grammar = Forward()
>> grammar << ((word + grammar) | (word + Literal(end)))
>>
>> This allows a recursive search to find the string correct without any
>> need for "backtracking," if I understand what you mean by that. Couldn't
>> pyparsing have done something like this?
>>

>
> Dear JAVotAM -
>
> This was a great post! (Although I'm not sure the comment in the
> first paragraph was entirely fair - I know the difference between
> recursive string parsing and recursive multiplication.)


I often use hyperbole to emphasize a point. I honestly mean no offense.
That comment wasn't even meant to be fair, I was hoping to be provocative...


> You really got me thinking more about how this recursion actually
> behaves, especially with respect to elements such as OneOrMore. Your
> original question is really quite common, and up until now, my stock
> answer has been to use negative lookahead. The expression you posted
> is the first time I've seen this solution, and it *does* work.


I'm glad I got you thinking! I'd hate to be another newbie with another
of a thousand questions answered in the FAQ....
Hey, are you the author of pyparsing?


> I was all set to write a cocky response on why your expression
> wouldn't work. I've seen it many times before, where people (usually
> coming from EBNF experience) implement listOfXs = OneOrMore(x) as:
>
> listOfXs = Forward()
> listOfXs << ( x + listOfXs | x )
>
> Actually, what they usually write is:
>
> listOfXs << ( listOfXs + x )
>
> but this sends pyparsing into a recursive tailspin.
>
> So I fired up SciTE and copy/pasted your code into the editor and ran
> it, and it worked just fine - this was a shock! I studied this for a
> few minutes, and then realized what was happening. First of all, I
> misread what you posted. You posted this:
>
> grammar << ((word + grammar) | (word + Literal(end)))
>
> which works. I *thought* you posted this:
>
> grammar << ((word + grammar) | word) + Literal(end)
>
> which doesn't work. In fact this behaves the same as your original
> post, except it iterates over the input string's words recursively,
> vs. repetition ins a for-loop, as is done by OneOrMore.


I'm grateful that you actually tested my code before posting your cocky
response!


> So going back to your working version, I had to see why it works.
> Initially, the first term in the MatchFirst (the '|' operator creates
> MatchFirst instances) is just the same, and by grammar referencing
> itself, it just goes word by word through the input trying to find a
> match. I'll try to visualize the steps:
>
> level "First Second Third end"
> 1 word grammar
> 2 word grammar
> 3 word grammar
> 4 word grammar <- fails!
> 4 word end <- fails!
> (therefore level 3 grammar fails!)
> 3 word end <-- success!!!
>
> grammar has 2 options: to match a word followed by a grammar, or to
> match a word followed by 'end'. At 4 levels deep into the Forward's
> recursion, the first option fails, because after parsing "end" as the
> initial word, there is no more text to try to match against grammar.
> Level 4's Forward then also tries to match a word followed by 'end',
> but this fails for the same reason. So at this point, the 4th level
> Forward fails to match either of its options, so it throws its
> exception back up to level 3, indicating that the first alternative,
> word followed by grammar, failed. Level 3 then moves on to see if
> word followed by the literal 'end' matches, and it does - success!


This is, literally, what it's doing. I'm not exactly a programming whiz
so I think of it a little more abstractly. In pyparsing's implementation,
it recurses through the first rule, OneOrMore, then iteratively moves to the
next rule, only to fail. So, obviously that rule must be part of the
recursion if we're not to use backtracking or lookaheads. If it's part of
the recursion, it has to be the terminal case. We know that the OneOrMore
rule is necessarily followed by the literal, so we can safely conclude that
the terminal case will necessarily be followed by the literal, hence the
production mentioned...


> Here's where I am stuck now. In the original grammar that you posted,
> which you want to render into this behavior, grammar is defined as:
>
> grammar = OneOrMore(Word(alphas)) + Literal('end')
>
> Unfortunately, when the OneOrMore gets constructed, it does not have
> any visibility beyond knowing what is to be repeated. Again, here is
> the data structure that is being built:
>
> - And
> - OneOrMore
> - Word(alphas)
> - Literal('end')
>
> Only at the level of the And is there any awareness that the OneOrMore
> is followed by anything, let alone by something which could be
> misinterpreted as something matching the OneOrMore's repetition
> expression.
>
> Can you suggest a way I could generalize this, so that OneOrMore stops
> matching before it gets to 'end'?


Again, I'm not an elite hacker so it's hard to give a good suggestion.
I might have implemented the parser by having its input be a string
specifying the grammar but, then again, I'd need a parser to prase that
string! Oh, the irony...
Probably, I would have constructed the rule tree differently. I like
how pyparsing uses the Python interpreter to parse the input string, the
Python code, itself, as pyparsing's input. That's worth preserving...
So, instead of organizing the tree by rule categories, as pyparsing
does:


-And
-OneOrMore
-Word(alphas)
-Literal('end')


...I would have organized it by what the grammar actually expects. It's
hard for me to describe what I'm thinking as a simple hierarchy. I could
use a directed graph but that's hard to type, so I'll just use "labels":


start:
call OneOrMore:

OneOrMore:
Or(
And(
Word(alphas),
call OneOrMore,
And(
Word(alphas),
Literal('end')))


...hopefully, that indentation makes some sort of sense. I would
imagine that it would be constructed like this:


# This translates to temp << ((word + temp) | word)
temp = OneOrMore(word)

# Instead of wrapping this in an And structure,
# tell temp to And this with its terminal case...
grammar = temp + literal


In my parser, everything would be recursive, hence the term "recursive
descent" parser. So, a list of Ands like:


grammar = Literal('a') + Literal('b') + Literal('c')


...would look like this:


start:
call And_a:

And_a:
Literal('a') + call And_b:

And_b:
Literal('b') + call And_c:

And_c:
Literal('c')


...notice the utter lack of iteration.
Now, this example isn't, strictly speaking, recursive, but it descends
through a set of function calls that are, potentionally, recursive...
Now, obviously things like Or will have iterate through its cases but
that's unavoidable and should be okay. The other constructions should need
some thought, too. However, hopefully, this technique will render some
constructions unnecessary...
So, in my parser, the grammar is represented by a directed graph (the
temporary variable flowing through operations like a + b + c + d) and all
operations modify this graph. Obviously, some research needs to be done on
how feasible this implementation is, especially with the rich set of
operations available in pyparsing, but I'm hopeful...
Hopefully this all makes sense. Thanks for reading...




 
Reply With Quote
 
Just Another Victim of the Ambient Morality
Guest
Posts: n/a
 
      11-04-2007

"Neil Cerutti" <(E-Mail Removed)> wrote in message
newsz3Xi.39812$(E-Mail Removed). net...
> On 2007-11-03, Paul McGuire <(E-Mail Removed)> wrote:
>> On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
>><(E-Mail Removed)> wrote:
>>> It has recursion in it but that's not sufficient to call it a
>>> recursive
>>> descent parser any more than having a recursive implementation of the
>>> factorial function. The important part is what it recurses through...

>>
>><snip>
>>
>>> In my opinion, the rule set I mentioned in my original post:
>>>
>>> grammar = OneOrMore(Word(alphas)) + Literal('end')
>>>
>>> ...should be translated (internally) to something like this:
>>>
>>> word = Word(alphas)
>>> grammar = Forward()
>>> grammar << ((word + grammar) | (word + Literal(end)))
>>>
>>> This allows a recursive search to find the string correct without
>>> any
>>> need for "backtracking," if I understand what you mean by that.
>>> Couldn't
>>> pyparsing have done something like this?
>>>

>>
>> Dear JAVotAM -
>>
>> This was a great post! (Although I'm not sure the comment in the
>> first paragraph was entirely fair - I know the difference between
>> recursive string parsing and recursive multiplication.)
>>
>> You really got me thinking more about how this recursion actually
>> behaves, especially with respect to elements such as OneOrMore. Your
>> original question is really quite common, and up until now, my stock
>> answer has been to use negative lookahead. The expression you posted
>> is the first time I've seen this solution, and it *does* work.

>
> Is there not an ambiguity in the grammar?
>
> In EBNF:
>
> goal --> WORD { WORD } END
>
> WORD is '[a-zA-Z]+'
> END is 'end'
>
> I think it is fine that PyParsing can't guess what the composer
> of that grammar meant.


First, I don't know if that constitutes an ambiguity in the grammar.
'end' is a word but it's unambiguous that this grammar must end in a literal
'end'. You could interpret the input as just a sequence of words or you
could interpret it as a sequence of words terminated by the word 'end'. One
interpretation conforms to the grammar while the other doesn't. You would
assume that the interpretation that agrees with the grammar would be the
preferable choice and so should the program...
Secondly, even if it is an ambiguity... so what? pyparsing's current
behaviour is to return a parse error, pretending that the string can't be
parsed. Ideally, perhaps it should alert you to the ambiguity but, surely,
it's better to return _a_ valid parsing than to pretend that the string
can't be parsed at all...





 
Reply With Quote
 
Neil Cerutti
Guest
Posts: n/a
 
      11-04-2007
On 2007-11-04, Just Another Victim of the Ambient Morality
<(E-Mail Removed)> wrote:
>
> "Neil Cerutti" <(E-Mail Removed)> wrote in message
> newsz3Xi.39812$(E-Mail Removed). net...
>> On 2007-11-03, Paul McGuire <(E-Mail Removed)> wrote:
>>> On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
>>><(E-Mail Removed)> wrote:
>>>> It has recursion in it but that's not sufficient to call it a
>>>> recursive
>>>> descent parser any more than having a recursive implementation of the
>>>> factorial function. The important part is what it recurses through...
>>>
>>><snip>
>>>
>>>> In my opinion, the rule set I mentioned in my original post:
>>>>
>>>> grammar = OneOrMore(Word(alphas)) + Literal('end')
>>>>
>>>> ...should be translated (internally) to something like this:
>>>>
>>>> word = Word(alphas)
>>>> grammar = Forward()
>>>> grammar << ((word + grammar) | (word + Literal(end)))
>>>>
>>>> This allows a recursive search to find the string correct without
>>>> any
>>>> need for "backtracking," if I understand what you mean by that.
>>>> Couldn't
>>>> pyparsing have done something like this?
>>>>
>>>
>>> Dear JAVotAM -
>>>
>>> This was a great post! (Although I'm not sure the comment in the
>>> first paragraph was entirely fair - I know the difference between
>>> recursive string parsing and recursive multiplication.)
>>>
>>> You really got me thinking more about how this recursion actually
>>> behaves, especially with respect to elements such as OneOrMore. Your
>>> original question is really quite common, and up until now, my stock
>>> answer has been to use negative lookahead. The expression you posted
>>> is the first time I've seen this solution, and it *does* work.

>>
>> Is there not an ambiguity in the grammar?
>>
>> In EBNF:
>>
>> goal --> WORD { WORD } END
>>
>> WORD is '[a-zA-Z]+'
>> END is 'end'
>>
>> I think it is fine that PyParsing can't guess what the composer
>> of that grammar meant.

>
> First, I don't know if that constitutes an ambiguity in the
> grammar. 'end' is a word but it's unambiguous that this grammar
> must end in a literal 'end'. You could interpret the input as
> just a sequence of words or you could interpret it as a
> sequence of words terminated by the word 'end'.


Yeah. If it were a regex, it would be: '[ab]+b'. That is a fine
regex, because a regex is generally just a recognizer; the
ambiguity doesn't have to do with recognizing the language. But
PyParsing is parser. The ambiguity is in deciding what's a
Word(alphas) and what's an 'end'.

> One interpretation conforms to the grammar while the other
> doesn't. You would assume that the interpretation that agrees
> with the grammar would be the preferable choice and so should
> the program. Secondly, even if it is an ambiguity... so what?
> pyparsing's current behaviour is to return a parse error,
> pretending that the string can't be parsed. Ideally, perhaps
> it should alert you to the ambiguity but, surely, it's better
> to return _a_ valid parsing than to pretend that the string
> can't be parsed at all...


I wouldn't characterize it as pretending. How would you parse:

hello end hello end

"WORD END WORD END" and "WORD WORD WORD END" are both valid
interpretations, according to the grammar.

As soon as you remove the ambiguity from the grammar, PyParsing
starts to work correctly.

Consider writing a recursive decent parser by hand to parse the
language '[ab]+b'.

goal --> ab_list 'b'
ab_list --> 'a' list_tail
ab_list --> 'b' list_tail
list_tail --> 'a' list_tail
list_tail --> 'b' list_tail
list_tail --> null


The above has the exact same bug (and probably some others--I'm
sorry unable to test it just now) as the PyParsing solution.

The error is in the grammar. It might be fixed by specifying that
'b' must be followed by EOF, and then it could be coded by using
more than one character of lookahead.

class ParseError(Exception):
pass

class Parser:
def __init__(self, s):
self.s = s
self.i = 0
self.c = self.s[self.i]

def match(self, c):
if self.c != c:
raise ParseError('expected %s got %s' % (c, self.c))
self.i += 1
if self.i == len(self.s):
raise ParseError('unexpected EOF')
self.c = self.s[self.i]

def goal(self):
self.ab_list()
self.match('b')

def ab_list(self):
if self.c in 'ab':
self.match(self.c)
self.list_tail()
else:
raise ParseError('expected list of a or b, got %s' % self.c)

def list_tail(self):
if self.c in 'ab':
self.match(self.c)
self.list_tail()
else:
raise ParseError('expected a list of a or b, got %s' % self.c)

>>> Parser('aaab').goal()

Traceback (most recent call last):
File "py.py", line 37, in ?
Parser('aaab').goal()
File "py.py", line 20, in goal
self.ab_list()
File "py.py", line 26, in ab_list
self.list_tail()
File "py.py", line 33, in list_tail
self.list_tail()
File "py.py", line 33, in list_tail
self.list_tail()
File "py.py", line 32, in list_tail
self.match(self.c)
File "py.py", line 16, in match
raise ParseError('unexpected EOF')
__main__.ParseError: unexpected EOF

It's not a coincidence that is has the same bug as the PyParsing
solution. You can remove the ambiguity in several ways, perhaps
by specifying where EOF should be (you seem to be assuming this
in your interpretation of the grammar, but PyParsing does not).

goal --> ab_list 'b' EOF
ab_list --> 'a' list_tail
ab_list --> 'b' list_tail
list_tail --> 'a' list_tail
list_tail --> 'b' list_tail
list_tail --> null

I need to change goal, match and list_tail for this new grammar,
and add an EOF object and a peek method.

....
EOF = object()
....
def match(self, c):
if self.c != c:
raise ParseError('expected %s got %s' % (c, self.c))
if self.i >= len(self.s)-1:
self.c = EOF
self.i = len(self.s)
else:
self.i += 1
self.c = self.s[self.i]

def peek(self, lookahead):
if self.i+lookahead >= len(self.s):
return EOF
else:
return self.s[self.i + lookahead]
...
def list_tail(self):
if self.c == 'a':
self.match('a')
self.list_tail()
elif self.c == 'b':
if self.peek(1) != EOF:
self.match('b')
self.list_tail()
else:
raise ParseError('expected list of a or b, got %s' % self.c)

The grammar now has the problem that it's LL(2) rather than
LL(1). PyParsing can handle that type of grammar with negative
lookahead assertions, I think.

--
Neil Cerutti
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Recursive descent algorithm able to parse Python? seberino@spawar.navy.mil Python 9 10-07-2006 03:34 AM
Can recursive descent parser handle Python grammar? seberino@spawar.navy.mil Python 4 10-02-2006 08:58 AM
[Snippet] a Recursive Descent Parser via TDD - recursiveDescentParser.h Phlip C++ 6 08-05-2004 03:12 AM
[Snippet] a Recursive Descent Parser via TDD - test.h Phlip C++ 0 08-02-2004 07:59 PM
[Snippet] a Recursive Descent Parser via TDD - recursiveDescentParser.cpp Phlip C++ 0 08-02-2004 07:48 PM



Advertisments