LEE Sau Dan <> writes:
>What? Parsing English with yacc? yacc is supposed to handle LALR(1)
>context-free grammars only. English is much more complicated than
>LALR(1), unless you deliberate trim it down into a toy language,
>resembling some programming languages.
Well, yes, that's true.
However, it's an interesting question just how significant that is.
That is, just how much of English *is*, if not LALR(1) C-F, at least C-F,
and thus amenable to an approach along the lines of yacc+lex? Or, might we
even (*shudder*) consider skipping parsing altogether and going straight to
the semantics?
Granted, there are parts of English that are beyond this approach, and they
deserve (and get) plenty of syntactic attention. However, decades of CL/NLP
work have shown that there is also a vast amount of actually occurring
English speech and text that yields readily to 'brute-force' methods. This
has to be due to some characteristics of English, or (perhaps) of Language
itself.
And it might be worthwhile for linguists to be thinking about these
questions, because others certainly are. For instance, here's a relevant
quotation:
"... the complexity and power required to analyze linguistic data is
discontinuous in its distribution. Coarsely put, we have seen over and
over that the simplest tools have the broadest coverage, and more and
more complexity is required to expand the coverage less and less.
Consider the place of natural language as a whole on the Chomsky
hierarchy, for instance. Chomsky (1956) demonstrated that natural
language is at least context-free in its complexity, and after a number
of failed proofs, it is now commonly agreed that natural language is
strongly and weakly trans-context-free (Shieber 1985, Kac 1987, Culy
1985, Bresnan et al. 1982).
"Yet what is striking about these results is both the relative
infrequency of constructions which demonstrate this complexity and the
increase in computational power required to account for them. For
example, the constructions which are necessarily at least context-free
(such as center embedding) seem fairly uncommon in comparison with
constructions which could be fairly characterized as finite state; the
constructions which are necessarily trans-context-free are even fewer.
"In other words, a large subset of language can be handled with
relatively simple computational tools; a much smaller subset requires a
radically more expensive approach; and an even smaller subset something
more expensive still. This observation has profound effects on the
analysis of large corpora: there is a premium on identifying those
linguistic insights which are simplest, most general, least
controversial, and most powerful, in order to exploit them to gain the
broadest coverage for the least effort."
--- "Theoretical and Computational Linguistics: Toward a Mutual
Understanding", by Sam Bayer & the MITRE Natural Language Group
Chapter 8 of "Using Computers in Linguistics" (Routledge 1998:212)
- John Lawler University of Michigan Linguistics Department
------------------------------------------------------------------
"Using Computers in Linguistics: A Practical Guide" Routledge 1998
http://www.routledge.com/linguistics...html#chapter.8