Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > Re: Parsing English with lex and yacc

Reply
Thread Tools

Re: Parsing English with lex and yacc

 
 
Arthur T. Murray
Guest
Posts: n/a
 
      01-23-2004
(Amnon Meyers) wrote on 16 Jan 2004:
> [...]
> I of course recommend using TAIParse with a VisualText trial version.
> http://www.textanalysis.com


> Amnon


Thank you. TAIParse has been added as a link for AI4U users at
http://mentifex.virtualentity.com/parser.html#taiparse (q.v.).

Text AI (Text Analysis International) has been added at
http://mentifex.virtualentity.com/english.html#textai

NLP++ (the C++ -like programming language) has been added at
http://mentifex.virtualentity.com/cpp.html -- C++ AI Weblog.

ATM
--
http://www.mail-archive.com/.../msg01548.html
http://www.amazon.com/exec/obidos/ASIN/0595654371/ -- AI4U
 
Reply With Quote
 
 
 
 
LEE Sau Dan
Guest
Posts: n/a
 
      01-23-2004
>>>>> "Arthur" == Arthur T Murray <> writes:

Arthur> (Amnon Meyers) wrote on 16
Arthur> Jan 2004:
>> [...] I of course recommend using TAIParse with a VisualText
>> trial version. http://www.textanalysis.com


>> Amnon


Arthur> Thank you. TAIParse has been added as a link for AI4U
Arthur> users at
Arthur> http://mentifex.virtualentity.com/parser.html#taiparse
Arthur> (q.v.).

Arthur> Text AI (Text Analysis International) has been added at
Arthur> http://mentifex.virtualentity.com/english.html#textai

Arthur> NLP++ (the C++ -like programming language) has been added
Arthur> at http://mentifex.virtualentity.com/cpp.html -- C++ AI
Arthur> Weblog.

Arthur> ATM --
Arthur> http://www.mail-archive.com/.../msg01548.html
Arthur> http://www.amazon.com/exec/obidos/ASIN/0595654371/ -- AI4U

What? Parsing English with yacc? yacc is supposed to handle LALR(1)
context-free grammars only. English is much more complicated than
LALR(1), unless you deliberate trim it down into a toy language,
resembling some programming languages.


--
Lee Sau Dan +Z05biGVm-(Big5) ~{@nJX6X~}(HZ)

E-mail:
Home page: http://www.informatik.uni-freiburg.de/~danlee
 
Reply With Quote
 
 
 
 
John Lawler
Guest
Posts: n/a
 
      01-26-2004
LEE Sau Dan <> writes:

>What? Parsing English with yacc? yacc is supposed to handle LALR(1)
>context-free grammars only. English is much more complicated than
>LALR(1), unless you deliberate trim it down into a toy language,
>resembling some programming languages.


Well, yes, that's true.
However, it's an interesting question just how significant that is.

That is, just how much of English *is*, if not LALR(1) C-F, at least C-F,
and thus amenable to an approach along the lines of yacc+lex? Or, might we
even (*shudder*) consider skipping parsing altogether and going straight to
the semantics?

Granted, there are parts of English that are beyond this approach, and they
deserve (and get) plenty of syntactic attention. However, decades of CL/NLP
work have shown that there is also a vast amount of actually occurring
English speech and text that yields readily to 'brute-force' methods. This
has to be due to some characteristics of English, or (perhaps) of Language
itself.

And it might be worthwhile for linguists to be thinking about these
questions, because others certainly are. For instance, here's a relevant
quotation:

"... the complexity and power required to analyze linguistic data is
discontinuous in its distribution. Coarsely put, we have seen over and
over that the simplest tools have the broadest coverage, and more and
more complexity is required to expand the coverage less and less.
Consider the place of natural language as a whole on the Chomsky
hierarchy, for instance. Chomsky (1956) demonstrated that natural
language is at least context-free in its complexity, and after a number
of failed proofs, it is now commonly agreed that natural language is
strongly and weakly trans-context-free (Shieber 1985, Kac 1987, Culy
1985, Bresnan et al. 1982).

"Yet what is striking about these results is both the relative
infrequency of constructions which demonstrate this complexity and the
increase in computational power required to account for them. For
example, the constructions which are necessarily at least context-free
(such as center embedding) seem fairly uncommon in comparison with
constructions which could be fairly characterized as finite state; the
constructions which are necessarily trans-context-free are even fewer.

"In other words, a large subset of language can be handled with
relatively simple computational tools; a much smaller subset requires a
radically more expensive approach; and an even smaller subset something
more expensive still. This observation has profound effects on the
analysis of large corpora: there is a premium on identifying those
linguistic insights which are simplest, most general, least
controversial, and most powerful, in order to exploit them to gain the
broadest coverage for the least effort."

--- "Theoretical and Computational Linguistics: Toward a Mutual
Understanding", by Sam Bayer & the MITRE Natural Language Group
Chapter 8 of "Using Computers in Linguistics" (Routledge 1998:212)

- John Lawler University of Michigan Linguistics Department
------------------------------------------------------------------
"Using Computers in Linguistics: A Practical Guide" Routledge 1998
http://www.routledge.com/linguistics...html#chapter.8
 
Reply With Quote
 
Ron Hardin
Guest
Posts: n/a
 
      01-26-2004
It was downhill after grep, in other words.
--
Ron Hardin


On the internet, nobody knows you're a jerk.
 
Reply With Quote
 
redspot
Guest
Posts: n/a
 
      01-26-2004
"John Lawler" <> wrote in message
news:N_aRb.1234$...
> "... the complexity and power required to analyze linguistic data is
> discontinuous in its distribution. Coarsely put, we have seen over and
> over that the simplest tools have the broadest coverage, and more and
> more complexity is required to expand the coverage less and less.
> Consider the place of natural language as a whole on the Chomsky
> hierarchy, for instance. Chomsky (1956) demonstrated that natural
> language is at least context-free in its complexity, and after a

number
> of failed proofs, it is now commonly agreed that natural language is
> strongly and weakly trans-context-free (Shieber 1985, Kac 1987, Culy
> 1985, Bresnan et al. 1982).
>
> "Yet what is striking about these results is both the relative
> infrequency of constructions which demonstrate this complexity and the
> increase in computational power required to account for them. For
> example, the constructions which are necessarily at least context-free
> (such as center embedding) seem fairly uncommon in comparison with
> constructions which could be fairly characterized as finite state; the
> constructions which are necessarily trans-context-free are even fewer.
>
> "In other words, a large subset of language can be handled with
> relatively simple computational tools; a much smaller subset requires

a
> radically more expensive approach; and an even smaller subset

something
> more expensive still. This observation has profound effects on the
> analysis of large corpora: there is a premium on identifying those
> linguistic insights which are simplest, most general, least
> controversial, and most powerful, in order to exploit them to gain the
> broadest coverage for the least effort."


Could he have found a more complicated way of saying that
yes, the Law Of Diminishing Returns is still in effect?

My question is if the ability to parse/interpret those very small
subsets is necessary. Isn't asking the speaker for clarification of
what he's trying to get across a valid approach?

After all, it is possible that you are trying to parse/interpret text
which was not correctly used by the speaker, in which case you
would virtually always have to ask the speaker for clarification
(or correction). What not do the same for that small subset of
text which may be correct, but extremely difficult to interpret?

I know next to nothing about any of this. I'm just asking based on
what I hope is a common sense view and a desire to learn.


 
Reply With Quote
 
Amnon Meyers
Guest
Posts: n/a
 
      01-26-2004
LEE Sau Dan <> wrote in message news:<>...

> [snip]
> What? Parsing English with yacc? yacc is supposed to handle LALR(1)
> context-free grammars only. English is much more complicated than
> LALR(1), unless you deliberate trim it down into a toy language,
> resembling some programming languages.


Hi,

You may want to look up Masaru Tomita and Generalized LR parsing (or
GLR). This is work to handle ambiguity and perhaps other NLP
constructs in an LR parser. The original work is old (1991 or so),
and I don't know if much has been done since then.

On a separate branch, NLP++ is a multi-pass language that integrates
recursive grammars, patterns, code, a KBMS, and more.
http://www.textanalysis.com/tai-multi2003.pdf

Amnon
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Does lex+yacc produce a tree data structure that is easy for anexternal C++ program to examine and manipulate? Robert C++ 1 04-14-2008 03:47 PM
Lex/Yacc and multiple input files max.giacometti@gmail.com C Programming 2 05-19-2007 06:30 AM
lex and yacc Gvs C Programming 3 05-12-2005 03:22 AM
What's wrong with my lex and yacc program? cylin C Programming 1 01-07-2004 03:44 AM
YACC-LEX parsing overflow Alvaro Puente C Programming 1 07-10-2003 09:38 AM



Advertisments