Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > A very simple parser with scanf & C

Reply
Thread Tools

A very simple parser with scanf & C

 
 
Ben Bacarisse
Guest
Posts: n/a
 
      09-24-2011
"BartC" <(E-Mail Removed)> writes:

> "Ben Bacarisse" <(E-Mail Removed)> wrote in message
> news:0.f03c4aaff4700164ddab.20110924003423BST.8762 (E-Mail Removed)...
>> "BartC" <(E-Mail Removed)> writes:

>
>>> Also both
>>> "ABC" and "ABCDEF" might be valid keywords, and the wrong one will be
>>> picked. Probably the error will be picked up eventually, but it's no
>>> longer clear where it occurred.

>>
>> I am not sure I follow. I think you are saying that you should not
>> match "ABC" followed by "DEF" when the longer string "ABCDEF" should be
>> a single (preferred) match. That's like saying "you need to code the
>> matching correctly". Sure, don't accept "ABC" early if you might need
>> to accept something longer. That does not seem to be the fault of using
>> a length-limited compare -- it's the fault of getting the algorithm
>> wrong. I'd be happy to match both using a length-limited compare, I'd
>> just make sure to match from longest token to shortest.

>
> Clearly this can be made to work, *if* you're aware of the implications and
> can work around any potential issues.
>
> But as I said, I'd be a little uneasy using it, if the test string is not
> well defined. Because you're trying to match, say, "ABC" again, with a test
> string that might be A*****... or one that might be ABC*****..., where *
> represents any possible character, and which might not be intended to form
> part of your test token.


I just don't follow. If you are looking for ABC and you find ABC you
have found what you are looking for. You seem to be saying that the BC
might be there somehow "by accident" but that can't be what you mean.
If you can't rely on the string contents being "intended", how can you
parse it using any method?. Maybe you could give a concrete example --
a simple set of tokens and some example strings which justifies your
unease.

> I would rather find the extent of the test token, then you know the lengths
> of both strings being compared. Now, if the lengths are unequal, you can
> move on to the next test. And if they the strings are in fact identical, you
> know for sure that first token is correct.


How does this help with characters that are there but not "intended to
form part of" the token? If you are looking for a C operator and > is
followed by =, say, you have to take it. You can't "find the extent" of
the token without looking at the characters that might make it up.
"1>=+2" is no different to "1ABC+2". If BC are there, and ABC is a
valid token, you've found one (always presuming >=+ and ABC+ are not
also, longer, valid tokens).

<snip>
--
Ben.
 
Reply With Quote
 
 
 
 
BartC
Guest
Posts: n/a
 
      09-24-2011
"Ben Bacarisse" <(E-Mail Removed)> wrote in message
news:0.f62ca533fe3fac1321b9.20110924024253BST.87zk (E-Mail Removed)...
> "BartC" <(E-Mail Removed)> writes:


>> But as I said, I'd be a little uneasy using it, if the test string is not
>> well defined. Because you're trying to match, say, "ABC" again, with a
>> test
>> string that might be A*****... or one that might be ABC*****..., where *
>> represents any possible character, and which might not be intended to
>> form
>> part of your test token.

>


> Maybe you could give a concrete example --
> a simple set of tokens and some example strings which justifies your
> unease.


Token: "proc" (define a procedure).

Input: processno=(a+b) (an assignment)

The parser would match the "proc" in "processno" and tell me it's a
procedure definition, or the start of one:

'proc essno=(a+b)'

If the input is *supposed* to be an assignment, I'd be rather cross with my
parser if it told me the next token was a 'proc' keyword! It would likely
generate a syntax error at some point, but that's not right because there's
nothing wrong with the input.

It depends also on how it deals with expected whitespace or separator after
the token; the parser could check for this and only recognise a 'proc' token
if that separator was present. But this is little different from my
suggestion to find the extent of the token in the first place.

The very fact that you have to think about this stuff at all, suggests that
there is a better way! Which I consider to be simpler and more efficient
than blindly testing every possible length of input sequence, for example is
there really much point in trying to see if "unsigned" matches
"C->sd=AE..."? Instead you get the token "C", and while you still have to do
a lookup, you only need an actual string compare against single-letter
keywords, if there are any..




Bartc

 
Reply With Quote
 
 
 
 
BartC
Guest
Posts: n/a
 
      09-24-2011

"BartC" <(E-Mail Removed)> wrote in message news:j5ke3c$dp5$(E-Mail Removed)...
> "Ben Bacarisse" <(E-Mail Removed)> wrote in message


>> Maybe you could give a concrete example --


>for example is
> there really much point in trying to see if "unsigned" matches
> "C->sd=AE..."? Instead you get the token "C", and while you still have to
> do
> a lookup, you only need an actual string compare against single-letter
> keywords, if there are any..


Which brings up another point (before I clicked the wrong button..); exactly
how *do* you lookup a string S in a table, if you don't even know it's
length?

Some lookup methods will work (the linear one the OP used, for a start), and
perhaps some hairy ones which navigate a tree by indexing by each letter in
turn or something (and can still turn up false positives). But the hash
table method I normally use would have difficulty.

--
Bartc

 
Reply With Quote
 
Ben Bacarisse
Guest
Posts: n/a
 
      09-24-2011
"BartC" <(E-Mail Removed)> writes:

> "Ben Bacarisse" <(E-Mail Removed)> wrote in message
> news:0.f62ca533fe3fac1321b9.20110924024253BST.87zk (E-Mail Removed)...
>> "BartC" <(E-Mail Removed)> writes:

>
>>> But as I said, I'd be a little uneasy using it, if the test string is not
>>> well defined. Because you're trying to match, say, "ABC" again, with a
>>> test
>>> string that might be A*****... or one that might be ABC*****..., where *
>>> represents any possible character, and which might not be intended to
>>> form
>>> part of your test token.

>>

>
>> Maybe you could give a concrete example --
>> a simple set of tokens and some example strings which justifies your
>> unease.

>
> Token: "proc" (define a procedure).
>
> Input: processno=(a+b) (an assignment)
>
> The parser would match the "proc" in "processno" and tell me it's a
> procedure definition, or the start of one:
>
> 'proc essno=(a+b)'


And it would be right to do so. You haven't given the full example
whereas the OP did. The full example includes other tokens which can be
longer that "proc" and can have proc as a prefix. If the OP had
specified that arbitrary identifiers have to be matched as well, and
that some of them may start with "DOOR", do you think I've had said his
parser was OK? Of course not, but he didn't. The specification was
admirably clear which made me not at all uneasy about using a
length-limited compare.

Of course, you may be right. Maybe this the start of larger project
that will one day include arbitrary identifiers, or keywords like "door"
and "doors", when the OP will need some full-blown lexer and parser.
But maybe not; sometimes the simple solution is the right one.

> If the input is *supposed* to be an assignment, I'd be rather cross with my
> parser if it told me the next token was a 'proc' keyword! It would likely
> generate a syntax error at some point, but that's not right because there's
> nothing wrong with the input.
>
> It depends also on how it deals with expected whitespace or separator after
> the token; the parser could check for this and only recognise a 'proc' token
> if that separator was present. But this is little different from my
> suggestion to find the extent of the token in the first place.
>
> The very fact that you have to think about this stuff at all, suggests that
> there is a better way! Which I consider to be simpler and more efficient
> than blindly testing every possible length of input sequence,


There possibly is. Please code it up and we can compare. The OP's code
was about 16 lines, not counting }s and the unnecessary if. Also I
think it was probably reasonably efficient, though that's hard to
measure.

> for example is
> there really much point in trying to see if "unsigned" matches
> "C->sd=AE..."? Instead you get the token "C", and while you still have to do
> a lookup, you only need an actual string compare against single-letter
> keywords, if there are any..


We may be talking at cross purposes. You are saying, I think, that
there is a better way to write a full-blown lexer and parser. I was
curious about your unease at using a specific kind of test in a specific
case where I felt no such unease. Maybe what I should have got from
your comment is that you think that this is one of the cases that is
just bound to get more complex, and therefore OP should be thinking in
more general terms right form the start. You might be right about that,
and if I'd got that that was your point, I may not have commented at
all.

--
Ben.
 
Reply With Quote
 
Ben Bacarisse
Guest
Posts: n/a
 
      09-24-2011
"BartC" <(E-Mail Removed)> writes:

> "BartC" <(E-Mail Removed)> wrote in message news:j5ke3c$dp5$(E-Mail Removed)...
>> "Ben Bacarisse" <(E-Mail Removed)> wrote in message

>
>>> Maybe you could give a concrete example --

>
>>for example is
>> there really much point in trying to see if "unsigned" matches
>> "C->sd=AE..."? Instead you get the token "C", and while you still
>> have to do
>> a lookup, you only need an actual string compare against single-letter
>> keywords, if there are any..

>
> Which brings up another point (before I clicked the wrong button..);
> exactly how *do* you lookup a string S in a table, if you don't even
> know it's length?
>
> Some lookup methods will work (the linear one the OP used, for a
> start), and perhaps some hairy ones which navigate a tree by indexing
> by each letter in turn or something (and can still turn up false
> positives). But the hash table method I normally use would have
> difficulty.


You answered your own question allbeit with a delightfully biased
description of a trie ("hairy" and prone to "false positives").

There is something a little circular about your desire to delimit the
tokens before trying to match them. This is usually the correct way to
do it because most programming languages have been designed with this
sort of lexical analysis in mind. But if the keywords include 2d, 3d
(and 47d, etc), and things like +more and <p> then it may be hard to
delimit possible candidate strings before matching them.

--
Ben.
 
Reply With Quote
 
BartC
Guest
Posts: n/a
 
      09-24-2011
"Ben Bacarisse" <(E-Mail Removed)> wrote in message
news:0.4c392657a43f99297620.20110924144538BST.87fw (E-Mail Removed)...
> "BartC" <(E-Mail Removed)> writes:



> We may be talking at cross purposes. You are saying, I think, that
> there is a better way to write a full-blown lexer and parser. I was
> curious about your unease at using a specific kind of test in a specific
> case where I felt no such unease. Maybe what I should have got from
> your comment is that you think that this is one of the cases that is
> just bound to get more complex, and therefore OP should be thinking in
> more general terms right form the start. You might be right about that,
> and if I'd got that that was your point, I may not have commented at
> all.


My original reply just pointed out some possible issues with the method that
was used. And suggested that it could also be done another way. I don't
think it was too critical.

I made no comment about the overall approach (the parsing bit rather than
the lexing).

> You answered your own question allbeit with a delightfully biased
> description of a trie ("hairy" and prone to "false positives").


I'm not sure I was was biased. I think that trie thing is hairier than some
other methods. And the false positives will apply to any lookup method when
the word you're looking up doesn't have a well-defined length.

--
Bartc

 
Reply With Quote
 
Ben Bacarisse
Guest
Posts: n/a
 
      09-24-2011
"BartC" <(E-Mail Removed)> writes:

> "Ben Bacarisse" <(E-Mail Removed)> wrote in message
> news:0.4c392657a43f99297620.20110924144538BST.87fw (E-Mail Removed)...

<snip>
>> You answered your own question allbeit with a delightfully biased
>> description of a trie ("hairy" and prone to "false positives").

>
> I'm not sure I was was biased. I think that trie thing is hairier than
> some other methods. And the false positives will apply to any lookup
> method when the word you're looking up doesn't have a well-defined
> length.


We've done this to death. I know that was what you meant by false
positives, but it's a deliberately polemical way of putting it! If
matching "abc", right here, right now (no matter what follows) is
correct then it's not a false positive -- it's a true positive. If you
can't match "abc", right here, right now, you have a bug. Your code is
wrong. Calling it a "false positive" (rather like your original word
"uneasy") suggest some sort of statistical correctness or, worse, some
sort of uncertainty about is right at this point of the match. The same
sense of uncertainty is carried by the notion of not having "a
well-defined length". The word we are looking up (all of them in the
case of a trie) has a perfectly well defined length and so does every
part of the sub-string we may be matching it against.

I doubt we disagree on any technical matter in the debate anymore. It's
mostly down to writing style I suspect. At least I think it is. How
would you write the OP matcher?

--
Ben.
 
Reply With Quote
 
BartC
Guest
Posts: n/a
 
      09-24-2011
"Ben Bacarisse" <(E-Mail Removed)> wrote in message

> How would you write the OP matcher?


I posted the code below. I hope this does all the things I suggested!

There is a function lex() which extracts the next token from the input in a
general manner (and reports, to some extent, what type of token it is).

And there is a function parsetest(), which takes a given string, and checks
it corresponds to the OP's syntax spec.

Here, the details of the character-level handling, and the syntax parsing,
are largely separate.

(I haven't bothered with a formal table lookup; lex() uses file globals for
simplicity; and overflow of the lextoken[] array isn't checked. Some checks,
eg. lex()!='A', could probably be dispensed with. And there's supposed to be
something funny about using atoi(), but I can't remember what..)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

/* globals set by lex() */
int lextype; /* One of 'A' 'N' 'P' 'L' 'E' '?' means Alpha, Number,
Punctuation, EOL, End of string, Error */
int lexleadsp; /* Set to 1 when leading whitespace seen (requirement
of OP) */
char lextoken[1000]; /* Contains token for 'A', 'N' and 'P' lexical types */
char *lexptr; /* Points to next character of input */

int lex(void) {
char c;
int i;

lextoken[0]=0;
lexleadsp=0;

while (*lexptr!='\n' && isspace(*lexptr)) {
lexleadsp=1;
++lexptr;
}

c=*lexptr;
i=0;

if (c==0)
lextype='E';
else if (isalpha(c)) {
do
lextoken[i++]=toupper(*lexptr++);
while (isalpha(*lexptr));
lextype='A';
}
else if (isdigit(c)) {
do
lextoken[i++]=toupper(*lexptr++);
while (isdigit(*lexptr));
lextype='N';
}
else if (c=='\n') {
lextype='L';
++lexptr;
}
else if (ispunct(c)) {
lextoken[i++]=*lexptr++;
lextype='P';
}
else {
lextype='?';
++lexptr;
}

lextoken[i]=0;
return lextype;
}

/* Make use of lex() function to test an input string in this syntax:
OPEN|CLOSE|LOCK DOOR [+]n where n must be 1 to 10, and without
leading or trailing white space.
Returns 1 (passed) or 0 (failed)
*/
int parsetest(char *input) {
int n,t;

lexptr=input;

if (lex()!='A' || lexleadsp) return 0;
if (strcmp(lextoken,"OPEN")!=0 &&
strcmp(lextoken,"CLOSE")!=0 &&
strcmp(lextoken,"LOCK")!=0)
return 0;

if (lex()!='A' || strcmp(lextoken,"DOOR")!=0) return 0;

t=lex();
if (t=='P' && lextoken[0]=='+') t=lex();

if (t!='N') return 0;
n=atoi(lextoken);
if (n<1 || n>10) return 0;

switch (lex()) {
case 'E': case 'L':
if (lexleadsp) return 0;
break;
default:
return 0;
}
return 1;
}

int main(void) {
char* teststr = "Lock DOOR 10";
int status;

printf("Testing: \"%s\"\n",teststr);

status=parsetest(teststr);

printf("Result: %s\n",status?"Passed":"Failed");
}

--
bartc

 
Reply With Quote
 
Ben Bacarisse
Guest
Posts: n/a
 
      09-25-2011
"BartC" <(E-Mail Removed)> writes:

> "Ben Bacarisse" <(E-Mail Removed)> wrote in message
>
>> How would you write the OP matcher?

>
> I posted the code below. I hope this does all the things I suggested!
>
> There is a function lex() which extracts the next token from the input in a
> general manner (and reports, to some extent, what type of token it is).
>
> And there is a function parsetest(), which takes a given string, and checks
> it corresponds to the OP's syntax spec.
>
> Here, the details of the character-level handling, and the syntax parsing,
> are largely separate.
>
> (I haven't bothered with a formal table lookup; lex() uses file
> globals for simplicity; and overflow of the lextoken[] array isn't
> checked. Some checks, eg. lex()!='A', could probably be dispensed
> with.


I think the extra globals are a problem. To avoid the length-limited
compare, you have to store the tokens somewhere, and in C that means
either a lot of ugliness (passed-in buffers, globals arrays, what have
you) or a lot of machinery. Even taking the "no machinery" approach, 16
lines with no globals has turned into about 50 with four of them.
Upthread you said this approach was simpler and more efficient. OK, the
"more efficient" probably referred to using hashing to check keywords,
but then there would be a lot more then 50 lines. Is it really simpler?
More extensible, yes, but not, I'd say, simpler.

> And there's supposed to be
> something funny about using atoi(), but I can't remember what..)


It's undefined on integer overflow.

<snip code>
Thanks for posting that.

--
Ben.
 
Reply With Quote
 
BartC
Guest
Posts: n/a
 
      09-25-2011

"Ben Bacarisse" <(E-Mail Removed)> wrote in message
news:0.dc9f6dac9be14d60b0ee.20110925015426BST.871u (E-Mail Removed)...
> "BartC" <(E-Mail Removed)> writes:


>> I posted the code below. I hope this does all the things I suggested!


> Upthread you said this approach was simpler and more efficient. OK, the
> "more efficient" probably referred to using hashing to check keywords,
> but then there would be a lot more then 50 lines. Is it really simpler?
> More extensible, yes, but not, I'd say, simpler.


Simplicity is not just about line-count. It's also knowing exactly what is
being compared and being confident about the results.

Also, the actual syntax parsing *is* simpler, once the lexical stuff has
been encapsulated.

The efficiency will come from opportunities to do more streamlined lookups.
As it is, with an extra step to tokenise input before doing anything with it
(and using all those isxxx() functions), it's about half the speed of the
OP's original.

(Tokenising can also leave the token strings in-place, especially if the
source string can be modified, but it's bit more awkward.)

--
Bartc

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Help running a very very very simple code olivier.melcher Java 8 05-12-2008 07:51 PM
difference between scanf("%i") and scanf("%d") ??? perhaps bug inVS2005? =?ISO-8859-1?Q?Martin_J=F8rgensen?= C Programming 18 05-02-2006 10:53 AM
scanf (yes/no) - doesn't work + deprecation errors scanf, fopen etc. =?ISO-8859-1?Q?Martin_J=F8rgensen?= C Programming 185 04-03-2006 02:49 PM
Quick Book file access very very very slow Thomas Reed Computer Support 7 04-09-2004 08:09 PM
very Very VERY dumb Question About The new Set( ) 's Raymond Arthur St. Marie II of III Python 4 07-27-2003 12:09 AM



Advertisments