Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   C Programming (http://www.velocityreviews.com/forums/f42-c-programming.html)
-   -   newbie fscanf %[ conversions, multipliers (http://www.velocityreviews.com/forums/t440669-newbie-fscanf-conversions-multipliers.html)

Steven 12-27-2005 05:12 PM

newbie fscanf %[ conversions, multipliers
 
Hi,

I am using fscanf() to read words. But I want to match alphanumeric
characters only. However the program, when using the conversion
specifier %255[a-z,A-Z] prints only spaces and other non-standard
ascii characters. I have listed a small example below. Can someone
please tell me what I am doing wrong or forgetting, with regards to
the conversion specifier ? Thankx. !

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define MAXWORDLEN 256

int main(void) {
char word[MAXWORDLEN];
char *bigr[2];
int i = 0;

bigr[0] = calloc(MAXWORDLEN, sizeof(char));
bigr[1] = calloc(MAXWORDLEN, sizeof(char));

while(fscanf(stdin, "%255[a-z,A-Z]", word) != EOF) {
strcpy(bigr[i++], word);
if(i == 2) {
printf("%s %s\n", bigr[0], bigr[1]);
strcpy(bigr[0], bigr[1]);
i = 1;
}
}

return 0;
}

Chris Torek 12-27-2005 07:54 PM

Re: newbie fscanf %[ conversions, multipliers
 
In article <k2t2r1htco5qmvnl8ro4f4v62hmppij92u@4ax.com>
Steven <Steven@yahoo.com> wrote:
>I am using fscanf() to read words. But I want to match alphanumeric
>characters only. ...


>#include <stdio.h>
>#include <string.h>
>#include <stdlib.h>
>
>#define MAXWORDLEN 256
>
>int main(void) {
> char word[MAXWORDLEN];
> char *bigr[2];
> int i = 0;
>
> bigr[0] = calloc(MAXWORDLEN, sizeof(char));
> bigr[1] = calloc(MAXWORDLEN, sizeof(char));
>
> while(fscanf(stdin, "%255[a-z,A-Z]", word) != EOF) {


So far, not too bad, except that you did not store the return
value from fscanf(). There are three possible return values
for this particular call: EOF, 0, and 1 (representing "input
failure", "matching failure", and "success" respectively). This
code handles input failure, but cannot distinguish between
"matching failure" and "success".

It also seems a little odd to me that you include a comma in the
scanset, and no digits, when you say "alphanumeric only". (It is
also worth pointing out that on an EBCDIC machine, such as some
IBM mainframes, %[A-Za-z] includes some punctuation and such, as
the alphabetic characters are not contiguous. It will also not
work well with some European character sets in ISO Latin 1, where
legitimate alphabetic characters like will be excluded.)

The biggest problem, though, is what happens if the scanf engine
succeeds. The conversion specification here is %255[ and the
scanset is "a through z" plus "," plus "A through Z": lowercase
alphabetic, comma, and uppercase alphabetic, if the machine uses
ASCII. If the input begins with at least one alphabetic character
or comma, the conversion will succeed -- fscanf will return 1 --
and the converted characters will be stored in the array named
"word", which is in fact big enough (256 characters).

Something eventually causes the scan to stop. There are only
three possibilities: an attempt to read encounters EOF; the 255
character limit runs out; or -- most likely -- the next character
in the input stream is not in the scanset. It is the third case
that is the immediate problem. When the scanf engine stops
processing input directives, whatever character(s) are in the
input stream remain in the input stream. Assuming the first
directive stops because of a space or a newline, the space or
newline remains in the stream.

The %[ directive, unlike most directives, *does not skip initial
white space* (spaces, tabs, newlines, etc).

The code inside the loop also does not skip white space:

> strcpy(bigr[i++], word);
> if(i == 2) {
> printf("%s %s\n", bigr[0], bigr[1]);
> strcpy(bigr[0], bigr[1]);
> i = 1;
> }
> }


Thus, on the next trip through the loop, the first character that
the fscanf() call encounters will be the whitespace left behind by
the previous fscanf(). This will cause a "matching failure", so
that the second fscanf() will return 0, leaving the "word" array
unmodified.

You could attempt to fix this by skipping whitespace inside the
loop:

#include <ctype.h> /* with the other #includes */
...
/* somewhere inside the loop */
int c;

while ((c = getc(stdin)) != EOF && isspace(c))
continue;
if (c != EOF)
ungetc(c, stdin);

but this is not quite correct. Suppose the scanf engine eventually
stops, but not because of whitespace, not because of EOF, and not
because the 255-character limit ran out: suppose it stops because
the next input character available is, e.g., '('. This is not
alphabetic but is also not whitespace, so isspace() will say "not
space".

In fact, what you need is "read and convert stuff that *is* part
of a word" interleaved between "read and discard stuff that is
*not* part of a word". The question then becomes whether the file
must begin with a "word", or will you allow "non-word" stuff to
come before the first "word".

> return 0;
>}


Good, main() needs a return value. :-)

It *is* possible to do this with the scanf engine, but you will
need at least two calls to it unless the file *must* begin with a
word. In the latter case, you can do:

for (;;) {
/* fscanf(stdin, fmt, ...) == scanf(fmt, ...) */
result = scanf("%255[A-Za-z0-9]%*[^A-Za-z0-9]", word);
if (result == EOF)
break;
if (result == 0)
... do something ...

This scanf directive-pair means: "Read and convert stuff in the
character class, with an input failure if EOF occurs before any
input, or a matching failure if there are no characters in the
class. Then, if no failure, read and discard stuff (not) in the
character class, with an input failure if EOF occurs before any
further characters are input, or a matching failure if the next
input character is in the class." The return value will be EOF
if input failure occurred before any data were stored in the array
named word, 0 if a matching failure occurred before any data were
stored in the array, or 1 if data were stored in the array. (You
get no notice if the second directive fails, due to the assignment
suppression.)

The second character-class is negated because of the "^", hence
the (not). Note that either directive can fail if there is not at
least one character in (or not in) the class: %[ demands that at
least one character be read (and discarded for %*[, or assigned
for %[).

The scanf() above will "get stuck" if the file begins with a non-word
character. Suppose the first character is ':' (colon), for instance.
The first %[ directive will see the colon and fail with a matching
failure, terminating the scan, returning 0, leaving the colon in
the input stream. A subsequent trip through the loop will again
see the colon and again cause the scanf to terminate with a matching
failure, returning 0.

If you wish to discard "non-word" characters, but allow the case
of "no non-word characters", you can invoke scanf twice:

#define WORD_CLASS "A-Za-z0-9"

result = scanf("%*[^" WORD_CLASS "]");
/* XXX: throw above result away */

result = scanf("%255[" WORD_CLASS "]", word);
if (result == EOF)
break;
if (result == 0)
... panic -- this should never happen ...

Here, the first call is allowed to fail with a matching failure if
there is a "word-class" character. In this case, it leaves the
"word-class" character in the input stream, and the second scanf
will find it there. It is also allowed to fail (silently) with an
input failure, in the hopes that the second scanf will also
immediately encounter input failure (this is likely, but not
guaranteed -- if you want to avoid the situation, you could test
the first result). And of course, it is allowed to succeed,
eating up all "non-word" characters and leaving either EOF or
a "word" character for the second scanf.

You cannot combine these two calls into one, because if the stream
currently begins with a valid word character, the negated class
directive ("%[^...]") will cause the scanf call to fail, and return
without converting-and-assigning into the "word" array.

Finally, two more notes.

First: suppose an input word exceeds 255 characters in length. A
loop of the form:

for (;;) {
/* read and discard any non-word characters, allowing none */
...
/* read and convert valid "word" characters, requiring 1 or
more but stopping after 255 even if there are more */
...
/* do something with the word */
}

will consider the remaining character(s) -- up to the next 255 --
as an additional, separate word, even though the two input "words"
were not separated by any non-word characters.

This may be what you want, or may not.

Second: "alphanumeric" words often mean "words starting with an
alphabetic character, then allowing alphabetic or numeric characters"
(in programming languages, at least -- C among them -- identfiers
are alphanumeric words that cannot *begin* with digits). The
scanf engine is not very suited to such a job: its directives are
clumsier than typical regular-expression handlers (lex, perl and
awk REs, and the like). You can sort-of express this with:

char firstchar[2];
char rest[256 - 1];
int result;

result = scanf("%1[A-Za-z]%254[A-Za-z0-9]", firstchar, rest);

although to allow for EBCDIC, the "A-Z"s should be expanded out
as well:

#define ALPHABETIC "ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
"abcdefghijklmnopqrstuvwxyz"
#define ALPHANUMERIC ALPHABETIC "0-9"

...

result = scanf("%1[" ALPHABETIC "]%254[" ALPHANUMERIC "]",
firstchar, rest);

(C guarantees that the digits are grouped "properly" so we can use
the shorthand for the digit part). In both cases, if "result" is
1, the input was just a single-character alphabetic-only "word";
if result is 2, the alphanumeric tail of the word is in "rest".
(We need a 2-character array to hold the first character because
the %[ directive always stores a C string, i.e., adds the '\0'.)

The best solution is probably to ignore scanf entirely. In this
case, you can write a small "word reading" function that uses
isalpha() and isdigit() from <ctype.h>, and a corresponding
"word skipping" function that also uses isapha() and isdigit().

As usual, scanf is a poor solution: for simple problems, it is too
complicated; for robust programs that do complicated jobs, it is
too simple.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (4039.22'N, 11150.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.

Steven 12-27-2005 10:52 PM

Re: newbie fscanf %[ conversions, multipliers
 
Sorry for breaking the net etiquette by starting my reply at the top.

But thank you so much for the very complete reply!

I resulted to the scanf fam. for ease of use, but after reading your
reply I am not sure anymore `what easy actually is' :-)

Especially as a beginner even such seemingly simple task as deriving
tokens from text data can be dangerous. For the future I hope that
this will turn into C power for me, untill then I promise to keep
reading.

Thanks again for the complete reply!

Steven.


On 27 Dec 2005 19:54:32 GMT, Chris Torek <nospam@torek.net> wrote:
> In article <k2t2r1htco5qmvnl8ro4f4v62hmppij92u@4ax.com>
> Steven <Steven@yahoo.com> wrote:
> >I am using fscanf() to read words. But I want to match alphanumeric
> >characters only. ...

>
> >#include <stdio.h>
> >#include <string.h>
> >#include <stdlib.h>
> >
> >#define MAXWORDLEN 256
> >
> >int main(void) {
> > char word[MAXWORDLEN];
> > char *bigr[2];
> > int i = 0;
> >
> > bigr[0] = calloc(MAXWORDLEN, sizeof(char));
> > bigr[1] = calloc(MAXWORDLEN, sizeof(char));
> >
> > while(fscanf(stdin, "%255[a-z,A-Z]", word) != EOF) {

>
> So far, not too bad, except that you did not store the return
> value from fscanf(). There are three possible return values
> for this particular call: EOF, 0, and 1 (representing "input
> failure", "matching failure", and "success" respectively). This
> code handles input failure, but cannot distinguish between
> "matching failure" and "success".
>
> It also seems a little odd to me that you include a comma in the
> scanset, and no digits, when you say "alphanumeric only". (It is
> also worth pointing out that on an EBCDIC machine, such as some
> IBM mainframes, %[A-Za-z] includes some punctuation and such, as
> the alphabetic characters are not contiguous. It will also not
> work well with some European character sets in ISO Latin 1, where
> legitimate alphabetic characters like will be excluded.)
>
> The biggest problem, though, is what happens if the scanf engine
> succeeds. The conversion specification here is %255[ and the
> scanset is "a through z" plus "," plus "A through Z": lowercase
> alphabetic, comma, and uppercase alphabetic, if the machine uses
> ASCII. If the input begins with at least one alphabetic character
> or comma, the conversion will succeed -- fscanf will return 1 --
> and the converted characters will be stored in the array named
> "word", which is in fact big enough (256 characters).
>
> Something eventually causes the scan to stop. There are only
> three possibilities: an attempt to read encounters EOF; the 255
> character limit runs out; or -- most likely -- the next character
> in the input stream is not in the scanset. It is the third case
> that is the immediate problem. When the scanf engine stops
> processing input directives, whatever character(s) are in the
> input stream remain in the input stream. Assuming the first
> directive stops because of a space or a newline, the space or
> newline remains in the stream.
>
> The %[ directive, unlike most directives, *does not skip initial
> white space* (spaces, tabs, newlines, etc).
>
> The code inside the loop also does not skip white space:
>
> > strcpy(bigr[i++], word);
> > if(i == 2) {
> > printf("%s %s\n", bigr[0], bigr[1]);
> > strcpy(bigr[0], bigr[1]);
> > i = 1;
> > }
> > }

>
> Thus, on the next trip through the loop, the first character that
> the fscanf() call encounters will be the whitespace left behind by
> the previous fscanf(). This will cause a "matching failure", so
> that the second fscanf() will return 0, leaving the "word" array
> unmodified.
>
> You could attempt to fix this by skipping whitespace inside the
> loop:
>
> #include <ctype.h> /* with the other #includes */
> ...
> /* somewhere inside the loop */
> int c;
>
> while ((c = getc(stdin)) != EOF && isspace(c))
> continue;
> if (c != EOF)
> ungetc(c, stdin);
>
> but this is not quite correct. Suppose the scanf engine eventually
> stops, but not because of whitespace, not because of EOF, and not
> because the 255-character limit ran out: suppose it stops because
> the next input character available is, e.g., '('. This is not
> alphabetic but is also not whitespace, so isspace() will say "not
> space".
>
> In fact, what you need is "read and convert stuff that *is* part
> of a word" interleaved between "read and discard stuff that is
> *not* part of a word". The question then becomes whether the file
> must begin with a "word", or will you allow "non-word" stuff to
> come before the first "word".
>
> > return 0;
> >}

>
> Good, main() needs a return value. :-)
>
> It *is* possible to do this with the scanf engine, but you will
> need at least two calls to it unless the file *must* begin with a
> word. In the latter case, you can do:
>
> for (;;) {
> /* fscanf(stdin, fmt, ...) == scanf(fmt, ...) */
> result = scanf("%255[A-Za-z0-9]%*[^A-Za-z0-9]", word);
> if (result == EOF)
> break;
> if (result == 0)
> ... do something ...
>
> This scanf directive-pair means: "Read and convert stuff in the
> character class, with an input failure if EOF occurs before any
> input, or a matching failure if there are no characters in the
> class. Then, if no failure, read and discard stuff (not) in the
> character class, with an input failure if EOF occurs before any
> further characters are input, or a matching failure if the next
> input character is in the class." The return value will be EOF
> if input failure occurred before any data were stored in the array
> named word, 0 if a matching failure occurred before any data were
> stored in the array, or 1 if data were stored in the array. (You
> get no notice if the second directive fails, due to the assignment
> suppression.)
>
> The second character-class is negated because of the "^", hence
> the (not). Note that either directive can fail if there is not at
> least one character in (or not in) the class: %[ demands that at
> least one character be read (and discarded for %*[, or assigned
> for %[).
>
> The scanf() above will "get stuck" if the file begins with a non-word
> character. Suppose the first character is ':' (colon), for instance.
> The first %[ directive will see the colon and fail with a matching
> failure, terminating the scan, returning 0, leaving the colon in
> the input stream. A subsequent trip through the loop will again
> see the colon and again cause the scanf to terminate with a matching
> failure, returning 0.
>
> If you wish to discard "non-word" characters, but allow the case
> of "no non-word characters", you can invoke scanf twice:
>
> #define WORD_CLASS "A-Za-z0-9"
>
> result = scanf("%*[^" WORD_CLASS "]");
> /* XXX: throw above result away */
>
> result = scanf("%255[" WORD_CLASS "]", word);
> if (result == EOF)
> break;
> if (result == 0)
> ... panic -- this should never happen ...
>
> Here, the first call is allowed to fail with a matching failure if
> there is a "word-class" character. In this case, it leaves the
> "word-class" character in the input stream, and the second scanf
> will find it there. It is also allowed to fail (silently) with an
> input failure, in the hopes that the second scanf will also
> immediately encounter input failure (this is likely, but not
> guaranteed -- if you want to avoid the situation, you could test
> the first result). And of course, it is allowed to succeed,
> eating up all "non-word" characters and leaving either EOF or
> a "word" character for the second scanf.
>
> You cannot combine these two calls into one, because if the stream
> currently begins with a valid word character, the negated class
> directive ("%[^...]") will cause the scanf call to fail, and return
> without converting-and-assigning into the "word" array.
>
> Finally, two more notes.
>
> First: suppose an input word exceeds 255 characters in length. A
> loop of the form:
>
> for (;;) {
> /* read and discard any non-word characters, allowing none */
> ...
> /* read and convert valid "word" characters, requiring 1 or
> more but stopping after 255 even if there are more */
> ...
> /* do something with the word */
> }
>
> will consider the remaining character(s) -- up to the next 255 --
> as an additional, separate word, even though the two input "words"
> were not separated by any non-word characters.
>
> This may be what you want, or may not.
>
> Second: "alphanumeric" words often mean "words starting with an
> alphabetic character, then allowing alphabetic or numeric characters"
> (in programming languages, at least -- C among them -- identfiers
> are alphanumeric words that cannot *begin* with digits). The
> scanf engine is not very suited to such a job: its directives are
> clumsier than typical regular-expression handlers (lex, perl and
> awk REs, and the like). You can sort-of express this with:
>
> char firstchar[2];
> char rest[256 - 1];
> int result;
>
> result = scanf("%1[A-Za-z]%254[A-Za-z0-9]", firstchar, rest);
>
> although to allow for EBCDIC, the "A-Z"s should be expanded out
> as well:
>
> #define ALPHABETIC "ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
> "abcdefghijklmnopqrstuvwxyz"
> #define ALPHANUMERIC ALPHABETIC "0-9"
>
> ...
>
> result = scanf("%1[" ALPHABETIC "]%254[" ALPHANUMERIC "]",
> firstchar, rest);
>
> (C guarantees that the digits are grouped "properly" so we can use
> the shorthand for the digit part). In both cases, if "result" is
> 1, the input was just a single-character alphabetic-only "word";
> if result is 2, the alphanumeric tail of the word is in "rest".
> (We need a 2-character array to hold the first character because
> the %[ directive always stores a C string, i.e., adds the '\0'.)
>
> The best solution is probably to ignore scanf entirely. In this
> case, you can write a small "word reading" function that uses
> isalpha() and isdigit() from <ctype.h>, and a corresponding
> "word skipping" function that also uses isapha() and isdigit().
>
> As usual, scanf is a poor solution: for simple problems, it is too
> complicated; for robust programs that do complicated jobs, it is
> too simple.


Christopher Benson-Manica 12-28-2005 01:40 AM

Re: newbie fscanf %[ conversions, multipliers
 
Steven <Steven@yahoo.com> wrote:

> bigr[0] = calloc(MAXWORDLEN, sizeof(char));
> bigr[1] = calloc(MAXWORDLEN, sizeof(char));


Far be it from me to nitpick the outstanding reply you already
received, but you should check the return value of calloc() before
continuing.

--
Christopher Benson-Manica | I *should* know what I'm talking about - if I
ataru(at)cyberspace.org | don't, I need to know. Flames welcome.


All times are GMT. The time now is 03:53 PM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.