Richard wrote:
> Which way would you guys recommened to best parse a multiline file which contains
> two fields seperated by a tab. In this case its the
> linux/proc/filesystems file a sample of which I have included below:
>
> nodev usbfs
> ext3
> nodev fuse
> vfat
> ntfs
> nodev binfmt_misc
> udf
> iso9660
>
> The first field can be "empty" and concist of only a single tab
> character. The seperator is a tab.
>
> Is sscanf best suited to this? Or use strtok/strtok_r?
strtok(..., "\t") will give the same result for "\tfoo"
and "\t\tfoo\t" and "foo". If you *know* that the input has
two tab-separated fields and that only the first (never the
second) can be empty, you can get this to work: If strtok()
finds two fields they are #1 and #2, but if it finds only
one it is #2 with #1 empty.
However, it makes me queasy to put that much faith in an
input source I don't control programmatically. Who knows?
Maybe in six months somebody will extend the format, adding
an optional third field. If that happened, then the field-
counting approach would misinterpret "\tfoo\tbar" as if it
were "foo\tbar". It would be better to adopt a method that
would complain about "\tfoo\tbar" than to be fooled by it.
fgets() plus sscanf() is a possibility, but it's a bit
tricky to use: The obvious "%s\t%s" will not do what you
want. (The first "%s" will skip any leading white space,
leaving you in the same hole as the strtok() approach, and
the "\t" will match any amount of any kind of white space,
tabs or other.) Something like "%[^\t]%*1[\t]%s" would do
a little better, but still wouldn't be fully satisfactory:
It would match the prefix of "foo\tbar baz goozle frobnitz"
without any warning of the trailing junk. You could use
"%[^\t]%*1[\t]%s%n" and then check that sscanf() had in fact
consumed the entire string ...
... but wouldn't it be simpler just to pick the line
apart for yourself? Read it in with fgets(), use strchr()
to find the first tab (syntax error if there isn't one), and
the first (possibly empty) field is everything from the start
to just before the tab. Then start just after the tab and use
strchr() again to find the terminating '\n'; the second field
is everything from just after the tab to just before the '\n'
(syntax error if its length is zero). You can use strcspn()
to check that the second field contains no white space and
squawk if it does (somebody added a third field you don't
understand).
> The field I am really interested in is the second one : any hints & tips
> appreciated as to do this in the most efficient manor.
The "most efficient manor" is the house of Usher. Resist
this unnecessary impulse for efficiency, lest your program meet
the same fate as did that storied manse.
(In other words: How long is this file, anyhow? How many
times will you scan its contents? If you sped up the scanning
by a factor of four hundred twenty gazillion, how much faster
would the program as a whole run? If you give your SUV a coat
of wax, will you improve its fuel economy by making it slipperier
or harm it by adding weight?)
--
Eric Sosman
lid