Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > parse two field file

Reply
Thread Tools

parse two field file

 
 
Richard
Guest
Posts: n/a
 
      12-17-2006

Which way would you guys recommened to best parse a multiline file which contains
two fields seperated by a tab. In this case its the
linux/proc/filesystems file a sample of which I have included below:

nodev usbfs
ext3
nodev fuse
vfat
ntfs
nodev binfmt_misc
udf
iso9660

The first field can be "empty" and concist of only a single tab
character. The seperator is a tab.

Is sscanf best suited to this? Or use strtok/strtok_r?

The field I am really interested in is the second one : any hints & tips
appreciated as to do this in the most efficient manor.

--
 
Reply With Quote
 
 
 
 
CBFalconer
Guest
Posts: n/a
 
      12-17-2006
Richard wrote:
>
> Which way would you guys recommened to best parse a multiline file
> which contains two fields seperated by a tab. In this case its the
> linux/proc/filesystems file a sample of which I have included below:
>
> nodev usbfs
> ext3
> nodev fuse
> vfat
> ntfs
> nodev binfmt_misc
> udf
> iso9660
>
> The first field can be "empty" and concist of only a single tab
> character. The seperator is a tab.
>
> Is sscanf best suited to this? Or use strtok/strtok_r?
>
> The field I am really interested in is the second one : any hints
> & tips appreciated as to do this in the most efficient manor.


Use toksplit. Call with tokchar set to '\t'. Std C code follows:

/* ------- file toksplit.h ----------*/
#ifndef H_toksplit_h
# define H_toksplit_h

# ifdef __cplusplus
extern "C" {
# endif

#include <stddef.h>

/* copy over the next token from an input string, after
skipping leading blanks (or other whitespace?). The
token is terminated by the first appearance of tokchar,
or by the end of the source string.

The caller must supply sufficient space in token to
receive any token, Otherwise tokens will be truncated.

Returns: a pointer past the terminating tokchar.

This will happily return an infinity of empty tokens if
called with src pointing to the end of a string. Tokens
will never include a copy of tokchar.

released to Public Domain, by C.B. Falconer.
Published 2006-02-20. Attribution appreciated.
*/

const char *toksplit(const char *src, /* Source of tokens */
char tokchar, /* token delimiting char */
char *token, /* receiver of parsed token */
size_t lgh); /* length token can receive */
/* not including final '\0' */

# ifdef __cplusplus
}
# endif
#endif
/* ------- end file toksplit.h ----------*/

/* ------- file toksplit.c ----------*/
#include "toksplit.h"

/* copy over the next token from an input string, after
skipping leading blanks (or other whitespace?). The
token is terminated by the first appearance of tokchar,
or by the end of the source string.

The caller must supply sufficient space in token to
receive any token, Otherwise tokens will be truncated.

Returns: a pointer past the terminating tokchar.

This will happily return an infinity of empty tokens if
called with src pointing to the end of a string. Tokens
will never include a copy of tokchar.

A better name would be "strtkn", except that is reserved
for the system namespace. Change to that at your risk.

released to Public Domain, by C.B. Falconer.
Published 2006-02-20. Attribution appreciated.
Revised 2006-06-13
*/

const char *toksplit(const char *src, /* Source of tokens */
char tokchar, /* token delimiting char */
char *token, /* receiver of parsed token */
size_t lgh) /* length token can receive */
/* not including final '\0' */
{
if (src) {
while (' ' == *src) src++;

while (*src && (tokchar != *src)) {
if (lgh) {
*token++ = *src;
--lgh;
}
src++;
}
if (*src && (tokchar == *src)) src++;
}
*token = '\0';
return src;
} /* toksplit */

#ifdef TESTING
#include <stdio.h>

#define ABRsize 6 /* length of acceptable token abbreviations */

/* ---------------- */

static void showtoken(int i, char *tok)
{
putchar(i + '1'); putchar(':');
puts(tok);
} /* showtoken */

/* ---------------- */

int main(void)
{
char teststring[] = "This is a test, ,, abbrev, more";

const char *t, *s = teststring;
int i;
char token[ABRsize + 1];

puts(teststring);
t = s;
for (i = 0; i < 4; i++) {
t = toksplit(t, ',', token, ABRsize);
showtoken(i, token);
}

puts("\nHow to detect 'no more tokens' while truncating");
t = s; i = 0;
while (*t) {
t = toksplit(t, ',', token, 3);
showtoken(i, token);
i++;
}

puts("\nUsing blanks as token delimiters");
t = s; i = 0;
while (*t) {
t = toksplit(t, ' ', token, ABRsize);
showtoken(i, token);
i++;
}
return 0;
} /* main */

#endif
/* ------- end file toksplit.c ----------*/

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>

 
Reply With Quote
 
 
 
 
Malcolm
Guest
Posts: n/a
 
      12-17-2006



"Richard" <> wrote in message
news:...
>
> Which way would you guys recommened to best parse a multiline file which
> contains
> two fields seperated by a tab. In this case its the
> linux/proc/filesystems file a sample of which I have included below:
>
> nodev usbfs
> ext3
> nodev fuse
> vfat
> ntfs
> nodev binfmt_misc
> udf
> iso9660
>
> The first field can be "empty" and concist of only a single tab
> character. The seperator is a tab.
>
> Is sscanf best suited to this? Or use strtok/strtok_r?
>
> The field I am really interested in is the second one : any hints & tips
> appreciated as to do this in the most efficient manor.
>

The input format is slightly quirky, so the best solution is to call fgets()
to read a line and then parse it yourself.

int checkheader(char *str)

ccan check whether the string is a header or not by looking for the tab or
counting whitespace.

parseheader(char *str, char *field1, char *field2)

will pull out the fields for you. make sure you reject over-long strings.
Then the data fields only contain one string.

However

void trim(char *str)

which removes leading and trailing whitespace is a good function to have.

so too is
int checkblank(char *str)

which checks for strings which consist entirely of whitespace characters.
--
www.personal.leeds.ac.uk/~bgy1mm
freeware games to download.


 
Reply With Quote
 
Richard
Guest
Posts: n/a
 
      12-17-2006
"Malcolm" <> writes:

> "Richard" <> wrote in message
> news:...
>>
>> Which way would you guys recommened to best parse a multiline file which
>> contains
>> two fields seperated by a tab. In this case its the
>> linux/proc/filesystems file a sample of which I have included below:
>>
>> nodev usbfs
>> ext3
>> nodev fuse
>> vfat
>> ntfs
>> nodev binfmt_misc
>> udf
>> iso9660
>>
>> The first field can be "empty" and concist of only a single tab
>> character. The seperator is a tab.
>>
>> Is sscanf best suited to this? Or use strtok/strtok_r?
>>
>> The field I am really interested in is the second one : any hints & tips
>> appreciated as to do this in the most efficient manor.
>>

> The input format is slightly quirky, so the best solution is to call fgets()
> to read a line and then parse it yourself.
>
> int checkheader(char *str)
>
> ccan check whether the string is a header or not by looking for the tab or
> counting whitespace.
>
> parseheader(char *str, char *field1, char *field2)
>
> will pull out the fields for you. make sure you reject over-long strings.
> Then the data fields only contain one string.
>
> However
>
> void trim(char *str)
>
> which removes leading and trailing whitespace is a good function to have.
>
> so too is
> int checkblank(char *str)
>
> which checks for strings which consist entirely of whitespace
> characters.


I just did sscanf("%s%s",f1,f2) in the end.

--
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      12-17-2006
Richard wrote:
> Which way would you guys recommened to best parse a multiline file which contains
> two fields seperated by a tab. In this case its the
> linux/proc/filesystems file a sample of which I have included below:
>
> nodev usbfs
> ext3
> nodev fuse
> vfat
> ntfs
> nodev binfmt_misc
> udf
> iso9660
>
> The first field can be "empty" and concist of only a single tab
> character. The seperator is a tab.
>
> Is sscanf best suited to this? Or use strtok/strtok_r?


strtok(..., "\t") will give the same result for "\tfoo"
and "\t\tfoo\t" and "foo". If you *know* that the input has
two tab-separated fields and that only the first (never the
second) can be empty, you can get this to work: If strtok()
finds two fields they are #1 and #2, but if it finds only
one it is #2 with #1 empty.

However, it makes me queasy to put that much faith in an
input source I don't control programmatically. Who knows?
Maybe in six months somebody will extend the format, adding
an optional third field. If that happened, then the field-
counting approach would misinterpret "\tfoo\tbar" as if it
were "foo\tbar". It would be better to adopt a method that
would complain about "\tfoo\tbar" than to be fooled by it.

fgets() plus sscanf() is a possibility, but it's a bit
tricky to use: The obvious "%s\t%s" will not do what you
want. (The first "%s" will skip any leading white space,
leaving you in the same hole as the strtok() approach, and
the "\t" will match any amount of any kind of white space,
tabs or other.) Something like "%[^\t]%*1[\t]%s" would do
a little better, but still wouldn't be fully satisfactory:
It would match the prefix of "foo\tbar baz goozle frobnitz"
without any warning of the trailing junk. You could use
"%[^\t]%*1[\t]%s%n" and then check that sscanf() had in fact
consumed the entire string ...

... but wouldn't it be simpler just to pick the line
apart for yourself? Read it in with fgets(), use strchr()
to find the first tab (syntax error if there isn't one), and
the first (possibly empty) field is everything from the start
to just before the tab. Then start just after the tab and use
strchr() again to find the terminating '\n'; the second field
is everything from just after the tab to just before the '\n'
(syntax error if its length is zero). You can use strcspn()
to check that the second field contains no white space and
squawk if it does (somebody added a third field you don't
understand).

> The field I am really interested in is the second one : any hints & tips
> appreciated as to do this in the most efficient manor.


The "most efficient manor" is the house of Usher. Resist
this unnecessary impulse for efficiency, lest your program meet
the same fate as did that storied manse.

(In other words: How long is this file, anyhow? How many
times will you scan its contents? If you sped up the scanning
by a factor of four hundred twenty gazillion, how much faster
would the program as a whole run? If you give your SUV a coat
of wax, will you improve its fuel economy by making it slipperier
or harm it by adding weight?)

--
Eric Sosman
lid
 
Reply With Quote
 
Giorgos Keramidas
Guest
Posts: n/a
 
      12-26-2006
On Sun, 17 Dec 2006 01:10:16 +0100, Richard <> wrote:
> Which way would you guys recommened to best parse a multiline file
> which contains two fields seperated by a tab. In this case its the
> linux/proc/filesystems file a sample of which I have included below:
>
> nodev usbfs
> ext3
> nodev fuse
> vfat
> ntfs
> nodev binfmt_misc
> udf
> iso9660
>
> The first field can be "empty" and concist of only a single tab
> character. The seperator is a tab.
>
> Is sscanf best suited to this? Or use strtok/strtok_r?


strtok() is not so nice, because it tries to modify the string you pass
to it. I would probably use strcspn() for this, with something like:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXLINE 256

static void doline(char *buf, size_t bufsize);

int
main(void)
{
char buf[MAXLINE];
FILE *fp;

/*
* Add code here that opens /proc/filesystems file, instead of using
* `stdin' as the input file.
*/
fp = stdin;

clearerr(fp);
while (fgets(buf, sizeof buf, fp) != NULL) {
doline(buf, sizeof buf);
}
if (ferror(fp) != 0) {
perror("fgets");
exit(EXIT_FAILURE);
}
/*
* Add code here that closes the open file referenced by `fp'.
*/

return EXIT_SUCCESS;
}

static void
doline(char *buf, size_t bufsize)
{
char *field;
size_t pos, pos2, fieldsize;

assert(buf != NULL && bufsize > 0);
(void)bufsize;

pos = strcspn(buf, "\t");
if (buf[pos] == '\0') {
fprintf(stderr,
"warning: no TAB in `%s', skipping this line\n", buf);
return;
}
pos2 = strcspn(buf + pos + 1, "\t");

fieldsize = pos2 + 1;
field = malloc(fieldsize);
if (field == NULL) {
perror("malloc");
return;
}
strncpy(field, buf + pos + 1, fieldsize - 1);
field[fieldsize - 1] = '\0';
field[strcspn(field, "\n\r")] = '\0';
printf("%s\n", field);
free(field);
}

The trick is to use strcspn() to find out the 'part' of the original
string which you are interested in, and then you can do whatever you
like with this part. In the particular program, I'm temporarily
allocate a new string buffer, copy the original contents in this new
buffer, print the buffer and release its memory. Any other way you can
think about to use this substring is fine too

 
Reply With Quote
 
Dave Thompson
Guest
Posts: n/a
 
      01-03-2007
On Sun, 17 Dec 2006 10:37:28 -0500, Eric Sosman
<> wrote:

> Richard wrote:
> > Which way would you guys recommened to best parse a multiline file which contains
> > two fields seperated by a tab. <snip>

> strtok(..., "\t") will [lose empty fields]


Right.

> fgets() plus sscanf() is a possibility, but it's a bit
> tricky to use: The obvious "%s\t%s" will not do what you
> want. (The first "%s" will skip any leading white space,
> leaving you in the same hole as the strtok() approach, and
> the "\t" will match any amount of any kind of white space,
> tabs or other.) Something like "%[^\t]%*1[\t]%s" would do
> a little better, but still wouldn't be fully satisfactory:


Not enough better. If the first field is empty and thus the first
%[^\t] matches nothing, *scanf stops and doesn't do the %*1[\t]s.

This is effectively the same problem of the people who periodically
try to use {,f}scanf to replace <ILLEGAL> fflush (input) </>.
(Some people, including IIRC Dan Pop, have recommended e.g.
if( scanf ("%*[^\n]%*1[\n]") < 2 ) getchar ();
but I consider that too much uglier than the obvious, though slightly
longer and possibly slightly less efficient
while( (ch = getchar()) != EOF && ch != '\n' ) ;
etc.

Plus unbounded %[...] or %s risks buffer overflow and resulting UB.
You should specify a length at most one less than the buffer size.

> It would match the prefix of "foo\tbar baz goozle frobnitz"
> without any warning of the trailing junk. You could use
> "%[^\t]%*1[\t]%s%n" and then check that sscanf() had in fact
> consumed the entire string ...
>
> ... but wouldn't it be simpler just to pick the line
> apart for yourself? Read it in with fgets(), use strchr()
> to find the first tab <snip>


Yes.

> The "most efficient manor" is the house of Usher. Resist
> this unnecessary impulse for efficiency, lest your program meet
> the same fate as did that storied manse.
>

Yes. Or even the hundred-year shay, IIRC grade school. <G>

- David.Thompson1 at worldnet.att.net
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
javascript validation for a not required field, field is onlyrequired if another field has a value jr Javascript 3 07-08-2010 10:33 AM
how to read parse data from two file to generate one one file Asif Iqbal Ruby 0 08-06-2009 04:47 PM
Copy File Field Value to Dynamic File Field Value VUNETdotUS Javascript 25 11-10-2007 10:36 AM
1.Enter space bar for field names and save the field.The field shoud not get saved and an alert should be there as"Space bars are not allowed" Sound Javascript 2 09-28-2006 02:43 PM
How to parse a string like C program parse the command line string? linzhenhua1205@163.com C Programming 19 03-15-2005 07:41 PM



Advertisments