Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Parsing two formatted text files

Reply
Thread Tools

Parsing two formatted text files

 
 
bfowlkes@gmail.com
Guest
Posts: n/a
 
      03-31-2006
Hello,

I am trying to parse two pre-formatted text files and write them to a
different files formatted in a different way. The story about this is I
was hired along with about 20 other people and it seems we are trying
to learn the whole C language in two weeks! To top it all off, I was an
English Major, but I'm trying my best. Ok back to the program. So we
have two files product_catalog.txt and sales_month.txt

The info in product_catalog.txt looks like this:

1010:CD drive external 32x :1MagiCopy:15.5:100
1020:CD drive external 40x :20th Century Fox:16.74:130
1030:CD drive external 48x :3COM:13.48:160
1040:CD drive external 52x :4XEM:15.92:190

We need to write it to another file that is going to look like this

ID Number Description Provider Cost Stock Total
1010 CD Drive 32x 1MagiCopy 15.50 100 1550.00

Since the text file to be read from is preformatted I thought I could
use the fscanf() to to parse each line and assign it into structure
variables, but I am having problems.

Here is my code to read the file:

int readFile (char *filename, struct productData product[], size_t
arrLen)
/* Returns number of products read */
{
FILE *fp;

if ( ( fp = fopen( "product_catalog.txt", "rb+" ) ) == NULL ) {
printf( "File could not be opened.\n" );
} /* end if */

else
{
int i;
for (i=0; i<arrLen && !feof(fp); i++)
{
if (5 != fscanf(fp, "%d %s %s %f %d",
&product[i].idnumber,
product[i].description,
product[i].provider,
&product[i].cost,
&product[i].stock))
{
printf("Invalid file format\n");
fclose(fp);
return 0;
}
}
fclose(fp);
return i;
}


}

The problem seems to be that each field I want to parse seems to be
separated by a colon ( Is there anyway to tell fscanf() to parse up
until you reach a colon and then stop and start scanning again, or
should I give up this approach and try to tokenize the input stream?
Any help is much appreciated.

Brett

 
Reply With Quote
 
 
 
 
Eric Sosman
Guest
Posts: n/a
 
      03-31-2006


http://www.velocityreviews.com/forums/(E-Mail Removed) wrote On 03/31/06 18:06,:
> Hello,
>
> I am trying to parse two pre-formatted text files and write them to a
> different files formatted in a different way. The story about this is I
> was hired along with about 20 other people and it seems we are trying
> to learn the whole C language in two weeks! To top it all off, I was an
> English Major, but I'm trying my best. Ok back to the program. So we
> have two files product_catalog.txt and sales_month.txt
>
> The info in product_catalog.txt looks like this:
>
> 1010:CD drive external 32x :1MagiCopy:15.5:100
> 1020:CD drive external 40x :20th Century Fox:16.74:130
> 1030:CD drive external 48x :3COM:13.48:160
> 1040:CD drive external 52x :4XEM:15.92:190
>
> We need to write it to another file that is going to look like this
>
> ID Number Description Provider Cost Stock Total
> 1010 CD Drive 32x 1MagiCopy 15.50 100 1550.00


That's not just reformatting. There's a little bit
of computation (deriving the 1550.00), which isn't hard.
Harder -- potentially very hard -- is the translation
that seems to be occurring: How did "drive" become "Drive,"
and where did "external" disappear to, and what rules
govern such transformations?

> Since the text file to be read from is preformatted I thought I could
> use the fscanf() to to parse each line and assign it into structure
> variables, but I am having problems.
>
> Here is my code to read the file:
>
> int readFile (char *filename, struct productData product[], size_t
> arrLen)
> /* Returns number of products read */
> {
> FILE *fp;
>
> if ( ( fp = fopen( "product_catalog.txt", "rb+" ) ) == NULL ) {
> printf( "File could not be opened.\n" );
> } /* end if */
>
> else
> {
> int i;
> for (i=0; i<arrLen && !feof(fp); i++)
> {
> if (5 != fscanf(fp, "%d %s %s %f %d",
> &product[i].idnumber,
> product[i].description,
> product[i].provider,
> &product[i].cost,
> &product[i].stock))
> {
> printf("Invalid file format\n");
> fclose(fp);
> return 0;
> }
> }
> fclose(fp);
> return i;
> }
>
>
> }
>
> The problem seems to be that each field I want to parse seems to be
> separated by a colon ( Is there anyway to tell fscanf() to parse up
> until you reach a colon and then stop and start scanning again, or
> should I give up this approach and try to tokenize the input stream?
> Any help is much appreciated.


"%s" will skip leading white space, grab a string,
and stop when it hits white space again. Hence, it's
no good for your input format, where white spaces can
occur as part of a data field.

You could use "%[^:]" to look for colon-delimited
fields, but the resulting program would be rather fragile.
One lousy line with an extra colon or a missing colon,
and you'll be out of step for the rest of the journey.
or until you trip and fall, whichever comes first.
(fscanf() is no respecter of line boundaries, and will
happily cross them in search of more input.)

Recommended approach: Use fgets() (but not gets()!!!)
to read each line into a big char[] array, and then pick
the line apart with other tools. sscanf() may be a choice
you'd find familiar -- and since sscanf() cannot run off
the end of its input array (and thus inadvertengly bypass
line boundaries), some of the infelicities of fscanf()
disappear.

--
(E-Mail Removed)

 
Reply With Quote
 
 
 
 
Ben C
Guest
Posts: n/a
 
      03-31-2006
On 2006-03-31, (E-Mail Removed) <(E-Mail Removed)> wrote:
> [...]
> The info in product_catalog.txt looks like this:
>
> 1010:CD drive external 32x :1MagiCopy:15.5:100
> 1020:CD drive external 40x :20th Century Fox:16.74:130
> 1030:CD drive external 48x :3COM:13.48:160
> 1040:CD drive external 52x :4XEM:15.92:190
>
> We need to write it to another file that is going to look like this
>
> ID Number Description Provider Cost Stock Total
> 1010 CD Drive 32x 1MagiCopy 15.50 100 1550.00
>
> Since the text file to be read from is preformatted I thought I could
> use the fscanf() to to parse each line and assign it into structure
> variables, but I am having problems.
>
> Here is my code to read the file:
>
> int readFile (char *filename, struct productData product[], size_t
> arrLen)
> /* Returns number of products read */
> {
> FILE *fp;
>
> if ( ( fp = fopen( "product_catalog.txt", "rb+" ) ) == NULL ) {
> printf( "File could not be opened.\n" );
> } /* end if */
>
> else
> {
> int i;
> for (i=0; i<arrLen && !feof(fp); i++)
> {
> if (5 != fscanf(fp, "%d %s %s %f %d",
> &product[i].idnumber,
> product[i].description,
> product[i].provider,
> &product[i].cost,
> &product[i].stock))
> {
> printf("Invalid file format\n");
> fclose(fp);
> return 0;
> }
> }
> fclose(fp);
> return i;
> }
> }
>
> The problem seems to be that each field I want to parse seems to be
> separated by a colon ( Is there anyway to tell fscanf() to parse up
> until you reach a colon and then stop and start scanning again, or
> should I give up this approach and try to tokenize the input stream?


You put the colons in the format string:

if (5 != fscanf(fp, "%d:%s:%s:%f:%d" ...

But this still won't work quite right, because %s will make fscanf will
stop at the spaces.

You can use %[^:] to mean "series of non-colons" so:

if (5 != fscanf(fp, "%d:%[^:]:%[^:]:%f:%d" ...

should do the trick.

You also have to be careful that badly formatted input data can't
overflow the arrays you're storing the data in. fscanf provides various
format modifiers for this-- it can optionally scan up to a maximum
length, or it can allocate the buffers for you.

e.g.:

if (5 != fscanf(fp, "%d:%64[^:]:%64[^:]:%f:%d" ...

if your buffers for description and provider were 64 bytes long. They'd
get truncated of course, which might not be acceptable. In that case you
could try %a[^:] (see fscanf manual).

The other point is that if you have any choice in the matter C is not
the best language for this task, you'd be much better off with something
else-- Python, Tcl, Perl, that kind of thing. Awk might be the perfect
choice.
 
Reply With Quote
 
CBFalconer
Guest
Posts: n/a
 
      04-01-2006
Eric Sosman wrote:
> (E-Mail Removed) wrote On 03/31/06 18:06,:
>>
>> I am trying to parse two pre-formatted text files and write them to a
>> different files formatted in a different way. The story about this is I
>> was hired along with about 20 other people and it seems we are trying
>> to learn the whole C language in two weeks! To top it all off, I was an
>> English Major, but I'm trying my best. Ok back to the program. So we
>> have two files product_catalog.txt and sales_month.txt
>>
>> The info in product_catalog.txt looks like this:
>>
>> 1010:CD drive external 32x :1MagiCopy:15.5:100
>> 1020:CD drive external 40x :20th Century Fox:16.74:130
>> 1030:CD drive external 48x :3COM:13.48:160
>> 1040:CD drive external 52x :4XEM:15.92:190
>>
>> We need to write it to another file that is going to look like this
>>
>> ID Number Description Provider Cost Stock Total
>> 1010 CD Drive 32x 1MagiCopy 15.50 100 1550.00

>
> That's not just reformatting. There's a little bit
> of computation (deriving the 1550.00), which isn't hard.
> Harder -- potentially very hard -- is the translation
> that seems to be occurring: How did "drive" become "Drive,"
> and where did "external" disappear to, and what rules
> govern such transformations?
>
>> Since the text file to be read from is preformatted I thought I could
>> use the fscanf() to to parse each line and assign it into structure
>> variables, but I am having problems.
>>
>> Here is my code to read the file:
>>
>> int readFile (char *filename, struct productData product[], size_t
>> arrLen)
>> /* Returns number of products read */
>> {
>> FILE *fp;
>>
>> if ( ( fp = fopen( "product_catalog.txt", "rb+" ) ) == NULL ) {
>> printf( "File could not be opened.\n" );
>> } /* end if */
>> else
> > {
> > int i;
> > for (i=0; i<arrLen && !feof(fp); i++)
> > {
> > if (5 != fscanf(fp, "%d %s %s %f %d",
> > &product[i].idnumber,
> > product[i].description,
> > product[i].provider,
> > &product[i].cost,
> > &product[i].stock))
> > {
> > printf("Invalid file format\n");
> > fclose(fp);
> > return 0;
> > }
> > }
> > fclose(fp);
> > return i;
> > }
> > }
> >
> > The problem seems to be that each field I want to parse seems to be
> > separated by a colon ( Is there anyway to tell fscanf() to parse up
> > until you reach a colon and then stop and start scanning again, or
> > should I give up this approach and try to tokenize the input stream?
> > Any help is much appreciated.

>
> "%s" will skip leading white space, grab a string,
> and stop when it hits white space again. Hence, it's
> no good for your input format, where white spaces can
> occur as part of a data field.
>
> You could use "%[^:]" to look for colon-delimited
> fields, but the resulting program would be rather fragile.
> One lousy line with an extra colon or a missing colon,
> and you'll be out of step for the rest of the journey.
> or until you trip and fall, whichever comes first.
> (fscanf() is no respecter of line boundaries, and will
> happily cross them in search of more input.)
>
> Recommended approach: Use fgets() (but not gets()!!!)
> to read each line into a big char[] array, and then pick
> the line apart with other tools. sscanf() may be a choice
> you'd find familiar -- and since sscanf() cannot run off
> the end of its input array (and thus inadvertengly bypass
> line boundaries), some of the infelicities of fscanf()
> disappear.


I would suggest he keep things as simple as possible. He could use
my ggets() to input the lines, and my toksplit to parse them.
toksplit was published here a few days ago, just search the group
archives. ggets is available on my page at:

<http://cbfalconer.home.att.net/download/ggets.zip>

Then the code will look much like:

char *ln, *tmp;
int ix;
char tok[MAXTOKEN + 1]; /* allow for '0' always */

while (0 == ggets(&ln)) {
tmp = ln; ix = 0;
while (*tmp) {
tmp = toksplit(tmp, ':', tok, MAXTOKEN);
ix++; /* just to keep track of which token in line */
/* code to modify and output from tok */
/* probably best isolated in a separate function */
}
free(ln);
}

Notice that the only configuration constants are MAXTOKEN and what
the token delimiting character (':' here) actually is.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>

 
Reply With Quote
 
bfowlkes@gmail.com
Guest
Posts: n/a
 
      04-01-2006
I am making some progress, but not much unfortunately. Using these two
code segments that I found from another post I was able to parse out
each field as a text file, the output looks like this:

Line number: 1
Token: 1010
Token: CD drive external 32x
Token: 1MagiCopy
Token: 15.5
Token: 100

Line number: 2
Token: 1020
Token: CD drive external 40x
Token: 20th Century Fox
Token: 16.74
Token: 130


size_t get_line( FILE *f , char *line, size_t len )
{
char *ptr;


ptr = fgets( line, len, f );


if( NULL == ptr ) {
line[0] = '\0';
return 0;
}


if( NULL != (ptr = strchr(line, DELIMITER)) ) *ptr = '\0';


return strlen(line);
}

while( 0 != get_line( fp, data, sizeof(data)) ) {
count++;
printf( "Line number: %d\n", count );
for( ptr0 = data; NULL != (ptr1 = strtok(ptr0, TOKEN)); ptr0 =
NULL )
printf( "Token: %s\n", ptr1 );
putchar( '\n' );
}


What I was going to do was assign each field value into an array of
structures, but it gives me a segmentation fault, is there another way
to achieve the main objective?

 
Reply With Quote
 
Ben C
Guest
Posts: n/a
 
      04-01-2006
On 2006-04-01, (E-Mail Removed) <(E-Mail Removed)> wrote:
> I am making some progress, but not much unfortunately. Using these two
> code segments that I found from another post I was able to parse out
> each field as a text file, the output looks like this:
>
> Line number: 1
> Token: 1010
> Token: CD drive external 32x
> Token: 1MagiCopy
> Token: 15.5
> Token: 100
>
> Line number: 2
> Token: 1020
> Token: CD drive external 40x
> Token: 20th Century Fox
> Token: 16.74
> Token: 130
>
> size_t get_line( FILE *f , char *line, size_t len )
> {
> char *ptr;
>
>
> ptr = fgets( line, len, f );
>
>
> if( NULL == ptr ) {
> line[0] = '\0';
> return 0;
> }
>
>
> if( NULL != (ptr = strchr(line, DELIMITER)) ) *ptr = '\0';
>
>
> return strlen(line);
> }
>
> while( 0 != get_line( fp, data, sizeof(data)) ) {
> count++;
> printf( "Line number: %d\n", count );
> for( ptr0 = data; NULL != (ptr1 = strtok(ptr0, TOKEN)); ptr0 =
> NULL )
> printf( "Token: %s\n", ptr1 );
> putchar( '\n' );
> }


> What I was going to do was assign each field value into an array of
> structures, but it gives me a segmentation fault, is there another way
> to achieve the main objective?


If the main objective is just to print it all out again formatted
differently, you can maybe do that in the loop, and avoid having to
store the data.

But you should be able to fix the segmentation fault! The error might be
in part of the code we can't see-- it looks from "data, sizeof(data)"
that data is an array; where do you declare it? And how's the array of
structures created?

In any case, you reuse the same buffer for each line, so you're going to
have to actually copy the strings out somehow.

Guessing, but the problem may be that you're just copying the pointers,
but not duplicating the actual strings.

for( ptr0 = data; NULL != (ptr1 = strtok(ptr0, TOKEN)); ptr0 = NULL )

records[i].name = ptr1; /* very likely to be wrong */
records[i].name = strdup(ptr1); /* some chance of working */

HTH
 
Reply With Quote
 
CBFalconer
Guest
Posts: n/a
 
      04-02-2006
(E-Mail Removed) wrote:
>
> I am making some progress, but not much unfortunately. Using these
> two code segments that I found from another post I was able to
> parse out each field as a text file, the output looks like this:


You reply to my posting, but ignore all that I suggested, and
refuse to quote proper context. I see no point in anyone
attempting to assist you further.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Help Parsing RFC822 Formatted Date VUNETdotUS ASP General 3 10-18-2007 02:51 PM
How to compare two SOAP Envelope or two Document or two XML files GenxLogic Java 3 12-06-2006 08:41 PM
How to convert MetaStock formatted files into ASCII files Steve D Perl Misc 4 01-10-2006 08:05 AM
Merging two text files based on some kind of text anchors triangle Perl Misc 1 01-30-2004 09:00 PM
Server-side Printing With Formatted Text Stuart Woodard ASP .Net 0 08-03-2003 05:53 PM



Advertisments