Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Reading Words from File

Reply
Thread Tools

Reading Words from File

 
 
dough
Guest
Posts: n/a
 
      10-04-2005
I want to read in lines from a file and then seperate the words so i
can do a process on each of the words. Say the text file "readme.txt"
contains the following:

In the face of criticism from the left and right, President Bush
insisted Tuesday that Harriet Miers is the nation's best-qualified
candidate for the Supreme Court and assured skeptical conservatives
that his lawyer...

I could get an input to a char *s such that s = "In" and then i do
something with s, then s = "the" and then i do something with that,
etc. With no idea the length of any string or line or whitespace.

Heres what I have so far.

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void process(char *s) /* whats here is not really important *
{
printf("%s", s);
}

int main() {

char buffer[80];
FILE *f = fopen("readme.txt", "r");
char *s;

while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
{
while( sscanf(buffer, "%s", s) ) /* scans for words in line */
{
process(s); /* do stuff to the words */
}
}

fclose(f);
return 0;

}

Also, is there anyway to adjust the size of the buffer or reallocate
the memory so it doesn't overflow and get a seg error.

 
Reply With Quote
 
 
 
 
Alexei A. Frounze
Guest
Posts: n/a
 
      10-04-2005
"dough" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ups.com...
> I want to read in lines from a file and then seperate the words so i
> can do a process on each of the words. Say the text file "readme.txt"
> contains the following:
>
> In the face of criticism from the left and right, President Bush
> insisted Tuesday that Harriet Miers is the nation's best-qualified
> candidate for the Supreme Court and assured skeptical conservatives
> that his lawyer...
>
> I could get an input to a char *s such that s = "In" and then i do
> something with s, then s = "the" and then i do something with that,
> etc. With no idea the length of any string or line or whitespace.


I don't want to be harsh, but it seems to me the 2nd paragraph is off topic
and unwise for a poster looking for help...

Alex


 
Reply With Quote
 
 
 
 
Walter Roberson
Guest
Posts: n/a
 
      10-04-2005
In article <(E-Mail Removed). com>,
dough <(E-Mail Removed)> wrote:
:I want to read in lines from a file and then seperate the words so i
:can do a process on each of the words.

There is often a non-trivial semantic problem in deciding what
a "word" is in such matters. For example, in

"Oh!," he yelled (into his Hello-Kitty phone.)

then if you go by whitespace you get "words" such as

"Oh!," and (into and phone.) and Hello-Kitty

which is usually not the breakdown you want.
--
These .signatures are sold by volume, and not by weight.
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      10-04-2005


dough wrote On 10/04/05 14:39,:
> I want to read in lines from a file and then seperate the words so i
> can do a process on each of the words. Say the text file "readme.txt"
> contains the following:
>
> In the face of criticism from the left and right, President Bush
> insisted Tuesday that Harriet Miers is the nation's best-qualified
> candidate for the Supreme Court and assured skeptical conservatives
> that his lawyer...
>
> I could get an input to a char *s such that s = "In" and then i do
> something with s, then s = "the" and then i do something with that,
> etc. With no idea the length of any string or line or whitespace.
>
> Heres what I have so far.
>
> #include <ctype.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> void process(char *s) /* whats here is not really important *
> {
> printf("%s", s);
> }
>
> int main() {
>
> char buffer[80];
> FILE *f = fopen("readme.txt", "r");
> char *s;


It would be a good idea to test `f == NULL' before
proceeding ...

> while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
> {
> while( sscanf(buffer, "%s", s) ) /* scans for words in line */


Here's a problem: `s' doesn't point to anything, so
when scanf() locates a word and tries to copy it to the
memory `s' points at, all manner of mischief can ensue.

> {
> process(s); /* do stuff to the words */
> }
> }
>
> fclose(f);
> return 0;
>
> }



> Also, is there anyway to adjust the size of the buffer or reallocate
> the memory so it doesn't overflow and get a seg error.


If you used malloc() to create the space for `buffer', you
could use realloc() to enlarge it. But the immediate problem
is not the size of `buffer', but the uninitialized `s'.

Your overall task sounds like a job for the much-maligned
strtok() function. However, see Walter Roberson's post for
some of the pitfalls of using simple string-bashing to separate
"words" from their surroundings.

--
http://www.velocityreviews.com/forums/(E-Mail Removed)

 
Reply With Quote
 
Christopher Benson-Manica
Guest
Posts: n/a
 
      10-04-2005
Walter Roberson <(E-Mail Removed)-cnrc.gc.ca> wrote:

> There is often a non-trivial semantic problem in deciding what
> a "word" is in such matters. For example, in


> "Oh!," he yelled (into his Hello-Kitty phone.)


I must say that that is a truly bizarre example sentence That
aside, it seems to me that assuming a "word" is a sequence of
consecutive alpha characters would yield better results, at least
depending on what OP wants to do with the "words" once he has them.

--
Christopher Benson-Manica | I *should* know what I'm talking about - if I
ataru(at)cyberspace.org | don't, I need to know. Flames welcome.
 
Reply With Quote
 
Hemanth
Guest
Posts: n/a
 
      10-04-2005
dough wrote:
> I want to read in lines from a file and then seperate the words so i
> can do a process on each of the words.



.......use strtok() function to split a string into words (use
whitespace or any other separator you want)


> char buffer[80];
> FILE *f = fopen("readme.txt", "r");
> while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
>
> Also, is there anyway to adjust the size of the buffer or reallocate
> the memory so it doesn't overflow and get a seg error.



........the fgets statement reads until num-1 characters are read (in
this case 79) or a newline or EOF is reached (whichever happens first).
So I don't think you need a realloc in this case.


HTH,
Hemanth

 
Reply With Quote
 
Michael Mair
Guest
Posts: n/a
 
      10-04-2005
dough wrote:
> I want to read in lines from a file and then seperate the words so i
> can do a process on each of the words. Say the text file "readme.txt"
> contains the following:
>
> In the face of criticism from the left and right, President Bush
> insisted Tuesday that Harriet Miers is the nation's best-qualified
> candidate for the Supreme Court and assured skeptical conservatives
> that his lawyer...
>
> I could get an input to a char *s such that s = "In" and then i do
> something with s, then s = "the" and then i do something with that,
> etc. With no idea the length of any string or line or whitespace.


I am not sure what your problem is.
When you have a problem, please help us help you:
State what you want to achieve (this part seems clear) and
what about your solution did not work.
Otherwise, everyone tells you about A because you seemed to
ask for B while meaning C...

>
> Heres what I have so far.
>
> #include <ctype.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> void process(char *s) /* whats here is not really important *
> {
> printf("%s", s);
> }
>
> int main() {
>
> char buffer[80];
> FILE *f = fopen("readme.txt", "r");
> char *s;


Check whether f is != NULL. If you omitted the check for
brevity, then write a comment.

> while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
> {
> while( sscanf(buffer, "%s", s) ) /* scans for words in line */
> {
> process(s); /* do stuff to the words */
> }
> }


Okay, so what is the problem here? About everything:
1) you may inadvertently separate a word if your buffer is not
long enough (uncritical)
2) You scan always from the same position (buffer is effectively &buffer[0])
3) You read your string into memory pointed to by an unitialized pointer.

Consider
char s[sizeof buffer] = "", *tmp = NULL;
while (....)
{
tmp = buffer;
while ( sscanf(tmp, "%s", s) )
{
process(s);
tmp += strlen(s);
}
/* a */
}
This solves 2) and 3).
Another solution is the use of strtok() etc.

If you check at point "a" whether buffer[strlen(buffer)-1]=='\n',
then you can also detect instances of 1).
However, this may not be what you are looking for (see below)

>
> fclose(f);
> return 0;
>
> }
>
> Also, is there anyway to adjust the size of the buffer or reallocate
> the memory so it doesn't overflow and get a seg error.


realloc() helps you do that.
Have a look at the comp.lang.c archives to see how to use it.

If you do not need the words in context, you also use getc() which
may be clearer:

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>

#define START_BUFSIZE 20


void process(const char *s);
int resize_buffer (char **buf, size_t *len);


int main (void)
{
FILE *f;
char *s = NULL;
size_t length = 0;
int input;

if (NULL == (f = fopen("readme.txt", "r")))
{
fprintf(stderr, "Cannot open file\n");
exit(EXIT_FAILURE);
}
if (NULL == (s = malloc((START_BUFSIZE+1) * sizeof *s)))
{
fprintf(stderr, "Error on allocating memory for s\n");
fclose(f);
exit(EXIT_FAILURE);
}
length = START_BUFSIZE;

do /* ... while (input != EOF) */
{
size_t curr = 0;

/* Read up to the first whitespace */
while (!isspace(input = getc(f)) && input != EOF)
{
s[curr++] = input;
if (curr == length)
{
if (resize_buffer(&s, &length))
{
/* perform error handling */
break;
}
}
}
/* Make s a string */
s[curr] = '\0';

if (curr)
process(s);

/* Read up to the first non-whitespace */
while ((input = getc(f)) != EOF)
{
putchar('*');
if (!isspace(input))
{
ungetc(input, f);
break;
}
}
} while (input != EOF);

free(s);
fclose(f);

putchar('\n');

return 0;
}

void process(const char *s) /* whats here is not really important */
{
printf("%s", s); fflush (stdout);
}

int resize_buffer (char **buf, size_t *len)
{
/* Using mybuf and mylen for readability */
char *mybuf = *buf;
size_t mylen = *len;

char *tmp;
size_t destlen = 2*mylen+1;

/* A */
if (NULL == (tmp = realloc(mybuf, destlen)))
{
return 1;
}
mybuf = tmp;
mylen = destlen - 1;

/* write back to parameters */
*buf = mybuf;
*len = mylen;

return 0;
}


Cheers
Michael
--
E-Mail: Mine is an /at/ gmx /dot/ de address.
 
Reply With Quote
 
Walter Roberson
Guest
Posts: n/a
 
      10-04-2005
In article <dhumdl$j2o$(E-Mail Removed)>,
Christopher Benson-Manica <(E-Mail Removed)> wrote:
>Walter Roberson <(E-Mail Removed)-cnrc.gc.ca> wrote:


>> There is often a non-trivial semantic problem in deciding what
>> a "word" is in such matters.


>aside, it seems to me that assuming a "word" is a sequence of
>consecutive alpha characters would yield better results, at least
>depending on what OP wants to do with the "words" once he has them.


Using "alpha" as the boundary definition runs into difficulties
with possessives, contractions, joined-words, and words such as
re-enter in which the dash indicates seperation of vowels that
would otherwise form a diapthong. It would likely also run
into problems with Mr. Salutation, and abbreviations such as etc.
in which the period is really part of the word.
--
Okay, buzzwords only. Two syllables, tops. -- Laurie Anderson
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      10-04-2005


Christopher Benson-Manica wrote On 10/04/05 15:50,:
> Walter Roberson <(E-Mail Removed)-cnrc.gc.ca> wrote:
>
>
>>There is often a non-trivial semantic problem in deciding what
>>a "word" is in such matters. For example, in

>
>
>> "Oh!," he yelled (into his Hello-Kitty phone.)

>
>
> I must say that that is a truly bizarre example sentence That
> aside, it seems to me that assuming a "word" is a sequence of
> consecutive alpha characters would yield better results, at least
> depending on what OP wants to do with the "words" once he has them.


This is a reasonable 1st approximation, but its tend-
ency to generate non-words (e.g., "st") isn't desirable.

--
(E-Mail Removed)


 
Reply With Quote
 
Barry
Guest
Posts: n/a
 
      10-04-2005

"dough" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ups.com...
> I want to read in lines from a file and then seperate the words so i
> can do a process on each of the words. Say the text file "readme.txt"
> contains the following:
>
> In the face of criticism from the left and right, President Bush
> insisted Tuesday that Harriet Miers is the nation's best-qualified
> candidate for the Supreme Court and assured skeptical conservatives
> that his lawyer...
>
> I could get an input to a char *s such that s = "In" and then i do
> something with s, then s = "the" and then i do something with that,
> etc. With no idea the length of any string or line or whitespace.
>
> Heres what I have so far.
>
> #include <ctype.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> void process(char *s) /* whats here is not really important *
> {
> printf("%s", s);
> }
>
> int main() {
>
> char buffer[80];
> FILE *f = fopen("readme.txt", "r");
> char *s;
>
> while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
> {
> while( sscanf(buffer, "%s", s) ) /* scans for words in line */
> {
> process(s); /* do stuff to the words */
> }
> }
>
> fclose(f);
> return 0;
>
> }
>
> Also, is there anyway to adjust the size of the buffer or reallocate
> the memory so it doesn't overflow and get a seg error.
>


"process" is a terrible name for a function in any context.

Barry


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Words and non-words, according to Microsoft et al Steve B NZ Computing 11 03-21-2008 11:52 PM
Replace stop words (remove words from a string) BerlinBrown Python 6 01-17-2008 02:37 PM
Words Words utab C++ 6 02-16-2006 07:00 PM
Non-noise words are incorrectly recognised as noise words. Peter Strĝiman ASP .Net 1 08-23-2005 01:26 PM
Re: A little bit of help regarding my linked list program required. - "words.c" - "words.c" Richard Heathfield C Programming 7 10-05-2003 02:38 PM



Advertisments