Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > Reading text file contents to a character buffer

Reply
Thread Tools

Reading text file contents to a character buffer

 
 
Navaneeth
Guest
Posts: n/a
 
      08-02-2010
Hello,

I am trying to load a text file into in-memory character buffer. I
wrote the following code.

static char *readcontent(const char *filename)
{
char *fcontent = NULL, *tmp = NULL;
size_t result, elements_read = 0;
int pagenum = 1;

FILE *fp;
fp = fopen(filename, "r");

if(fp) {
while(1) {
tmp = (char*) realloc(fcontent, PAGE_SIZE * pagenum);
if(tmp) {
fcontent = tmp;
}
else {
free(fcontent);
fcontent = NULL;
printf("failed to reallocate memory\n");
break;
}

result = fread(fcontent + elements_read, 1, PAGE_SIZE,
fp);

if(result != PAGE_SIZE) {
if(feof(fp)) {
break;
}
else {
printf("error reading file");
break;
}
}
elements_read += result;
++pagenum;
}
fclose(fp);
}
return fcontent;
}

This is working fine. Since I am new to C, I am wondering is this the
correct approach to do it? Any suggestions to improve the code are
welcome. This code is expected to compile atleast on GCC and MSVC.

Any thoughts?

Thanks
 
Reply With Quote
 
 
 
 
Ben Bacarisse
Guest
Posts: n/a
 
      08-02-2010
Navaneeth <(E-Mail Removed)> writes:

> I am trying to load a text file into in-memory character buffer. I
> wrote the following code.
>
> static char *readcontent(const char *filename)
> {


From a purely functional point of view, I'd say that you should have
some way to return how long the file was to the caller. It may not
matter in your case but (a) you have the information so it's 'free' and
(b) it makes the function more useful. A good way to do this is to
supply a size_t * argument which if non-NULL get set to the length of
the data written. Permitting a null pointer means that callers who
don't care about the length don't have to allocate a dummy variable just
for the purpose.

If you don't do this (and quite possibly if you so) you should at least
null-terminate the result. As it stands, the caller can't use the
returned data without running off the end of the array.

> char *fcontent = NULL, *tmp = NULL;
> size_t result, elements_read = 0;
> int pagenum = 1;
>
> FILE *fp;
> fp = fopen(filename, "r");
>
> if(fp) {
> while(1) {
> tmp = (char*) realloc(fcontent, PAGE_SIZE * pagenum);


I would not cast the return from realloc.

A common trick is to increase the storage used by some factor (> 1)
rather than but adding a fixed amount. This has the advantage of
reducing the number of realloc calls at the expense of a little more
data.

> if(tmp) {
> fcontent = tmp;
> }
> else {
> free(fcontent);
> fcontent = NULL;
> printf("failed to reallocate memory\n");


I would not print an error message to stdout. In fact, I would not
print an error message at all! A utility function like should be usable
by programs that might want to report errors in all sorts of other
ways. In general, it is best to put error reporting as "high up" in the
you program as is practical

> break;
> }
>
> result = fread(fcontent + elements_read, 1, PAGE_SIZE,
> fp);
>
> if(result != PAGE_SIZE) {
> if(feof(fp)) {
> break;
> }
> else {
> printf("error reading file");
> break;
> }


I'd put the break here.

> }
> elements_read += result;
> ++pagenum;
> }
> fclose(fp);
> }
> return fcontent;
> }


<snip>
--
Ben.
 
Reply With Quote
 
 
 
 
BGB / cr88192
Guest
Posts: n/a
 
      08-02-2010

"Ben Bacarisse" <(E-Mail Removed)> wrote in message
news:0.18b2dae7adb5cead6481.20100802175331BST.87y6 (E-Mail Removed)...
> Navaneeth <(E-Mail Removed)> writes:
>
>> I am trying to load a text file into in-memory character buffer. I
>> wrote the following code.
>>
>> static char *readcontent(const char *filename)
>> {

>
> From a purely functional point of view, I'd say that you should have
> some way to return how long the file was to the caller. It may not
> matter in your case but (a) you have the information so it's 'free' and
> (b) it makes the function more useful. A good way to do this is to
> supply a size_t * argument which if non-NULL get set to the length of
> the data written. Permitting a null pointer means that callers who
> don't care about the length don't have to allocate a dummy variable just
> for the purpose.
>
> If you don't do this (and quite possibly if you so) you should at least
> null-terminate the result. As it stands, the caller can't use the
> returned data without running off the end of the array.
>


yeah.


>> char *fcontent = NULL, *tmp = NULL;
>> size_t result, elements_read = 0;
>> int pagenum = 1;
>>
>> FILE *fp;
>> fp = fopen(filename, "r");
>>
>> if(fp) {
>> while(1) {
>> tmp = (char*) realloc(fcontent, PAGE_SIZE * pagenum);

>
> I would not cast the return from realloc.
>


I disagree. there is no harm done by the cast, and (among several other
practices commonly disliked here) it also aides in writing code which works
in both C and C++ compilers.


> A common trick is to increase the storage used by some factor (> 1)
> rather than but adding a fixed amount. This has the advantage of
> reducing the number of realloc calls at the expense of a little more
> data.
>


yeah, the 1.5^x curve works well in my case.
every time one needs more memory, scale the prior size by 1.5...
average case overhead is 25%.

scaling by a fixed size leads to more realloc, which means more time, and
greater memory fragmentation.
fragmentation may well be a greater enemy than the raw anount of memory
used.


>> if(tmp) {
>> fcontent = tmp;
>> }
>> else {
>> free(fcontent);
>> fcontent = NULL;
>> printf("failed to reallocate memory\n");

>
> I would not print an error message to stdout. In fact, I would not
> print an error message at all! A utility function like should be usable
> by programs that might want to report errors in all sorts of other
> ways. In general, it is best to put error reporting as "high up" in the
> you program as is practical
>


yeah, status code and returning NULL being a common way to do this.

less common ways include callbacks, event queuing, exceptions, longjmp, ...
but each has drawbacks...

I list exceptions here, but these are typically OS/compiler/framework
specific (and not part of standard C), which is their main drawback.


>> break;
>> }
>>
>> result = fread(fcontent + elements_read, 1, PAGE_SIZE,
>> fp);
>>
>> if(result != PAGE_SIZE) {
>> if(feof(fp)) {
>> break;
>> }
>> else {
>> printf("error reading file");
>> break;
>> }

>
> I'd put the break here.
>
>> }
>> elements_read += result;
>> ++pagenum;
>> }
>> fclose(fp);
>> }
>> return fcontent;
>> }

>


this function being a great example of CLC pedantics...


personally, I would just be like "to hell with it" and just use fseek/ftell,
since there is almost no way that a text file is going to be larger than 2GB
and expected to just be read into memory like this (it wouldn't fit on 32
bit systems anyways...).

typically, on a 32-bit system, even 500MB malloc's are prone to fail, since
often some random DLL will end up mapped right into the middle of an
otherwise clean 1GB span of memory, of the theoretically 2GB or 3GB or so
available to the app. (for those few apps which try to allocate memory in
such large chunks...).

but, memory usage patterns is its own complicated set of issues.


however, given that the average size of most "sane" text files is measurable
in kB, with a few larger ones being a few MB, usually there is little risk
of size overflows just using the fseek/ftell strategy...

(although, yes, if the file *is* larger than 2GB or so, then the size may
come back negative or otherwise completely wrong...).


but, oh well, whatever...


 
Reply With Quote
 
ImpalerCore
Guest
Posts: n/a
 
      08-02-2010
On Aug 2, 2:36*pm, "BGB / cr88192" <(E-Mail Removed)> wrote:

<snip>

> this function being a great example of CLC pedantics...
>
> personally, I would just be like "to hell with it" and just use fseek/ftell,
> since there is almost no way that a text file is going to be larger than 2GB
> and expected to just be read into memory like this (it wouldn't fit on 32
> bit systems anyways...).


Unless you have a data file that's over 2GB and you convert it to an
ASCII representation for 'grep-ability' and now it's ballooned to over
9GB. Not saying that it's a good thing, but I have done it before
when I've been in a pinch.

> typically, on a 32-bit system, even 500MB malloc's are prone to fail, since
> often some random DLL will end up mapped right into the middle of an
> otherwise clean 1GB span of memory, of the theoretically 2GB or 3GB or so
> available to the app. (for those few apps which try to allocate memory in
> such large chunks...).


I experience similar issues with 500MB malloc's failure, and I've not
been able to determine why. I don't allocate memory that large, but I
have some test programs that try to blow up memory and noticed the max
was around 500MB.

> but, memory usage patterns is its own complicated set of issues.
>
> however, given that the average size of most "sane" text files is measurable
> in kB, with a few larger ones being a few MB, usually there is little risk
> of size overflows just using the fseek/ftell strategy...
>
> (although, yes, if the file *is* larger than 2GB or so, then the size may
> come back negative or otherwise completely wrong...).
>
> but, oh well, whatever...


As soon as you say getting the size of a file, afaik it's outside the
scope of the standard. I know in my environment I can't use ftell to
get a 2GB+ NTFS file size since sizeof( long int ) is 4, and thus I
have to resort to using a system specific method using uint64_t and
GetFileAttributeExA. Unfortunately there is no completely portable
method to get a file size. The best thing in my opinion is to make a
wrapper function that encapsulates the non-portable behavior so that
the non-portability is centralized in one spot if at all possible.

Of course, I wouldn't fault someone for using the ftell/fseek combo if
that is all that is needed to handle his range of input file sizes as
long as that person doesn't claim it's portable.

Best regards,
John D.
 
Reply With Quote
 
BGB / cr88192
Guest
Posts: n/a
 
      08-02-2010

"ImpalerCore" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
On Aug 2, 2:36 pm, "BGB / cr88192" <(E-Mail Removed)> wrote:

<snip>

> this function being a great example of CLC pedantics...
>
> personally, I would just be like "to hell with it" and just use
> fseek/ftell,
> since there is almost no way that a text file is going to be larger than
> 2GB
> and expected to just be read into memory like this (it wouldn't fit on 32
> bit systems anyways...).


<--
Unless you have a data file that's over 2GB and you convert it to an
ASCII representation for 'grep-ability' and now it's ballooned to over
9GB. Not saying that it's a good thing, but I have done it before
when I've been in a pinch.
-->

note: by "and" I meant T&T=T, not T|T=T.

but, otherwise, potentially...
this depends on the app then.

I suspect this is likely to be a very rare case though, where most text one
deals with is typically in the kB or low MB range...


> typically, on a 32-bit system, even 500MB malloc's are prone to fail,
> since
> often some random DLL will end up mapped right into the middle of an
> otherwise clean 1GB span of memory, of the theoretically 2GB or 3GB or so
> available to the app. (for those few apps which try to allocate memory in
> such large chunks...).


<--
I experience similar issues with 500MB malloc's failure, and I've not
been able to determine why. I don't allocate memory that large, but I
have some test programs that try to blow up memory and noticed the max
was around 500MB.
-->

this is due mostly to address-space fragmentation...

as the size gets larger, the probability of there being an unbroken run of a
given size becomes increasingly smaller (although, the specifics depend on
the exact memory layout and allocation behaviors).

but, 500MB seems to be a good breakoff, as:
it reprsents, typically, around 1/4 of the total app-usable address space;
typically, the OS will load libraries, ... in ways which are not typically
convinient for maintaining an unbroken space.


there is much less of an issue on 64-bit systems, but there are still
issues, such as trying to allocate a huge chunk of memory (several GB or
more) will often horridly bog down the system IME (or, at least on Windows).

on Windows, a better strategy was to use reserved/uncommitted memory (via
VirtualAlloc), and then to commit it piecewise...

in this particular case, I had allocated a 4GB chunk of memory, mostly as I
needed a region of memory with a "window" where any point in the space could
be reached via a 32-bit offset, whereas with the wider (64-bit) address
space, I couldn't make any guerantees about distance (2GB would be maybe
better though, as 2GB allows any point to be reached from any other via a
signed 32-bit offset).

I haven't tested on Linux to see what happens with large malloc or mmap
calls.


> but, memory usage patterns is its own complicated set of issues.
>
> however, given that the average size of most "sane" text files is
> measurable
> in kB, with a few larger ones being a few MB, usually there is little risk
> of size overflows just using the fseek/ftell strategy...
>
> (although, yes, if the file *is* larger than 2GB or so, then the size may
> come back negative or otherwise completely wrong...).
>
> but, oh well, whatever...


<--
As soon as you say getting the size of a file, afaik it's outside the
scope of the standard. I know in my environment I can't use ftell to
get a 2GB+ NTFS file size since sizeof( long int ) is 4, and thus I
have to resort to using a system specific method using uint64_t and
GetFileAttributeExA. Unfortunately there is no completely portable
method to get a file size. The best thing in my opinion is to make a
wrapper function that encapsulates the non-portable behavior so that
the non-portability is centralized in one spot if at all possible.
-->

agreed...

IME, one typically ends up with a lot of OS-specific wrappers, usually
dedicated to their own source files (like, having various OS-specific source
files, and then swapping them out for the OS in question).
this is typically much cleaner than the use of fine-grained ifdef's
(although, sometimes I have used coarse-grained ifdef's in these files,
typically so that I don't have to jerk off the Makefile so much, and so the
entire file contents disappear if not on the OS in question).


<--
Of course, I wouldn't fault someone for using the ftell/fseek combo if
that is all that is needed to handle his range of input file sizes as
long as that person doesn't claim it's portable.
-->

one can claim it is portable (since it *is* part of the standard), so long
as they also don't claim it will work with larger files...



 
Reply With Quote
 
Ben Bacarisse
Guest
Posts: n/a
 
      08-02-2010
"BGB / cr88192" <(E-Mail Removed)> writes:

> "Ben Bacarisse" <(E-Mail Removed)> wrote in message
> news:0.18b2dae7adb5cead6481.20100802175331BST.87y6 (E-Mail Removed)...
>> Navaneeth <(E-Mail Removed)> writes:

<snip>
>>> tmp = (char*) realloc(fcontent, PAGE_SIZE * pagenum);

>>
>> I would not cast the return from realloc.

>
> I disagree. there is no harm done by the cast, and (among several other
> practices commonly disliked here) it also aides in writing code which works
> in both C and C++ compilers.


You disagree that I would not cast the return from realloc?

I tried to put it as mildly as possible with the sole purpose of
encouraging the OP to think about it. I don't mind either way, but the
argument about compiling with C++ is (or should be) a red herring. It
is trivial to link C with C++ and preferable to compile the C with a C
compiler. In cases like this there is no reason to do anything else.

>> A common trick is to increase the storage used by some factor (> 1)
>> rather than but adding a fixed amount. This has the advantage of
>> reducing the number of realloc calls at the expense of a little more
>> data.

>
> yeah, the 1.5^x curve works well in my case.
> every time one needs more memory, scale the prior size by 1.5...
> average case overhead is 25%.
>
> scaling by a fixed size leads to more realloc, which means more time, and
> greater memory fragmentation.
> fragmentation may well be a greater enemy than the raw anount of memory
> used.


Did you really mean to say "scaling" here? It sounds like you are
arguing against it having agreed that it is often preferable.

<snip>
> this function being a great example of CLC pedantics...
>
> personally, I would just be like "to hell with it" and just use fseek/ftell,
> since there is almost no way that a text file is going to be larger than 2GB
> and expected to just be read into memory like this (it wouldn't fit on 32
> bit systems anyways...).


Putting aside the issue of pedantry, it is worth pointing out that ftell
does not report the correct size to read in a whole text file on some
very common platforms. It may usually be a "safe" lie (i.e. you
get told a size larger than required) but it is, none the less, not the
size you need to read.

<snip>
--
Ben.
 
Reply With Quote
 
Ian Collins
Guest
Posts: n/a
 
      08-02-2010
On 08/ 3/10 11:16 AM, Ben Bacarisse wrote:
> "BGB / cr88192"<(E-Mail Removed)> writes:
>>
>> personally, I would just be like "to hell with it" and just use fseek/ftell,
>> since there is almost no way that a text file is going to be larger than 2GB
>> and expected to just be read into memory like this (it wouldn't fit on 32
>> bit systems anyways...).

>
> Putting aside the issue of pedantry, it is worth pointing out that ftell
> does not report the correct size to read in a whole text file on some
> very common platforms. It may usually be a "safe" lie (i.e. you
> get told a size larger than required) but it is, none the less, not the
> size you need to read.


But on those systems there isn't a direct way of determining the correct
size to read without scanning the file. Provided the code doesn't
expect the result of a read to exactly match the size requested, all
should be well.

--
Ian Collins
 
Reply With Quote
 
Morris Keesan
Guest
Posts: n/a
 
      08-03-2010
On Mon, 02 Aug 2010 11:54:38 -0400, Navaneeth <(E-Mail Removed)>
wrote:

> printf("failed to reallocate memory\n");

....
> if(result != PAGE_SIZE) {
> if(feof(fp)) {
> break;
> }
> else {
> printf("error reading file");

....
>
> This is working fine. Since I am new to C, I am wondering is this the
> correct approach to do it? Any suggestions to improve the code are
> welcome.


If you're going to output an error message, it's more appropriate to
send it to stderr, e.g.
fputs("failed to reallocate memory\n", stderr);

stdout and stderr are very often the same device, but when they're not,
users expect to see error messages on stderr, not stdout.
--
Morris Keesan -- http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
Navaneeth
Guest
Posts: n/a
 
      08-03-2010
> yeah, the 1.5^x curve works well in my case.
> every time one needs more memory, scale the prior size by 1.5...
> average case overhead is 25%.
>
> scaling by a fixed size leads to more realloc, which means more time, and
> greater memory fragmentation.
> fragmentation may well be a greater enemy than the raw anount of memory
> used.


Thanks. I am not sure I followed you fully. Can you give me code
sample of "realloc" which allocates in a way that you are explaining?
 
Reply With Quote
 
Malcolm McLean
Guest
Posts: n/a
 
      08-03-2010
On Aug 3, 7:16*pm, Navaneeth <(E-Mail Removed)> wrote:
>
>
> Thanks. I am not sure I followed you fully. Can you give me code
> sample of "realloc" which allocates in a way that you are explaining?


>

He means this

/*
untested
*/
char *loadtext(FILE *fp)
{
char *answer;
void *temp;
int N = 0;
int buffsize = 8;
int ch;

answer = malloc(buffsize);
if(!answer)
return 0;

while( (ch = fgetc(fp)) != EOF)
{
if(N == buffsize -1)
{
buffsize = buffsize + buffsize/2;
temp = realloc(answer, buffsize);
if(!temp)
{
free(answer);
return 0;
}
}
answer[N++] = ch;
}
answer[N] = 0;
return answer;
}
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Newbie : Reading input into character buffer (am I doing this right?) deech C++ 6 02-12-2010 11:33 PM
When using System.IO.FileStream, I write 8 bytes, then seek to the start of the file, does the 8 bytes get flushed on seek and the buffer become a readbuffer at that point instead of being a write buffer? DR ASP .Net 2 07-29-2008 09:50 AM
Copying contents of gzip file in character buffer lokaresa@gmail.com C Programming 3 03-24-2008 02:14 AM
How do I save the contents of a text buffer google@orcon.net.nz Python 7 02-18-2007 12:42 PM
Reading in and out of a character buffer n_jaksic@hotmail.com C++ 4 09-06-2006 08:07 PM



Advertisments