Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C Programming > File handling question

Reply
Thread Tools

File handling question

 
 
Richard Heathfield
Guest
Posts: n/a
 
      05-17-2006
Anunay said:

>
> Richard Heathfield wrote:
>> > As the file is too big to be fully fit into main memory, what
>> > will fopen() return?

>>
>> You seem to be confused about the purpose of fopen. It does not read a
>> file into main memory.

>
> Okay, that was a bad question. I meant to ask that where will the
> remaining portion of file be stored as it could not get loaded in one
> go?


On the disk, where it is already. That's what disks are for.

You just load one line of the file at a time. Having read it, you either
keep it or discard it, as I explained earlier, and then you read the next
one. And so on.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
 
Reply With Quote
 
 
 
 
CBFalconer
Guest
Posts: n/a
 
      05-17-2006
Richard Heathfield wrote:
> Anunay said:
>> Richard Heathfield wrote:
>>
>>>> As the file is too big to be fully fit into main memory, what
>>>> will fopen() return?
>>>
>>> You seem to be confused about the purpose of fopen. It does not
>>> read a file into main memory.

>>
>> Okay, that was a bad question. I meant to ask that where will
>> the remaining portion of file be stored as it could not get
>> loaded in one go?

>
> On the disk, where it is already. That's what disks are for.
>
> You just load one line of the file at a time. Having read it,
> you either keep it or discard it, as I explained earlier, and
> then you read the next one. And so on.


I think it is time to refer Anunay to K&R II and to his elementary
computers textbook.

Analog for Anunay: When you open a book, do you immediately
transfer its content to your brain? Or do you start at page 1 and
read a sentence? Later do you read another sentence?

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>

 
Reply With Quote
 
 
 
 
Flash Gordon
Guest
Posts: n/a
 
      05-17-2006
Ico wrote:
> Friedrich Dominicus <(E-Mail Removed)> wrote:
>> "Anunay" <(E-Mail Removed)> writes:
>>
>>> Hi all,
>>>
>>> Suppose a text file is given and we have to write a program that will
>>> return a random line from the file. This can be easily done. But what
>>> if the text file is too big and can't fit into the main memory
>>> completely?

>> What it the problem? You read the file line by line till you are at
>> your random number.

>
> Requiring 2 passes of file reading: one to count the number of lines,
> one to choos a random line. This can be used when reading a file
> (although not optimal, ofcourse), but is not possible with streams.


Since the random distribution was not specified you don't need to do two
passes. This might mean there is no possibility of returning lines
towards the end of the file, or if the file is only two lines a high
probability of returning the last line, but it is still random.

>>> Also, if we are given a stream, instead of a file, then what changes
>>> are required?

>
>> What do you mean with this?

>
> I assume the OP means reading from standard input, instead of a file.


Pick a probability for returning the current line, and keep going until
you either hit EOF or you at random decide to return the line.
--
Flash Gordon, living in interesting times.
Web site - http://home.flash-gordon.me.uk/
comp.lang.c posting guidelines and intro:
http://clc-wiki.net/wiki/Intro_to_clc

Inviato da X-Privat.Org - Registrazione gratuita http://www.x-privat.org/join.php
 
Reply With Quote
 
Eric Sosman
Guest
Posts: n/a
 
      05-17-2006
Flash Gordon wrote:

> Ico wrote:
>
>> Friedrich Dominicus <(E-Mail Removed)> wrote:
>>
>>> "Anunay" <(E-Mail Removed)> writes:
>>>
>>>> Hi all,
>>>>
>>>> Suppose a text file is given and we have to write a program that will
>>>> return a random line from the file. This can be easily done. But what
>>>> if the text file is too big and can't fit into the main memory
>>>> completely?
>>>
>>> What it the problem? You read the file line by line till you are at
>>> your random number.

>>
>>
>> Requiring 2 passes of file reading: one to count the number of lines,
>> one to choos a random line. This can be used when reading a file
>> (although not optimal, ofcourse), but is not possible with streams.

>
>
> Since the random distribution was not specified you don't need to do two
> passes. This might mean there is no possibility of returning lines
> towards the end of the file, or if the file is only two lines a high
> probability of returning the last line, but it is still random.
>
>>>> Also, if we are given a stream, instead of a file, then what changes
>>>> are required?

>>
>>
>>> What do you mean with this?

>>
>>
>> I assume the OP means reading from standard input, instead of a file.

>
>
> Pick a probability for returning the current line, and keep going until
> you either hit EOF or you at random decide to return the line.


Read Richard Heathfield's initial post on this thread
again; you'll find that his algorithm (1) requires no advance
knowledge of the line count, (2) makes just one pass over the
input, and (3) has equal probability of selecting any particular
line. It will, however, make one complete pass over all the
input even if it eventually selects the very first line read;
there is no "shortcut."

Point (3) may be a little difficult to grasp at first, so
let me offer a couple simple examples and then a formal proof.
What is the probability that the very last line is chosen? If
there are N lines altogether, the last line is chosen with
probability 1/N, as desired. How about the first line? It goes
into the "saved" buffer as soon as it's read (with probability
1/1), and then it survives there with probability

(1-1/2)*(1-1/3)*(1-1/4)*...*(1-1/N)
= (1/2)*(2/3)*(3/4)*...*((N-1)/N)
= 1/N

Now for the formal proof, by induction. If the file has
only one line, that line is selected with probability 1/1.
That's the "ground" step. Now suppose that the algorithm
works for N-line files, that is, that it chooses one of the
N lines with probability 1/N; we wish to show that it works
for files of N+1 lines. Imagine the situation just before
the final line is read: the first N lines have been processed
and one of them resides in the "saved" buffer. The final step
chooses the last line with probability 1/(N+1) -- which is as
it should be -- or chooses the already-saved line with probability
N/(N+1). Since the first N lines were processed by the N-line
algorithm (known to be correct), any particular one of those first
N lines is "saved" with probability 1/N, hence the probability that
that line is chosen from among N+1 is (1/N)*(N/(N+1)) = 1/(N+1).

The algorithm makes an unbiased "choice" from a 1-line file,
and if it works for N lines it also works for N+1. Therefore,
it works for all positive N. (To the limits of the random number
generator's precision, of course -- but that's an altogether
deeper topic.)

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)lid
 
Reply With Quote
 
Jordan Abel
Guest
Posts: n/a
 
      05-17-2006
On 2006-05-17, Anunay <(E-Mail Removed)> wrote:
> Hi all,
>
> Suppose a text file is given and we have to write a program that will
> return a random line from the file. This can be easily done. But what
> if the text file is too big and can't fit into the main memory
> completely? In this case, how will we modify our code?
>
> Also, if we are given a stream, instead of a file, then what changes
> are required?
>
> Thanks,
> Anunay
>


http://minnie.tuhs.org/UnixTree/V7/u...fortune.c.html

Note that, of course, '(1+(double)RAND_MAX)' needs to be used instead of
'32768.' in that code for modern systems, and it could stand to be
otherwise updated. But the idea is there.
 
Reply With Quote
 
Bill Pursell
Guest
Posts: n/a
 
      05-17-2006
Anunay wrote:
> Richard Heathfield wrote:
> > > As the file is too big to be fully fit into main memory, what
> > > will fopen() return?

> >
> > You seem to be confused about the purpose of fopen. It does not read a file
> > into main memory.

>
> Okay, that was a bad question. I meant to ask that where will the
> remaining portion of file be stored as it could not get loaded in one
> go?


As Richard pointed out, you are confused about the purpose of fopen().
The problem you are describing might be exemplified in the
following poorly designed program:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>

const char *path = "/tmp/some_file";

int
main(void)
{
FILE *ifp;
struct stat stat_buf;
unsigned char *data_buf;

/* fopen() doesn't care how big the file is. */
if ( (ifp = fopen(path, "r")) == NULL)
perror(path), exit(EXIT_FAILURE);

/* stat() will tell you how big the file is. */
if (stat(path, &stat_buf) == -1)
perror(NULL), exit(EXIT_FAILURE);

/*
* If you want to read the file all at once,
* you'll need to allocate space. This will
* fail if the file is too big.
*/
if ( (data_buf = malloc(stat_buf.st_size)) == NULL)
fprintf(stderr, "No memory\n"), exit(EXIT_FAILURE);

/* Process the file here. (look up fread()) */

return EXIT_SUCCESS;
}


The reason this is poorly designed is precisely because
of the problem you are trying to describe. It is fairly stupid
to read the entire file in at once. Instead, you should only
read in chunks and process them as you go.

 
Reply With Quote
 
Robert Latest
Guest
Posts: n/a
 
      05-17-2006
On Wed, 17 May 2006 07:08:21 -0400,
CBFalconer <(E-Mail Removed)> wrote
in Msg. <(E-Mail Removed)>

> Analog for Anunay: When you open a book


He doesn't. Or at least not C books.

robert
 
Reply With Quote
 
Joe Smith
Guest
Posts: n/a
 
      05-18-2006

"Eric Sosman">
>>>> "Anunay" <(E-Mail Removed)> writes:


>>>>> Suppose a text file is given and we have to write a program that will
>>>>> return a random line from the file. This can be easily done. But what
>>>>> if the text file is too big and can't fit into the main memory
>>>>> completely?


[snipped Flash Gordon for thematic considerations]
[Heathfield:]
Not at all, if you write it properly first time. You just need storage for
two lines: the currently "saved" line, and the most recently read line.
When you read line N into memory, copy it into the "saved" buffer with
probability 1/N. At the end of this process, the "saved" buffer will
contain a randomly selected line.

> Read Richard Heathfield's initial post on this thread
> again; you'll find that his algorithm (1) requires no advance
> knowledge of the line count, (2) makes just one pass over the
> input, and (3) has equal probability of selecting any particular
> line. It will, however, make one complete pass over all the
> input even if it eventually selects the very first line read;
> there is no "shortcut."
>
> Point (3) may be a little difficult to grasp at first, so
> let me offer a couple simple examples and then a formal proof.
> What is the probability that the very last line is chosen? If
> there are N lines altogether, the last line is chosen with
> probability 1/N, as desired. How about the first line? It goes
> into the "saved" buffer as soon as it's read (with probability
> 1/1), and then it survives there with probability
>
> (1-1/2)*(1-1/3)*(1-1/4)*...*(1-1/N)
> = (1/2)*(2/3)*(3/4)*...*((N-1)/N)
> = 1/N
>
> Now for the formal proof, by induction. If the file has
> only one line, that line is selected with probability 1/1.
> That's the "ground" step. Now suppose that the algorithm
> works for N-line files, that is, that it chooses one of the
> N lines with probability 1/N; we wish to show that it works
> for files of N+1 lines. Imagine the situation just before
> the final line is read: the first N lines have been processed
> and one of them resides in the "saved" buffer. The final step
> chooses the last line with probability 1/(N+1) -- which is as
> it should be -- or chooses the already-saved line with probability
> N/(N+1). Since the first N lines were processed by the N-line
> algorithm (known to be correct), any particular one of those first
> N lines is "saved" with probability 1/N, hence the probability that
> that line is chosen from among N+1 is (1/N)*(N/(N+1)) = 1/(N+1).
>
> The algorithm makes an unbiased "choice" from a 1-line file,
> and if it works for N lines it also works for N+1. Therefore,
> it works for all positive N. (To the limits of the random number
> generator's precision, of course -- but that's an altogether
> deeper topic.)


This proof is at the point where the other people in the room either shake
or nod their head and comment. First, I find it notable that there exist
proofs in theoretical comp sci and that this continues to be fertile ground
for unfolding scientific truth.

I find no error in the induction. One of the reasons that I find no error
could be that I've all but forgotten conditional probability. If that is
sound, the telescoping looks correct. I've taken the liberty of sending
this to a professor with whom I'm acquainted who authored a book in
statistical paradoxes to ask for comment. There will always be a
counterexample somewhere; the question is whether it's relevant and how far
afield. If relevant, one would hand Mr. Sosman the chalk back to elaborate.

Mr. Heathfield's algorithm would make the act of observation. Whenever that
happens, the Law of Unintended Consequences gets a phone call. joe


 
Reply With Quote
 
Richard Heathfield
Guest
Posts: n/a
 
      05-18-2006
Joe Smith said:

<snip>

> Mr. Heathfield's algorithm [...]


Um, I think it's in Knuth somewhere (although I had a look just now, and
couldn't find it in TAOCP2.)

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
 
Reply With Quote
 
Joe Smith
Guest
Posts: n/a
 
      05-18-2006

"Richard Heathfield" <(E-Mail Removed)> wrote in message
news(E-Mail Removed)...
> Joe Smith said:
>
> <snip>
>
>> Mr. Heathfield's algorithm [...]

>
> Um, I think it's in Knuth somewhere (although I had a look just now, and
> couldn't find it in TAOCP2.)


I wish I had that handy, as I recall that he offers someting like $2.56 for
each erratum. I believe that relatively close to the beginning, Knuth
claims that Leonardo de Pisa gives the following numbers: 0, 1, 1, 2, 3, 5,
8, 13 and so on. In point of fact, Leonardo in _Libber Abbaci_ gives 0, 1,
2, 3, 5...., omitting one of the one's. I wouldn't be surprised if
Professor Knuth already heard of this, but if he hasn't and I read the
passage correctly, then .... joe


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
signal handling and (structured) exception handling Peter C++ 34 10-17-2009 10:03 AM
python list handling and Lisp list handling Mark Tarver Python 22 04-26-2009 09:36 PM
Is faster handling hexadecimal values than handling chars? IƱaki Baz Castillo Ruby 1 04-15-2008 09:04 AM
file handling in a server (.py) file using xmlrpc uwb Python 4 07-08-2005 07:55 PM
General File Handling (Class Structure Preservation) Question Sean W. Quinn C++ 1 12-01-2003 02:58 AM



Advertisments