Flash Gordon wrote:

> Ico wrote:

>

>> Friedrich Dominicus <(E-Mail Removed)> wrote:

>>

>>> "Anunay" <(E-Mail Removed)> writes:

>>>

>>>> Hi all,

>>>>

>>>> Suppose a text file is given and we have to write a program that will

>>>> return a random line from the file. This can be easily done. But what

>>>> if the text file is too big and can't fit into the main memory

>>>> completely?

>>>

>>> What it the problem? You read the file line by line till you are at

>>> your random number.

>>

>>

>> Requiring 2 passes of file reading: one to count the number of lines,

>> one to choos a random line. This can be used when reading a file

>> (although not optimal, ofcourse), but is not possible with streams.

>

>

> Since the random distribution was not specified you don't need to do two

> passes. This might mean there is no possibility of returning lines

> towards the end of the file, or if the file is only two lines a high

> probability of returning the last line, but it is still random.

>

>>>> Also, if we are given a stream, instead of a file, then what changes

>>>> are required?

>>

>>

>>> What do you mean with this?

>>

>>

>> I assume the OP means reading from standard input, instead of a file.

>

>

> Pick a probability for returning the current line, and keep going until

> you either hit EOF or you at random decide to return the line.
Read Richard Heathfield's initial post on this thread

again; you'll find that his algorithm (1) requires no advance

knowledge of the line count, (2) makes just one pass over the

input, and (3) has equal probability of selecting any particular

line. It will, however, make one complete pass over all the

input even if it eventually selects the very first line read;

there is no "shortcut."

Point (3) may be a little difficult to grasp at first, so

let me offer a couple simple examples and then a formal proof.

What is the probability that the very last line is chosen? If

there are N lines altogether, the last line is chosen with

probability 1/N, as desired. How about the first line? It goes

into the "saved" buffer as soon as it's read (with probability

1/1), and then it survives there with probability

(1-1/2)*(1-1/3)*(1-1/4)*...*(1-1/N)

= (1/2)*(2/3)*(3/4)*...*((N-1)/N)

= 1/N

Now for the formal proof, by induction. If the file has

only one line, that line is selected with probability 1/1.

That's the "ground" step. Now suppose that the algorithm

works for N-line files, that is, that it chooses one of the

N lines with probability 1/N; we wish to show that it works

for files of N+1 lines. Imagine the situation just before

the final line is read: the first N lines have been processed

and one of them resides in the "saved" buffer. The final step

chooses the last line with probability 1/(N+1) -- which is as

it should be -- or chooses the already-saved line with probability

N/(N+1). Since the first N lines were processed by the N-line

algorithm (known to be correct), any particular one of those first

N lines is "saved" with probability 1/N, hence the probability that

that line is chosen from among N+1 is (1/N)*(N/(N+1)) = 1/(N+1).

The algorithm makes an unbiased "choice" from a 1-line file,

and if it works for N lines it also works for N+1. Therefore,

it works for all positive N. (To the limits of the random number

generator's precision, of course -- but that's an altogether

deeper topic.)

--

Eric Sosman

http://www.velocityreviews.com/forums/(E-Mail Removed)lid