Velocity Reviews > File handling question

# File handling question

Eric Sosman
Guest
Posts: n/a

 05-19-2006
Richard Heathfield wrote:
> Joe Smith said:
>
> <snip>
>
>>Mr. Heathfield's algorithm [...]

>
>
> Um, I think it's in Knuth somewhere (although I had a look just now, and
> couldn't find it in TAOCP2.)

Algorithm 3.4.2R is a more general version, selecting
a sample of size n possibly greater than 1. But be sure
to read the third edition of TAOCP II; the second edition
also has an Algorithm 3.4.2R for the same purpose, but it's
a different algorithm!

--
Eric Sosman
http://www.velocityreviews.com/forums/(E-Mail Removed)lid

Old Wolf
Guest
Posts: n/a

 05-19-2006
Joe Smith wrote:
> Eric Sosman wrote:
>> Now for the formal proof, by induction. If the file has
>> only one line, that line is selected with probability 1/1.
>> That's the "ground" step. Now suppose that the algorithm
>> works for N-line files, that is, that it chooses one of the
>> N lines with probability 1/N; we wish to show that it works
>> for files of N+1 lines. Imagine the situation just before
>> the final line is read: the first N lines have been processed
>> and one of them resides in the "saved" buffer. The final step
>> chooses the last line with probability 1/(N+1) -- which is as
>> it should be -- or chooses the already-saved line with probability
>> N/(N+1). Since the first N lines were processed by the N-line
>> algorithm (known to be correct), any particular one of those first
>> N lines is "saved" with probability 1/N, hence the probability that
>> that line is chosen from among N+1 is (1/N)*(N/(N+1)) = 1/(N+1).
>>
>> The algorithm makes an unbiased "choice" from a 1-line file,
>> and if it works for N lines it also works for N+1. Therefore,
>> it works for all positive N. (To the limits of the random number
>> generator's precision, of course -- but that's an altogether
>> deeper topic.)

>
> I find no error in the induction. One of the reasons that I find no error
> could be that I've all but forgotten conditional probability. If that is
> sound, the telescoping looks correct. I've taken the liberty of sending
> this to a professor with whom I'm acquainted who authored a book in

There is no error. This is a great example of how good induction is.

Another way of writing the inductive step is to note that all of
the first N lines have equal probability of being selected; and
selecting or not selecting the new last line does not affect this,
so the new probability of one of the first N lines remaining selected
is (1 - P(new line)) / N .

Eric Sosman
Guest
Posts: n/a

 05-19-2006
Old Wolf wrote:
> Joe Smith wrote:
>
>>Eric Sosman wrote:
>>
>>> Now for the formal proof, by induction. If the file has
>>>only one line, that line is selected with probability 1/1.
>>>That's the "ground" step. Now suppose that the algorithm
>>>works for N-line files, that is, that it chooses one of the
>>>N lines with probability 1/N; we wish to show that it works
>>>for files of N+1 lines. Imagine the situation just before
>>>the final line is read: the first N lines have been processed
>>>and one of them resides in the "saved" buffer. The final step
>>>chooses the last line with probability 1/(N+1) -- which is as
>>>it should be -- or chooses the already-saved line with probability
>>>N/(N+1). Since the first N lines were processed by the N-line
>>>algorithm (known to be correct), any particular one of those first
>>>N lines is "saved" with probability 1/N, hence the probability that
>>>that line is chosen from among N+1 is (1/N)*(N/(N+1)) = 1/(N+1).
>>>
>>> The algorithm makes an unbiased "choice" from a 1-line file,
>>>and if it works for N lines it also works for N+1. Therefore,
>>>it works for all positive N. (To the limits of the random number
>>>generator's precision, of course -- but that's an altogether
>>>deeper topic.)

>>
>>I find no error in the induction. One of the reasons that I find no error
>>could be that I've all but forgotten conditional probability. If that is
>>sound, the telescoping looks correct. I've taken the liberty of sending
>>this to a professor with whom I'm acquainted who authored a book in

>
>
> There is no error. This is a great example of how good induction is.
>
> Another way of writing the inductive step is to note that all of
> the first N lines have equal probability of being selected; and
> selecting or not selecting the new last line does not affect this,
> so the new probability of one of the first N lines remaining selected
> is (1 - P(new line)) / N .

The proof can also be done non-inductively, by considering
the probability that any particular line is selected. The Kth
line is chosen if and only if

- When it is read it goes into the "saved" buffer, and

- It survives in the "saved" buffer as all the remaining

The Kth line enters the "saved" buffer with probability 1/K,
survives the challenge from the next line with probability
K/(K+1), from the line after that with probability (K+1)/(K+2),
and so on, surviving the final line's challenge with probability
(N-1)/N. Multiply the probabilities together, and each denominator
except the last cancels with the numerator of the fraction right
after it, leaving 1/N as the probability that the Kth line is
chosen, for any 0 < K <= N.

The non-inductive proof is shorter, and may be more appealing
to some. To me, though, induction seems a more natural way to
prove things about iterative processes; the correspondence of
the inductive step with the iterative step is usually compelling.

--
Eric Sosman
(E-Mail Removed)lid

Joe Smith
Guest
Posts: n/a

 05-19-2006

"Old Wolf" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
> Joe Smith wrote:
>> Eric Sosman wrote:
>>> Now for the formal proof, by induction. If the file has
>>> only one line, that line is selected with probability 1/1.
>>> That's the "ground" step. Now suppose that the algorithm
>>> works for N-line files, that is, that it chooses one of the
>>> N lines with probability 1/N; we wish to show that it works
>>> for files of N+1 lines. Imagine the situation just before
>>> the final line is read: the first N lines have been processed
>>> and one of them resides in the "saved" buffer. The final step
>>> chooses the last line with probability 1/(N+1) -- which is as
>>> it should be -- or chooses the already-saved line with probability
>>> N/(N+1). Since the first N lines were processed by the N-line
>>> algorithm (known to be correct), any particular one of those first
>>> N lines is "saved" with probability 1/N, hence the probability that
>>> that line is chosen from among N+1 is (1/N)*(N/(N+1)) = 1/(N+1).
>>>
>>> The algorithm makes an unbiased "choice" from a 1-line file,
>>> and if it works for N lines it also works for N+1. Therefore,
>>> it works for all positive N. (To the limits of the random number
>>> generator's precision, of course -- but that's an altogether
>>> deeper topic.)

>>
>> I find no error in the induction. One of the reasons that I find no
>> error
>> could be that I've all but forgotten conditional probability. If that is
>> sound, the telescoping looks correct. I've taken the liberty of sending
>> this to a professor with whom I'm acquainted who authored a book in

>
> There is no error. This is a great example of how good induction is.
>
> Another way of writing the inductive step is to note that all of
> the first N lines have equal probability of being selected; and
> selecting or not selecting the new last line does not affect this,
> so the new probability of one of the first N lines remaining selected
> is (1 - P(new line)) / N .

I would call that more an appeal to strong induction, whereas Mr. Sosman's
is to weak induction. Of course, that wouldn't change the outcome, just
what is "known" during the inductive step. BTW, you were right elsewhere
that I was mistakenly talking about the Euclidean Algorithm as the Reverse.
Why you think public key cryptogography is OT here is beyond me, except that
that might mean you only discuss it on your Own Terms. joe