On 2009-04-01, Ilya Zakharevich <(E-Mail Removed)> wrote:

> On 2009-04-01, http://www.velocityreviews.com/forums/(E-Mail Removed) <(E-Mail Removed)> wrote:

>> I have very big file(GBs).

>>

>> I want to calculate median of this large data set. Any efficient way

>> to calculate it.

>>

>> I cannot read filei n array as it will go out-of-memory due to large

>> amount of data.

>

> Median is MAJORLY tricky. (IIRC,) Contrary to a widespead believes,
^^^^

> there IS a linear time limited-memory algo (but it is MUCH more

> delicate than the usual FFT/quicksort-like tricks).
Yes, I remembered correct.

*This* time I even managed to

reconstruct the algo...

> I would be pretty sure that the algo is already implemented in any

> reasonable toolset. Did you check the CPAN search?
In fact, this algo makes sense ONLY if know nothing about distribution

of the data. E.g., if you know that your data is `double-precision',

then there are only 2**64 buckets; on 10-year-old hardware it is not a

problem to decimate into about 2**24 buckets.

So `double-precision' data is guarantied to be decimated in 3 rounds,

and most probably would in 2 (second round very quick).

Hope this helps,

Ilya