Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > median of large data set (from large file)

Reply
Thread Tools

median of large data set (from large file)

 
 
friend.05@gmail.com
Guest
Posts: n/a
 
      04-01-2009
I have very big file(GBs).

I want to calculate median of this large data set. Any efficient way
to calculate it.

I cannot read filei n array as it will go out-of-memory due to large
amount of data.


THanks
 
Reply With Quote
 
 
 
 
ccc31807
Guest
Posts: n/a
 
      04-01-2009
On Apr 1, 4:34*pm, "(E-Mail Removed)" <(E-Mail Removed)>
wrote:
> I have very big file(GBs).
>
> I want to calculate median of this large data set. Any efficient way
> to calculate it.
>
> I cannot read filei n array as it will go out-of-memory due to large
> amount of data.
>
> THanks


As I recall, you can use a bucket sort to sort a list without reading
them all into memory. Sort the list using bucket sort, counting the
elements, and then take the middle element.

CC
 
Reply With Quote
 
 
 
 
smallpond
Guest
Posts: n/a
 
      04-01-2009
On Apr 1, 4:34*pm, "(E-Mail Removed)" <(E-Mail Removed)>
wrote:
> I have very big file(GBs).
>
> I want to calculate median of this large data set. Any efficient way
> to calculate it.
>
> I cannot read filei n array as it will go out-of-memory due to large
> amount of data.
>
> THanks


There is a method for calculating the median without
sorting that takes O(log N) passes through your 9-track
tapes, getting progressively closer upper and lower bounds.
It's probably in Knuth, but I don't have a copy handy.
 
Reply With Quote
 
Ilya Zakharevich
Guest
Posts: n/a
 
      04-01-2009
On 2009-04-01, Ilya Zakharevich <(E-Mail Removed)> wrote:
> Median is MAJORLY tricky. (IIRC,) Contrary to a widespead believes,
> there IS a linear time limited-memory algo (but it is MUCH more
> delicate than the usual FFT/quicksort-like tricks).


Oups, of course it needs "limited" RANDOM-ACCESS memory; it still needs
a lot of sequential-access memory (read: disk).

Ilya
 
Reply With Quote
 
Ilya Zakharevich
Guest
Posts: n/a
 
      04-01-2009
On 2009-04-01, Ilya Zakharevich <(E-Mail Removed)> wrote:
> On 2009-04-01, http://www.velocityreviews.com/forums/(E-Mail Removed) <(E-Mail Removed)> wrote:
>> I have very big file(GBs).
>>
>> I want to calculate median of this large data set. Any efficient way
>> to calculate it.
>>
>> I cannot read filei n array as it will go out-of-memory due to large
>> amount of data.

>
> Median is MAJORLY tricky. (IIRC,) Contrary to a widespead believes,

^^^^
> there IS a linear time limited-memory algo (but it is MUCH more
> delicate than the usual FFT/quicksort-like tricks).


Yes, I remembered correct. *This* time I even managed to
reconstruct the algo...

> I would be pretty sure that the algo is already implemented in any
> reasonable toolset. Did you check the CPAN search?


In fact, this algo makes sense ONLY if know nothing about distribution
of the data. E.g., if you know that your data is `double-precision',
then there are only 2**64 buckets; on 10-year-old hardware it is not a
problem to decimate into about 2**24 buckets.

So `double-precision' data is guarantied to be decimated in 3 rounds,
and most probably would in 2 (second round very quick).

Hope this helps,
Ilya
 
Reply With Quote
 
Xho Jingleheimerschmidt
Guest
Posts: n/a
 
      04-02-2009
(E-Mail Removed) wrote:
> I have very big file(GBs).
>
> I want to calculate median of this large data set. Any efficient way
> to calculate it.
>
> I cannot read filei n array as it will go out-of-memory due to large
> amount of data.


I'd start by using the system's file sort utility to sort the file, and
see if that is good enough.

Xho
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: Mean, median, and mode Josiah Carlson Python 7 12-06-2004 10:30 PM
RE: Mean, median, and mode - third time's the charm!!! Frohnhofer, James Python 0 12-06-2004 09:16 PM
RE: Mean, median, and mode Robert Brewer Python 5 12-06-2004 08:31 PM
Re: Python-Help ( Mean,Median & Mode) Alfred Canoy Python 2 12-06-2004 04:03 AM
Implementation of parallel 2D median filtering Andreas VHDL 1 12-02-2003 11:19 AM



Advertisments