# More math than perl...

 10-05-2007
Background:

I have a routine I am writing in perl that will give me the median for
a 0 to 5 rating. The ratings are stored in a file and I load the
values into 7 different variables, RATE0 - RATE5 and one called TOTAL.
When a person rates a page I increment one of the RATE variables
based on what they selected (0 - 5) and increment TOTAL so I have a
running count (which is really just the sum of RATE0 - RATE5).

The problem I have (and I hope I am explaining this right), to
calculate a median, I have to make an array that contains all the
values, sorted from low to high, and then look at the value of the
element in the middle to get the median. As an example if I have the
following (not real code, just an example of the logic):

\$RATE[0] = 3;
\$RATE[1] = 1;
\$RATE[2] = 0;
\$RATE[3] = 4;
\$RATE[4] = 1;
\$RATE[5] = 2;

Then my array would be:

@ARRAY = (0,0,0,1,3,3,3,3,4,5,5);

And the median would be \$ARRAY[5] or 3. With an even number of
elements in @ARRAY I have to add the value below the middle and the
value above the middle, divide by 2 to get the median.

For a small sample this is no problem, but when the number of people
who have rated it get in to the 1000's this array is going to be too
cumbersome. Does anyone know of a simpler way to do it in perl without
adding in modules or using alot of memory?

Any / all ideas are welcomed, but please remember that the example I
gave is just typed to give you an idea and is not any real code I am
using.

Bill H

 10-05-2007
>>>>> "BH" == Bill H writes:

BH> The problem I have (and I hope I am explaining this right), to
BH> calculate a median, I have to make an array that contains all
BH> the values, sorted from low to high, and then look at the
BH> value of the element in the middle to get the median.

So, let me see if I understand this right.

You have five variables, \$RATE0 through \$RATE5, and each contains a
count of how many people rated that page that number?

You could probably do something slick, but the brute-force method
looks like this:

my @array = ((0) x \$RATE0, (1) x \$RATE1, (2) x \$RATE2,
(3) x \$RATE3, (4) x \$RATE4, (5) x \$RATE5);

my \$median;

if (@array % 2)
{
\$median = (\$array[(@array-1)/2] + \$array[(@array+1)/2])/2;
}
else
{
\$median = \$array[@array/2];
}

Alternately, if you did the sensible thing and kept \$RATE0 through
\$RATE5 in an array, you could say, much more elegantly,

my @array = map { (\$_) x \$RATE[\$_] } (0..5);

Charlton

http://www.velocityreviews.com/forums/

 10-05-2007
On 10/05/2007 04:12 PM, Bill H wrote:
> [...]
> And the median would be \$ARRAY[5] or 3. With an even number of
> elements in @ARRAY I have to add the value below the middle and the
> value above the middle, divide by 2 to get the median.
>
> For a small sample this is no problem, but when the number of people
> who have rated it get in to the 1000's this array is going to be too
> cumbersome. Does anyone know of a simpler way to do it in perl without
> adding in modules or using alot of memory?
> [...]

I would just build the array in memory. On any reasonably modern system,
you'll have to have millions of values before you run out of memory.

I know the mean can be calculated "on the fly"--without storing all of
the values to be examined, but I can't see how this is to be done with
the median; I don't think it's possible.

PS.
I would have given this post a more descriptive subject line like:
calculating median without using too much memory.

 10-05-2007
Bill H wrote:
> Background:
>
> I have a routine I am writing in perl that will give me the median for
> a 0 to 5 rating. The ratings are stored in a file and I load the
> values into 7 different variables, RATE0 - RATE5 and one called TOTAL.
> When a person rates a page I increment one of the RATE variables
> based on what they selected (0 - 5) and increment TOTAL so I have a
> running count (which is really just the sum of RATE0 - RATE5).
>
> The problem I have (and I hope I am explaining this right), to
> calculate a median, I have to make an array that contains all the
> values, sorted from low to high, and then look at the value of the
> element in the middle to get the median. As an example if I have the
> following (not real code, just an example of the logic):
>
> \$RATE[0] = 3;
> \$RATE[1] = 1;
> \$RATE[2] = 0;
> \$RATE[3] = 4;
> \$RATE[4] = 1;
> \$RATE[5] = 2;
>
> Then my array would be:
>
> @ARRAY = (0,0,0,1,3,3,3,3,4,5,5);
>
> And the median would be \$ARRAY[5] or 3. With an even number of
> elements in @ARRAY I have to add the value below the middle and the
> value above the middle, divide by 2 to get the median.
>
> For a small sample this is no problem, but when the number of people
> who have rated it get in to the 1000's this array is going to be too
> cumbersome. Does anyone know of a simpler way to do it in perl without
> adding in modules or using alot of memory?
>
> Any / all ideas are welcomed, but please remember that the example I
> gave is just typed to give you an idea and is not any real code I am
> using.

Perhaps this is close to what you require:

\$ perl -le'
my @RATES = ( 3, 1, 0, 4, 1, 2 );
my \$TOTAL = 11;

my \$half = int( \$TOTAL / 2 );
for my \$i ( 0 .. \$#RATES ) {
if ( ( \$half -= \$RATES[ \$i ] ) < 0 ) {
print "Median = \$i";
last;
}
}
'
Median = 3

John
 10-05-2007
On Oct 5, 5:44 pm, Charlton Wilbur wrote:
> >>>>> "BH" == Bill H <(E-Mail Removed)> writes:

>
> BH> The problem I have (and I hope I am explaining this right), to
> BH> calculate a median, I have to make an array that contains all
> BH> the values, sorted from low to high, and then look at the
> BH> value of the element in the middle to get the median.
>
> So, let me see if I understand this right.
>
> You have five variables, \$RATE0 through \$RATE5, and each contains a
> count of how many people rated that page that number?
>
> You could probably do something slick, but the brute-force method
> looks like this:
>
> my @array = ((0) x \$RATE0, (1) x \$RATE1, (2) x \$RATE2,
> (3) x \$RATE3, (4) x \$RATE4, (5) x \$RATE5);
>
> my \$median;
>
> if (@array % 2)
> {
> \$median = (\$array[(@array-1)/2] + \$array[(@array+1)/2])/2;}
>
> else
> {
> \$median = \$array[@array/2];
>
> }
>
> Alternately, if you did the sensible thing and kept \$RATE0 through
> \$RATE5 in an array, you could say, much more elegantly,
>
> my @array = map { (\$_) x \$RATE[\$_] } (0..5);
>
> Charlton
>
Thanks Charlton, but would this not still make a large array if the
total number of people is high (unles I am missing something in it).

Bill H

 10-05-2007
Bill H wrote:
> Background:
>
> I have a routine I am writing in perl that will give me the median for
> a 0 to 5 rating. The ratings are stored in a file and I load the
> values into 7 different variables, RATE0 - RATE5 and one called TOTAL.
> When a person rates a page I increment one of the RATE variables
> based on what they selected (0 - 5) and increment TOTAL so I have a
> running count (which is really just the sum of RATE0 - RATE5).
>
> The problem I have (and I hope I am explaining this right), to
> calculate a median, I have to make an array that contains all the
> values, sorted from low to high, and then look at the value of the
> element in the middle to get the median. As an example if I have the
> following (not real code, just an example of the logic):
>
> \$RATE[0] = 3;
> \$RATE[1] = 1;
> \$RATE[2] = 0;
> \$RATE[3] = 4;
> \$RATE[4] = 1;
> \$RATE[5] = 2;

Compute the median directly from the structure you already have.

use List::Util qw(sum);

sub median_from_bins {
my (\$bins,\$total)=@_;
\$total=sum @\$bins unless defined \$total;
my \$sofar=0;
for (my \$x=0; \$x<=5; \$x++) {
\$sofar+=\$bins->[\$x];
return \$x if \$sofar>\$total/2;
if (\$sofar == \$total/2) {
my \$y=\$x+1;
\$y++ until \$bins->[\$y];
return (\$x+\$y)/2;
};
die "Should never get here \$x \$sum \$total @\$bins";
};

my \$median = median_from_bins(\@RATE,\$TOTAL);

Xho

 10-06-2007
Bill H wrote:
> On Oct 5, 5:44 pm, Charlton Wilbur wrote:
>> >>>>> "BH" == Bill H <(E-Mail Removed)> writes:

>>
>> BH> The problem I have (and I hope I am explaining this right), to
>> BH> calculate a median, I have to make an array that contains all
>> BH> the values, sorted from low to high, and then look at the
>> BH> value of the element in the middle to get the median.

>> Alternately, if you did the sensible thing and kept \$RATE0 through
>> \$RATE5 in an array, you could say, much more elegantly,
>>
>> my @array = map { (\$_) x \$RATE[\$_] } (0..5);

> Thanks Charlton, but would this not still make a large array if the
> total number of people is high (unles I am missing something in it).

How many hundreds of thousands of people do you expect
will take your survey?

 10-06-2007

"Bill H" wrote in message
news:
> Background:
>
> I have a routine I am writing in perl that will give me the median for
> a 0 to 5 rating. The ratings are stored in a file and I load the
> values into 7 different variables, RATE0 - RATE5 and one called TOTAL.
> When a person rates a page I increment one of the RATE variables
> based on what they selected (0 - 5) and increment TOTAL so I have a
> running count (which is really just the sum of RATE0 - RATE5).
>
> The problem I have (and I hope I am explaining this right), to
> calculate a median, I have to make an array that contains all the
> values, sorted from low to high, and then look at the value of the
> element in the middle to get the median. As an example if I have the
> following (not real code, just an example of the logic):
>
> \$RATE[0] = 3;
> \$RATE[1] = 1;
> \$RATE[2] = 0;
> \$RATE[3] = 4;
> \$RATE[4] = 1;
> \$RATE[5] = 2;
>
> Then my array would be:
>
> @ARRAY = (0,0,0,1,3,3,3,3,4,5,5);
>
> And the median would be \$ARRAY[5] or 3. With an even number of
> elements in @ARRAY I have to add the value below the middle and the
> value above the middle, divide by 2 to get the median.
>
> For a small sample this is no problem, but when the number of people
> who have rated it get in to the 1000's this array is going to be too
> cumbersome. Does anyone know of a simpler way to do it in perl without
> adding in modules or using alot of memory?
>
> Any / all ideas are welcomed, but please remember that the example I
> gave is just typed to give you an idea and is not any real code I am
> using.
>
> Bill H
>

Bill, If keeping memory use low is a priority and you indeed need the median
of thousands of ratings then for your data you can probably safely use the
mean value
since as your data count increases the mean and median will converge.
INT(\$mean) could give you a whole number if needed.
Cheers, Peter

 10-06-2007
On 10/05/2007 09:55 PM, l v wrote:
> Bill H wrote:
>> [ problem calculating the median without using too much memory ]
>> Bill H
>>

>
>
> use strict;
> use warnings;
> @ARRAY = (0,0,0,1,3,3,3,3,4,5,5);

This kind of array is what Bill wanted to avoid creating.

>
> # using the same array since you are concerned about memory.
> # need to load the array to handle sorting of 2 digit numbers.
> @ARRAY = sort map {sprintf "%05d", \$_} @ARRAY;

How is that simpler than this?

@ARRAY = sort { \$a <=> \$b } @ARRAY;

> \$midPoint = \$#ARRAY / 2;
> \$median = \$ARRAY[int \$midPoint];
>
> if (\$midPoint != int \$midPoint) {
> \$upperPoint = \$midPoint +1;
> \$median = (\$median + \$ARRAY[int \$upperPoint]) / 2;
> }
>
> print "median = \$median\n";
>

use POSIX 'ceil';
print "median = ", \$ARRAY[ceil(@ARRAY/2)], "\n";

>
> But this is why I use the Statistics:escriptive:iscrete module to
> calculate medians.
>

Bill said he didn't want to use any modules.

 10-06-2007
On 05 Oct 2007 17:44:43 -0400, Charlton Wilbur wrote:
<(E-Mail Removed)> wrote:

>if (@array % 2)
>{
> \$median = (\$array[(@array-1)/2] + \$array[(@array+1)/2])/2;
>}
>else
>{
> \$median = \$array[@array/2];
>}

Actually, AIUI the index of latter should be (@array-1)/2 (or
\$#array/2) and the two should calculations should be swapped. Of
course, this is IMHO a good place where to use the ternary conditional
operator.[*]

[*] On a second thought, do "of course" and "IMHO" clash?

Michele
