Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Re: Compact Python library for math statistics

Reply
Thread Tools

Re: Compact Python library for math statistics

 
 
Gerrit
Guest
Posts: n/a
 
      04-02-2004
������� ������� wrote:
> I'm looking for a Python library for math statistics. This must be a clear set of general statistics functions like 'average', 'variance', 'covariance' etc.


The next version of Python will have a 'statistics' module. It is
probably usable in Python 2.3 as well. You can find it in CVS:

http://cvs.sourceforge.net/viewcvs.p.../statistics.py

I'm not sure whether it's usable in current CVS, though. You may have to
tweak it a little.

Gerrit.

--
Weather in Twenthe, Netherlands 02/04 11:55 UTC:
16.0°C Broken clouds mostly cloudy wind 4.5 m/s ESE (57 m above NAP)
--
Experiences with Asperger's Syndrome:
http://topjaklont.student.utwente.nl/english/

 
Reply With Quote
 
 
 
 
Chris Fonnesbeck
Guest
Posts: n/a
 
      04-06-2004
Gerrit <> wrote in message news:<mailman.267.1080904644.20120.python->...
> wrote:
> > I'm looking for a Python library for math statistics. This must be a cl

> ear set of general statistics functions like 'average', 'variance', 'cova
> riance' etc.
>
> The next version of Python will have a 'statistics' module. It is
> probably usable in Python 2.3 as well. You can find it in CVS:
>
> http://cvs.sourceforge.net/viewcvs.p...hon/nondist/sa
> ndbox/statistics/statistics.py
>
> I'm not sure whether it's usable in current CVS, though. You may have to
> tweak it a little.
>
> Gerrit.



I'm hoping there will be more functions added to this module (e.g.
median, quantiles, skewness, kurtosis). It wouldnt take much to
include at least the basic summary stats. I would be more than happy
to contribute.

cjf
 
Reply With Quote
 
 
 
 
TaeKyon
Guest
Posts: n/a
 
      04-06-2004
Il Mon, 05 Apr 2004 19:41:52 -0700, Chris Fonnesbeck ha scritto:

>> > I'm looking for a Python library for math statistics. This must be a cl

>> ear set of general statistics functions like 'average', 'variance', 'cova
>> riance' etc.


You can also use R from within python; take a look at:

http://www.omegahat.org/RSPython/

--
Michele Alzetta
 
Reply With Quote
 
beliavsky@aol.com
Guest
Posts: n/a
 
      04-06-2004
Gerrit <> wrote in message news:<mailman.267.1080904644.20120.python->...
> wrote:
> > I'm looking for a Python library for math statistics. This must be a cl

> ear set of general statistics functions like 'average', 'variance', 'cova
> riance' etc.
>
> The next version of Python will have a 'statistics' module. It is
> probably usable in Python 2.3 as well. You can find it in CVS:
>
> http://cvs.sourceforge.net/viewcvs.p...hon/nondist/sa
> ndbox/statistics/statistics.py
>
> I'm not sure whether it's usable in current CVS, though. You may have to
> tweak it a little.


<SNIP>

It works for me, at least the mean function. A statistics module will
be nice to have, although it is easy to write your own.

Here is a minor suggestion. The functions 'mean' and 'variance' are
separate, and the latter function requires a mean to be calculated. To
save CPU time, it would be nice to have a single function that returns
both the mean and variance, or a function to compute the variance with
a known mean.

Ideally there would be a function such as

def stats(x,ss)

where ss contains a list of statistics to be computed and the function
returns a list of the same size. If you called it with

y = stats(x,["mean","variance"])

the function would compute the mean and variance efficiently.

Other comments:
(1) In computing the median, there is a line of code

return (select(data, n//2) + select(data, n//2-1)) / 2

I think finding the 500th and 501st elements separately out of a 1000
element array is inefficient. Isn't there a way to get consecutive
ordered elements in about the same time needed to get a single
element?

(2) The following code crashes when median(x) is computed. Why?

from statistics import mean,median
x = [1.0,2.0,3.0,4.0]
print mean(x)
print median(x)

(3) The standard deviation is computed as

return variance(data, sample) ** 0.5

I think the sqrt function should be used instead -- this may be
implemented more efficiently than general exponentiation.
 
Reply With Quote
 
Dave Benjamin
Guest
Posts: n/a
 
      04-06-2004
In article < >, Chris Fonnesbeck wrote:
> Gerrit <> wrote in message news:<mailman.267.1080904644.20120.python->...
>> wrote:
>> > I'm looking for a Python library for math statistics. This must be a cl

>> ear set of general statistics functions like 'average', 'variance', 'cova
>> riance' etc.
>>
>> The next version of Python will have a 'statistics' module. It is
>> probably usable in Python 2.3 as well. You can find it in CVS:
>>
>> http://cvs.sourceforge.net/viewcvs.p...hon/nondist/sa
>> ndbox/statistics/statistics.py
>>
>> I'm not sure whether it's usable in current CVS, though. You may have to
>> tweak it a little.

>
> I'm hoping there will be more functions added to this module (e.g.
> median, quantiles, skewness, kurtosis). It wouldnt take much to
> include at least the basic summary stats. I would be more than happy
> to contribute.


I'd really like to see linear regression in the Python stats module. I've
used the one from stats.py successfully - this may be a good source of
ideas, too:

http://www.nmr.mgh.harvard.edu/Neura...ry/python.html

(see stats.py)
(apologies if this has already been pointed out somewhere)

--
..:[ dave benjamin: ramen/[sp00] -:- spoomusic.com -:- ramenfest.com ]:.
: please talk to your son or daughter about parametric polymorphism. :
 
Reply With Quote
 
Asier
Guest
Posts: n/a
 
      04-06-2004
> > I'm looking for a Python library for math statistics. This must be a cl
> ear set of general statistics functions like 'average', 'variance', 'cova
> riance' etc.


Have you looked at PyGSL? http://pygsl.sf.net

I've programmed with the GSL library in C and works very well and
fast. It has code for a very long list of mathematical functions.
Currently pygsl is a WIP but has some modules completed.

--
Asier.
 
Reply With Quote
 
Alan James Salmoni
Guest
Posts: n/a
 
      04-08-2004
Hi Gerrit,

If you want an object-oriented version, try the SalStat stats module
(salstat_stats.py). Features the descriptives you discussed plus a
range of inferential tests (currently up to and including anova and
nonparametric equivilents). Addy is http://salstat.sourceforge.net for
the entire package. The CVS stats module is a little borked right now
though as I've been making lots of changes, so get the stable
downloadable one.

Alan.

Gerrit <> wrote in message news:<mailman.267.1080904644.20120.python->...
> wrote:
> > I'm looking for a Python library for math statistics. This must be a cl

> ear set of general statistics functions like 'average', 'variance', 'cova
> riance' etc.
>
> The next version of Python will have a 'statistics' module. It is
> probably usable in Python 2.3 as well. You can find it in CVS:
>
> http://cvs.sourceforge.net/viewcvs.p...hon/nondist/sa
> ndbox/statistics/statistics.py
>
> I'm not sure whether it's usable in current CVS, though. You may have to
> tweak it a little.
>
> Gerrit.
>
> --
> Weather in Twenthe, Netherlands 02/04 11:55 UTC:
> 16.0°C Broken clouds mostly cloudy wind 4.5 m/s ESE (57 m above NAP
> )

 
Reply With Quote
 
Raymond Hettinger
Guest
Posts: n/a
 
      04-09-2004
> A statistics module will
> be nice to have, although it is easy to write your own.
>
> Here is a minor suggestion. The functions 'mean' and 'variance' are
> separate, and the latter function requires a mean to be calculated. To
> save CPU time, it would be nice to have a single function that returns
> both the mean and variance, or a function to compute the variance with
> a known mean.


Like you said, that is easy enough to write on your own. This
lightweight module is not meant to replace heavy-weights that already
exist outside of the core distribution.

The goals are to have a simple set of functions for daily use and for
these data reduction functions to work as well as possible with
generator expression (one-pass over the data whereever possibe).



> (1) In computing the median, there is a line of code
>
> return (select(data, n//2) + select(data, n//2-1)) / 2
>
> I think finding the 500th and 501st elements separately out of a 1000
> element array is inefficient. Isn't there a way to get consecutive
> ordered elements in about the same time needed to get a single
> element?


Select uses an O(n) algorithm, so they penalty is not that much.
Making it accomodate selecting a range would greatly complicate and
slow down the code. If you need the low, high, percentiles, then it
may be better to just sort the data.



> (2) The following code crashes when median(x) is computed. Why?
>
> from statistics import mean,median
> x = [1.0,2.0,3.0,4.0]
> print mean(x)
> print median(x)


Hmm, it works for me. What does your traceback look like?



> (3) The standard deviation is computed as
>
> return variance(data, sample) ** 0.5
>
> I think the sqrt function should be used instead -- this may be
> implemented more efficiently than general exponentiation.


The timings show otherwise:

C:\pydev>python timeit.py -r9 -n100000 -s "import math;
sqrt=math.sqrt" "sqrt(7.0)"
100000 loops, best of 9: 1.7 usec per loop

C:\pydev>python timeit.py -r9 -n100000 -s "7.0 ** 0.5"
100000 loops, best of 9: 0.237 usec per loop



Raymond Hettinger
 
Reply With Quote
 
beliavsky@aol.com
Guest
Posts: n/a
 
      04-09-2004
(Raymond Hettinger) wrote in message news:<. com>...

<SNIP>

> > (2) The following code crashes when median(x) is computed. Why?
> >
> > from statistics import mean,median
> > x = [1.0,2.0,3.0,4.0]
> > print mean(x)
> > print median(x)

>
> Hmm, it works for me. What does your traceback look like?


The module statistics.py imports a module 'random'. I have my own file
random.py, and it was importing that. My mistake -- sorry.

> > (3) The standard deviation is computed as
> >
> > return variance(data, sample) ** 0.5
> >
> > I think the sqrt function should be used instead -- this may be
> > implemented more efficiently than general exponentiation.

>
> The timings show otherwise:
>
> C:\pydev>python timeit.py -r9 -n100000 -s "import math;
> sqrt=math.sqrt" "sqrt(7.0)"
> 100000 loops, best of 9: 1.7 usec per loop
>
> C:\pydev>python timeit.py -r9 -n100000 -s "7.0 ** 0.5"
> 100000 loops, best of 9: 0.237 usec per loop


For the Compaq and Lahey/Fujitsu Fortran 95 compilers I found that
sqrt(x) and x**0.5 take the same time -- probably the compiler
converts the latter to the former. On one compiler I found that
computing x**0.49 takes about 10 times longer than sqrt(x), indicating
that a sqrt function should be considerably faster than real
exponentiation.

I wonder if for Python, psyco eliminates the speed difference between
sqrt(x) and x**0.5. Otherwise, the speed difference may indicate a
fundamental problem in using a scripting language like Python for
numerical work -- function calls take too much time. Because of that,
sqrt is much slower than real exponentiation, when it should be much
faster.

Overall, the Python code below is about 100 times slower than the
Fortran equivalent. This is a typical ratio I have found for code
involving loops.

from math import sqrt
n = 10000000 + 1
sum_sqrt = 0.0
for i in range(1,n):
sum_sqrt = sum_sqrt + (float(i))**0.5
print sum_sqrt
 
Reply With Quote
 
Josiah Carlson
Guest
Posts: n/a
 
      04-10-2004
> Overall, the Python code below is about 100 times slower than the
> Fortran equivalent. This is a typical ratio I have found for code
> involving loops.
>
> from math import sqrt
> n = 10000000 + 1
> sum_sqrt = 0.0
> for i in range(1,n):
> sum_sqrt = sum_sqrt + (float(i))**0.5
> print sum_sqrt


Yeah...you may want to consider doing some optimizations to the above
code. Using 'xrange' instead of 'range' is significantly faster
(especially when your machine can't hold 'n' integers in a Python list
in memory), as is the removal of the 'float(i)' cast (which is unnecessary).

As for Python being slow compared to Fortran, of course it is going to
be slow in comparison. Fortran is compiled to assembly, and has fairly
decent (if not amazing) optimizers. Python is bytecode compiled,
interpreted, and lacks an even remotely equivalent optimizer.


- Josiah
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Math.random() and Math.round(Math.random()) and Math.floor(Math.random()*2) VK Javascript 15 05-02-2010 03:43 PM
Python on Computation, Math and Statistics W. Watson Python 9 08-19-2007 10:49 PM
Adding Math.log2 and Math.logn to the core library. Phrogz Ruby 8 02-08-2007 05:00 AM
Math Solving, and Statistics Programs ola8@mail.gr VHDL 0 05-31-2006 08:07 PM
Re: Is still math.h the C++ math library ? AciD_X C++ 4 04-01-2004 07:29 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57