Velocity Reviews > Request for feedback on API design

# Request for feedback on API design

Steven D'Aprano
Guest
Posts: n/a

 12-09-2010
I am soliciting feedback regarding the API of my statistics module:

Specifically the following couple of issues:

(1) Multivariate statistics such as covariance have two obvious APIs:

A pass the X and Y values as two separate iterable arguments, e.g.:
cov([1, 2, 3], [4, 5, 6])

B pass the X and Y values as a single iterable of tuples, e.g.:
cov([(1, 4), (2, 5), (3, 6)]

I currently support both APIs. Do people prefer one, or the other, or
both? If there is a clear preference for one over the other, I may drop
support for the other.

(2) Statistics text books often give formulae in terms of sums and
differences such as

Sxx = n*Σ(x**2) - (Σx)**2

There are quite a few of these: I count at least six common ones, all
closely related and confusing named:

Sxx, Syy, Sxy, SSx, SSy, SPxy

(the x and y should all be subscript).

Are they useful, or would they just add unnecessary complexity? Would
people would like to see these included in the package?

Thank you for your feedback.

--
Steven

Tim Chase
Guest
Posts: n/a

 12-10-2010
On 12/09/2010 05:44 PM, Steven D'Aprano wrote:
> (1) Multivariate statistics such as covariance have two obvious APIs:
>
> A pass the X and Y values as two separate iterable arguments, e.g.:
> cov([1, 2, 3], [4, 5, 6])
>
> B pass the X and Y values as a single iterable of tuples, e.g.:
> cov([(1, 4), (2, 5), (3, 6)]
>
> I currently support both APIs. Do people prefer one, or the other, or
> both? If there is a clear preference for one over the other, I may drop
> support for the other.

I'm partial to the "B" form (iterable of 2-tuples) -- it
indicates that the two data-sets (x_n and y_n) should be of the
same length and paired. The "A" form leaves this less obvious
that len(param1) should equal len(param2).

I haven't poked at your code sufficiently to determine whether
all the functions within can handle streamed data, or whether
they keep the entire dataset internally, but handing off an
iterable-of-pairs tends to be a little more straight-forward:

cov(humongous_dataset_iter)

or

cov(izip(humongous_dataset_iter1, humongous_dataset_iter2))

The "A" form makes doing this a little less obvious than the "B"
form.

> (2) Statistics text books often give formulae in terms of sums and
> differences such as
>
> Sxx = n*Σ(x**2) - (Σx)**2
>
> There are quite a few of these: I count at least six common ones,

When you take this count, is it across multiple text-books, or
are they common in just a small sampling of texts? (I confess
it's been a decade and a half since I last suffered a stats class)

> all closely related and confusing named:
>
> Sxx, Syy, Sxy, SSx, SSy, SPxy
>
> (the x and y should all be subscript).
>
> Are they useful, or would they just add unnecessary complexity?

I think it depends on your audience: amateur statisticians or
pros? I suspect that pros wouldn't blink at the distinctions
while weekenders like myself would get a little bleary-eyed
without at least a module docstring to clearly spell out the
distinctions and the forumlae used for determining them.

Just my from-the-hip thoughts for whatever little they may be worth.

-tkc

Steven D'Aprano
Guest
Posts: n/a

 12-10-2010
On Thu, 09 Dec 2010 18:48:10 -0600, Tim Chase wrote:

> On 12/09/2010 05:44 PM, Steven D'Aprano wrote:
>> (1) Multivariate statistics such as covariance have two obvious APIs:
>>
>> A pass the X and Y values as two separate iterable arguments,
>> e.g.:
>> cov([1, 2, 3], [4, 5, 6])
>>
>> B pass the X and Y values as a single iterable of tuples, e.g.:
>> cov([(1, 4), (2, 5), (3, 6)]
>>
>> I currently support both APIs. Do people prefer one, or the other, or
>> both? If there is a clear preference for one over the other, I may drop
>> support for the other.

>
> I'm partial to the "B" form (iterable of 2-tuples) -- it indicates that
> the two data-sets (x_n and y_n) should be of the same length and paired.
> The "A" form leaves this less obvious that len(param1) should equal
> len(param2).

> I haven't poked at your code sufficiently to determine whether all the
> functions within can handle streamed data, or whether they keep the
> entire dataset internally,

Where possible, the functions don't keep the entire dataset internally.
Some functions have to (e.g. order statistics need to see the entire data
sequence at once), but the rest are capable of dealing with streamed data.

Also, there are a few functions such as standard deviation that have a
single-pass algorithm, and a more accurate multiple-pass algorithm.

>> (2) Statistics text books often give formulae in terms of sums and
>> differences such as
>>
>> Sxx = n*Σ(x**2) - (Σx)**2
>>
>> There are quite a few of these: I count at least six common ones,

>
> When you take this count, is it across multiple text-books, or are they
> common in just a small sampling of texts? (I confess it's been a decade
> and a half since I last suffered a stats class)

I admit that I haven't done an exhaustive search of the literature, but
it does seen quite common to extract common expressions from various
stats formulae and give them names. The only use-case I can imagine for
them is checking hand-calculations or doing schoolwork.

--
Steven

Arnaud Delobelle
Guest
Posts: n/a

 12-13-2010
Steven D'Aprano <(E-Mail Removed)> writes:

> I am soliciting feedback regarding the API of my statistics module:
>
>
>
> Specifically the following couple of issues:
>
> (1) Multivariate statistics such as covariance have two obvious APIs:
>
> A pass the X and Y values as two separate iterable arguments, e.g.:
> cov([1, 2, 3], [4, 5, 6])
>
> B pass the X and Y values as a single iterable of tuples, e.g.:
> cov([(1, 4), (2, 5), (3, 6)]
>
> I currently support both APIs. Do people prefer one, or the other, or
> both? If there is a clear preference for one over the other, I may drop
> support for the other.
>

I don't have an informed opinion on this.

> (2) Statistics text books often give formulae in terms of sums and
> differences such as
>
> Sxx = n*Σ(x**2) - (Σx)**2

Interestingly, your Sxx is closely related to the variance:

if x is a list of n numbers then

Sxx == (n**2)*var(x)

And more generally if x and y have the same length n, then Sxy (*) is
related to the covariance

Sxy == (n**2)*cov(x, y)

So if you have a variance and covariance function, it would be redundant
to include Sxx and Sxy. Another argument against including Sxx & co is
that their definition is not universally agreed upon. For example, I
have seen

Sxx = Σ(x**2) - (Σx)**2/n

HTH

--
Arnaud

(*) Here I take Sxy to be n*Σ(xy) - (Σx)(Σy), generalising from your
definition of Sxx.

Ethan Furman
Guest
Posts: n/a

 12-13-2010
Steven D'Aprano wrote:
> I am soliciting feedback regarding the API of my statistics module:
>
>
>
> Specifically the following couple of issues:
>
> (1) Multivariate statistics such as covariance have two obvious APIs:
>
> A pass the X and Y values as two separate iterable arguments, e.g.:
> cov([1, 2, 3], [4, 5, 6])
>
> B pass the X and Y values as a single iterable of tuples, e.g.:
> cov([(1, 4), (2, 5), (3, 6)]
>
> I currently support both APIs. Do people prefer one, or the other, or
> both? If there is a clear preference for one over the other, I may drop
> support for the other.
>

Don't currently need/use stats, but B seems clearer to me.

~Ethan~