Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > None versus MISSING sentinel -- request for design feedback

Reply
Thread Tools

None versus MISSING sentinel -- request for design feedback

 
 
Steven D'Aprano
Guest
Posts: n/a
 
      07-15-2011
Hello folks,

I'm designing an API for some lightweight calculator-like statistics
functions, such as mean, standard deviation, etc., and I want to support
missing values. Missing values should be just ignored. E.g.:

mean([1, 2, MISSING, 3]) => 6/3 = 2 rather than 6/4 or raising an error.

My question is, should I accept None as the missing value, or a dedicated
singleton?

In favour of None: it's already there, no extra code required. People may
expect it to work.

Against None: it's too easy to mistakenly add None to a data set by mistake,
because functions return None by default.

In favour of a dedicated MISSING singleton: it's obvious from context. It's
not a lot of work to implement compared to using None. Hard to accidentally
include it by mistake. If None does creep into the data by accident, you
get a nice explicit exception.

Against MISSING: users may expect to be able to choose their own sentinel by
assigning to MISSING. I don't want to support that.


I've considered what other packages do:-

R uses a special value, NA, to stand in for missing values. This is more or
less the model I wish to follow.

I believe that MATLAB treats float NANs as missing values. I consider this
an abuse of NANs and I won't be supporting that

Spreadsheets such as Excel, OpenOffice and Gnumeric generally ignore blank
cells, and give you a choice between ignoring text and treating it as zero.
E.g. with cells set to [1, 2, "spam", 3] the AVERAGE function returns 2 and
the AVERAGEA function returns 1.5.

numpy uses masked arrays, which is probably over-kill for my purposes; I am
gratified to see it doesn't abuse NANs:

>>> import numpy as np
>>> a = np.array([1, 2, float('nan'), 3])
>>> np.mean(a)

nan

numpy also treats None as an error:

>>> a = np.array([1, 2, None, 3])
>>> np.mean(a)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/site-packages/numpy/core/fromnumeric.py", line
860, in mean
return mean(axis, dtype, out)
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'


I would appreciate any comments, advice or suggestions.


--
Steven

 
Reply With Quote
 
 
 
 
Chris Angelico
Guest
Posts: n/a
 
      07-15-2011
On Fri, Jul 15, 2011 at 3:28 PM, Steven D'Aprano
<(E-Mail Removed)> wrote:
> My question is, should I accept None as the missing value, or a dedicated
> singleton?
>
> In favour of None: it's already there, no extra code required. People may
> expect it to work.
>
> Against None: it's too easy to mistakenly add None to a data set by mistake,
> because functions return None by default.


I guess the question is: Why are the missing values there? If they're
there because some function returned None because it didn't have a
value to return, and therefore it's a missing value, then using None
as "missing" would make a lot of sense. But if it's a more explicit
concept of "here's a table of values, and the user said that this one
doesn't exist", it'd be better to have an explicit MISSING. (Which I
assume would be exposed as yourmodule.MISSING or something.)

Agreed that float('nan') and "" and "spam" are all bad values for
Missings. Possibly "" should come out as 0, but "spam" should
definitely fail.

Chris Angelico
 
Reply With Quote
 
 
 
 
Rob Williscroft
Guest
Posts: n/a
 
      07-15-2011
Steven D'Aprano wrote in news:4e1fd009$0$29986$c3e8da3
$(E-Mail Removed) in gmane.comp.python.general:

> I'm designing an API for some lightweight calculator-like statistics
> functions, such as mean, standard deviation, etc., and I want to support
> missing values. Missing values should be just ignored. E.g.:
>
> mean([1, 2, MISSING, 3]) => 6/3 = 2 rather than 6/4 or raising an error.


If you can't make your mind up then maybe you shouldn't:

MISSING = MissingObject()
def mean( sequence, missing = MISSING ):
...

Rob.

 
Reply With Quote
 
Cameron Simpson
Guest
Posts: n/a
 
      07-15-2011
On 15Jul2011 15:28, Steven D'Aprano <(E-Mail Removed)> wrote:
| In favour of None: it's already there, no extra code required. People may
| expect it to work.

Broadly, I like this one for the reasons you cite.

| Against None: it's too easy to mistakenly add None to a data set by mistake,
| because functions return None by default.

This is a hazard everywhere, but won't such a circumstance normally
break lots of stuff anyway? What's an example scenario for getting None
by accident but still a bunch of non-None values? The main one I can
imagine is a function with a return path that accidentally misses the
value something, eg:

def f(x):
if blah:
return 7
...
if foo:
return 0
# whoops!


I suppose there's no scope for having the append-to-the-list step sanity
check for the sentinel (be it None or otherwise)?

| In favour of a dedicated MISSING singleton: it's obvious from context. It's
| not a lot of work to implement compared to using None. Hard to accidentally
| include it by mistake. If None does creep into the data by accident, you
| get a nice explicit exception.

I confess to being about to discard None as a sentinel in a bit of my
own code, but only to allow None to be used as a valid value, using the
usual idiom:

class IQ(Queue):
def __init__(self, ...):
self._sentinel = object()
...

| Against MISSING: users may expect to be able to choose their own sentinel by
| assigning to MISSING. I don't want to support that.

Well, we don't have readonly values to play with
Personally I'd do what I did above: give it a "private" name like
_MISSING so that people should expect to have inside (and unsupported,
unguarenteed) knowledge if they fiddle with it. Or are you publishing
the sentinal's name to your callers i.e. may they really return _MISSING
legitimately from their functions?

Cheers,
--
Cameron Simpson <(E-Mail Removed)> DoD#743
http://www.cskk.ezoshosting.com/cs/

What's fair got to do with it? It's going to happen. - Lawrence of Arabia
 
Reply With Quote
 
bruno.desthuilliers@gmail.com
Guest
Posts: n/a
 
      07-15-2011
On Jul 15, 8:08*am, Chris Angelico <(E-Mail Removed)> wrote:
>
> Agreed that float('nan') and "" and "spam" are all bad values for
> Missings. Possibly "" should come out as 0


"In the face of ambiguity, refuse the temptation to guess."

As far as I'm concerned, I'd expect this to raise a TypeError...


 
Reply With Quote
 
bruno.desthuilliers@gmail.com
Guest
Posts: n/a
 
      07-15-2011
On Jul 15, 7:28*am, Steven D'Aprano <steve
(E-Mail Removed)> wrote:
>
> I'm designing an API for some lightweight calculator-like statistics
> functions, such as mean, standard deviation, etc., and I want to support
> missing values. Missing values should be just ignored. E.g.:



(snip)

> Against None: it's too easy to mistakenly add None to a data set by mistake,
> because functions return None by default.


Yeps.

> In favour of a dedicated MISSING singleton: it's obvious from context. It's
> not a lot of work to implement compared to using None. Hard to accidentally
> include it by mistake. If None does creep into the data by accident, you
> get a nice explicit exception.
>
> Against MISSING: users may expect to be able to choose their own sentinelby
> assigning to MISSING. I don't want to support that.


What about allowing users to specificy their own sentinel in the
simplest pythonic way:

# stevencalc.py
MISSING = object()

def mean(values, missing=MISSING):
your code here


Or, if you want to make it easier to specify the sentinel once for the
whole API:

# stevencalc.py
MISSING = object()

class Calc(object):
def __init__(self, missing=MISSING):
self._missing = missing
def mean(self, values):
# your code here


# default:
_calc = Calc()
mean = _calc.mean
# etc...

My 2 cents...
 
Reply With Quote
 
Teemu Likonen
Guest
Posts: n/a
 
      07-15-2011
* 2011-07-15T15:28:41+10:00 * Steven D'Aprano wrote:

> I'm designing an API for some lightweight calculator-like statistics
> functions, such as mean, standard deviation, etc., and I want to
> support missing values. Missing values should be just ignored. E.g.:
>
> mean([1, 2, MISSING, 3]) => 6/3 = 2 rather than 6/4 or raising an
> error.
>
> My question is, should I accept None as the missing value, or a
> dedicated singleton?


How about accepting anything but ignoring all non-numbers?
 
Reply With Quote
 
bruno.desthuilliers@gmail.com
Guest
Posts: n/a
 
      07-15-2011
On Jul 15, 9:44*am, Cameron Simpson <(E-Mail Removed)> wrote:
> On 15Jul2011 15:28, Steven D'Aprano <(E-Mail Removed)> wrote:
> | Against MISSING: users may expect to be able to choose their own sentinel by
> | assigning to MISSING. I don't want to support that.
>
> Well, we don't have readonly values to play with
> Personally I'd do what I did above: give it a "private" name like
> _MISSING so that people should expect to have inside (and unsupported,
> unguarenteed) knowledge if they fiddle with it.


I think the point is to allow users to explicitely use MISSING in
their data sets, so it does have to be public. But anyway: ALL_UPPER
names are supposed to be treated as constants, so the "warranty void
if messed with" still apply.
 
Reply With Quote
 
bruno.desthuilliers@gmail.com
Guest
Posts: n/a
 
      07-15-2011
On Jul 15, 10:28*am, Teemu Likonen <(E-Mail Removed)> wrote:
>
> How about accepting anything but ignoring all non-numbers?


Totally unpythonic. Better to be explicit about what you expect and
crash as loudly as possible when you get anything unexpected.
 
Reply With Quote
 
Steven D'Aprano
Guest
Posts: n/a
 
      07-15-2011
Cameron Simpson wrote:

> On 15Jul2011 15:28, Steven D'Aprano <(E-Mail Removed)>
> wrote:
> | In favour of None: it's already there, no extra code required. People
> | may expect it to work.
>
> Broadly, I like this one for the reasons you cite.
>
> | Against None: it's too easy to mistakenly add None to a data set by
> | mistake, because functions return None by default.
>
> This is a hazard everywhere, but won't such a circumstance normally
> break lots of stuff anyway?


Maybe, maybe not. Either way, it has nothing to do with me -- I only care
about what my library does if presented with None in a list of numbers.
Should I treat it as a missing value, and ignore it, or treat it as an
error?


> What's an example scenario for getting None
> by accident but still a bunch of non-None values? The main one I can
> imagine is a function with a return path that accidentally misses the
> value something, eg:

[code snipped]

Yes, that's the main example I can think of. It doesn't really matter how it
happens though, only that it is more likely for None to accidentally get
inserted into a list than it is for a module-specific MISSING value.

My thoughts are, if my library gets presented with two lists:

[1, 2, 3, None, 5, 6]

[1, 2, 3, mylibrary.MISSING, 5, 6]

which is less likely to be an accident rather than deliberate? That's the
one I should accept as the missing value. Does anyone think that's the
wrong choice?


> I suppose there's no scope for having the append-to-the-list step sanity
> check for the sentinel (be it None or otherwise)?


It is not my responsibility to validate data during construction, only to do
the right thing when given that data. The right thing being, raise an
exception if values are not numeric, unless an explicit "missing" value
(whatever that ends up being).


> | Against MISSING: users may expect to be able to choose their own
> | sentinel by assigning to MISSING. I don't want to support that.
>
> Well, we don't have readonly values to play with
> Personally I'd do what I did above: give it a "private" name like
> _MISSING so that people should expect to have inside (and unsupported,
> unguarenteed) knowledge if they fiddle with it. Or are you publishing
> the sentinal's name to your callers i.e. may they really return _MISSING
> legitimately from their functions?


Assuming I choose against None, and go with MISSING, it will be a public
part of the library API. The idea being that callers will be responsible
for ensuring that if they have data with missing values, they insert the
correct sentinel, rather than whatever random non-numeric value they
started off with.



--
Steven

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
DataGrid - BorderStyle="None" and BorderWidth="None" doesn't work for Firefox David Freeman ASP .Net 8 02-16-2011 11:03 AM
Re: Mozilla versus IE versus Opera versus Safari Peter Potamus the Purple Hippo Firefox 0 05-08-2008 12:56 PM
equal? versus eql? versus == versus === verus <=> Paul Butcher Ruby 12 11-28-2007 06:06 AM
testing for valid reference: obj vs. None!=obs vs. obj is not None alf Python 9 12-09-2006 05:00 AM
Is there a built-in method for transforming (1,None,"Hello!") to 1,None,"Hello!"? Daniel Crespo Python 5 11-13-2005 12:52 PM



Advertisments