Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Numpy outlier removal (http://www.velocityreviews.com/forums/t956216-numpy-outlier-removal.html)

Joseph L. Casale 01-06-2013 07:44 PM

Numpy outlier removal
 
I have a dataset that consists of a dict with text descriptions and values that are integers. If
required, I collect the values into a list and create a numpy array runningit through a simple
routine:*data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations
to include.


The problem is I loos track of which were removed so the original display of the dataset is
misleading when the processed average is returned as it includes the removed key/values.


Ayone know how I can maintain the relationship and when I exclude a value, remove it from
the dict?

Thanks!
jlc

Hans Mulder 01-06-2013 10:33 PM

Re: Numpy outlier removal
 
On 6/01/13 20:44:08, Joseph L. Casale wrote:
> I have a dataset that consists of a dict with text descriptions and values that are integers. If
> required, I collect the values into a list and create a numpy array running it through a simple
> routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations
> to include.
>
>
> The problem is I loos track of which were removed so the original display of the dataset is
> misleading when the processed average is returned as it includes the removed key/values.
>
>
> Ayone know how I can maintain the relationship and when I exclude a value, remove it from
> the dict?


Assuming your data and the dictionary are keyed by a common set of keys:

for key in descriptions:
if abs(data[key] - mean(data)) >= m * std(data):
del data[key]
del descriptions[key]


Hope this helps,

-- HansM


Joseph L. Casale 01-06-2013 10:50 PM

RE: Numpy outlier removal
 
>Assuming your data and the dictionary are keyed by a common set of keys:*

>
>for key in descriptions:
> * *if abs(data[key] - mean(data)) >= m * std(data):
> * * * *del data[key]
> * * * *del descriptions[key]



Heh, yeah sometimes the obvious is too simple to see. I used a dict comp torebuild
the results with the comparison.


Thanks!
jlc

MRAB 01-06-2013 11:18 PM

Re: Numpy outlier removal
 
On 2013-01-06 22:33, Hans Mulder wrote:
> On 6/01/13 20:44:08, Joseph L. Casale wrote:
>> I have a dataset that consists of a dict with text descriptions and values that are integers. If
>> required, I collect the values into a list and create a numpy array running it through a simple
>> routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations
>> to include.
>>
>>
>> The problem is I loos track of which were removed so the original display of the dataset is
>> misleading when the processed average is returned as it includes the removed key/values.
>>
>>
>> Ayone know how I can maintain the relationship and when I exclude a value, remove it from
>> the dict?

>
> Assuming your data and the dictionary are keyed by a common set of keys:
>
> for key in descriptions:
> if abs(data[key] - mean(data)) >= m * std(data):
> del data[key]
> del descriptions[key]
>

It's generally a bad idea to modify a collection over which you're
iterating. It's better to, say, make a list of what you're going to
delete and then iterate over that list to make the deletions:

deletions = []

for key in in descriptions:
if abs(data[key] - mean(data)) >= m * std(data):
deletions.append(key)

for key in deletions:
del data[key]
del descriptions[key]


Steven D'Aprano 01-07-2013 01:46 AM

Re: Numpy outlier removal
 
On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:

> I have a dataset that consists of a dict with text descriptions and
> values that are integers. If required, I collect the values into a list
> and create a numpy array running it through a simple routine:*
>
> data[abs(data - mean(data)) < m * std(data)]
>
> where m is the number of std deviations to include.


I'm not sure that this approach is statistically robust. No, let me be
even more assertive: I'm sure that this approach is NOT statistically
robust, and may be scientifically dubious.

The above assumes your data is normally distributed. How sure are you
that this is actually the case?

For normally distributed data:

Since both the mean and std calculations as effected by the presence of
outliers, your test for what counts as an outlier will miss outliers for
data from a normal distribution. For small N (sample size), it may be
mathematically impossible for any data point to be greater than m*SD from
the mean. For example, with N=5, no data point can be more than 1.789*SD
from the mean. So for N=5, m=1 may throw away good data, and m=2 will
fail to find any outliers no matter how outrageous they are.

For large N, you will expect to find significant numbers of data points
more than m*SD from the mean. With N=100000, and m=3, you will expect to
throw away 270 perfectly good data points simply because they are out on
the tails of the distribution.

Worse, if the data is not in fact from a normal distribution, all bets
are off. You may be keeping obvious outliers; or more often, your test
will be throwing away perfectly good data that it misidentifies as
outliers.

In other words: this approach for detecting outliers is nothing more than
a very rough, and very bad, heuristic, and should be avoided.

Identifying outliers is fraught with problems even for experts. For
example, the ozone hole over the Antarctic was ignored for many years
because the software being used to analyse it misidentified the data as
outliers.

The best general advice I have seen is:

Never automatically remove outliers except for values that are physically
impossible (e.g. "baby's weight is 95kg", "test score of 31 out of 20"),
unless you have good, solid, physical reasons for justifying removal of
outliers. Other than that, manually remove outliers with care, or not at
all, and if you do so, always report your results twice, once with all
the data, and once with supposed outliers removed.

You can read up more about outlier detection, and the difficulties
thereof, here:

http://www.medcalc.org/manual/outliers.php

https://secure.graphpad.com/guides/p...tics/index.htm

http://www.webapps.cee.vt.edu/ewr/en...r/outlier.html

http://stats.stackexchange.com/quest...ard-deviations



--
Steven

Joseph L. Casale 01-07-2013 02:12 AM

RE: Numpy outlier removal
 
> In other words: this approach for detecting outliers is nothing more than*

> a very rough, and very bad, heuristic, and should be avoided.


Heh, very true but the results will only be used for conversational*purposes.
I am making an assumption that the data is normally distributed and I do expect
valid results to all be very nearly the same.

> You can read up more about outlier detection, and the difficulties*
> thereof, here:



I much appreciate the links and the thought in the post. I'll admit I didn't
realize outlier detection was as involved.


Again, thanks!
jlc

Paul Simon 01-07-2013 02:21 AM

Re: Numpy outlier removal
 

"Steven D'Aprano" <steve+comp.lang.python@pearwood.info> wrote in message
news:50ea28e7$0$30003$c3e8da3$5496439d@news.astraw eb.com...
> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
>
>> I have a dataset that consists of a dict with text descriptions and
>> values that are integers. If required, I collect the values into a list
>> and create a numpy array running it through a simple routine:
>>
>> data[abs(data - mean(data)) < m * std(data)]
>>
>> where m is the number of std deviations to include.

>
> I'm not sure that this approach is statistically robust. No, let me be
> even more assertive: I'm sure that this approach is NOT statistically
> robust, and may be scientifically dubious.
>
> The above assumes your data is normally distributed. How sure are you
> that this is actually the case?
>
> For normally distributed data:
>
> Since both the mean and std calculations as effected by the presence of
> outliers, your test for what counts as an outlier will miss outliers for
> data from a normal distribution. For small N (sample size), it may be
> mathematically impossible for any data point to be greater than m*SD from
> the mean. For example, with N=5, no data point can be more than 1.789*SD
> from the mean. So for N=5, m=1 may throw away good data, and m=2 will
> fail to find any outliers no matter how outrageous they are.
>
> For large N, you will expect to find significant numbers of data points
> more than m*SD from the mean. With N=100000, and m=3, you will expect to
> throw away 270 perfectly good data points simply because they are out on
> the tails of the distribution.
>
> Worse, if the data is not in fact from a normal distribution, all bets
> are off. You may be keeping obvious outliers; or more often, your test
> will be throwing away perfectly good data that it misidentifies as
> outliers.
>
> In other words: this approach for detecting outliers is nothing more than
> a very rough, and very bad, heuristic, and should be avoided.
>
> Identifying outliers is fraught with problems even for experts. For
> example, the ozone hole over the Antarctic was ignored for many years
> because the software being used to analyse it misidentified the data as
> outliers.
>
> The best general advice I have seen is:
>
> Never automatically remove outliers except for values that are physically
> impossible (e.g. "baby's weight is 95kg", "test score of 31 out of 20"),
> unless you have good, solid, physical reasons for justifying removal of
> outliers. Other than that, manually remove outliers with care, or not at
> all, and if you do so, always report your results twice, once with all
> the data, and once with supposed outliers removed.
>
> You can read up more about outlier detection, and the difficulties
> thereof, here:
>
> http://www.medcalc.org/manual/outliers.php
>
> https://secure.graphpad.com/guides/p...tics/index.htm
>
> http://www.webapps.cee.vt.edu/ewr/en...r/outlier.html
>
> http://stats.stackexchange.com/quest...ard-deviations
>
>
>
> --
> Steven

If you suspect that the data may not be normal you might look at exploratory
data analysis, see Tukey. It's descriptive rather than analytic, treats
outliers respectfully, uses median rather than mean, and is very visual.
Wherever I analyzed data both gaussian and with EDA, EDA always won.

Paul



Oscar Benjamin 01-07-2013 02:29 AM

Re: Numpy outlier removal
 
On 7 January 2013 01:46, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
>
>> I have a dataset that consists of a dict with text descriptions and
>> values that are integers. If required, I collect the values into a list
>> and create a numpy array running it through a simple routine:
>>
>> data[abs(data - mean(data)) < m * std(data)]
>>
>> where m is the number of std deviations to include.

>
> I'm not sure that this approach is statistically robust. No, let me be
> even more assertive: I'm sure that this approach is NOT statistically
> robust, and may be scientifically dubious.


Whether or not this is "statistically robust" requires more
explanation about the OP's intention. Thus far, the OP has not given
any reason/motivation for excluding data or even for having any data
in the first place! It's hard to say whether any technique applied is
really accurate/robust without knowing *anything* about the purpose of
the operation.


Oscar

Steven D'Aprano 01-07-2013 05:11 AM

Re: Numpy outlier removal
 
On Mon, 07 Jan 2013 02:29:27 +0000, Oscar Benjamin wrote:

> On 7 January 2013 01:46, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
>>
>>> I have a dataset that consists of a dict with text descriptions and
>>> values that are integers. If required, I collect the values into a
>>> list and create a numpy array running it through a simple routine:
>>>
>>> data[abs(data - mean(data)) < m * std(data)]
>>>
>>> where m is the number of std deviations to include.

>>
>> I'm not sure that this approach is statistically robust. No, let me be
>> even more assertive: I'm sure that this approach is NOT statistically
>> robust, and may be scientifically dubious.

>
> Whether or not this is "statistically robust" requires more explanation
> about the OP's intention.


Not really. Statistics robustness is objectively defined, and the user's
intention doesn't come into it. The mean is not a robust measure of
central tendency, the median is, regardless of why you pick one or the
other.

There are sometimes good reasons for choosing non-robust statistics or
techniques over robust ones, but some techniques are so dodgy that there
is *never* a good reason for doing so. E.g. finding the line of best fit
by eye, or taking more and more samples until you get a statistically
significant result. Such techniques are not just non-robust in the
statistical sense, but non-robust in the general sense, if not outright
deceitful.



--
Steven

Oscar Benjamin 01-07-2013 03:20 PM

Re: Numpy outlier removal
 
On 7 January 2013 05:11, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Mon, 07 Jan 2013 02:29:27 +0000, Oscar Benjamin wrote:
>
>> On 7 January 2013 01:46, Steven D'Aprano
>> <steve+comp.lang.python@pearwood.info> wrote:
>>> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
>>>
>>> I'm not sure that this approach is statistically robust. No, let me be
>>> even more assertive: I'm sure that this approach is NOT statistically
>>> robust, and may be scientifically dubious.

>>
>> Whether or not this is "statistically robust" requires more explanation
>> about the OP's intention.

>
> Not really. Statistics robustness is objectively defined, and the user's
> intention doesn't come into it. The mean is not a robust measure of
> central tendency, the median is, regardless of why you pick one or the
> other.


Okay, I see what you mean. I wasn't thinking of robustness as a
technical term but now I see that you are correct.

Perhaps what I should have said is that whether or not this matters
depends on the problem at hand (hopefully this isn't an important
medical trial) and the particular type of data that you have; assuming
normality is fine in many cases even if the data is not "really"
normal.

>
> There are sometimes good reasons for choosing non-robust statistics or
> techniques over robust ones, but some techniques are so dodgy that there
> is *never* a good reason for doing so. E.g. finding the line of best fit
> by eye, or taking more and more samples until you get a statistically
> significant result. Such techniques are not just non-robust in the
> statistical sense, but non-robust in the general sense, if not outright
> deceitful.


There are sometimes good reasons to get a line of best fit by eye. In
particular if your data contains clusters that are hard to separate,
sometimes it's useful to just pick out roughly where you think a line
through a subset of the data is.


Oscar


All times are GMT. The time now is 07:49 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.