![]() |
Numpy outlier removal
I have a dataset that consists of a dict with text descriptions and values that are integers. If
required, I collect the values into a list and create a numpy array runningit through a simple routine:*data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations to include. The problem is I loos track of which were removed so the original display of the dataset is misleading when the processed average is returned as it includes the removed key/values. Ayone know how I can maintain the relationship and when I exclude a value, remove it from the dict? Thanks! jlc |
Re: Numpy outlier removal
On 6/01/13 20:44:08, Joseph L. Casale wrote:
> I have a dataset that consists of a dict with text descriptions and values that are integers. If > required, I collect the values into a list and create a numpy array running it through a simple > routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations > to include. > > > The problem is I loos track of which were removed so the original display of the dataset is > misleading when the processed average is returned as it includes the removed key/values. > > > Ayone know how I can maintain the relationship and when I exclude a value, remove it from > the dict? Assuming your data and the dictionary are keyed by a common set of keys: for key in descriptions: if abs(data[key] - mean(data)) >= m * std(data): del data[key] del descriptions[key] Hope this helps, -- HansM |
RE: Numpy outlier removal
>Assuming your data and the dictionary are keyed by a common set of keys:*
> >for key in descriptions: > * *if abs(data[key] - mean(data)) >= m * std(data): > * * * *del data[key] > * * * *del descriptions[key] Heh, yeah sometimes the obvious is too simple to see. I used a dict comp torebuild the results with the comparison. Thanks! jlc |
Re: Numpy outlier removal
On 2013-01-06 22:33, Hans Mulder wrote:
> On 6/01/13 20:44:08, Joseph L. Casale wrote: >> I have a dataset that consists of a dict with text descriptions and values that are integers. If >> required, I collect the values into a list and create a numpy array running it through a simple >> routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations >> to include. >> >> >> The problem is I loos track of which were removed so the original display of the dataset is >> misleading when the processed average is returned as it includes the removed key/values. >> >> >> Ayone know how I can maintain the relationship and when I exclude a value, remove it from >> the dict? > > Assuming your data and the dictionary are keyed by a common set of keys: > > for key in descriptions: > if abs(data[key] - mean(data)) >= m * std(data): > del data[key] > del descriptions[key] > It's generally a bad idea to modify a collection over which you're iterating. It's better to, say, make a list of what you're going to delete and then iterate over that list to make the deletions: deletions = [] for key in in descriptions: if abs(data[key] - mean(data)) >= m * std(data): deletions.append(key) for key in deletions: del data[key] del descriptions[key] |
Re: Numpy outlier removal
On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
> I have a dataset that consists of a dict with text descriptions and > values that are integers. If required, I collect the values into a list > and create a numpy array running it through a simple routine:Â* > > data[abs(data - mean(data)) < m * std(data)] > > where m is the number of std deviations to include. I'm not sure that this approach is statistically robust. No, let me be even more assertive: I'm sure that this approach is NOT statistically robust, and may be scientifically dubious. The above assumes your data is normally distributed. How sure are you that this is actually the case? For normally distributed data: Since both the mean and std calculations as effected by the presence of outliers, your test for what counts as an outlier will miss outliers for data from a normal distribution. For small N (sample size), it may be mathematically impossible for any data point to be greater than m*SD from the mean. For example, with N=5, no data point can be more than 1.789*SD from the mean. So for N=5, m=1 may throw away good data, and m=2 will fail to find any outliers no matter how outrageous they are. For large N, you will expect to find significant numbers of data points more than m*SD from the mean. With N=100000, and m=3, you will expect to throw away 270 perfectly good data points simply because they are out on the tails of the distribution. Worse, if the data is not in fact from a normal distribution, all bets are off. You may be keeping obvious outliers; or more often, your test will be throwing away perfectly good data that it misidentifies as outliers. In other words: this approach for detecting outliers is nothing more than a very rough, and very bad, heuristic, and should be avoided. Identifying outliers is fraught with problems even for experts. For example, the ozone hole over the Antarctic was ignored for many years because the software being used to analyse it misidentified the data as outliers. The best general advice I have seen is: Never automatically remove outliers except for values that are physically impossible (e.g. "baby's weight is 95kg", "test score of 31 out of 20"), unless you have good, solid, physical reasons for justifying removal of outliers. Other than that, manually remove outliers with care, or not at all, and if you do so, always report your results twice, once with all the data, and once with supposed outliers removed. You can read up more about outlier detection, and the difficulties thereof, here: http://www.medcalc.org/manual/outliers.php https://secure.graphpad.com/guides/p...tics/index.htm http://www.webapps.cee.vt.edu/ewr/en...r/outlier.html http://stats.stackexchange.com/quest...ard-deviations -- Steven |
RE: Numpy outlier removal
> In other words: this approach for detecting outliers is nothing more than*
> a very rough, and very bad, heuristic, and should be avoided. Heh, very true but the results will only be used for conversational*purposes. I am making an assumption that the data is normally distributed and I do expect valid results to all be very nearly the same. > You can read up more about outlier detection, and the difficulties* > thereof, here: I much appreciate the links and the thought in the post. I'll admit I didn't realize outlier detection was as involved. Again, thanks! jlc |
Re: Numpy outlier removal
"Steven D'Aprano" <steve+comp.lang.python@pearwood.info> wrote in message news:50ea28e7$0$30003$c3e8da3$5496439d@news.astraw eb.com... > On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote: > >> I have a dataset that consists of a dict with text descriptions and >> values that are integers. If required, I collect the values into a list >> and create a numpy array running it through a simple routine: >> >> data[abs(data - mean(data)) < m * std(data)] >> >> where m is the number of std deviations to include. > > I'm not sure that this approach is statistically robust. No, let me be > even more assertive: I'm sure that this approach is NOT statistically > robust, and may be scientifically dubious. > > The above assumes your data is normally distributed. How sure are you > that this is actually the case? > > For normally distributed data: > > Since both the mean and std calculations as effected by the presence of > outliers, your test for what counts as an outlier will miss outliers for > data from a normal distribution. For small N (sample size), it may be > mathematically impossible for any data point to be greater than m*SD from > the mean. For example, with N=5, no data point can be more than 1.789*SD > from the mean. So for N=5, m=1 may throw away good data, and m=2 will > fail to find any outliers no matter how outrageous they are. > > For large N, you will expect to find significant numbers of data points > more than m*SD from the mean. With N=100000, and m=3, you will expect to > throw away 270 perfectly good data points simply because they are out on > the tails of the distribution. > > Worse, if the data is not in fact from a normal distribution, all bets > are off. You may be keeping obvious outliers; or more often, your test > will be throwing away perfectly good data that it misidentifies as > outliers. > > In other words: this approach for detecting outliers is nothing more than > a very rough, and very bad, heuristic, and should be avoided. > > Identifying outliers is fraught with problems even for experts. For > example, the ozone hole over the Antarctic was ignored for many years > because the software being used to analyse it misidentified the data as > outliers. > > The best general advice I have seen is: > > Never automatically remove outliers except for values that are physically > impossible (e.g. "baby's weight is 95kg", "test score of 31 out of 20"), > unless you have good, solid, physical reasons for justifying removal of > outliers. Other than that, manually remove outliers with care, or not at > all, and if you do so, always report your results twice, once with all > the data, and once with supposed outliers removed. > > You can read up more about outlier detection, and the difficulties > thereof, here: > > http://www.medcalc.org/manual/outliers.php > > https://secure.graphpad.com/guides/p...tics/index.htm > > http://www.webapps.cee.vt.edu/ewr/en...r/outlier.html > > http://stats.stackexchange.com/quest...ard-deviations > > > > -- > Steven If you suspect that the data may not be normal you might look at exploratory data analysis, see Tukey. It's descriptive rather than analytic, treats outliers respectfully, uses median rather than mean, and is very visual. Wherever I analyzed data both gaussian and with EDA, EDA always won. Paul |
Re: Numpy outlier removal
On 7 January 2013 01:46, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote: > On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote: > >> I have a dataset that consists of a dict with text descriptions and >> values that are integers. If required, I collect the values into a list >> and create a numpy array running it through a simple routine: >> >> data[abs(data - mean(data)) < m * std(data)] >> >> where m is the number of std deviations to include. > > I'm not sure that this approach is statistically robust. No, let me be > even more assertive: I'm sure that this approach is NOT statistically > robust, and may be scientifically dubious. Whether or not this is "statistically robust" requires more explanation about the OP's intention. Thus far, the OP has not given any reason/motivation for excluding data or even for having any data in the first place! It's hard to say whether any technique applied is really accurate/robust without knowing *anything* about the purpose of the operation. Oscar |
Re: Numpy outlier removal
On Mon, 07 Jan 2013 02:29:27 +0000, Oscar Benjamin wrote:
> On 7 January 2013 01:46, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote: >> >>> I have a dataset that consists of a dict with text descriptions and >>> values that are integers. If required, I collect the values into a >>> list and create a numpy array running it through a simple routine: >>> >>> data[abs(data - mean(data)) < m * std(data)] >>> >>> where m is the number of std deviations to include. >> >> I'm not sure that this approach is statistically robust. No, let me be >> even more assertive: I'm sure that this approach is NOT statistically >> robust, and may be scientifically dubious. > > Whether or not this is "statistically robust" requires more explanation > about the OP's intention. Not really. Statistics robustness is objectively defined, and the user's intention doesn't come into it. The mean is not a robust measure of central tendency, the median is, regardless of why you pick one or the other. There are sometimes good reasons for choosing non-robust statistics or techniques over robust ones, but some techniques are so dodgy that there is *never* a good reason for doing so. E.g. finding the line of best fit by eye, or taking more and more samples until you get a statistically significant result. Such techniques are not just non-robust in the statistical sense, but non-robust in the general sense, if not outright deceitful. -- Steven |
Re: Numpy outlier removal
On 7 January 2013 05:11, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote: > On Mon, 07 Jan 2013 02:29:27 +0000, Oscar Benjamin wrote: > >> On 7 January 2013 01:46, Steven D'Aprano >> <steve+comp.lang.python@pearwood.info> wrote: >>> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote: >>> >>> I'm not sure that this approach is statistically robust. No, let me be >>> even more assertive: I'm sure that this approach is NOT statistically >>> robust, and may be scientifically dubious. >> >> Whether or not this is "statistically robust" requires more explanation >> about the OP's intention. > > Not really. Statistics robustness is objectively defined, and the user's > intention doesn't come into it. The mean is not a robust measure of > central tendency, the median is, regardless of why you pick one or the > other. Okay, I see what you mean. I wasn't thinking of robustness as a technical term but now I see that you are correct. Perhaps what I should have said is that whether or not this matters depends on the problem at hand (hopefully this isn't an important medical trial) and the particular type of data that you have; assuming normality is fine in many cases even if the data is not "really" normal. > > There are sometimes good reasons for choosing non-robust statistics or > techniques over robust ones, but some techniques are so dodgy that there > is *never* a good reason for doing so. E.g. finding the line of best fit > by eye, or taking more and more samples until you get a statistically > significant result. Such techniques are not just non-robust in the > statistical sense, but non-robust in the general sense, if not outright > deceitful. There are sometimes good reasons to get a line of best fit by eye. In particular if your data contains clusters that are hard to separate, sometimes it's useful to just pick out roughly where you think a line through a subset of the data is. Oscar |
| All times are GMT. The time now is 07:23 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.