Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Numpy outlier removal

Reply
Thread Tools

Numpy outlier removal

 
 
Robert Kern
Guest
Posts: n/a
 
      01-07-2013
On 07/01/2013 15:20, Oscar Benjamin wrote:
> On 7 January 2013 05:11, Steven D'Aprano
> <(E-Mail Removed)> wrote:
>> On Mon, 07 Jan 2013 02:29:27 +0000, Oscar Benjamin wrote:
>>
>>> On 7 January 2013 01:46, Steven D'Aprano
>>> <(E-Mail Removed)> wrote:
>>>> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
>>>>
>>>> I'm not sure that this approach is statistically robust. No, let me be
>>>> even more assertive: I'm sure that this approach is NOT statistically
>>>> robust, and may be scientifically dubious.
>>>
>>> Whether or not this is "statistically robust" requires more explanation
>>> about the OP's intention.

>>
>> Not really. Statistics robustness is objectively defined, and the user's
>> intention doesn't come into it. The mean is not a robust measure of
>> central tendency, the median is, regardless of why you pick one or the
>> other.

>
> Okay, I see what you mean. I wasn't thinking of robustness as a
> technical term but now I see that you are correct.
>
> Perhaps what I should have said is that whether or not this matters
> depends on the problem at hand (hopefully this isn't an important
> medical trial) and the particular type of data that you have; assuming
> normality is fine in many cases even if the data is not "really"
> normal.


"Having outliers" literally means that assuming normality is not fine. If
assuming normality were fine, then you wouldn't need to remove outliers.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      01-07-2013
On Mon, 07 Jan 2013 15:20:57 +0000, Oscar Benjamin wrote:

> There are sometimes good reasons to get a line of best fit by eye. In
> particular if your data contains clusters that are hard to separate,
> sometimes it's useful to just pick out roughly where you think a line
> through a subset of the data is.


Cherry picking subsets of your data as well as line fitting by eye? Two
wrongs do not make a right.

If you're going to just invent a line based on where you think it should
be, what do you need the data for? Just declare "this is the line I wish
to believe in" and save yourself the time and energy of collecting the
data in the first place. Your conclusion will be no less valid.

How do you distinguish between "data contains clusters that are hard to
separate" from "data doesn't fit a line at all"?

Even if the data actually is linear, on what basis could we distinguish
between the line you fit by eye (say) y = 2.5x + 3.7, and the line I fit
by eye (say) y = 3.1x + 4.1? The line you assert on the basis of purely
subjective judgement can be equally denied on the basis of subjective
judgement.

Anyone can fool themselves into placing a line through a subset of non-
linear data. Or, sadly more often, *deliberately* cherry picking fake
clusters in order to fool others. Here is a real world example of what
happens when people pick out the data clusters that they like based on
visual inspection:

http://www.skepticalscience.com/imag...pEscalator.gif

And not linear by any means, but related to the cherry picking theme:

http://www.skepticalscience.com/pics...alator2012.gif


To put it another way, when we fit patterns to data by eye, we can easily
fool ourselves into seeing patterns that aren't there, or missing the
patterns which are there. At best line fitting by eye is prone to honest
errors; at worst, it is open to the most deliberate abuse. We have eyes
and brains that evolved to spot the ripe fruit in trees, not to spot
linear trends in noisy data, and fitting by eye is not safe or
appropriate.


--
Steven
 
Reply With Quote
 
 
 
 
Chris Angelico
Guest
Posts: n/a
 
      01-07-2013
On Tue, Jan 8, 2013 at 4:58 AM, Steven D'Aprano
<(E-Mail Removed)> wrote:
> Anyone can fool themselves into placing a line through a subset of non-
> linear data. Or, sadly more often, *deliberately* cherry picking fake
> clusters in order to fool others. Here is a real world example of what
> happens when people pick out the data clusters that they like based on
> visual inspection:
>
> http://www.skepticalscience.com/imag...pEscalator.gif


And sensible people will notice that, even drawn like that, it's only
a ~0.6 deg increase across ~30 years. Hardly statistically
significant, given that weather patterns have been known to follow
cycles at least that long. But that's nothing to do with drawing lines
through points, and more to do with how much data you collect before
you announce a conclusion, and how easily a graph can prove any point
you like.

Statistical analysis is a huge science. So is lying. And I'm not sure
most people can pick one from the other.

ChrisA
 
Reply With Quote
 
Oscar Benjamin
Guest
Posts: n/a
 
      01-07-2013
On 7 January 2013 17:58, Steven D'Aprano
<(E-Mail Removed)> wrote:
> On Mon, 07 Jan 2013 15:20:57 +0000, Oscar Benjamin wrote:
>
>> There are sometimes good reasons to get a line of best fit by eye. In
>> particular if your data contains clusters that are hard to separate,
>> sometimes it's useful to just pick out roughly where you think a line
>> through a subset of the data is.

>
> Cherry picking subsets of your data as well as line fitting by eye? Two
> wrongs do not make a right.


It depends on what you're doing, though. I wouldn't use an eyeball fit
to get numbers that were an important part of the conclusion of some
or other study. I would very often use it while I'm just in the
process of trying to understand something.

> If you're going to just invent a line based on where you think it should
> be, what do you need the data for? Just declare "this is the line I wish
> to believe in" and save yourself the time and energy of collecting the
> data in the first place. Your conclusion will be no less valid.


An example: Earlier today I was looking at some experimental data. A
simple model of the process underlying the experiment suggests that
two variables x and y will vary in direct proportion to one another
and the data broadly reflects this. However, at this stage there is
some non-normal variability in the data, caused by experimental
difficulties. A subset of the data appears to closely follow a well
defined linear pattern but there are outliers and the pattern breaks
down in an asymmetric way at larger x and y values. At some later time
either the sources of experimental variation will be reduced, or they
will be better understood but for now it is still useful to estimate
the constant of proportionality in order to check whether it seems
consistent with the observed values of z. With this particular dataset
I would have wasted a lot of time if I had tried to find a
computational method to match the line that to me was very visible so
I chose the line visually.

>
> How do you distinguish between "data contains clusters that are hard to
> separate" from "data doesn't fit a line at all"?
>


In the example I gave it isn't possible to make that distinction with
the currently available data. That doesn't make it meaningless to try
and estimate the parameters of the relationship between the variables
using the preliminary data.

> Even if the data actually is linear, on what basis could we distinguish
> between the line you fit by eye (say) y = 2.5x + 3.7, and the line I fit
> by eye (say) y = 3.1x + 4.1? The line you assert on the basis of purely
> subjective judgement can be equally denied on the basis of subjective
> judgement.


It gets a bit easier if the line is constrained to go through the
origin. You seem to be thinking that the important thing is proving
that the line is "real", rather than identifying where it is. Both
things are important but not necessarily in the same problem. In my
example, the "real line" may not be straight and may not go through
the origin, but it is definitely there and if there were no
experimental problems then the data would all be very close to it.

> Anyone can fool themselves into placing a line through a subset of non-
> linear data. Or, sadly more often, *deliberately* cherry picking fake
> clusters in order to fool others. Here is a real world example of what
> happens when people pick out the data clusters that they like based on
> visual inspection:
>
> http://www.skepticalscience.com/imag...pEscalator.gif
>
> And not linear by any means, but related to the cherry picking theme:
>
> http://www.skepticalscience.com/pics...alator2012.gif
>
>
> To put it another way, when we fit patterns to data by eye, we can easily
> fool ourselves into seeing patterns that aren't there, or missing the
> patterns which are there. At best line fitting by eye is prone to honest
> errors; at worst, it is open to the most deliberate abuse. We have eyes
> and brains that evolved to spot the ripe fruit in trees, not to spot
> linear trends in noisy data, and fitting by eye is not safe or
> appropriate.


This is all true. But the human brain is also in many ways much better
than a typical computer program at recognising patterns in data when
the data can be depicted visually. I would very rarely attempt to
analyse data without representing it in some visual form. I also think
it would be highly foolish to go so far with refusing to eyeball data
that you would accept the output of some regression algorithm even
when it clearly looks wrong.


Oscar
 
Reply With Quote
 
Steven D'Aprano
Guest
Posts: n/a
 
      01-08-2013
On Mon, 07 Jan 2013 22:32:54 +0000, Oscar Benjamin wrote:

> An example: Earlier today I was looking at some experimental data. A
> simple model of the process underlying the experiment suggests that two
> variables x and y will vary in direct proportion to one another and the
> data broadly reflects this. However, at this stage there is some
> non-normal variability in the data, caused by experimental difficulties.
> A subset of the data appears to closely follow a well defined linear
> pattern but there are outliers and the pattern breaks down in an
> asymmetric way at larger x and y values. At some later time either the
> sources of experimental variation will be reduced, or they will be
> better understood but for now it is still useful to estimate the
> constant of proportionality in order to check whether it seems
> consistent with the observed values of z. With this particular dataset I
> would have wasted a lot of time if I had tried to find a computational
> method to match the line that to me was very visible so I chose the line
> visually.



If you mean:

"I looked at the data, identified that the range a < x < b looks linear
and the range x > b does not, then used least squares (or some other
recognised, objective technique for fitting a line) to the data in that
linear range"

then I'm completely cool with that. That's fine, with the understanding
that this is the first step in either fixing your measurement problems,
fixing your model, or at least avoiding extrapolation into the non-linear
range.

But that is not fitting a line by eye, which is what I am talking about.

If on the other hand you mean:

"I looked at the data, identified that the range a < x < b looked linear,
so I laid a ruler down over the graph and pushed it around until I was
satisfied that the ruler looked more or less like it fitted the data
points, according to my guess of what counts as a close fit"

that *is* fitting a line by eye, and it is entirely subjective and
extremely dodgy for anything beyond quick and dirty back of the envelope
calculations[1]. That's okay if all you want is to get something within
an order of magnitude or so, or a line roughly pointing in the right
direction, but that's all.


[...]
> I also think it would
> be highly foolish to go so far with refusing to eyeball data that you
> would accept the output of some regression algorithm even when it
> clearly looks wrong.


I never said anything of the sort.

I said, don't fit lines to data by eye. I didn't say not to sanity check
your straight line fit is reasonable by eyeballing it.



[1] Or if your data is so accurate and noise-free that you hardly have to
care about errors, since there clearly is one and only one straight line
that passes through all the points.


--
Steven
 
Reply With Quote
 
Steven D'Aprano
Guest
Posts: n/a
 
      01-08-2013
On Tue, 08 Jan 2013 06:43:46 +1100, Chris Angelico wrote:

> On Tue, Jan 8, 2013 at 4:58 AM, Steven D'Aprano
> <(E-Mail Removed)> wrote:
>> Anyone can fool themselves into placing a line through a subset of non-
>> linear data. Or, sadly more often, *deliberately* cherry picking fake
>> clusters in order to fool others. Here is a real world example of what
>> happens when people pick out the data clusters that they like based on
>> visual inspection:
>>
>> http://www.skepticalscience.com/imag...pEscalator.gif

>
> And sensible people will notice that, even drawn like that, it's only a
> ~0.6 deg increase across ~30 years. Hardly statistically significant,


Well, I don't know about "sensible people", but magnitude of an effect
has little to do with whether or not something is statistically
significant or not. Given noisy data, statistical significance relates to
whether or not we can be confident that the effect is *real*, not whether
it is a big effect or a small effect.

Here's an example: assume that you are on a fixed salary with a constant
weekly income. If you happen to win the lottery one day, and consequently
your income for that week quadruples, that is a large effect that fails
to have any statistical significance -- it's a blip, not part of any long-
term change in income. You can't conclude that you'll win the lottery
every week from now on.

On the other hand, if the government changes the rules relating to tax,
deductions, etc., even by a small amount, your weekly income might go
down, or up, by a single dollar. Even though that is a tiny effect, it is
*not* a blip, and will be statistically significant. In practice, it
takes a certain number of data points to reach that confidence level.
Your accountant, who knows the tax laws, will conclude that the change is
real immediately, but a statistician who sees only the pay slips may take
some months before she is convinced that the change is signal rather than
noise. With only three weeks pay slips in hand, the statistician cannot
be sure that the difference is not just some accounting error or other
fluke, but each additional data point increases the confidence that the
difference is real and not just some temporary aberration.

The other meaning of "significant" has nothing to do with statistics, and
everything to do with "a difference is only a difference if it makes a
difference". 0.2° per decade doesn't sound like much, not when we
consider daily or yearly temperatures that typically have a range of tens
of degrees between night and day, or winter and summer. But that is
misunderstanding the nature of long-term climate versus daily weather and
glossing over the fact that we're only talking about an average and
ignoring changes to the variability of the climate: a small increase in
average can lead to a large increase in extreme events.


> given that weather patterns have been known to follow cycles at least
> that long.


That is not a given. "Weather patterns" don't last for thirty years.
Perhaps you are talking about climate patterns? In which case, well, yes,
we can see a very strong climate pattern of warming on a time scale of
decades, with no evidence that it is a cycle.

There are, of course, many climate cycles that take place on a time frame
of years or decades, such as the North Atlantic Oscillation and the El
Nino Southern Oscillation. None of them are global, and as far as I know
none of them are exactly periodic. They are noise in the system, and
certainly not responsible for linear trends.



--
Steven
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      01-08-2013
On Tue, Jan 8, 2013 at 1:06 PM, Steven D'Aprano
<(E-Mail Removed)> wrote:
>> given that weather patterns have been known to follow cycles at least
>> that long.

>
> That is not a given. "Weather patterns" don't last for thirty years.
> Perhaps you are talking about climate patterns?


Yes, that's what I meant. In any case, debate about global warming is
quite tangential to the point about statistical validity; it looks
quite significant to show a line going from the bottom of the graph to
the top, but sounds a lot less noteworthy when you see it as a
half-degree increase on about (I think?) 30 degrees, and even less
when you measure temperatures in absolute scale (Kelvin) and it's half
a degree in three hundred. Those are principles worth considering,
regardless of the subject matter. If your railway tracks have widened
by a full eight millimeters due to increased pounding from heavier
vehicles travelling over it, that's significant and dangerous on
HO-scale model trains, but utterly insignificant on 5'3" gauge.

ChrisA
 
Reply With Quote
 
Terry Reedy
Guest
Posts: n/a
 
      01-08-2013
On 1/7/2013 8:23 PM, Steven D'Aprano wrote:
> On Mon, 07 Jan 2013 22:32:54 +0000, Oscar Benjamin wrote:
>
>> An example: Earlier today I was looking at some experimental data. A
>> simple model of the process underlying the experiment suggests that two
>> variables x and y will vary in direct proportion to one another and the
>> data broadly reflects this. However, at this stage there is some
>> non-normal variability in the data, caused by experimental difficulties.
>> A subset of the data appears to closely follow a well defined linear
>> pattern but there are outliers and the pattern breaks down in an
>> asymmetric way at larger x and y values. At some later time either the
>> sources of experimental variation will be reduced, or they will be
>> better understood but for now it is still useful to estimate the
>> constant of proportionality in order to check whether it seems
>> consistent with the observed values of z. With this particular dataset I
>> would have wasted a lot of time if I had tried to find a computational
>> method to match the line that to me was very visible so I chose the line
>> visually.

>
>
> If you mean:
>
> "I looked at the data, identified that the range a < x < b looks linear
> and the range x > b does not, then used least squares (or some other
> recognised, objective technique for fitting a line) to the data in that
> linear range"
>
> then I'm completely cool with that.


If both x and y are measured values, then regressing x on y and y on x
with give different answers and both will be wrong in that *neither*
will be the best answer for the relationship between them. Oscar did not
specify whether either was an experimentally set input variable.

> But that is not fitting a line by eye, which is what I am talking about.


With the line constrained to go through 0,0, a line eyeballed with a
clear ruler could easily be better than either regression line, as a
human will tend to minimize the deviations *perpendicular to the line*,
which is the proper thing to do (assuming both variables are measured in
the same units).

--
Terry Jan Reedy

 
Reply With Quote
 
Oscar Benjamin
Guest
Posts: n/a
 
      01-08-2013
On 8 January 2013 01:23, Steven D'Aprano
<(E-Mail Removed)> wrote:
> On Mon, 07 Jan 2013 22:32:54 +0000, Oscar Benjamin wrote:
>
> [...]
>> I also think it would
>> be highly foolish to go so far with refusing to eyeball data that you
>> would accept the output of some regression algorithm even when it
>> clearly looks wrong.

>
> I never said anything of the sort.
>
> I said, don't fit lines to data by eye. I didn't say not to sanity check
> your straight line fit is reasonable by eyeballing it.


I should have been a little clearer. That was the situation when I
decided to just use a (digital) ruler - although really it was more of
a visual bisection (1, 2, 1.5, 1.25...). The regression result was
clearly wrong (and also invalid for the reasons Terry has described).
Some of the problems were easily fixable and others were not. I could
have spent an hour getting the code to make the line go where I wanted
it to, or I could just fit the line visually in about 2 minutes.


Oscar
 
Reply With Quote
 
Robert Kern
Guest
Posts: n/a
 
      01-08-2013
On 08/01/2013 06:35, Chris Angelico wrote:
> On Tue, Jan 8, 2013 at 1:06 PM, Steven D'Aprano
> <(E-Mail Removed)> wrote:
>>> given that weather patterns have been known to follow cycles at least
>>> that long.

>>
>> That is not a given. "Weather patterns" don't last for thirty years.
>> Perhaps you are talking about climate patterns?

>
> Yes, that's what I meant. In any case, debate about global warming is
> quite tangential to the point about statistical validity; it looks
> quite significant to show a line going from the bottom of the graph to
> the top, but sounds a lot less noteworthy when you see it as a
> half-degree increase on about (I think?) 30 degrees, and even less
> when you measure temperatures in absolute scale (Kelvin) and it's half
> a degree in three hundred.


Why on Earth do you think that the distance from nominal surface temperatures to
freezing much less absolute 0 is the right scale to compare global warming
changes against? You need to compare against the size of global mean temperature
changes that would cause large amounts of human suffering, and that scale is on
the order of a *few* degrees, not hundreds. A change of half a degree over a few
decades with no signs of slowing down *should* be alarming.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: [Offtopic] Line fitting [was Re: Numpy outlier removal] Alan Spence Python 0 01-11-2013 02:30 PM
A More Concise Description of Numpy than the Guide to Numpy? W. eWatson Python 2 11-23-2009 08:58 PM
NumPy Question - numpy.put in multi-dimensional array Bryan.Fodness@gmail.com Python 2 11-13-2007 10:36 PM
numpy migration (also posted to numpy-discussion) Duncan Smith Python 3 04-25-2007 01:36 AM
Trouble with numpy-0.9.4 and numpy-0.9.5 drife Python 1 03-01-2006 05:59 PM



Advertisments