Velocity Reviews > Programmatically finding "significant" data points

# Programmatically finding "significant" data points

erikcw
Guest
Posts: n/a

 11-14-2006
Hi all,

I have a collection of ordered numerical data in a list. The numbers
when plotted on a line chart make a low-high-low-high-high-low (random)
pattern. I need an algorithm to extract the "significant" high and low
points from this data.

Here is some sample data:
data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
0.10]

In this data, some of the significant points include:
data[0]
data[2]
data[4]
data[6]
data[8]
data[9]
data[13]
data[14]
.....

How do I sort through this data and pull out these points of
significance?

Erik

Jeremy Sanders
Guest
Posts: n/a

 11-14-2006
erikcw wrote:

> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.
>

....
>
> How do I sort through this data and pull out these points of
> significance?

Get a book on statistics. One idea is as follows. If you expect the points
to be centred around a single value, you can calculate the median or mean
of the points, calculate their standard deviation (aka spread), and remove
points which are more than N-times the standard deviation from the median.

Jeremy

--
Jeremy Sanders
http://www.jeremysanders.net/

Fredrik Lundh
Guest
Posts: n/a

 11-14-2006
"erikcw" wrote:

> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.
>
> Here is some sample data:
> data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
> 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
> 0.10]

silly solution:

for i in range(1, len(data)-1):
if data[i-1] < data[i] > data[i+1] or data[i-1] > data[i] < data[i+1]:
print i

(the above doesn't handle the "edges", but that's easy to fix)

</F>

Philipp Pagel
Guest
Posts: n/a

 11-14-2006
erikcw <(E-Mail Removed)> wrote:
> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.

I am not sure, what you mean by 'ordered' in this context. As
pointed out by Jeremy, you need to find an appropriate statistical test.
The appropriateness depend on how your data is (presumably) distributed
and what exactly you are trying to test. E.g. do te data pints come from
differetn groupos of some kind? Or are you just looking for extreme
values (outliers maybe?)?

So it's more of statistical question than a python one.

cu
Philipp

--
Dr. Philipp Pagel Tel. +49-8161-71 2131
Dept. of Genome Oriented Bioinformatics Fax. +49-8161-71 2186
Technical University of Munich
http://mips.gsf.de/staff/pagel

Peter Otten
Guest
Posts: n/a

 11-14-2006
erikcw wrote:

> Hi all,
>
> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.
>
> Here is some sample data:
> data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
> 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
> 0.10]
>
> In this data, some of the significant points include:
> data[0]
> data[2]
> data[4]
> data[6]
> data[8]
> data[9]
> data[13]
> data[14]
> ....
>
> How do I sort through this data and pull out these points of
> significance?

I think you are looking for "extrema":

def w3(items):
items = iter(items)
view = None, items.next(), items.next()
for item in items:
view = view[1:] + (item,)
yield view

for i, (a, b, c) in enumerate(w3(data)):
if a > b < c:
print i+1, "min", b
elif a < b > c:
print i+1, "max", b
else:
print i+1, "---", b

Peter

Alan J. Salmoni
Guest
Posts: n/a

 11-14-2006
If the order doesn't matter, you can sort the data and remove x * 0.5 *
n where x is the proportion of numbers you want. If you have too many
similar values though, this falls down. I suggest you check out
quantiles in a good statistics book.

Alan.

Peter Otten wrote:

> erikcw wrote:
>
> > Hi all,
> >
> > I have a collection of ordered numerical data in a list. The numbers
> > when plotted on a line chart make a low-high-low-high-high-low (random)
> > pattern. I need an algorithm to extract the "significant" high and low
> > points from this data.
> >
> > Here is some sample data:
> > data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
> > 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
> > 0.10]
> >
> > In this data, some of the significant points include:
> > data[0]
> > data[2]
> > data[4]
> > data[6]
> > data[8]
> > data[9]
> > data[13]
> > data[14]
> > ....
> >
> > How do I sort through this data and pull out these points of
> > significance?

>
> I think you are looking for "extrema":
>
> def w3(items):
> items = iter(items)
> view = None, items.next(), items.next()
> for item in items:
> view = view[1:] + (item,)
> yield view
>
> for i, (a, b, c) in enumerate(w3(data)):
> if a > b < c:
> print i+1, "min", b
> elif a < b > c:
> print i+1, "max", b
> else:
> print i+1, "---", b
>
> Peter

Ganesan Rajagopal
Guest
Posts: n/a

 11-14-2006
>>>>> Jeremy Sanders <(E-Mail Removed)> writes:

>> How do I sort through this data and pull out these points of
>> significance?

> Get a book on statistics. One idea is as follows. If you expect the points
> to be centred around a single value, you can calculate the median or mean
> of the points, calculate their standard deviation (aka spread), and remove
> points which are more than N-times the standard deviation from the median.

Standard deviation was the first thought that jumped to my mind
too. However, that's not what the OP is after. He's seems to be looking for
points when the direction changes.

Ganesan

--
Ganesan Rajagopal

Roberto Bonvallet
Guest
Posts: n/a

 11-14-2006
erikcw wrote:
> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.

In calculus, you identify high and low points by looking where the
derivative changes its sign. When working with discrete samples, you can
look at the sign changes in finite differences:

>>> data = [...]
>>> diff = [data[i + 1] - data[i] for i in range(len(data))]
>>> map(str, diff)

['0.4', '0.1', '-0.2', '-0.01', '0.11', '0.5', '-0.2', '-0.2', '0.6',
'-0.1', '0.2', '0.1', '0.1', '-0.45', '0.15', '-0.3', '-0.2', '0.1',
'-0.4', '0.05', '-0.1', '-0.25']

The high points are those where diff changes from + to -, and the low
points are those where diff changes from - to +.

HTH,
--
Roberto Bonvallet

Roy Smith
Guest
Posts: n/a

 11-14-2006
"erikcw" <(E-Mail Removed)> wrote:
> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.

I think you want a control chart. A good place to start might be
http://en.wikipedia.org/wiki/Control_chart. Even if you don't actually
graph the data, understanding the math behind control charts might help you

Wow. I think this is the first time I'm actually used something I learned
by sitting though those stupid Six Sigma training classes

Beliavsky
Guest
Posts: n/a

 11-14-2006

erikcw wrote:
> Hi all,
>
> I have a collection of ordered numerical data in a list.

Called a "time series" in statistics.

> The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.
>
> Here is some sample data:
> data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
> 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
> 0.10]
>
> In this data, some of the significant points include:
> data[0]
> data[2]
> data[4]
> data[6]
> data[8]
> data[9]
> data[13]
> data[14]
> ....
>
> How do I sort through this data and pull out these points of
> significance?

The best place to ask about an algorithm for this is not
comp.lang.python -- maybe sci.stat.math would be better. Once you have
an algorithm, coding it in Python should not be difficult. I'd suggest
using the NumPy array rather than the native Python list, which is not
designed for crunching numbers.