Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Programmatically finding "significant" data points

Reply
Thread Tools

Programmatically finding "significant" data points

 
 
erikcw
Guest
Posts: n/a
 
      11-14-2006
Hi all,

I have a collection of ordered numerical data in a list. The numbers
when plotted on a line chart make a low-high-low-high-high-low (random)
pattern. I need an algorithm to extract the "significant" high and low
points from this data.

Here is some sample data:
data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
0.10]

In this data, some of the significant points include:
data[0]
data[2]
data[4]
data[6]
data[8]
data[9]
data[13]
data[14]
.....

How do I sort through this data and pull out these points of
significance?

Thanks for your help!

Erik

 
Reply With Quote
 
 
 
 
Jeremy Sanders
Guest
Posts: n/a
 
      11-14-2006
erikcw wrote:

> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.
>

....
>
> How do I sort through this data and pull out these points of
> significance?


Get a book on statistics. One idea is as follows. If you expect the points
to be centred around a single value, you can calculate the median or mean
of the points, calculate their standard deviation (aka spread), and remove
points which are more than N-times the standard deviation from the median.

Jeremy

--
Jeremy Sanders
http://www.jeremysanders.net/
 
Reply With Quote
 
 
 
 
Fredrik Lundh
Guest
Posts: n/a
 
      11-14-2006
"erikcw" wrote:

> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.
>
> Here is some sample data:
> data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
> 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
> 0.10]


silly solution:

for i in range(1, len(data)-1):
if data[i-1] < data[i] > data[i+1] or data[i-1] > data[i] < data[i+1]:
print i

(the above doesn't handle the "edges", but that's easy to fix)

</F>



 
Reply With Quote
 
Philipp Pagel
Guest
Posts: n/a
 
      11-14-2006
erikcw <(E-Mail Removed)> wrote:
> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.


I am not sure, what you mean by 'ordered' in this context. As
pointed out by Jeremy, you need to find an appropriate statistical test.
The appropriateness depend on how your data is (presumably) distributed
and what exactly you are trying to test. E.g. do te data pints come from
differetn groupos of some kind? Or are you just looking for extreme
values (outliers maybe?)?

So it's more of statistical question than a python one.

cu
Philipp

--
Dr. Philipp Pagel Tel. +49-8161-71 2131
Dept. of Genome Oriented Bioinformatics Fax. +49-8161-71 2186
Technical University of Munich
http://mips.gsf.de/staff/pagel
 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      11-14-2006
erikcw wrote:

> Hi all,
>
> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.
>
> Here is some sample data:
> data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
> 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
> 0.10]
>
> In this data, some of the significant points include:
> data[0]
> data[2]
> data[4]
> data[6]
> data[8]
> data[9]
> data[13]
> data[14]
> ....
>
> How do I sort through this data and pull out these points of
> significance?


I think you are looking for "extrema":

def w3(items):
items = iter(items)
view = None, items.next(), items.next()
for item in items:
view = view[1:] + (item,)
yield view

for i, (a, b, c) in enumerate(w3(data)):
if a > b < c:
print i+1, "min", b
elif a < b > c:
print i+1, "max", b
else:
print i+1, "---", b

Peter
 
Reply With Quote
 
Alan J. Salmoni
Guest
Posts: n/a
 
      11-14-2006
If the order doesn't matter, you can sort the data and remove x * 0.5 *
n where x is the proportion of numbers you want. If you have too many
similar values though, this falls down. I suggest you check out
quantiles in a good statistics book.

Alan.

Peter Otten wrote:

> erikcw wrote:
>
> > Hi all,
> >
> > I have a collection of ordered numerical data in a list. The numbers
> > when plotted on a line chart make a low-high-low-high-high-low (random)
> > pattern. I need an algorithm to extract the "significant" high and low
> > points from this data.
> >
> > Here is some sample data:
> > data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
> > 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
> > 0.10]
> >
> > In this data, some of the significant points include:
> > data[0]
> > data[2]
> > data[4]
> > data[6]
> > data[8]
> > data[9]
> > data[13]
> > data[14]
> > ....
> >
> > How do I sort through this data and pull out these points of
> > significance?

>
> I think you are looking for "extrema":
>
> def w3(items):
> items = iter(items)
> view = None, items.next(), items.next()
> for item in items:
> view = view[1:] + (item,)
> yield view
>
> for i, (a, b, c) in enumerate(w3(data)):
> if a > b < c:
> print i+1, "min", b
> elif a < b > c:
> print i+1, "max", b
> else:
> print i+1, "---", b
>
> Peter


 
Reply With Quote
 
Ganesan Rajagopal
Guest
Posts: n/a
 
      11-14-2006
>>>>> Jeremy Sanders <(E-Mail Removed)> writes:

>> How do I sort through this data and pull out these points of
>> significance?


> Get a book on statistics. One idea is as follows. If you expect the points
> to be centred around a single value, you can calculate the median or mean
> of the points, calculate their standard deviation (aka spread), and remove
> points which are more than N-times the standard deviation from the median.


Standard deviation was the first thought that jumped to my mind
too. However, that's not what the OP is after. He's seems to be looking for
points when the direction changes.

Ganesan

--
Ganesan Rajagopal

 
Reply With Quote
 
Roberto Bonvallet
Guest
Posts: n/a
 
      11-14-2006
erikcw wrote:
> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.


In calculus, you identify high and low points by looking where the
derivative changes its sign. When working with discrete samples, you can
look at the sign changes in finite differences:

>>> data = [...]
>>> diff = [data[i + 1] - data[i] for i in range(len(data))]
>>> map(str, diff)

['0.4', '0.1', '-0.2', '-0.01', '0.11', '0.5', '-0.2', '-0.2', '0.6',
'-0.1', '0.2', '0.1', '0.1', '-0.45', '0.15', '-0.3', '-0.2', '0.1',
'-0.4', '0.05', '-0.1', '-0.25']

The high points are those where diff changes from + to -, and the low
points are those where diff changes from - to +.

HTH,
--
Roberto Bonvallet
 
Reply With Quote
 
Roy Smith
Guest
Posts: n/a
 
      11-14-2006
"erikcw" <(E-Mail Removed)> wrote:
> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.


I think you want a control chart. A good place to start might be
http://en.wikipedia.org/wiki/Control_chart. Even if you don't actually
graph the data, understanding the math behind control charts might help you
with your analysis.

Wow. I think this is the first time I'm actually used something I learned
by sitting though those stupid Six Sigma training classes
 
Reply With Quote
 
Beliavsky
Guest
Posts: n/a
 
      11-14-2006

erikcw wrote:
> Hi all,
>
> I have a collection of ordered numerical data in a list.


Called a "time series" in statistics.

> The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.
>
> Here is some sample data:
> data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
> 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
> 0.10]
>
> In this data, some of the significant points include:
> data[0]
> data[2]
> data[4]
> data[6]
> data[8]
> data[9]
> data[13]
> data[14]
> ....
>
> How do I sort through this data and pull out these points of
> significance?


The best place to ask about an algorithm for this is not
comp.lang.python -- maybe sci.stat.math would be better. Once you have
an algorithm, coding it in Python should not be difficult. I'd suggest
using the NumPy array rather than the native Python list, which is not
designed for crunching numbers.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Finding wireless access points. jim Wireless Networking 12 05-29-2008 08:59 PM
finding uniformly spaced points on Arc2D Jeff Higgins Java 5 01-08-2007 08:00 PM
finding angle between two points cwsullivan@ucdavis.edu C++ 15 02-02-2006 08:14 PM
Programmatically finding machine.config Kevin Jackson ASP .Net 2 03-17-2005 02:43 PM
Re: programmatically finding the application root Aaron Prohaska ASP .Net 0 01-09-2004 12:50 AM



Advertisments