Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Perl routine for cluster detection

Reply
Thread Tools

Perl routine for cluster detection

 
 
vincent64@yahoo.com
Guest
Posts: n/a
 
      04-28-2007
Here is some elementary code to detect the presence of a clustering
structure in a 2-dimensional dataset. It's more heuristic than
scientific, so take it with a grain of salt, as even the concept of
cluster is highly fuzzy.

The seed routine creates a cluster of 1000 points, saved in
cluster.txt: each row corresponds to a point; the first column is the
cluster number, and the next two columns are the x and y coordinates.
The cluster number is automatically incremented each time a new call
to seed is made, resulting in the creation of a new cluster. The
distance routine computes the distance between two points, for 100
points randomly selected in the data set previously created
(cluster.txt). The output is a file dist.txt, with one row per pair
of
points, with two fields: the first column is an indicator and is
equal
to 1 if both points belong to the same cluster; the second column is
the distance between the two points. This script illustrates that it
is possible to check whether a data set contains one or two clusters
by looking at the distribution of distances: a gap in the
distribution
means the presence of distinct clusters. It also suggests that the
computational complexity of computing whether a data set contains one
of more clusters is well below O(n), possibly O(n0.5), if one uses
sampling techniques.


Source code: http://datashaping.com/regress.shtml

 
Reply With Quote
 
 
 
 
Mirco Wahab
Guest
Posts: n/a
 
      04-29-2007
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> Here is some elementary code to detect the presence of a clustering
> structure in a 2-dimensional dataset. It's more heuristic than
> scientific, so take it with a grain of salt, as even the concept of
> cluster is highly fuzzy.


Before going into details, I'd like to ask what
you think what the following part of your program
does:

...
sub seed {
local($x,$y)=@_;
$kmax=1000;

$x=rand($x)-0.5;
$y=rand($y)-0.5;

for ($k=0; $k<$kmax; $k++) {
print "\t$cluster [$1]\n";
$x=$x+rand($1)-0.5;
$y=$y+rand($1)-0.5;
$px[$k]=$x;
$py[$k]=$y;
}
...


Aside from beeing not able to run under 'strict',
what's meant with

$x=$x+rand($1)-0.5;
$y=$y+rand($1)-0.5;

because '$1' is, at this point, not set.

> The seed routine creates a cluster of 1000 points, saved in
> cluster.txt: each row corresponds to a point; the first column is the
> cluster number, and the next two columns are the x and y coordinates.


Don't do that. The convention in this business is.
First comes x, then y, then z. Because your 'cluster number'
is somehow 'a plane' in your problem space, you should make
it that ('z', third column).

> The cluster number is automatically incremented each time a new call
> to seed is made, resulting in the creation of a new cluster. The
> distance routine computes the distance between two points, for 100
> points randomly selected in the data set previously created
> (cluster.txt). The output is a file dist.txt, with one row per pair
> of points, with two fields: the first column is an indicator and is
> equal to 1 if both points belong to the same cluster; the second column is
> the distance between the two points. This script illustrates that it
> is possible to check whether a data set contains one or two clusters
> by looking at the distribution of distances: a gap in the
> distribution means the presence of distinct clusters. It also suggests
> that the computational complexity of computing whether a data set contains
> one of more clusters is well below O(n), possibly O(n0.5), if one uses
> sampling techniques.


Whats the point of that? You have, say 10^7 2D-points, then you
select 100 pair-samples from them, compute their distance and
claim you have 'complexity well below O(n), possibly O(n0.5)'?

I don't get that ...

==>your code was: datashaping.com/cluster_pl.txt

I'd recommend to translate the code from Perl3-style
to Perl5, which is not really that difficult, because
the code does basically almost nothing.

Starting point: ==>

use strict;
use warnings;

my $idclust = 0;

dmp_seed([1.0,1.0], 1000, $idclust++, '>cluster.txt');
dmp_seed([25.0,25.0], 1000, $idclust++, '>>cluster.txt');

distance(100, 'cluster.txt', 'dist.txt'); # nsamp read write


sub distance {
my ($nsamp, $fn_clust, $fn_dist) = @_;

open my $fc,'<', $fn_clust or die "no coord in: $!";
my @pc = map [/(\S+)/g], <$fc>;
close $fc;

open my $fd, '>', $fn_dist or die "no dist out: $!";
for (1 .. $nsamp) {
my ($pm, $pn) = ( $pc[int rand @pc], $pc[int rand @pc] );
printf $fd "%d\t%.8f\n", 1-($pm->[2] == $pn->[2]),
sqrt(($pm->[0]-$pn->[0])**2 + ($pm->[1]-$pn->[1])**2)
}
close $fd;
}

sub dmp_seed {
my ($rseed, $nmax, $cluid, $fname_mod) = @_;
my ($x, $y) = map $_+rand(1)-0.5, @$rseed;

open my $fh, $fname_mod or die "no way out: $!";
for(1 .. $nmax) {
printf $fh "%.8f\t%.8f\t%d\n", $x+=rand(1)-0.5, $y+=rand(1)-0.5, $cluid
}
close $fh;
}
<==

Regards

M.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Validation of viewstate MAC failed. If this application is hosted by a Web Farm or cluster, ensure that <machineKey> configuration specifies the same validationKey and validation algorithm. AutoGenerate cannot be used in a cluster. Mark B ASP .Net 2 09-11-2009 07:09 AM
automating a perl installation on a cluster (or using non-standard nfs paths) Rahul Perl Misc 8 02-11-2009 12:47 AM
Perl Pro but Java Newbie: Need nudge in proper direction for myfavorite Perl routine in Java /usr/ceo Java 32 09-15-2008 12:12 AM
Validation of viewstate MAC failed. If this application is hosted by a Web Farm or cluster, ensure that <machineKey> configuration specifies the same validationKey and validation algorithm. AutoGenerate cannot be used in a cluster. Dhruba Bandopadhyay ASP .Net 1 05-25-2006 01:06 AM
How do we Call an external routine in Perl Bazil Perl Misc 2 12-06-2003 02:20 AM



Advertisments