K-Means Versus K-Spilling

  • Updated

simCluster+ uses K-Means as the underlying clustering approach. In general, this means that simCluster+ will return K clusters and every element in the dataset will be mapped to a cluster. However, simCluster can also do K-Spilling clustering. K-Spilling will return K clusters that are more tightly formed around their “mean” centroid, but if a datapoint is not close enough to the centroids, it will “spill” into secondary clusters. This means that K-Spilling will return K+N clusters where the first K are tightly bound and the N additional clusters contain data points that are not within the density criteria.

The Range Percentile parameter determines if K-Means or K-Spilling will be used. If Range Percentile is 1.0, then all data points will be clustered into K clusters. If the Range Percentile is less than 1.0, then dense clusters will be produced and data points that are not within the specified density will spill into clusters beyond K.

Here’s an abstract example of how this works. This is a plot of a dataset with two attributes (X and Y):

mceclip0.png

Using simCluster+ with a Range Percentile of 1.0 and K = 2, will result in the following two clusters, shown as red for cluster 0 and blue for cluster 1:

mceclip1.png

If the Range Percentile is lowered below 1.0, but K remains at K = 2, the following four clusters might be produced:

mceclip2.png

Here the red and blue data points are tightly clustered around their centroids based on the Range Percentile parameter. In addition to clusters 0 and 1, there are also clusters 2 (black) and 3 (green). These last two clusters are data points that do not fit the density requirement and have ‘spilled’ over into separate clusters.

Here’s a summary of the different types of clustering:

Model

Range Percentile

No. of Clusters

All Data Points Clustered?

Use Case

simCluster+

1.0

K

Yes

Produce exactly K clusters and cluster all data points

simCluster+

< 1.0

> K

Yes

Produce K clusters that are densely packed and spill other data points into additional clusters. All data points clustered.

simCluster

N/A

Many

No

Produce many clusters, very densely packed. Not all data points clustered.

 

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.