Cluster Statistics

  • Updated

It is very rare that a cluster contains only uniform values for all the rows in the cluster. Usually there are multiple values, but of different weights. For example, a cluster might be composed of mostly female gender, but some rows contain male gender. This may be because other factors are more predictive or that some factors in combination with those attributes are equally predictive. Furthermore, to gain insights into how unique a cluster might be, it is useful to compare the statistical analysis of the cluster to that of the overall data set.

When you select a cluster and click the View Details button a modal window appears that contains the average weight of the top factors factors (or frequency, if selected) and a table of the rows of the cluster. Additionally, there is a Statistics tab. When selected, that tab provides a capability to display the cluster statistics of an attribute. If the Spec Analyzer has been run on the input data set, then overall dataset statistics are also presented for comparison purposes.


On the Statistics tab, select an attribute of interest from the drop-down. Alternatively, type in a string into the Chose Attribute box and select an attribute from the search results.


Next, click the blue “+” button. This will add the statistical display of the attribute in the modal window. When the blue “+” button is clicked it turns into a red “-” button, which can be used to eliminate the display for this attribute. Also, when an attribute is added to the display, the blue “+” button and attribute drop down appear at the end of the display so that other attributes can be added to the modal window.


Similar to the Spec Analyzer, there are sub-tabs with different statistical displays. For Real attributes there are sub-tabs for Top Values, a Distribution Histogram, and Statistics. For Nominal types there is a sub-tab for Top Values.


The rightmost columns in the Top Values and Statistics sub-tabs show information about the overall data set for comparison. For example, in this diabetes rehospitalization supervised clustering, notice that for the selected cluster, Number_Inpatient (number of inpatient procedures) the top value is 1 and it occurs roughly 2/3rds of the time. Whereas the value 0 is the top value in the overall data set and it occurs roughly 2/3rds of the time. Also, the values range from 0 to 10 in this cluster, but range from 0 to 21 in the overall data set.

Here’s an example Nominal attribute. The Distribution Histogram and Cluster Statistics are not shown for this data type.




Was this article helpful?

0 out of 0 found this helpful



Please sign in to leave a comment.