Hierarchical Clustering of Pokemon pt 2

Talking with a co-worker today, the question came up of how to determine the optimal number of clusters. In the previous example, 6 was arbitrarily chosen and almost certainly not the best grouping. A quick bit of research does not provide me with a clear answer to this question. So, I started deriving my own method. I figured I could use variance as a measure of how tight a cluster is. Then by taking the mean variance of the clusters, I could look for a number of clusters where adding another cluster did not reduce variance meaningfully. This can be quickly done with:

 v <- function(n) {  
      pokemon$groups <- cutree(hc, k=n)  
      dd <- ddply(pokemon, .(groups), numcolwise(var))  
      return(mean(colMeans(dd[, 2:7], na.rm=T), na.rm=T))  
 }  
   
 dv <- data.frame(n=1:nrow(pokemon), var=sapply(1:nrow(pokemon), function (x) { v(x) }))  
   

Graphing the mean variance against the number of clusters (n) yields this nice graph.



Clearly, the reduction of mean variance starts to erode around n of 15-30.  But, the exact number is still unknown to me. Think I need to read up on this a bit more. Also, you may notice there are a few values where the overall variance actually increases. Not exactly sure what's happening here. It happens when the first two clusters are made, breaking out the high HP pokemon, which have large deviations from the norm and from each other. The 2nd increases occurs at 6 clusters.

In short, not sure this answers the question regarding the optimal number of clusters. However, I feel pretty confident thinking that value is in the 15-30 range. Need to find another technique to answer this question. Also, may be worth doing a k-means on this also.

Updating the gist to include this new code.

No comments:

Post a Comment