v <- function(n) {
pokemon$groups <- cutree(hc, k=n)
dd <- ddply(pokemon, .(groups), numcolwise(var))
return(mean(colMeans(dd[, 2:7], na.rm=T), na.rm=T))
}
dv <- data.frame(n=1:nrow(pokemon), var=sapply(1:nrow(pokemon), function (x) { v(x) }))
Graphing the mean variance against the number of clusters (n) yields this nice graph.
Clearly, the reduction of mean variance starts to erode around n of 15-30. But, the exact number is still unknown to me. Think I need to read up on this a bit more. Also, you may notice there are a few values where the overall variance actually increases. Not exactly sure what's happening here. It happens when the first two clusters are made, breaking out the high HP pokemon, which have large deviations from the norm and from each other. The 2nd increases occurs at 6 clusters.
In short, not sure this answers the question regarding the optimal number of clusters. However, I feel pretty confident thinking that value is in the 15-30 range. Need to find another technique to answer this question. Also, may be worth doing a k-means on this also.
Updating the gist to include this new code.
No comments:
Post a Comment