Hierarchical Clustering of Pokemon pt 2

Talking with a co-worker today, the question came up of how to determine the optimal number of clusters. In the previous example, 6 was arbitrarily chosen and almost certainly not the best grouping. A quick bit of research does not provide me with a clear answer to this question. So, I started deriving my own method. I figured I could use variance as a measure of how tight a cluster is. Then by taking the mean variance of the clusters, I could look for a number of clusters where adding another cluster did not reduce variance meaningfully. This can be quickly done with:

 v <- function(n) {  
      pokemon$groups <- cutree(hc, k=n)  
      dd <- ddply(pokemon, .(groups), numcolwise(var))  
      return(mean(colMeans(dd[, 2:7], na.rm=T), na.rm=T))  
 }  
   
 dv <- data.frame(n=1:nrow(pokemon), var=sapply(1:nrow(pokemon), function (x) { v(x) }))  
   

Graphing the mean variance against the number of clusters (n) yields this nice graph.



Clearly, the reduction of mean variance starts to erode around n of 15-30.  But, the exact number is still unknown to me. Think I need to read up on this a bit more. Also, you may notice there are a few values where the overall variance actually increases. Not exactly sure what's happening here. It happens when the first two clusters are made, breaking out the high HP pokemon, which have large deviations from the norm and from each other. The 2nd increases occurs at 6 clusters.

In short, not sure this answers the question regarding the optimal number of clusters. However, I feel pretty confident thinking that value is in the 15-30 range. Need to find another technique to answer this question. Also, may be worth doing a k-means on this also.

Updating the gist to include this new code.

Hierarchical Clustering of Pokemon

In learning more about Pokemon as I get ready for X & Y release next month, I come to realize I should be thinking of the little monsters as I would character classes from a RPG.  Just like a RPG, we have our DPS'ers, tanks and healers.  These roles are determined by the moves a pokemon can have as well as their traits.  I thought it would be interesting to take a stab at clustering pokemon by traits and seeing what I can learn. As before, I'll be using Veekun's data on github for this exercise.

Let's start by loading the needed libraries and getting the data loaded into R.

 library(RCurl)  
 library(reshape2)  
 library(plyr)  
 library(ggplot2)  
   
 pokemon <- read.csv(text = getURL("https://raw.github.com/veekun/pokedex/master/pokedex/data/csv/pokemon_species.csv", ssl.verifypeer=FALSE))  
 pokemon_stats <- read.csv(text = getURL("https://raw.github.com/veekun/pokedex/master/pokedex/data/csv/pokemon_stats.csv", ssl.verifypeer=FALSE))  
 stat_names <- read.csv(text = getURL("https://raw.github.com/veekun/pokedex/master/pokedex/data/csv/stat_names.csv", ssl.verifypeer=FALSE))  
   

We don't care about the unevolved forms, so let's throw them away.

 pokemon <- ddply(pokemon, .(evolution_chain_id), subset, evolves_from_species_id == max(evolves_from_species_id, na.rm=T))  
   

Now, the pokemon attributes are in another table, so let's clean it up and merge everything together into a nice data frame listing our pokemon by name and their various attributes as columns.

 stat_names <- stat_names[stat_names$local_language_id == 9, c("stat_id", "name")]  
   
 pokemon_stats <- merge(pokemon_stats, stat_names)  
 pokemon_stats <- pokemon_stats[, c("pokemon_id", "base_stat", "name")]  
   
 colnames(pokemon_stats) <- c("pokemon_id", "value", "name")  
 pokemon_stats <- dcast(pokemon_stats, pokemon_id ~ name, mean )  
   
 colnames(pokemon_stats)[[1]] <- "id"  
   
 pokemon <- merge(pokemon, pokemon_stats)  
   
 pokemon <- pokemon[, c("identifier", "Attack", "Defense", "HP", "Special Attack", "Special Defense", "Speed")]  

> head(pokemon)
  identifier Attack Defense HP Special Attack Special Defense Speed 
1   venusaur     82      83 80            100             100    80
2  charizard     84      78 78            109              85   100      
3  blastoise     83     100 79             85             105    78      
4 butterfree     45      50 60             80              80    70      
5   beedrill     80      40 65             45              80    75      
6    pidgeot     80      75 83             70              70    91      

   

Previous experience using hclust has told me I should probably scale these values to give them equal weight. Then we can create the cluster and create a quick plot.

 pokemon.scaled <- scale(pokemon[, -1])  
   
 hc <- hclust(dist(pokemon.scaled))  
 plot(hc, labels=pokemon$identifier, hang=-1)  
   



Cool chart huh? Pretty hard to read. All I'm really interested in is choosing how many groups I want to have. I'm going to choose 6 as that seems to be a nice place to break it up. Based on what I find, I might want to try a different grouping, but I'll start with 6 and see if some characteristics appear within those groups.

 pokemon$groups <- cutree(hc, k=6)  
   
 df <- melt(ddply(pokemon[, 2:8], .(groups), numcolwise(mean)), id=c("groups"))  
   
 ggplot(df, aes(variable, value, fill=variable)) + geom_bar(stat="identity") + facet_grid(groups ~ .) + coord_flip() + theme(legend.position="none")  
   



Now we have something. Immediately we can see that group 5 consists of the high HP pokemon, so we'll call them "Meat Shields". Group 6 has high HP and high Attack, we'll call these our "Bulky Attackers". Likewise, Group 5 has high Defense and high Attack, thus the label "Hard Knocks". Group 3 seems to be distinguished by their lack of distinguishing attributes. I'll call them the "Bench". I would guess this group has some tricks in their moveset to make them more viable. Group 1 and 2 are the hardest to distinguish. I should probably break them into other groups to narrow down their similarities a bit. However, we can see at least Group 2 has the highest Speed of all groups so we'll call them the "Speed Demons". Leaving Group 1 with the label, "Middle Road" as all their attributes seem to be in that middle ground. With these labels in place, let's add them and then see which pokemon are in which groups.

 > pokemon$class <- factor(pokemon$groups, labels=c("Middle Road", "Speed Demons", "Bench", "Hard Knocks", "Meat Shield", "Bulky Attackers"))  
 pokemon$identifier <- as.character(pokemon$identifier)  
 split(pokemon$identifier, pokemon$class)  
 > > $`Middle Road`  
  [1] "venusaur"  "blastoise" "pidgeot"  "nidoqueen" "nidoking"   
  [6] "vileplume" "golduck"  "poliwrath" "tentacruel" "slowbro"    
 [11] "dewgong"  "hypno"   "weezing"  "seaking"  "mr-mime"    
 [16] "vaporeon"  "omastar"  "meganium"  "feraligatr" "noctowl"    
 [21] "ledian"   "lanturn"  "ampharos"  "bellossom" "politoed"   
 [26] "sunflora"  "quagsire"  "umbreon"  "slowking"  "magcargo"   
 [31] "mantine"  "kingdra"  "swampert"  "ludicolo"  "pelipper"   
 [36] "gardevoir" "exploud"  "swalot"   "grumpig"  "altaria"    
 [41] "whiscash"  "claydol"  "cradily"  "milotic"  "glalie"    
 [46] "walrein"  "huntail"  "gorebyss"  "torterra"  "empoleon"   
 [51] "bastiodon" "wormadam"  "vespiquen" "gastrodon" "skuntank"   
 [56] "bronzong"  "abomasnow" "magnezone" "lickilicky" "togekiss"   
 [61] "glaceon"  "probopass" "dusknoir"  "musharna"  "seismitoad"  
 [66] "scrafty"  "cofagrigus" "garbodor"  "gothitelle" "reuniclus"   
 [71] "vanilluxe" "escavalier" "amoonguss" "jellicent" "ferrothorn"  
 [76] "beheeyem"  "chandelure" "mandibuzz" "volcarona"   
   
 $`Speed Demons`  
  [1] "charizard" "butterfree" "beedrill"  "raticate"  "fearow"    
  [6] "arbok"   "ninetales" "venomoth"  "dugtrio"  "persian"    
  [11] "primeape"  "arcanine"  "alakazam"  "victreebel" "rapidash"   
  [16] "dodrio"   "gengar"   "electrode" "exeggutor" "hitmonlee"   
  [21] "hitmonchan" "starmie"  "jynx"    "electabuzz" "magmar"    
  [26] "jolteon"  "flareon"  "typhlosion" "furret"   "crobat"    
  [31] "xatu"    "jumpluff"  "espeon"   "octillery" "houndoom"   
  [36] "hitmontop" "sceptile"  "blaziken"  "mightyena" "linoone"    
  [41] "dustox"   "shiftry"  "swellow"  "masquerain" "breloom"    
  [46] "ninjask"  "delcatty"  "medicham"  "manectric" "roselia"    
  [51] "sharpedo"  "camerupt"  "flygon"   "cacturne"  "crawdaunt"   
  [56] "banette"  "chimecho"  "salamence" "infernape" "bibarel"    
  [61] "kricketune" "luxray"   "mothim"   "floatzel"  "cherrim"    
  [66] "ambipom"  "lopunny"  "mismagius" "honchkrow" "purugly"    
  [71] "lucario"  "toxicroak" "lumineon"  "weavile"  "yanmega"    
  [76] "porygon-z" "gallade"  "froslass"  "serperior" "emboar"    
  [81] "samurott"  "watchog"  "liepard"  "simisage"  "simisear"   
  [86] "simipour"  "unfezant"  "zebstrika" "swoobat"  "leavanny"   
  [91] "scolipede" "whimsicott" "lilligant" "archeops"  "zoroark"    
  [96] "cinccino"  "swanna"   "sawsbuck"  "galvantula" "eelektross"  
 [101] "accelgor"  "mienshao"  "hydreigon"   
   
 $Bench  
 [1] "pikachu"  "clefairy"  "jigglypuff" "marill"   "shedinja"   
   
 $`Hard Knocks`  
  [1] "sandslash" "parasect"  "machamp"  "golem"   "muk"      
  [6] "cloyster"  "kingler"  "marowak"  "gyarados"  "kabutops"   
 [11] "dragonite" "ariados"  "sudowoodo" "forretress" "steelix"    
 [16] "granbull"  "scizor"   "ursaring"  "donphan"  "tyranitar"   
 [21] "aggron"   "armaldo"  "metagross" "garchomp"  "hippowdon"   
 [26] "drapion"  "rhyperior" "tangrowth" "leafeon"  "gliscor"    
 [31] "mamoswine" "stoutland" "gigalith"  "conkeldurr" "krookodile"  
 [36] "crustle"  "carracosta" "klinklang" "haxorus"  "beartic"    
 [41] "golurk"   "bisharp"  "braviary"   
   
 $`Meat Shield`  
 [1] "chansey"  "wobbuffet"  
   
 $`Bulky Attackers`  
 [1] "snorlax"  "slaking"  "hariyama"  "wailord"  "staraptor"   
 [6] "rampardos" "drifblim"  "excadrill" "darmanitan"  
   

Unfortunately, I'm not really an expert enough trainer to know whether these classifications are accurate. Maybe a more experienced player can give us some feedback on how well this worked. At least a few of the pokemon I'm more familiar with are in the groups I would expect them to be in. To follow up, I think I'll try to break the groups up a bit more. See if I can get some distinction in the larger groups. Might be interesting to see which types fit into these classifications most frequently.

Link to complete script for anyone interested.

Last Night on Earth combat probabilities

Last Night on Earth is a game I really enjoy playing. It's got zombies, a cheesy theme, interesting randomness, deep strategy and cool combat. The combat is complicated enough though, that when playing, I'm not exactly sure the chances of the various outcomes. To help myself out, I'll take a look at the basic combat probabilities and see if I can create a generalized rule to make it easy to consider the situation. For this exercise, I'll ignore the effects of weapons, like the chainsaw, and potentially consider looking at them in the future. I would also like to see the effect "reroll a die" cards might have on the probabilities, but again, in the future maybe.

Combat occurs whenever a zombie and a hero occupy the same space. The base combat rules have the zombie rolling 1 die and the hero 2. If any of the hero's die are greater than any of the zombie's die, the hero "defends" against the zombie. If any of the zombie die are higher than all of the hero's die, then the hero takes a "wound". Finally, if the hero rolls doubles AND any die is higher than the zombie's, the hero "kills" the zombie. The possible outcomes of this type of combat are described with this R function.



To simulate the combat as described above, we simply use fight(1, 2).

 > fight(1, 2)  
 $counts  
 results  
 defend  kill wound   
   110   15   91   
   
 $ratios  
 results  
   defend    kill   wound   
 0.50925926 0.06944444 0.42129630   
   
 >   

So, roughly, a single zombie has a 42% chance to inflict a wound on the hero. The odds are in the hero's favor, but just barely and the hero has a dismal chance to kill the zombie. Now, one of the things that makes the combat so interesting is that player's can add extra die to their rolls using event or weapon cards like Faith. In this example, the hero is now rolling 3 fight die instead of 2 while the zombie is still only rolling 1.

 > fight(1, 3)  
 $counts  
 results  
 defend  kill wound   
   510  345  441   
   
 $ratios  
 results  
   defend   kill   wound   
 0.3935185 0.2662037 0.3402778   
   
 >   

The zombie's chance of inflicting a wound is reduced to 1:3 and the hero gains a much greater chance to kill the zombie. Clearly, the hero getting that extra fight die makes combat a bit more rational for the hero. But what about when the zombie gains an extra die.

 > fight(2, 2)  
 $counts  
 results  
 defend  kill wound   
   450   55  791   
   
 $ratios  
 results  
   defend    kill   wound   
 0.34722222 0.04243827 0.61033951   

The extra die increases a zombie's chances to inflict a wound by 50% over just one die. Big difference. One zombie attacking the player is now very dangerous. How about +1 die for each.

 > fight(2, 3)  
 $counts  
 results  
 defend  kill wound   
  2262  1405  4109   
   
 $ratios  
 results  
   defend   kill   wound   
 0.2908951 0.1806842 0.5284208   
   

Still very much in the zombie's favor. So, essentially that 1 extra die is much more powerful to a zombie than the hero.

Now we have a good idea of the probabilities for simple combat, but let's see about making that generalized rule I mentioned at the start. The question I want to know most is how many zombies should I concentrate on a hero to pressure them. Its rare than the zombie has an extra die, at least without the expansions, but common for the player to have 1 or more extra die. So in this example, I'm going to calculate the outcomes of the zombie(s) inflicting at least one wound to the hero with a varying number of combat die. To do this, I'm going to use this function to calculate probability of at least one wound using a varying number of zombies and varying number of hero combat die.

 > df <- expand.grid(numZombie = 1:6, heroDie = 2:4)  
 > df$woundChance <- apply(df, 1, function(x) { cumProb(1, x[[1]], fight(1, x[[2]])$ratios[[3]])})  
 > head(df)  
  numZombie heroDie woundChance  
 1     1      2  0.4212963  
 2     2      2  0.6651020  
 3     3      2  0.8061933  
 4     4      2  0.8878433  
 5     5      2  0.9350945  
 6     6      2  0.9624390  
 >   

Numbers are sometimes hard to read and that's where charts serve us nicely.



Pretty, but let's make it even easier. A simple linear model should net us a nice easy to remember rule, much like the 2/4 rule in hold em.

 lm <- lm(woundChance ~ numZombie + heroDie, data=df)  
 > summary(lm)  
   
 Call:  
 lm(formula = woundChance ~ numZombie + heroDie, data = df)  
   
 Residuals:  
    Min    1Q  Median    3Q   Max   
 -0.09623 -0.06060 0.01660 0.05243 0.08463   
   
 Coefficients:  
        Estimate Std. Error t value Pr(>|t|)    
 (Intercept) 0.525251  0.065099  8.068 7.75e-07 ***  
 numZombie  0.109486  0.008815 12.421 2.70e-09 ***  
 heroDie -0.066075  0.018437 -3.584 0.00271 **   
 ---  
 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  
   
 Residual standard error: 0.06387 on 15 degrees of freedom  
 Multiple R-squared: 0.9176,     Adjusted R-squared: 0.9067   
 F-statistic: 83.56 on 2 and 15 DF, p-value: 7.379e-09  

After all this, what do we have?
In combat, there is a 50-50 chance of wounding the hero. Each zombie in the fight increases the odds by 10%. Each hero die reduces it by ~6%.

Using this simplified rule, 2 zombies attacking a hero with 2 die has a (50% + (2 * 10%) - (2 * 6%)) 58% chance. Our table above states the chance to be 66%. Looking at the residual plot clearly illustrates the exponential nature of our model, but I want easy math to make the calculations on the fly. Adding +/- 5% will help me to generally consider the odds and while not perfect, it's better than what I had to start with.



Pokemon Type Analysis in R

Lately I've had a fascination with Pokemon. It's a game full of numbers and those numbers call for analysis. As I've been playing White 2 in preparation for the release of X/Y in October, I started to wonder what type pairings of the little monsters would have good resistances. All I really know is that my current Ferrothorn is pretty cool and that Spiritomb supposedly has no weaknesses, but I wanted to know more. Specifically, what type combos has the most resistances and immunities.  Let's see how R might help us figure this out.

Luckily, a site I frequent, shares their pocket monster data for anyone. The data I'm looking for is kept at csv in the pokedex/data/csv folder.  After getting the data, let's load it into R.

  # load the data  
  types <- read.csv("types.csv")  
  type_efficacy <- read.csv("type_efficacy.csv")  

Now let's clean up the data a bit.

 # get rid of some columns we don't care about  
 types <- types[, 1:2]  
 # scale down damage factor  
 type_efficacy$damage_factor <- type_efficacy$damage_factor / 100  

Ok, great.  If we look at our type_efficacy data frame, we have ids for the damage and target types.  Would like to replace those ids with plain text labels that I can understand a little easier.  This would be super easy in sql, but for some reason, seems to be a real pain in R.  Maybe there is a better answer.

 # hacky way to get this to work. need a better solutions.  
 # join from plyr won't work because column names are not the same?  
 te <- merge(type_efficacy, types, by.x = "damage_type_id", by.y="id", all=T)  
 names(te)[[4]] <- "damage_type"  
 te <- merge(te, types, by.x = "target_type_id", by.y="id", all=T)  
 names(te)[[5]] <- "target_type"  
 # get rid of the NA rows and id value columns  
 te <- te[complete.cases(te), c("damage_type", "target_type", "damage_factor")]  

Next we need all pairings of the types.  Turns out not all of these type pairings exist as real pokemon, but we can deal with that later.

 # create all pairings of types  
 tm <- expand.grid(t, t)  
 colnames(tm) <- c("t1", "t2")  
 # get rid of rows with same type  
 tm <- tm[which(tm$t1 != tm$t2 | is.na(tm$t2)) , ]  
 tm <- tm[!is.na(tm$t1), ]  

Great!  But, there is a small problem.  We have both Steel / Grass and Grass / Steel.  The order of the pairings does not matter in calculating the resistances, so let's get rid of those redundant combinations.  This is another place where a better solution is probably available, but this quick function makes it fast and easy to identify these duplicated pairings.  I'll source this function, run it against the type pairings data frame and then get rid of all duplicated pairings.

 # quick function to find combinations that are already in the data frame  
 tagDups <- function(df = tm) {  
  df$dup = FALSE  
  for (i in seq(1, nrow(df))) {  
   v1 = df[1:i, ]$t1 == df[i, ]$t2  
   v2 = df[1:i, ]$t2 == df[i, ]$t1  
     
   if(length(which(v1 & v2)) > 0) {  
    df[i, ]$dup = TRUE  
   }  
  }  
  return(df)  
 }  
   
 # and get rid of those rows  
 tm <- tagDups()  
 tm <- tm[!tm$dup, 1:2]  

Back to the damage calculations, I would like a data frame where I can easily look up the damage factors of the type can be easily calculated.  I'll cast our earlier type_efficacy data frame into a more convenient form.

 # create a damage / target matrix  
 library(reshape2)  
 type_dmg <- dcast(te, damage_type ~ target_type)  
   
 head(type_dmg)  
  damage_type bug dark dragon electric fighting fire flying ghost grass ground ice normal poison psychic rock steel water  
 1     bug 1.0 2.0  1.0   1.0   0.5 0.5  0.5  0.5  2.0   1  1   1  0.5   2.0 1.0  0.5  1.0  
 2    dark 1.0 0.5  1.0   1.0   0.5 1.0  1.0  2.0  1.0   1  1   1  1.0   2.0 1.0  0.5  1.0  
 3   dragon 1.0 1.0  2.0   1.0   1.0 1.0  1.0  1.0  1.0   1  1   1  1.0   1.0 1.0  0.5  1.0  
 4  electric 1.0 1.0  0.5   0.5   1.0 1.0  2.0  1.0  0.5   0  1   1  1.0   1.0 1.0  1.0  2.0  
 5  fighting 0.5 2.0  1.0   1.0   1.0 1.0  0.5  0.0  1.0   1  2   2  0.5   0.5 2.0  2.0  1.0  
 6    fire 2.0 1.0  0.5   1.0   1.0 0.5  1.0  1.0  2.0   1  2   1  1.0   1.0 0.5  2.0  0.5  

Ok, ugly formatting, but whatever.  You get the idea.  Now we can get a type profile simply with the command:

 # now we can get the damage taken profile for a type  
 type_dmg[, "ice"]  

At this point we have a data frame listing all type pairings in tm and a lookup table for a type damage profile in type_dmg.  This should give us enough to do all the analysis now.

First, we need a quick function to create the pairing summaries for us.

 # returns damage taken vector for type pairing  
 calcRes <- function(row, td = type_dmg) {  
  t1 = as.character(row$t1)  
  t2 = as.character(row$t2)  
  if(is.na(row$t2)) {  
   res <- td[, t1]  
  } else {  
   res <- (td[, t1] * td[, t2])  
  }  
    
  immunities <- length(which(res == 0))  
  resistances <- length(which(res > 0 & res < 1))  
  standards <- length(which(res == 1))  
  weakness <- length(which(res == 2))  
  kryptonite <- length(which(res == 4))  
    
  return(c(t1, t2, immunities, resistances, standards, weakness, kryptonite, res))  
 }  

And an object to store our results.

 type_summary <- data.frame(t1 = character(),   
               t2 = character(),  
               immunities = numeric(),  
               resistance = numeric(),  
               standards = numeric(),  
               weaknesses = numeric(),  
               kryptonite = numeric(),  
               bug    = numeric(),  
               dark    = numeric(),  
               dragon   = numeric(),  
               electric  = numeric(),  
               fighting  = numeric(),  
               fire    = numeric(),  
               flying   = numeric(),  
               ghost   = numeric(),  
               grass   = numeric(),  
               ground   = numeric(),  
               ice    = numeric(),  
               normal   = numeric(),  
               poison   = numeric(),  
               psychic  = numeric(),  
               rock    = numeric(),  
               steel   = numeric(),  
               water   = numeric(),  
               stringsAsFactors=FALSE)  

Finally, for each type pairing, let's calculate it's resistances.  I should have probably done this is sapply or something, but for only a few hundred rows, I guess it's ok to do it the "easiest" way I know.

 for(i in seq(1, nrow(tm))) {  
  type_summary[i, ] <- calcRes(tm[i, ])  
 }  

After it's all done, we have a nice data set that looks like this.
CSV
Rda

That was fun.  Now I can look at various type pairings as I plan my team for X/Y in October.  Surely this will make X/Y more fun for me.  Or actually, maybe this was the fun part.  No insight to share this time, maybe in the next post.  Of course, thanks to veekun and his pokedex for the raw data.