## Learning Data Science: Understanding and Using k-means Clustering

A few months ago I published a quite popular post on Clustering the Bible… one well known clustering algorithm is k-means. If you want to learn how k-means works and how to apply it in a real-world example, read on…

k-means (not to be confused with k-nearest neighbours or KNN: Teach R to read handwritten Digits with just 4 Lines of Code) is a simple, yet often very effective unsupervised learning algorithm to find similarities in large amounts of data and cluster them accordingly. The hyperparameter k stands for the number of clusters which has to be set beforehand.

The guiding principles are:

• The distance between data points within clusters should be as small as possible.
• The distance of the centroids (= centres of the clusters) should be as big as possible.

Because there are too many possible combinations of all possible clusters comprising all possible data points k-means follows an iterative approach:

1. Initialization: assign clusters randomly to all data points
2. E-step (for expectation): assign each observation to the “nearest” (based on Euclidean distance) cluster
3. M-step (for maximization): determine new centroids based on the mean of assigned objects
4. Repeat steps 3 and 4 until no further changes occur

As can be seen above k-means is an example of a so-called expectation-maximization algorithm.

To implement k-means in R we first assign some variables and define a helper function for plotting the steps:

n <- 3 # no. of centroids
set.seed(1415) # set seed for reproducibility

M1 <- matrix(round(runif(100, 1, 5), 1), ncol = 2)
M2 <- matrix(round(runif(100, 7, 12), 1), ncol = 2)
M3 <- matrix(round(runif(100, 20, 25), 1), ncol = 2)
M <- rbind(M1, M2, M3)

C <- M[1:n, ] # define centroids as first n objects
obs <- length(M) / 2
A <- sample(1:n, obs, replace = TRUE) # assign objects to centroids at random
colors <- seq(10, 200, 25)

clusterplot <- function(M, C, txt) {
plot(M, main = txt, xlab = "", ylab = "")
for(i in 1:n) {
points(C[i, , drop = FALSE], pch = 23, lwd = 3, col = colors[i])
points(M[A == i, , drop = FALSE], col = colors[i])
}
}
clusterplot(M, C, "Initialization")


Here comes the k-means algorithm as described above (the circles are the data points, diamonds are the centroids and the three colours symbolize cluster assignments):

repeat {
# calculate Euclidean distance between objects and centroids
D <- matrix(data = NA, nrow = n, ncol = obs)
for(i in 1:n) {
for(j in 1:obs) {
D[i, j] <- sqrt((M[j, 1] - C[i, 1])^2 + (M[j, 2] - C[i, 2])^2)
}
}
O <- A

## E-step: parameters are fixed, distributions are optimized
A <- max.col(t(-D)) # assign objects to centroids
if(all(O == A)) break # if no change stop
clusterplot(M, C, "E-step")

## M-step: distributions are fixed, parameters are optimized
# determine new centroids based on mean of assigned objects
for(i in 1:n) {
C[i, ] <- apply(M[A == i, , drop = FALSE], 2, mean)
}
clusterplot(M, C, "M-step")
}


As can seen the clusters wander slowly but surely until all three are stable. We now compare the result with the k-means function in Base R:

cl <- kmeans(M, n)
clusterplot(M, cl$centers, "Base R")  (custom <- C[order(C[ , 1]), ]) ## [,1] [,2] ## [1,] 3.008 2.740 ## [2,] 9.518 9.326 ## [3,] 22.754 22.396 (base <- cl$centers[order(cl$centers[ , 1]), ]) ## [,1] [,2] ## 2 3.008 2.740 ## 1 9.518 9.326 ## 3 22.754 22.396 round(base - custom, 13) ## [,1] [,2] ## 2 0 0 ## 1 0 0 ## 3 0 0  As you can see, the result is the same! Now, for some real-world application: clustering wholesale customer data. The data set refers to clients of a wholesale distributor. It includes the annual spending on diverse product categories and is from the renowned UCI Machine Learning Repository (I guess the category “Delicassen” should rather be “Delicatessen”). Have a look at the following code: data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale customers data.csv", header = TRUE) head(data) ## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen ## 1 2 3 12669 9656 7561 214 2674 1338 ## 2 2 3 7057 9810 9568 1762 3293 1776 ## 3 2 3 6353 8808 7684 2405 3516 7844 ## 4 1 3 13265 1196 4221 6404 507 1788 ## 5 2 3 22615 5410 7198 3915 1777 5185 ## 6 2 3 9413 8259 5126 666 1795 1451 set.seed(123) k <- kmeans(data[ , -c(1, 2)], centers = 4) # remove columns 1 and 2, create 4 clusters (centers <- k$centers) # display cluster centers
##       Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1  8149.837 18715.857 27756.592 2034.714        12523.020   2282.143
## 2 20598.389  3789.425  5027.274 3993.540         1120.142   1638.398
## 3 48777.375  6607.375  6197.792 9462.792          932.125   4435.333
## 4  5442.969  4120.071  5597.087 2258.157         1989.299   1053.272

round(prop.table(centers, 2) * 100) # percentage of sales per category
##   Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1    10   56      62     11               76         24
## 2    25   11      11     22                7         17
## 3    59   20      14     53                6         47
## 4     7   12      13     13               12         11

table(k$cluster) # number of customers per cluster ## ## 1 2 3 4 ## 49 113 24 254  One interpretation could be the following for the four clusters: 1. Big general shops 2. Small food shops 3. Big food shops 4. Small general shops As you can see, the interpretation of some clusters found by the algorithm can be quite a challenge. If you have a better idea of how to interpret the result please tell me in the comments below! ## Reinforcement Learning: Life is a Maze It can be argued that most important decisions in life are some variant of an exploitation-exploration problem. Shall I stick with my current job or look for a new one? Shall I stay with my partner or seek a new love? Shall I continue reading the book or watch the movie instead? In all of those cases the question is always whether I should “exploit” the thing I have or whether I should “explore” new things. If you want to learn how to tackle this most basic trade off read on… At the core this can be stated as the problem a gambler has who wants to play a one-armed bandit: if there are several machines with different winning probabilities (a so-called multi-armed bandit problem) the question the gambler faces is: which machine to play? He could “exploit” one machine or “explore” different machines. So what is the best strategy given a limited amount of time… and money? There are two extreme cases: no exploration, i.e. playing only one randomly chosen bandit, or no exploitation, i.e. playing all bandits randomly – so obviously we need some middle ground between those two extremes. We have to start with one randomly chosen bandit, try different ones after that and compare the results. So in the simplest case the first variable is the probability rate with which to switch to a random bandit – or to stick with the best bandit found so far. Let us create an example with bandits, which return units on average, except the second one which returns units. So the best strategy would obviously be to choose the second bandit right away and stick with it, but of course we don’t know the average returns of each bandit so this won’t work. Instead we need another vector which tallies the results of each bandit so far. This vector has to be updated after each game, so we need an update function which gets as arguments the current bandit and the return of the game. The intelligence of the strategy lies in this update function, so how should we go about? The big idea behind this strategy is called Bellman equation and in its simplest form it works as follows: the adjustment of the former result vector is the difference between the former result and the current result weighted by some discount factor, in this case the inverse of the number of games played on the respective machine. This learning strategy is called Q-learning and is a so called reinforcement learning technique. Have a look at the following example implementation: set.seed(3141) # for reproducibility # Q-learning update function update <- function(i, r) { Q[i] <<- Q[i] + 1/(k[i]+1) * (r-Q[i]) # Q-learning function k[i] <<- k[i] + 1 # one more game played on i'th bandit } # simulate game on one-armed bandit i ret <- function(i) { round(rnorm(1, mean = rets[i])) } # chose which bandit to play which.bandit <- function() { p <- runif(1) ifelse(p >= epsilon, which.max(Q), sample(1:n, 1)) } epsilon <- 0.1 # switch in epsilon percent of cases rets <- c(4, 5, 4, 4, 4) # average returns of bandits n <- length(rets) Q <- rep(0, n) # initialize return vector k <- rep(0, n) # initialize vector for games played on each bandit N <- 1000 # no. of runs R <- 0 # sum of returns for (j in 1:N) { i <- which.bandit() # chose bandit r <- ret(i) # simulate bandit R <- R + r # add return of bandit to overall sum of returns update(i, r) # calling Q-learning update function } which.max(Q) # which bandit has the highest return? ## [1] 2 Q ## [1] 4.000000 5.040481 4.090909 4.214286 3.611111 k ## [1] 32 914 22 14 18 N * max(rets) # theoretical max. return ## [1] 5000 R ## [1] 4949 R / (N * max(rets)) # percent reached of theoretical max ## [1] 0.9898  So, obviously the algorithm found a nearly perfect strategy all on its own! Now, this is the simplest possible application of reinforcement learning. Let us now implement a more sophisticated example: a robot navigating a maze. Whereas the difficulty in the first example was that the feedback was blurred (because the return of each one-armed bandit is only an average return) here we only get definitive feedback after several steps (when the robot reaches its goal). Because this situation is more complicated we need more memory to store the intermediate results. In our multi-armed bandit example the memory was a vector, here we will need a matrix. The robot will try to reach the goal in the following maze (i.e. to get out of the maze to reach F which can be done via B or E) and find the best strategy for each room it is placed in: Have a look at the code (it is based on the Matlab code from the same tutorial the picture is from, which is why the names of variables and functions are called the same way to ensure consistency): # find all possible actions AllActions <- function(state, R) { which(R[state, ] >= 0) } # chose one action out of all possible actions by chance PossibleAction <- function(state, R) { sample(AllActions(state, R), 1) } # Q-learning function UpdateQ <- function(state, Q, R, gamma, goalstate) { action <- PossibleAction(state, R) Q[state, action] <- R[state, action] + gamma * max(Q[action, AllActions(action, R)]) # Bellman equation (learning rate implicitly = 1) if(action != goalstate) Q <- UpdateQ(action, Q, R, gamma, goalstate) Q } # recursive function to get the action with the the maximum Q value MaxQ <- function(state, Q, goalstate) { action <- which.max(Q[state[length(state)], ]) if (action != goalstate) action <- c(action, MaxQ(action, Q, goalstate)) action } # representation of the maze R <- matrix(c(-Inf, -Inf, -Inf, -Inf, 0, -Inf, -Inf, -Inf, -Inf, 0, -Inf, 100, -Inf, -Inf, -Inf, 0, -Inf, -Inf, -Inf, 0, 0, -Inf, 0, -Inf, 0, -Inf, -Inf, 0, -Inf, 100, -Inf, 0, -Inf, -Inf, 0, 100), ncol = 6, byrow = TRUE) colnames(R) <- rownames(R) <- LETTERS[1:6] R ## A B C D E F ## A -Inf -Inf -Inf -Inf 0 -Inf ## B -Inf -Inf -Inf 0 -Inf 100 ## C -Inf -Inf -Inf 0 -Inf -Inf ## D -Inf 0 0 -Inf 0 -Inf ## E 0 -Inf -Inf 0 -Inf 100 ## F -Inf 0 -Inf -Inf 0 100 Q <- matrix(0, nrow = nrow(R), ncol = ncol(R)) colnames(Q) <- rownames(Q) <- LETTERS[1:6] Q ## A B C D E F ## A 0 0 0 0 0 0 ## B 0 0 0 0 0 0 ## C 0 0 0 0 0 0 ## D 0 0 0 0 0 0 ## E 0 0 0 0 0 0 ## F 0 0 0 0 0 0 gamma <- 0.8 # learning rate goalstate <- 6 N <- 50000 # no. of episodes for (episode in 1:N) { state <- sample(1:goalstate, 1) Q <- UpdateQ(state, Q, R, gamma, goalstate) } Q ## A B C D E F ## A -Inf -Inf -Inf -Inf 400 0 ## B 0 0 0 320 0 500 ## C -Inf -Inf -Inf 320 0 0 ## D 0 400 256 0 400 0 ## E 320 0 0 320 0 500 ## F 0 400 0 0 400 500 Q / max(Q) * 100 ## A B C D E F ## A -Inf -Inf -Inf -Inf 80 0 ## B 0 0 0.0 64 0 100 ## C -Inf -Inf -Inf 64 0 0 ## D 0 80 51.2 0 80 0 ## E 64 0 0.0 64 0 100 ## F 0 80 0.0 0 80 100 # print all learned routes for all rooms for (i in 1:goalstate) { cat(LETTERS[i], LETTERS[MaxQ(i, Q, goalstate)], sep = " -> ") cat("\n") } ## A -> E -> F ## B -> F ## C -> D -> B -> F ## D -> B -> F ## E -> F ## F -> F  So again, the algorithm has found the best route for each room! Recently the combination of Neural Networks (see also Understanding the Magic of Neural Networks) and Reinforcement Learning has become quite popular. For example AlphaGo, the machine from Google that defeated a Go world champion for the first time in history is based on this powerful combination! ## Teach R to read handwritten Digits with just 4 Lines of Code What is the best way for me to find out whether you are rich or poor, when the only thing I know is your address? Looking at your neighbourhood! That is the big idea behind the k-nearest neighbours (or KNN) algorithm, where k stands for the number of neighbours to look at. The idea couldn’t be any simpler yet the results are often very impressive indeed – so read on… Let us take a task that is very hard to code, like identifying handwritten numbers. We will be using the Semeion Handwritten Digit Data Set from the UCI Machine Learning Repository and are separating training and test set for the upcoming task in the first step: # helper function for plotting images of digits in a nice way + returning the respective number plot_digit <- function(digit) { M <- matrix(as.numeric(digit[1:256]), nrow = 16, ncol = 16, byrow = TRUE) image(t(M[nrow(M):1, ]), col = c(0,1), xaxt = "n", yaxt = "n", useRaster = TRUE) digit[257] } # load data and chose some digits as examples semeion <- read.table("data/semeion.data", quote = "\"", comment.char = "") # put in right path here! digit_data <- semeion[ , 1:256] which_digit <- apply(semeion[ , 257:266], 1, function(x) which.max(x) - 1) semeion_new <- cbind(digit_data, which_digit) # chose training and test set by chance set.seed(123) # for reproducibility data <- semeion_new random <- sample(1:nrow(data), 0.8 * nrow(data)) # 80%: training data, 20%: test data train <- data[random, ] test <- data[-random, ] # plot example digits old_par <- par(mfrow = c(4, 6), oma = c(5, 4, 0, 0) + 0.1, mar = c(0, 0, 1, 1) + 0.1) matrix(apply(train[1:24, ], 1, plot_digit), 4, 6, byrow = TRUE)  ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 3 1 2 5 7 3 ## [2,] 1 5 1 6 7 6 ## [3,] 6 2 8 5 9 3 ## [4,] 5 7 5 7 5 9 par(old_par)  As you can see teaching a computer to read those digits is a task which would take considerable effort and easily hundreds of lines of code. You would have to intelligently identify different regions in the images and find some boundaries to try to identify which number is being shown. You could expect to do a lot of tweaking before you would get acceptable results. The real magic behind machine learning and artificial intelligence is that when something is too complicated to code let the machine program itself by just showing it lots of examples (see also my post So, what is AI really?). We will do just that with the nearest neighbour algorithm. When talking about neighbours it is implied already that we need some kind of distance metric to define what constitutes a neighbour. As in real life the simplest one is the so called Euclidean distance which is just how far different points are apart from each other as the crow flies. The simple formula that is used for this is just the good old Pythagorean theorem (in this case in a vectorized way) – you can see what maths at school was good for after all: dist_eucl <- function(x1, x2) { sqrt(sum((x1 - x2) ^ 2)) # Pythagorean theorem! }  The k-nearest neighbours algorithm is pretty straight forward: it just compares the digit which is to be identified with all other digits and choses the k nearest ones. In case that the k nearest ones don’t come up with the same answer the majority vote (or mathematically the mode) is taken: mode <- function(NNs) { names(sort(-table(NNs[ncol(NNs)])))[1] # mode = majority vote } knn <- function(train, test, k = 5) { dist_sort <- order(apply(train[-ncol(train)], 1, function(x) dist_eucl(as.numeric(x), x2 = as.numeric(test[-ncol(test)])))) mode(train[dist_sort[1:k], ]) }  So, the algorithm itself comprises barely 4 lines of code! Now, let us see how it performs on this complicated task with k = 9 out of sample (first a few examples are shown and after that we have a look at the overall performance): # show a few examples set.seed(123) # for reproducibility no_examples <- 24 examples <- sample(dim(test)[1], no_examples) old_par <- par(mfrow = c(4, 6), oma = c(5, 4, 0, 0) + 0.1, mar = c(0, 0, 1, 1) + 0.1) matrix(apply(test[examples, ], 1, plot_digit), 4, 6, byrow = TRUE)  ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 4 1 1 5 7 3 ## [2,] 0 5 1 4 7 3 ## [3,] 6 2 7 4 0 2 ## [4,] 5 5 3 6 3 7 par(old_par) prediction <- integer(no_examples) for (i in 1:no_examples) { prediction[i] <- knn(train, test[examples[i], ], k = 9) } print(matrix(prediction, 4, 6, byrow = TRUE), quote = FALSE, right = TRUE) ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 4 1 1 5 7 3 ## [2,] 0 5 1 4 7 3 ## [3,] 6 2 7 4 0 2 ## [4,] 5 5 3 6 3 7 # now for the overall accuracy library(OneR) # just for eval_model function to evaluate the model's accuracy prediction <- integer(nrow(test)) ptm <- proc.time() for (i in 1:nrow(test)) { prediction[i] <- knn(train, test[i, ], k = 9) } proc.time() - ptm ## user system elapsed ## 26.74 0.82 27.59 eval_model(prediction, test[ncol(test)], zero.print = ".") ## ## Confusion matrix (absolute): ## Actual ## Prediction 0 1 2 3 4 5 6 7 8 9 Sum ## 0 34 . . . . . 1 . . . 35 ## 1 . 36 1 . 2 . . 1 . 1 41 ## 2 . . 36 . . . . . 1 1 38 ## 3 . 1 . 32 . . . . . 2 35 ## 4 . . . . 29 . . . . . 29 ## 5 . . . . . 35 2 . 1 . 38 ## 6 . . . . . 1 23 . . . 24 ## 7 . . . . . . . 22 . 1 23 ## 8 . . . . . . . . 31 . 31 ## 9 . . . . . . . . 2 23 25 ## Sum 34 37 37 32 31 36 26 23 35 28 319 ## ## Confusion matrix (relative): ## Actual ## Prediction 0 1 2 3 4 5 6 7 8 9 Sum ## 0 0.11 . . . . . . . . . 0.11 ## 1 . 0.11 . . 0.01 . . . . . 0.13 ## 2 . . 0.11 . . . . . . . 0.12 ## 3 . . . 0.10 . . . . . 0.01 0.11 ## 4 . . . . 0.09 . . . . . 0.09 ## 5 . . . . . 0.11 0.01 . . . 0.12 ## 6 . . . . . . 0.07 . . . 0.08 ## 7 . . . . . . . 0.07 . . 0.07 ## 8 . . . . . . . . 0.10 . 0.10 ## 9 . . . . . . . . 0.01 0.07 0.08 ## Sum 0.11 0.12 0.12 0.10 0.10 0.11 0.08 0.07 0.11 0.09 1.00 ## ## Accuracy: ## 0.9436 (301/319) ## ## Error rate: ## 0.0564 (18/319) ## ## Error rate reduction (vs. base rate): ## 0.9362 (p-value < 2.2e-16)  Wow, it achieves an accuracy of nearly 95% out of the box while some of the digits are really hard to read even for humans! And we haven’t even given it the information that those images are two-dimensional because we coded all the images simply as (one-dimensional) binary numbers. To get the idea where it failed have a look at the digits that were misclassified: # show misclassified digits err <- which(as.integer(prediction) != unlist(test[ncol(test)])) old_par <- par(mfrow = c(3, 6), oma = c(5, 4, 0, 0) + 0.1, mar = c(0, 0, 1, 1) + 0.1) matrix(apply(test[err, ], 1, plot_digit), 3, 6, byrow = TRUE)  ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 2 6 9 8 9 5 ## [2,] 6 6 7 4 8 8 ## [3,] 9 9 1 4 8 9 par(old_par) # show what was predicted print(matrix(prediction[err], 3, 6, byrow = TRUE), quote = FALSE, right = TRUE) ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 1 5 1 9 3 6 ## [2,] 5 0 1 1 9 5 ## [3,] 3 7 3 1 2 2  Most of us would have difficulties reading at least some of those digits too, e.g. the third digit in the first row is supposed to be a 9, yet it could also be a distorted 1 – same with the first digit in the last row: some people would read a 3 (like our little program) or nothing at all really, but it is supposed to be a 9. So even the mistakes the system makes are understandable. Sometimes the simplest methods are – perhaps not the best but – very effective indeed, you should keep that in mind! ## Causation doesn’t imply Correlation either You may have misread the title as the old correlation does not imply causation mantra, but the opposite is also true! If you don’t believe me, read on… First I want to provide you with some intuition on what correlation is really all about! For many people (and many of my students for sure) the implications of the following formula for the correlation coefficient of two variables and are not immediately clear: In fact the most interesting part is this: . We see a product of two differences. The differences consist of the data points minus the respective means (average values): in effect this leads to the origin being moved to the means of both variables (as if you moved the crosshair right into the centre of all data points). There are now four possible quadrants for every data point: top or bottom, left or right. Top right means that both differences are positive, so the result will be positive too. The same is true for the bottom left quadrant because minus times minus equals plus (it often boils down to simple school maths)! The other two quadrants give negative results because minus times plus and plus times minus equals minus. After that we sum over all products and normalize them by dividing by the respective standard deviations (how much the data are spread out), so that we will only get values between and . Let us see this in action with an example. First we define a helper function for visualizing this intuition: cor.plot <- function(data) { x_mean <- mean(data[ , 1]) y_mean <- mean(data[ , 2]) plot(data, type = "n") # plot invisibly limits = par()$usr # limits of plot
rect(x_mean, y_mean, limits[2], limits[4], col = "lightgreen")
rect(x_mean, y_mean, limits[1], limits[4], col = "orangered")
rect(x_mean, y_mean, limits[1], limits[3], col = "lightgreen")
rect(x_mean, y_mean, limits[2], limits[3], col = "orangered")
points(data, pch = 16) # plot scatterplot on top
colnames(data) <- c("x", "y") # rename cols instead of dynamic variable names in lm
abline(lm(y ~ x, data), lwd = 2) # add regression line
title(paste("cor =", round(cor(data[1], data[2]), 2))) # add cor as title
}


Now for the actual example (in fact the same example we had in this post: Learning Data Science: Modelling Basics):

age <- c(21, 46, 55, 35, 28)
income <- c(1850, 2500, 2560, 2230, 1800)
data <- data.frame(age, income)
plot(data, pch = 16)


cor.plot(data)


The correlation is very high because most of the data points are in the positive (green) quadrants and the data is close to its regression line (linear regression and correlation are closely related mathematically).

Now, let us get to the actual topic of this post: Causation doesn’t imply Correlation either. What could be “more causal” than a parabolic shot? When you shoot a projectile without air resistance the trajectory will form a perfect parabola! This is in fact rocket science!

Let us simulate such a shot and calculate the correlation between time and altitude, two variables that are perfectly causally dependent:

t <- c(-30:30)
x <- -t^2
data <- data.frame(t, x)
plot(data, pch = 16)


cor.plot(data)


The correlation is exactly zero, zip, nada! And it is clear why: the data points in the positive and in the negative quadrants cancel each other out completely because of the perfect symmetry!

This leads us to the following very important insight:

Correlation is a measure of linear dependance (and linear dependance only!).

Even a strong causal relationship can be overlooked by correlation because of its non-linear nature (as in this case with the quadratic relationship). The following example conveys the same idea in a somewhat more humorous manner – it is the by now infamous datasaurus:

library(datasauRus) # on CRAN
dino <- datasaurus_dozen[datasaurus_dozen$dataset == "dino", 2:3] plot(dino, pch = 16, cex = 2)  cor.plot(dino)  As with the above example we can clearly see why the correlation is so low, although there is a whole dinosaur hiding in your data… The learning is that you should never just blindly trust statistical measures on their own, always visualize your data when possible: there might be some real beauties hiding inside your data, waiting to be discovered… ## Understanding AdaBoost – or how to turn Weakness into Strength Many of you might have heard of the concept “Wisdom of the Crowd”: when many people independently guess some quantity, e.g. the number of marbles in a jar glass, the average of their guesses is often pretty accurate – even though many of the guesses are totally off. The same principle is at work in so called ensemble methods, like bagging and boosting. If you want to know more about boosting and how to turn pseudocode of a scientific paper into valid R code read on… We start from an original paper of one of the authors of the first practical boosting algorithm, i.e. AdaBoost: Robert E. Schapire: Explaining AdaBoost. The first sentence of the introduction gives the big idea: Boosting is an approach to machine learning based on the idea of creating a highly accurate prediction rule by combining many relatively weak and inaccurate rules. The second page gives the pseudocode of Adaboost…: Given: where . Initialize: for . For : • Train weak learner using distribution . • Get weak hypothesis : . • Aim: select with low weighted error: • Choose . • Update, for : where is a normalization factor (chosen so that will be a distribution). Output the final hypothesis: … with some explanation: […] we are given labeled training examples where the are in some domain , and the labels . On each round , a distribution is computed as in the figure over the training examples, and a given weak learner or weak learning algorithm is applied to find a weak hypothesis : , where the aim of the weak learner is to find a weak hypothesis with low weighted error relative to . The final or combined hypothesis computes the sign of a weighted combination of weak hypotheses This is equivalent to saying that is computed as a weighted majority vote of the weak hypotheses where each is assigned weight . ([…] we use the terms “hypothesis” and “classifier” interchangeably.) So, AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous ones. But to really understand what is going on my approach has always been that you haven’t really understood something before you didn’t build it yourself… Perhaps you might want to try to translate the pseudocode into R code before reading on… (to increase your motivation I frankly admit that I also had some errors in my first implementation… which provides a good example of how strong the R community is because I posted it on stackoverflow and got a perfect answer two hours later: What is wrong with my implementation of AdaBoost? Anyway, here is my implementation (the data can be found here: http://freakonometrics.free.fr/myocarde.csv): library(rpart) library(OneR) maxdepth <- 1 T <- 100 # number of rounds # Given: (x_1, y_1),...,(x_m, y_m) where x_i element of X, y_i element of {-1, +1} myocarde <- read.table("data/myocarde.csv", header = TRUE, sep = ";") y <- (myocarde[ , "PRONO"] == "SURVIE") * 2 - 1 x <- myocarde[ , 1:7] m <- nrow(x) data <- data.frame(x, y) # Initialize: D_1(i) = 1/m for i = 1,...,m D <- rep(1/m, m) H <- replicate(T, list()) a <- vector(mode = "numeric", T) set.seed(123) # For t = 1,...,T for(t in 1:T) { # Train weak learner using distribution D_t # Get weak hypothesis h_t: X -> {-1, +1} H[[t]] <- rpart(y ~., data = data, weights = D, maxdepth = maxdepth, method = "class") # Aim: select h_t with low weighted error: e_t = Pr_i~D_t[h_t(x_i) != y_i] h <- predict(H[[t]], x, type = "class") e <- sum((h!=y) * D) # Choose a_t = 0.5 * log((1-e) / e) a[t] <- 0.5 * log((1-e) / e) # Update for i = 1,...,m: D_t+1(i) = (D_t(i) * exp(-a_t * y_i * h_t(x_i))) / Z_t # where Z_t is a normalization factor (chosen so that Dt+1 will be a distribution) D <- D * exp(-a[t] * y * as.numeric(as.character(h))) D <- D / sum(D) } # Output the final hypothesis: H(x) = sign(sum of a_t * h_t(x) for t=1 to T) newdata <- x H_x <- sapply(H, function(x) as.numeric(as.character(predict(x, newdata = newdata, type = "class")))) H_x <- t(a * t(H_x)) pred <- sign(rowSums(H_x)) eval_model(pred, y) ## ## Confusion matrix (absolute): ## Actual ## Prediction -1 1 Sum ## -1 29 0 29 ## 1 0 42 42 ## Sum 29 42 71 ## ## Confusion matrix (relative): ## Actual ## Prediction -1 1 Sum ## -1 0.41 0.00 0.41 ## 1 0.00 0.59 0.59 ## Sum 0.41 0.59 1.00 ## ## Accuracy: ## 1 (71/71) ## ## Error rate: ## 0 (0/71) ## ## Error rate reduction (vs. base rate): ## 1 (p-value < 2.2e-16)  Let’s compare this with the result from the package JOUSBoost (on CRAN): library(JOUSBoost) ## JOUSBoost 2.1.0 boost <- adaboost(as.matrix(x), y, tree_depth = maxdepth, n_rounds = T) pred <- predict(boost, x) eval_model(pred, y) ## ## Confusion matrix (absolute): ## Actual ## Prediction -1 1 Sum ## -1 29 0 29 ## 1 0 42 42 ## Sum 29 42 71 ## ## Confusion matrix (relative): ## Actual ## Prediction -1 1 Sum ## -1 0.41 0.00 0.41 ## 1 0.00 0.59 0.59 ## Sum 0.41 0.59 1.00 ## ## Accuracy: ## 1 (71/71) ## ## Error rate: ## 0 (0/71) ## ## Error rate reduction (vs. base rate): ## 1 (p-value < 2.2e-16)  As you can see: zero errors as with my implementation. Two additional remarks are in order: An accuracy of 100% hints at one of the problems of boosting: it is prone to overfitting (see also Learning Data Science: Modelling Basics). The second problem is the lack of interpretability: whereas decision trees are normally well interpretable ensembles of them are not. This is also known under the name Accuracy-Interpretability Trade-Off (another often used ensemble method is random forests, see also here: Learning Data Science: Predicting Income Brackets). Hope that this post was helpful for you to understand the widely used boosting methodology better and to see how you can get from pseudocode to valid R code. If you have any questions or feedback please let me know in the comments – Thank you and stay tuned! ## Learning R: The Ultimate Introduction (incl. Machine Learning!) There are a million reasons to learn R (see e.g. Why R for Data Science – and not Python?), but where to start? I present to you the ultimate introduction to bring you up to speed! So read on… I call it ultimate because it is the essence of many years of teaching R… or put differently: it is the kind of introduction I would have liked to have when I started out with R back in the days! A word of warning though: this is a introduction to R and not to statistics, so I won’t explain the statistics terms used here. You do not need to know any other programming language but it does no harm either. Ok, now let us start! First you need to install R (https://www.r-project.org) and preferably RStudio as a Graphical User Interface (GUI): https://www.rstudio.com/products/RStudio/#Desktop. Both are free and available for all common operating systems. To get a quick overview of RStudio watch this video: You can either type in the following commands in the console or open a new script tab (File -> New File -> R Script) and run the commands by pressing Ctrl + Enter/Return after having typed them. First of all R is a very good calculator: 2 + 2 ## [1] 4 sin(0.5) ## [1] 0.4794255 abs(-10) # absolute value ## [1] 10 pi ## [1] 3.141593 exp(1) # e ## [1] 2.718282 factorial(6) ## [1] 720  By the way: The hash is used for comments, everything after it will be ignored! Of course you can define variables and use them in your calculations: n1 <- 2 n2 <- 3 n1 # show content of variable by just typing the name ## [1] 2 n1 + n2 ## [1] 5 n1 * n2 ## [1] 6 n1^n2 ## [1] 8  Part of R’s power stems from the fact that functions can handle several numbers at once, called vectors, and do calculations on them. When calling a function arguments are passed with round brackets: n3 <- c(12, 5, 27) # concatenate (combine) elements into a vector n3 ## [1] 12 5 27 min(n3) ## [1] 5 max(n3) ## [1] 27 sum(n3) ## [1] 44 mean(n3) ## [1] 14.66667 sd(n3) # standard deviation ## [1] 11.23981 var(n3) # variance ## [1] 126.3333 median(n3) ## [1] 12 n3 / 12 ## [1] 1.0000000 0.4166667 2.2500000  In the last example the 12 was recycled three times. R always tries to do that (when feasible), sometimes giving a warning when it might not be intended: n3 / c(1, 2) ## Warning in n3/c(1, 2): longer object length is not a multiple of shorter ## object length ## [1] 12.0 2.5 27.0  In cases you only want parts of your vectors you can apply subsetting with square brackets: n3[1] ## [1] 12 n3[c(2, 3)] ## [1] 5 27  Ranges can easily be created with the colon: n4 <- 10:20 n4 ## [1] 10 11 12 13 14 15 16 17 18 19 20  When you test whether this vector is bigger than a certain number you will get logicals as a result. You can use those logicals for subsetting: n4 > 15 ## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE n4[n4 > 15] ## [1] 16 17 18 19 20  Perhaps you have heard the story of little Gauss where his teacher gave him the task to add all numbers from 1 to 100 to keep him busy for a while? Well, he found a mathematical trick to add them within seconds… for us normal people we can use R: sum(1:100) ## [1] 5050  When we want to use some code several times we can define our own function (a user-defined function). We do that the same way we create a vector (or any other data structure) because R is a so called functional programming language and functions are so called first-class citizens (i.e. on the same level as other data structures like vectors). The code that is being executed is put in curly brackets: gauss <- function(x) { sum(1:x) } gauss(100) ## [1] 5050 gauss(1000) ## [1] 500500  Of course we also have other data types, e.g. matrices are basically two dimensional vectors: M <- matrix(1:12, nrow = 3, byrow = TRUE) # create a matrix M ## [,1] [,2] [,3] [,4] ## [1,] 1 2 3 4 ## [2,] 5 6 7 8 ## [3,] 9 10 11 12 dim(M) ## [1] 3 4  Subsetting now has to provide two numbers, the first for the row, the second for the column (like in the game Battleship). If you leave one out, all data of the respective dimension will be shown: M[2, 3] ## [1] 7 M[ , c(1, 3)] ## [,1] [,2] ## [1,] 1 3 ## [2,] 5 7 ## [3,] 9 11  Another possibility to create matrices: v1 <- 1:4 v2 <- 4:1 M1 <- rbind(v1, v2) # row bind M1 ## [,1] [,2] [,3] [,4] ## v1 1 2 3 4 ## v2 4 3 2 1 M2 <- cbind(v1, v2) # column bind M2 ## v1 v2 ## [1,] 1 4 ## [2,] 2 3 ## [3,] 3 2 ## [4,] 4 1  Naming rows, here with inbuilt datasets: rownames(M2) <- LETTERS[1:4] M2 ## v1 v2 ## A 1 4 ## B 2 3 ## C 3 2 ## D 4 1 LETTERS ## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" ## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z" letters ## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" ## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"  When some result is Not Available: LETTERS[50] ## [1] NA  Getting the structure of your variables: str(LETTERS) ## chr [1:26] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" ... str(M2) ## int [1:4, 1:2] 1 2 3 4 4 3 2 1 ## - attr(*, "dimnames")=List of 2 ## ..$ : chr [1:4] "A" "B" "C" "D"
##   ..$: chr [1:2] "v1" "v2"  Another famous dataset (iris) that is also built into base R (to get help on any function or dataset just put the cursor in it and press F1): iris ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5.0 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## 11 5.4 3.7 1.5 0.2 setosa ## 12 4.8 3.4 1.6 0.2 setosa ## 13 4.8 3.0 1.4 0.1 setosa ## 14 4.3 3.0 1.1 0.1 setosa ## 15 5.8 4.0 1.2 0.2 setosa ## 16 5.7 4.4 1.5 0.4 setosa ## 17 5.4 3.9 1.3 0.4 setosa ## 18 5.1 3.5 1.4 0.3 setosa ## 19 5.7 3.8 1.7 0.3 setosa ## 20 5.1 3.8 1.5 0.3 setosa ## 21 5.4 3.4 1.7 0.2 setosa ## 22 5.1 3.7 1.5 0.4 setosa ## 23 4.6 3.6 1.0 0.2 setosa ## 24 5.1 3.3 1.7 0.5 setosa ## 25 4.8 3.4 1.9 0.2 setosa ## 26 5.0 3.0 1.6 0.2 setosa ## 27 5.0 3.4 1.6 0.4 setosa ## 28 5.2 3.5 1.5 0.2 setosa ## 29 5.2 3.4 1.4 0.2 setosa ## 30 4.7 3.2 1.6 0.2 setosa ## 31 4.8 3.1 1.6 0.2 setosa ## 32 5.4 3.4 1.5 0.4 setosa ## 33 5.2 4.1 1.5 0.1 setosa ## 34 5.5 4.2 1.4 0.2 setosa ## 35 4.9 3.1 1.5 0.2 setosa ## 36 5.0 3.2 1.2 0.2 setosa ## 37 5.5 3.5 1.3 0.2 setosa ## 38 4.9 3.6 1.4 0.1 setosa ## 39 4.4 3.0 1.3 0.2 setosa ## 40 5.1 3.4 1.5 0.2 setosa ## 41 5.0 3.5 1.3 0.3 setosa ## 42 4.5 2.3 1.3 0.3 setosa ## 43 4.4 3.2 1.3 0.2 setosa ## 44 5.0 3.5 1.6 0.6 setosa ## 45 5.1 3.8 1.9 0.4 setosa ## 46 4.8 3.0 1.4 0.3 setosa ## 47 5.1 3.8 1.6 0.2 setosa ## 48 4.6 3.2 1.4 0.2 setosa ## 49 5.3 3.7 1.5 0.2 setosa ## 50 5.0 3.3 1.4 0.2 setosa ## 51 7.0 3.2 4.7 1.4 versicolor ## 52 6.4 3.2 4.5 1.5 versicolor ## 53 6.9 3.1 4.9 1.5 versicolor ## 54 5.5 2.3 4.0 1.3 versicolor ## 55 6.5 2.8 4.6 1.5 versicolor ## 56 5.7 2.8 4.5 1.3 versicolor ## 57 6.3 3.3 4.7 1.6 versicolor ## 58 4.9 2.4 3.3 1.0 versicolor ## 59 6.6 2.9 4.6 1.3 versicolor ## 60 5.2 2.7 3.9 1.4 versicolor ## 61 5.0 2.0 3.5 1.0 versicolor ## 62 5.9 3.0 4.2 1.5 versicolor ## 63 6.0 2.2 4.0 1.0 versicolor ## 64 6.1 2.9 4.7 1.4 versicolor ## 65 5.6 2.9 3.6 1.3 versicolor ## 66 6.7 3.1 4.4 1.4 versicolor ## 67 5.6 3.0 4.5 1.5 versicolor ## 68 5.8 2.7 4.1 1.0 versicolor ## 69 6.2 2.2 4.5 1.5 versicolor ## 70 5.6 2.5 3.9 1.1 versicolor ## 71 5.9 3.2 4.8 1.8 versicolor ## 72 6.1 2.8 4.0 1.3 versicolor ## 73 6.3 2.5 4.9 1.5 versicolor ## 74 6.1 2.8 4.7 1.2 versicolor ## 75 6.4 2.9 4.3 1.3 versicolor ## 76 6.6 3.0 4.4 1.4 versicolor ## 77 6.8 2.8 4.8 1.4 versicolor ## 78 6.7 3.0 5.0 1.7 versicolor ## 79 6.0 2.9 4.5 1.5 versicolor ## 80 5.7 2.6 3.5 1.0 versicolor ## 81 5.5 2.4 3.8 1.1 versicolor ## 82 5.5 2.4 3.7 1.0 versicolor ## 83 5.8 2.7 3.9 1.2 versicolor ## 84 6.0 2.7 5.1 1.6 versicolor ## 85 5.4 3.0 4.5 1.5 versicolor ## 86 6.0 3.4 4.5 1.6 versicolor ## 87 6.7 3.1 4.7 1.5 versicolor ## 88 6.3 2.3 4.4 1.3 versicolor ## 89 5.6 3.0 4.1 1.3 versicolor ## 90 5.5 2.5 4.0 1.3 versicolor ## 91 5.5 2.6 4.4 1.2 versicolor ## 92 6.1 3.0 4.6 1.4 versicolor ## 93 5.8 2.6 4.0 1.2 versicolor ## 94 5.0 2.3 3.3 1.0 versicolor ## 95 5.6 2.7 4.2 1.3 versicolor ## 96 5.7 3.0 4.2 1.2 versicolor ## 97 5.7 2.9 4.2 1.3 versicolor ## 98 6.2 2.9 4.3 1.3 versicolor ## 99 5.1 2.5 3.0 1.1 versicolor ## 100 5.7 2.8 4.1 1.3 versicolor ## 101 6.3 3.3 6.0 2.5 virginica ## 102 5.8 2.7 5.1 1.9 virginica ## 103 7.1 3.0 5.9 2.1 virginica ## 104 6.3 2.9 5.6 1.8 virginica ## 105 6.5 3.0 5.8 2.2 virginica ## 106 7.6 3.0 6.6 2.1 virginica ## 107 4.9 2.5 4.5 1.7 virginica ## 108 7.3 2.9 6.3 1.8 virginica ## 109 6.7 2.5 5.8 1.8 virginica ## 110 7.2 3.6 6.1 2.5 virginica ## 111 6.5 3.2 5.1 2.0 virginica ## 112 6.4 2.7 5.3 1.9 virginica ## 113 6.8 3.0 5.5 2.1 virginica ## 114 5.7 2.5 5.0 2.0 virginica ## 115 5.8 2.8 5.1 2.4 virginica ## 116 6.4 3.2 5.3 2.3 virginica ## 117 6.5 3.0 5.5 1.8 virginica ## 118 7.7 3.8 6.7 2.2 virginica ## 119 7.7 2.6 6.9 2.3 virginica ## 120 6.0 2.2 5.0 1.5 virginica ## 121 6.9 3.2 5.7 2.3 virginica ## 122 5.6 2.8 4.9 2.0 virginica ## 123 7.7 2.8 6.7 2.0 virginica ## 124 6.3 2.7 4.9 1.8 virginica ## 125 6.7 3.3 5.7 2.1 virginica ## 126 7.2 3.2 6.0 1.8 virginica ## 127 6.2 2.8 4.8 1.8 virginica ## 128 6.1 3.0 4.9 1.8 virginica ## 129 6.4 2.8 5.6 2.1 virginica ## 130 7.2 3.0 5.8 1.6 virginica ## 131 7.4 2.8 6.1 1.9 virginica ## 132 7.9 3.8 6.4 2.0 virginica ## 133 6.4 2.8 5.6 2.2 virginica ## 134 6.3 2.8 5.1 1.5 virginica ## 135 6.1 2.6 5.6 1.4 virginica ## 136 7.7 3.0 6.1 2.3 virginica ## 137 6.3 3.4 5.6 2.4 virginica ## 138 6.4 3.1 5.5 1.8 virginica ## 139 6.0 3.0 4.8 1.8 virginica ## 140 6.9 3.1 5.4 2.1 virginica ## 141 6.7 3.1 5.6 2.4 virginica ## 142 6.9 3.1 5.1 2.3 virginica ## 143 5.8 2.7 5.1 1.9 virginica ## 144 6.8 3.2 5.9 2.3 virginica ## 145 6.7 3.3 5.7 2.5 virginica ## 146 6.7 3.0 5.2 2.3 virginica ## 147 6.3 2.5 5.0 1.9 virginica ## 148 6.5 3.0 5.2 2.0 virginica ## 149 6.2 3.4 5.4 2.3 virginica ## 150 5.9 3.0 5.1 1.8 virginica  Oops, that is a bit long… if you only want to show the first or last rows do the following: head(iris) # first 6 rows ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa tail(iris, 10) # last 10 rows ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 141 6.7 3.1 5.6 2.4 virginica ## 142 6.9 3.1 5.1 2.3 virginica ## 143 5.8 2.7 5.1 1.9 virginica ## 144 6.8 3.2 5.9 2.3 virginica ## 145 6.7 3.3 5.7 2.5 virginica ## 146 6.7 3.0 5.2 2.3 virginica ## 147 6.3 2.5 5.0 1.9 virginica ## 148 6.5 3.0 5.2 2.0 virginica ## 149 6.2 3.4 5.4 2.3 virginica ## 150 5.9 3.0 5.1 1.8 virginica  Iris is a so called data frame, the workhorse of R and data science (you will see how to create one below): str(iris) ## 'data.frame': 150 obs. of 5 variables: ##$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ##$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ##$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


As you can see, data frames can combine different data types. If you try to do that with e.g. vectors, which can only hold one data type, something called coercion happens, i.e. at least one data type is forced to become another one so that consistency is maintained:

str(c(2, "Hello")) # 2 is coerced to become a character string too
##  chr [1:2] "2" "Hello"


You can get a fast overview of your data like so:

summary(iris[1:4])
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

boxplot(iris[1:4])


As you have seen, R often runs a function on all of the data simultaneously. This feature is called vectorization and in many other languages you would need a loop for that. In R you don’t use loops that often, but of course they are available:

for (i in seq(5)) {
print(1:i)
}
## [1] 1
## [1] 1 2
## [1] 1 2 3
## [1] 1 2 3 4
## [1] 1 2 3 4 5


Speaking of control structures: of course conditional statements are available too:

even <- function(x) ifelse(x %% 2 == 0, TRUE, FALSE) # %% gives remainder of division (= modulo operator)
even(1:5)
## [1] FALSE  TRUE FALSE  TRUE FALSE


Linear modelling (e.g. correlation and linear regression) couldn’t be any easier, it is included in the core language:

age <- c(21, 46, 55, 35, 28)
income <- c(1850, 2500, 2560, 2230, 1800)
df <- data.frame(age, income) # create a data frame
df
##   age income
## 1  21   1850
## 2  46   2500
## 3  55   2560
## 4  35   2230
## 5  28   1800

cor(df) # correlation
##              age    income
## age    1.0000000 0.9464183
## income 0.9464183 1.0000000

LinReg <- lm(income ~ age, data = df) # income as a linear model of age
LinReg
##
## Call:
## lm(formula = income ~ age, data = df)
##
## Coefficients:
## (Intercept)          age
##     1279.37        24.56

summary(LinReg)
##
## Call:
## lm(formula = income ~ age, data = df)
##
## Residuals:
##       1       2       3       4       5
##   54.92   90.98  -70.04   91.12 -166.98
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1279.367    188.510   6.787  0.00654 **
## age           24.558      4.838   5.076  0.01477 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 132.1 on 3 degrees of freedom
## Multiple R-squared:  0.8957, Adjusted R-squared:  0.8609
## F-statistic: 25.77 on 1 and 3 DF,  p-value: 0.01477

plot(df, pch = 16, main = "Linear model")
abline(LinReg, col = "blue", lwd = 2) # adding the regression line


You could directly use the model to make predictions:

pred_LinReg <- predict(LinReg, data.frame(age = seq(15, 70, 5)))
names(pred_LinReg) <- seq(15, 70, 5)
round(pred_LinReg, 2)
##      15      20      25      30      35      40      45      50      55
## 1647.73 1770.52 1893.31 2016.10 2138.88 2261.67 2384.46 2507.25 2630.04
##      60      65      70
## 2752.83 2875.61 2998.40


If you want to know more about the modelling process you can find it here: Learning Data Science: Modelling Basics

Another strength of R is the huge number of add-on packages for all kinds of specialized tasks. For the grand finale of this introduction, we’re gonna get a little taste of machine learning. For that matter we install the OneR package from CRAN (the official package repository of R): Tools -> Install packages… -> type in “OneR” -> click “Install”.

After that we build a simple model on the iris dataset to predict the Species column:

library(OneR) # load package
data <- optbin(Species ~., data = iris) # find optimal bins for numeric predictors
model <- OneR(data, verbose = TRUE) # build actual model
##
##     Attribute    Accuracy
## 1 * Petal.Width  96%
## 2   Petal.Length 95.33%
## 3   Sepal.Length 74.67%
## 4   Sepal.Width  55.33%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

summary(model) # show rules
##
## Call:
## OneR.data.frame(x = data, verbose = TRUE)
##
## Rules:
## If Petal.Width = (0.0976,0.791] then Species = setosa
## If Petal.Width = (0.791,1.63]   then Species = versicolor
## If Petal.Width = (1.63,2.5]     then Species = virginica
##
## Accuracy:
## 144 of 150 instances classified correctly (96%)
##
## Contingency table:
##             Petal.Width
## Species      (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum
##   setosa               * 50            0          0  50
##   versicolor              0         * 48          2  50
##   virginica               0            4       * 46  50
##   Sum                    50           52         48 150
## ---
## Maximum in each column: '*'
##
## Pearson's Chi-squared test:
## X-squared = 266.35, df = 4, p-value < 2.2e-16

plot(model)


We’ll now see how well the model is doing:

prediction <- predict(model, data)
eval_model(prediction, data)
##
## Confusion matrix (absolute):
##             Actual
## Prediction   setosa versicolor virginica Sum
##   setosa         50          0         0  50
##   versicolor      0         48         4  52
##   virginica       0          2        46  48
##   Sum            50         50        50 150
##
## Confusion matrix (relative):
##             Actual
## Prediction   setosa versicolor virginica  Sum
##   setosa       0.33       0.00      0.00 0.33
##   versicolor   0.00       0.32      0.03 0.35
##   virginica    0.00       0.01      0.31 0.32
##   Sum          0.33       0.33      0.33 1.00
##
## Accuracy:
## 0.96 (144/150)
##
## Error rate:
## 0.04 (6/150)
##
## Error rate reduction (vs. base rate):
## 0.94 (p-value < 2.2e-16)


96% accuracy is not too bad, even for this simple dataset!

If you want to know more about the OneR package you can read the vignette: OneR – Establishing a New Baseline for Machine Learning Classification Models.

Well, and that’s it for the ultimate introduction to R – hopefully you liked it and you learned something! Please share your first experiences with R in the comments and also if you miss something (I might add it in the future!) – Thank you for reading and stay tuned for more to come!

## Google’s Eigenvector… or how a Random Surfer finds the most relevant Webpages

Like most people you will have used a search engine lately, like Google. But have you ever thought about how it manages to give you the most fitting results? How does it order the results so that the best are on top? Read on to find out!

The earliest search engines either had human curated indices, like Yahoo! or used some simple heuristic like the more often the keyword you were looking for was mentioned on a page the better, like Altavista – which led to all kinds of crazy effects like certain keywords being repeated thousands of times on webpages to make them more “relevant”.

Now, most of those search engines are long gone because a new kid arrived on the block: Google! Google’s search engine results were much better than all of the competition and they became the dominant player in no time. How did they do that?

The big idea was in fact published by the two founders: Sergey Brin and Lawrence Page, it is called the pagerank algorithm (which is of course a pun because one of the authors was named Page too). The original paper can be found here: S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Let us start with another, related question: which properties are the best to own in Monopoly? Many would instinctively answer with the most expensive ones, i.e. Park Place and Boardwalk. But a second thought reveals that those might be the the ones where you get the biggest rent if somebody lands on them but that the last part is the caveat… “IF” somebody lands on them! The best streets are actually the ones where players land on the most. Those happen to be the orange streets, St. James Place, Tennessee Avenue and New York Avenue and therefore they are the key to winning the game.

How do find those properties? For example by simulation: you just simulate thousands of dice rolls and see where the players land.

A similar idea holds true for finding the best web pages: you just start from a random position and simulate a surfer who visits different web pages by chance. For each surfing session you tally the respective webpage where she ends up and after many runs we get a percentage for each page. The higher this percentage is the more relevant the webpage!

Let us do this with some R code. First we define a very small net and plot it (the actual example can be found in chapter 30 of the very good book “Chaotic Fishponds and Mirror Universes” by Richard Elwes):

library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
##     decompose, spectrum
## The following object is masked from 'package:base':
##
##     union

M <- matrix(c(0, 0, 1, 1,
1, 0, 0, 0,
1, 0, 0, 0,
1, 1, 1, 0), nrow = 4)
colnames(M) <- rownames(M) <- c("A", "B", "C", "D")
M
##   A B C D
## A 0 1 1 1
## B 0 0 0 1
## C 1 0 0 1
## D 1 0 0 0

g <- graph_from_adjacency_matrix(t(M)) # careful with how the adjacency matrix is defined -> transpose of matrix
plot(g)


Now, we are running the actual simulation. We define two helper functions for that, next_page for getting a random but possible next page given the page our surfer is on at the moment and last_page which gives the final page after N clicks:

next_page <- function(page, graph) {
l <- sample(rownames(graph)[as.logical(graph[ , as.logical(page)])], 1)
as.numeric(rownames(graph) == l)
}

last_page <- function(page, graph, N = 100) {
for (i in seq(N)) {
page <- next_page(page, graph)
}
page
}

current_page <- c(1, 0, 0, 0) # random surfer starting from A
random_surfer <- replicate(2e4, last_page(current_page, M, 50))
round(rowSums(random_surfer) / sum(random_surfer), 2)
## [1] 0.43 0.07 0.28 0.22


So we see that page A is the most relevant one because our surfer ends up being there in more than 40% of all sessions, after that come the pages C, D and B. When you look at the net that makes sense, since all pages refer to A whereas B gets only one link, so it doesn’t seem to be that relevant.

As you have seen the simulation even for this small net took quite long so we need some clever mathematics to speed up the process. One idea is to transform our matrix which represents the network into a matrix which gives the probabilities of landing on the next pages and then multiply the probability matrix with the current position (and thereby transform the probabilities). Let us do this for the first step:

M_prob <- prop.table(M, 2) # create probability matrix
M_prob
##     A B C         D
## A 0.0 1 1 0.3333333
## B 0.0 0 0 0.3333333
## C 0.5 0 0 0.3333333
## D 0.5 0 0 0.0000000

M_prob %*% current_page
##   [,1]
## A  0.0
## B  0.0
## C  0.5
## D  0.5


The result says that there is a fifty-fifty chance of landing on C or D. When you look at the graph you see that this is correct since there are two links, one to C and one to D! For the next step you would have to multiply the matrix with the result again, or first multiply the matrix with itself before multiplying with the current position, which gives:

If we want to do this a hundred times we just have to raise this probability matrix to the one hundredth power:

We use the %^% operator in the expm package (on CRAN) for that:

library(expm)
##
## Attaching package: 'expm'
## The following object is masked from 'package:Matrix':
##
##     expm

r <- M_prob %^% 100 %*% current_page
r
##         [,1]
## A 0.42857143
## B 0.07142857
## C 0.28571429
## D 0.21428571


Again, we get the same result! You might ask: why ? The answer is that this is in most cases enough to get a stable result so that any further multiplication still results in the same result:

The last equations opens up still another possibility: we are obviously looking for a vector which goes unaffected when multiplied by the matrix . There is a mathematical name for that kind of behaviour: eigenvector! As you might have guessed the name is an import from the German language where it means something like “own vector”.

This hints at the problem we were solving all along (without consciously realizing perhaps): a page is the more relevant the more relevant a page is that links to it… now we have to know the importance of that page but that page two is the more relevant… and so on and so forth, we are going in circles here. The same is true when you look at the equation above: you define in terms of is the eigenvector of matrix !

There are very fast and powerful methods to find the eigenvectors of a matrix, and the corresponding eigen function is even a function in base R:

lr <- Re(eigen(M_prob)$vectors[ , 1]) # real parts of biggest eigenvector lr / sum(lr) # normalization ## [1] 0.42857143 0.07142857 0.28571429 0.21428571  Again, the same result! You can now understand the title of this post and titles of other articles about the pagerank algorithm and Google like “The$25,000,000,000 eigenvector”.

Yet, a word of warning is in order: there are cases where the probability matrix is not diagonalizable (we won’t get into the mathematical details here), which means that the eigenvector method won’t give sensible results. To check this the following code must evaluate to TRUE:

ev <- eigen(M_prob)$values length(unique(ev)) == length(ev) ## [1] TRUE  We now repeat the last two methods for a bigger network: set.seed(1415) n <- 10 g <- sample_gnp(n, p = 1/4, directed = TRUE) # create random graph g <- set_vertex_attr(g, "name", value = LETTERS[1:n]) plot(g)  M <- t(as_adjacency_matrix(g, sparse = FALSE)) M_prob <- prop.table(M, 2) # create probability matrix M_prob ## A B C D E F G H I J ## A 0.00 0 0 1 0.5 0.5 0.5 0.0000000 0.0000000 0.5 ## B 0.00 0 0 0 0.0 0.0 0.0 0.3333333 0.0000000 0.0 ## C 0.00 1 0 0 0.0 0.0 0.0 0.0000000 0.3333333 0.5 ## D 0.25 0 0 0 0.0 0.0 0.0 0.0000000 0.0000000 0.0 ## E 0.25 0 0 0 0.0 0.0 0.5 0.3333333 0.3333333 0.0 ## F 0.00 0 1 0 0.0 0.0 0.0 0.0000000 0.3333333 0.0 ## G 0.25 0 0 0 0.0 0.0 0.0 0.0000000 0.0000000 0.0 ## H 0.00 0 0 0 0.5 0.0 0.0 0.0000000 0.0000000 0.0 ## I 0.00 0 0 0 0.0 0.5 0.0 0.0000000 0.0000000 0.0 ## J 0.25 0 0 0 0.0 0.0 0.0 0.3333333 0.0000000 0.0 current_page <- c(1, rep(0, n-1)) r <- M_prob %^% 100 %*% current_page r ## [,1] ## A 0.27663574 ## B 0.02429905 ## C 0.08878509 ## D 0.06915881 ## E 0.14579434 ## F 0.10654199 ## G 0.06915881 ## H 0.07289723 ## I 0.05327107 ## J 0.09345787 lr <- Re(eigen(M_prob)$vectors[ , 1])
lr / sum(lr) # normalization of the real parts
##  [1] 0.27663551 0.02429907 0.08878505 0.06915888 0.14579439 0.10654206
##  [7] 0.06915888 0.07289720 0.05327103 0.09345794


We can now order the pages according to their importance – like the first 10 results of a google search:

search <- data.frame(Page = LETTERS[1:n], Rank = r)
search[order(search$Rank, decreasing = TRUE), ] ## Page Rank ## A A 0.27663574 ## E E 0.14579434 ## F F 0.10654199 ## J J 0.09345787 ## C C 0.08878509 ## H H 0.07289723 ## D D 0.06915881 ## G G 0.06915881 ## I I 0.05327107 ## B B 0.02429905  Looking at the net, does the resulting order make sense to you? Congratulations, you now understand the big idea behind one the greatest revolutions in information technology! ## Symbolic Regression, Genetic Programming… or if Kepler had R A few weeks ago we published a post about using the power of the evolutionary method for optimization (see Evolution works!). In this post we will go a step further, so read on… A problem researchers often face is that they have an amount of data and need to find some functional form, e.g. some kind of physical law, for it. The standard approach is to try different functional forms, like linear, quadratic or polynomial functions with higher order terms. Also possible is a fourier analysis with trigonometric functions. But all of those approaches are quite limited and it would be nice if we had a program to do this for us and come up with a functional form that is both accurate and parsimonious… well, your prayers were heard! This approach is called symbolic regression (also sometimes called free-form regression) – a special case of what is called genetic programming – and the idea is to give the algorithm a grammar which defines some basic functional building blocks (like addition, subtraction, multiplication, logarithms, trigonometric functions and so on) and then try different combinations in an evolutionary process which keeps the better terms and recombines them to even more fitting terms. And the end we want to have a nice formula which captures the dynamics of the system without overfitting the noise. The package with which you can do such magic is the gramEvol package (on CRAN). Let us start with a simple example where we first generate some data on the basis of a combination of trig functions: x <- seq(0, 4*pi, length.out = 201) y <- sin(x) + cos(x + x) plot(y)  We now try to find this functional form by just giving the program the generated data points. As a first step we load the package and define the grammar, i.e. the allowed functional building blocks (for details how to define your own grammar consult the package’s documentation): library("gramEvol") ruleDef <- list(expr = grule(op(expr, expr), func(expr), var), func = grule(sin, cos), op = grule('+', '-', '*'), var = grule(x)) grammarDef <- CreateGrammar(ruleDef) grammarDef ## <expr> ::= <op>(<expr>, <expr>) | <func>(<expr>) | <var> ## <func> ::= sin | cos ## <op> ::= "+" | "-" | "*" ## <var> ::= x  Just to give some examples of randomly created formulas from this grammar you could use the GrammarRandomExpression function: set.seed(123) GrammarRandomExpression(grammarDef, 6) ## [[1]] ## expression(sin(cos(x + x))) ## ## [[2]] ## expression(sin(cos(x * x)) + x) ## ## [[3]] ## expression((x - cos(x)) * x) ## ## [[4]] ## expression(x) ## ## [[5]] ## expression(sin(x)) ## ## [[6]] ## expression(x + x)  After that we have to define some cost function which is used to evaluate how good the respective formula is (again, for details please consult the documentation): SymRegFitFunc <- function(expr) { result <- eval(expr) if (any(is.nan(result))) return(Inf) return (mean(log(1 + abs(y - result)))) }  The last step is starting the search and see what the algorithm finds: set.seed(314) ge <- GrammaticalEvolution(grammarDef, SymRegFitFunc, terminationCost = 0.1, iterations = 2500, max.depth = 5) ge ## Grammatical Evolution Search Results: ## No. Generations: 2149 ## Best Expression: sin(x) + cos(x + x) ## Best Cost: 0 plot(y) points(eval(ge$best$expressions), col = "red", type = "l")  Quite impressive, isn’t it? It found the exact same form by just “looking at” the data and trying different combinations of functions, guided by evolution. Now, in a real world setting you normally don’t have perfect data but you always have some measurement errors (= noise), so we run the example again but this time with some added noise (we use the jitter function for that): x <- seq(0, 4*pi, length.out = 201) y <- jitter(sin(x) + cos(x + x), amount = 0.2) plot(y)  And now for the rest of the steps: ruleDef <- list(expr = grule(op(expr, expr), func(expr), var), func = grule(sin, cos), op = grule('+', '-', '*'), var = grule(x)) grammarDef <- CreateGrammar(ruleDef) grammarDef ## <expr> ::= <op>(<expr>, <expr>) | <func>(<expr>) | <var> ## <func> ::= sin | cos ## <op> ::= "+" | "-" | "*" ## <var> ::= x SymRegFitFunc <- function(expr) { result <- eval(expr) if (any(is.nan(result))) return(Inf) return (mean(log(1 + abs(y - result)))) } set.seed(314) ge <- GrammaticalEvolution(grammarDef, SymRegFitFunc, terminationCost = 0.1, iterations = 2500, max.depth = 5) ge ## Grammatical Evolution Search Results: ## No. Generations: 2149 ## Best Expression: sin(x) + cos(x + x) ## Best Cost: 0.0923240003917875 plot(y) points(eval(ge$best$expressions), col = "red", type = "l")  Although we added quite some noise the program was still successful in finding the original functional form! Now, we are ready to try something more useful: finding a real physical law of nature! We want to find the relationship between orbital periods and distances from the sun of our solar system. First we provide the distance and period data, normalised for the earth: planets <- c("Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus") distance <- c(0.72, 1.00, 1.52, 5.20, 9.53, 19.10) period <- c(0.61, 1.00, 1.84, 11.90, 29.40, 83.50) data.frame(planets, distance, period) ## planets distance period ## 1 Venus 0.72 0.61 ## 2 Earth 1.00 1.00 ## 3 Mars 1.52 1.84 ## 4 Jupiter 5.20 11.90 ## 5 Saturn 9.53 29.40 ## 6 Uranus 19.10 83.50  And after that we perform just the same steps as above: ruleDef <- list(expr = grule(op(expr, expr), func(expr), var), func = grule(sin, cos, tan, log, sqrt), op = grule('+', '-', '*', '/', '^'), var = grule(distance, n), n = grule(1, 2, 3, 4, 5, 6, 7, 8, 9)) grammarDef <- CreateGrammar(ruleDef) grammarDef ## <expr> ::= <op>(<expr>, <expr>) | <func>(<expr>) | <var> ## <func> ::= sin | cos | tan | log | sqrt ## <op> ::= "+" | "-" | "*" | "/" | "^" ## <var> ::= distance | <n> ## <n> ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 SymRegFitFunc <- function(expr) { result <- eval(expr) if (any(is.nan(result))) return(Inf) return (mean(log(1 + abs(period - result)))) } set.seed(2) suppressWarnings(ge <- GrammaticalEvolution(grammarDef, SymRegFitFunc, terminationCost = 0.05)) ge ## Grammatical Evolution Search Results: ## No. Generations: 42 ## Best Expression: sqrt(distance) * distance ## Best Cost: 0.0201895728693589  Wow, the algorithm just rediscovered the third law of Kepler in no time! The square of the orbital period of a planet is directly proportional to the cube of the semi-major axis of its orbit. If only Kepler could have used R! 😉 ## Learning Data Science: Predicting Income Brackets As promised in the post Learning Data Science: Modelling Basics we will now go a step further and try to predict income brackets with real world data and different modelling approaches. We will learn a thing or two along the way, e.g. about the so-called Accuracy-Interpretability Trade-Off, so read on… The data we will use is from here: Marketing data set. The description reads: This dataset contains questions from questionnaires that were filled out by shopping mall customers in the San Francisco Bay area. The goal is to predict the Annual Income of Household from the other 13 demographics attributes. The following extra information (or metadata) is provided with the data: cat(readLines('data/marketing-names.txt'), sep = '\n') Marketing data set 1: Description. This dataset contains questions from questionaries that were filled out by shopping mall customers in the San Francisco Bay area. The goal is to predict the Anual Income of Household from the other 13 demographics attributes. 2: Type. Classification 3: Origin. Real world 4: Instances. 6876 (8993) 5: Features. 14 6: Classes 9 7: Missing values. Yes 8: Header. @relation marketing @attribute Sex integer [1, 2] @attribute MaritalStatus integer [1, 5] @attribute Age integer [1, 7] @attribute Education integer [1, 6] @attribute Occupation integer [1, 9] @attribute YearsInSf integer [1, 5] @attribute DualIncome integer [1, 3] @attribute HouseholdMembers integer [1, 9] @attribute Under18 integer [0, 9] @attribute HouseholdStatus integer [1, 3] @attribute TypeOfHome integer [1, 5] @attribute EthnicClass integer [1, 8] @attribute Language integer [1, 3] @attribute Income {1, 2, 3, 4, 5, 6, 7, 8, 9} @inputs Sex, MaritalStatus, Age, Education, Occupation, YearsInSf, DualIncome, HouseholdMembers, Under18, HouseholdStatus, TypeOfHome, EthnicClass, Language @outputs Income DATA DICTIONARY 1 ANNUAL INCOME OF HOUSEHOLD (PERSONAL INCOME IF SINGLE) 1. Less than$10,000
2. $10,000 to$14,999
3. $15,000 to$19,999
4. $20,000 to$24,999
5. $25,000 to$29,999
6. $30,000 to$39,999
7. $40,000 to$49,999
8. $50,000 to$74,999
9. $75,000 or more 2 SEX 1. Male 2. Female 3 MARITAL STATUS 1. Married 2. Living together, not married 3. Divorced or separated 4. Widowed 5. Single, never married 4 AGE 1. 14 thru 17 2. 18 thru 24 3. 25 thru 34 4. 35 thru 44 5. 45 thru 54 6. 55 thru 64 7. 65 and Over 5 EDUCATION 1. Grade 8 or less 2. Grades 9 to 11 3. Graduated high school 4. 1 to 3 years of college 5. College graduate 6. Grad Study 6 OCCUPATION 1. Professional/Managerial 2. Sales Worker 3. Factory Worker/Laborer/Driver 4. Clerical/Service Worker 5. Homemaker 6. Student, HS or College 7. Military 8. Retired 9. Unemployed 7 HOW LONG HAVE YOU LIVED IN THE SAN FRAN./OAKLAND/SAN JOSE AREA? 1. Less than one year 2. One to three years 3. Four to six years 4. Seven to ten years 5. More than ten years 8 DUAL INCOMES (IF MARRIED) 1. Not Married 2. Yes 3. No 9 PERSONS IN YOUR HOUSEHOLD 1. One 2. Two 3. Three 4. Four 5. Five 6. Six 7. Seven 8. Eight 9. Nine or more 10 PERSONS IN HOUSEHOLD UNDER 18 0. None 1. One 2. Two 3. Three 4. Four 5. Five 6. Six 7. Seven 8. Eight 9. Nine or more 11 HOUSEHOLDER STATUS 1. Own 2. Rent 3. Live with Parents/Family 12 TYPE OF HOME 1. House 2. Condominium 3. Apartment 4. Mobile Home 5. Other 13 ETHNIC CLASSIFICATION 1. American Indian 2. Asian 3. Black 4. East Indian 5. Hispanic 6. Pacific Islander 7. White 8. Other 14 WHAT LANGUAGE IS SPOKEN MOST OFTEN IN YOUR HOME? 1. English 2. Spanish 3. Other  Our task is to predict the variable “Income”. So, let us first load the data (you can find the correctly formatted csv-file here: marketing.csv), have a look at some of its characteristics and perform a little bit of additional formatting. After that we divide it into a training (80%) and a test set (20%) to account for potential overfitting (also see Learning Data Science: Modelling Basics): data <- read.csv("data/marketing.csv") dim(data) ## [1] 6876 14 str(data) ## 'data.frame': 6876 obs. of 14 variables: ##$ Sex             : int  1 2 2 2 1 1 1 1 1 1 ...
##  $MaritalStatus : int 1 1 5 5 1 5 3 1 1 5 ... ##$ Age             : int  5 3 1 1 6 2 3 6 7 2 ...
##  $Education : int 5 5 2 2 4 3 4 3 4 4 ... ##$ Occupation      : int  5 1 6 6 8 9 3 8 8 9 ...
##  $YearsInSf : int 5 5 5 3 5 4 5 5 4 5 ... ##$ DualIncome      : int  3 2 1 1 3 1 1 3 3 1 ...
##  $HouseholdMembers: int 5 3 4 4 2 3 1 3 2 1 ... ##$ Under18         : int  2 1 2 2 0 1 0 0 0 0 ...
##  $HouseholdStatus : int 1 2 3 3 1 2 2 2 2 2 ... ##$ TypeOfHome      : int  1 3 1 1 1 3 3 3 3 3 ...
##  $EthnicClass : int 7 7 7 7 7 7 7 7 7 7 ... ##$ Language        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $Income : int 9 9 1 1 8 1 6 2 4 1 ... data_names <- names(data) data <- cbind(data[-ncol(data)], factor(data$Income)) # make variable Income (which should be predicted) a factor
names(data) <- data_names

set.seed(12)
random <- sample(1:nrow(data), 0.8 * nrow(data))
data_train <- data[random, ]
data_test <- data[-random, ]


We start with a simple but comprehensible model, OneR (on CRAN), as a benchmark:

library(OneR)
data <- optbin(data_train)
model <- OneR(data, verbose = TRUE)
##
##     Attribute        Accuracy
## 1 * Age              28.2%
## 2   MaritalStatus    28.11%
## 3   Occupation       28.07%
## 4   HouseholdStatus  27.56%
## 5   DualIncome       27.04%
## 6   Education        25.98%
## 7   HouseholdMembers 22.51%
## 8   Under18          20.69%
## 9   TypeOfHome       19.36%
## 10  EthnicClass      19.29%
## 11  Sex              18.07%
## 12  Language         17.82%
## 13  YearsInSf        17.75%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

summary(model)
##
## Call:
## OneR.data.frame(x = data, verbose = TRUE)
##
## Rules:
## If Age = 1 then Income = 1
## If Age = 2 then Income = 1
## If Age = 3 then Income = 6
## If Age = 4 then Income = 8
## If Age = 5 then Income = 8
## If Age = 6 then Income = 8
## If Age = 7 then Income = 6
##
## Accuracy:
## 1551 of 5500 instances classified correctly (28.2%)
##
## Contingency table:
##       Age
## Income     1     2     3     4     5    6    7  Sum
##    1   * 421 * 352    99    43    21   15   25  976
##    2      16   204   107    39    13   22   33  434
##    3       9   147   122    49    12   21   35  395
##    4       5   121   188    71    39   29   42  495
##    5       3    77   179    81    29   23   34  426
##    6      10    93 * 234   156    70   56 * 47  666
##    7      12    92   185   155   107   66   33  650
##    8      12   111   211 * 251 * 160 * 86   44  875
##    9      11    76   114   187   104   69   22  583
##    Sum   499  1273  1439  1032   555  387  315 5500
## ---
## Maximum in each column: '*'
##
## Pearson's Chi-squared test:
## X-squared = 2671.2, df = 48, p-value < 2.2e-16

plot(model)


prediction <- predict(model, data_test)
eval_model(prediction, data_test)
##
## Confusion matrix (absolute):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1    232   45   46   32   33   27   19   27   24  485
##        2      0    0    0    0    0    0    0    0    0    0
##        3      0    0    0    0    0    0    0    0    0    0
##        4      0    0    0    0    0    0    0    0    0    0
##        5      0    0    0    0    0    0    0    0    0    0
##        6     31   30   44   44   41   66   44   57   50  407
##        7      0    0    0    0    0    0    0    0    0    0
##        8     16   20   20   47   27   87   71  110   86  484
##        9      0    0    0    0    0    0    0    0    0    0
##        Sum  279   95  110  123  101  180  134  194  160 1376
##
## Confusion matrix (relative):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1   0.17 0.03 0.03 0.02 0.02 0.02 0.01 0.02 0.02 0.35
##        2   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        3   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        4   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        5   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        6   0.02 0.02 0.03 0.03 0.03 0.05 0.03 0.04 0.04 0.30
##        7   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        8   0.01 0.01 0.01 0.03 0.02 0.06 0.05 0.08 0.06 0.35
##        9   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        Sum 0.20 0.07 0.08 0.09 0.07 0.13 0.10 0.14 0.12 1.00
##
## Accuracy:
## 0.2965 (408/1376)
##
## Error rate:
## 0.7035 (968/1376)
##
## Error rate reduction (vs. base rate):
## 0.1176 (p-value < 2.2e-16)


What can we conclude from this? First the most important feature is “Age” while “Marital Status”, “Occupation” and “Household Status” perform comparably well. The overall accuracy (i.e. the percentage of correctly predicted instances) is with about 30% not that great, on the other hand not that extraordinarily uncommon when dealing with real-world data. Looking at the model itself, in the form of rules and the diagnostic plot, we can see that we have non-linear relationship between Age and Income: the older one gets the higher the income bracket, except for people who are old enough to retire. This seems plausible.

OneR is basically a one step decision tree, so the natural choice for our next analysis would be to have a full decision tree built (all packages are on CRAN):

library(rpart)
library(rpart.plot)
model <- rpart(Income ~., data = data_train)
rpart.plot(model, type = 3, extra = 0, box.palette = "Grays")


prediction <- predict(model, data_test, type = "class")
eval_model(prediction, data_test)
##
## Confusion matrix (absolute):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1    201   36   22   13   16   12    8   15   12  335
##        2     43   25   32   22   17   12   10   14    6  181
##        3      0    0    0    0    0    0    0    0    0    0
##        4      0    0    0    0    0    0    0    0    0    0
##        5      0    0    0    0    0    0    0    0    0    0
##        6     18   24   40   50   42   68   32   33   22  329
##        7      0    0    0    0    0    0    0    0    0    0
##        8     17   10   16   38   26   88   84  132  120  531
##        9      0    0    0    0    0    0    0    0    0    0
##        Sum  279   95  110  123  101  180  134  194  160 1376
##
## Confusion matrix (relative):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1   0.15 0.03 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.24
##        2   0.03 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.00 0.13
##        3   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        4   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        5   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        6   0.01 0.02 0.03 0.04 0.03 0.05 0.02 0.02 0.02 0.24
##        7   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        8   0.01 0.01 0.01 0.03 0.02 0.06 0.06 0.10 0.09 0.39
##        9   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        Sum 0.20 0.07 0.08 0.09 0.07 0.13 0.10 0.14 0.12 1.00
##
## Accuracy:
## 0.3096 (426/1376)
##
## Error rate:
## 0.6904 (950/1376)
##
## Error rate reduction (vs. base rate):
## 0.134 (p-value < 2.2e-16)


The new model has an accuracy that is a little bit better but the interpretability is a little bit worse. You have to go through the different branches to see in which income bracket it ends. So for example when the age bracket is below (which means that it is ) it predicts income bracket , when it is bigger than and the Household Status bracket is it predicts income income bracket and so on. The most important variable, as you can see is again Age (which is reassuring that OneR was on the right track) but we also see Household Status and Occupation again.

What is better than one tree? Many trees! So the next natural step is to have many trees built, while varying the features and the examples that should be included in each tree. At the end it may be that different trees give different income brackets as their respective prediction, which we solve via voting as in a good democracy. This whole method is fittingly called Random Forests and fortunately there are many good packages available in R. We use the randomForest package (also on CRAN) here:

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.

set.seed(4543) # for reproducibility
model <- randomForest(Income ~., data = data_train, importance = TRUE)
varImpPlot(model)


prediction <- predict(model, data_test)
eval_model(prediction, data_test)
##
## Confusion matrix (absolute):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1    223   35   26   16   19   11    9   18   16  373
##        2     24   15   12   20    7   11    1    4    1   95
##        3      9   10   11   14    9    6    3    4    2   68
##        4      5   15   25   22   10   22    6    9    5  119
##        5      2    2    8    9    6   12    6    3    1   49
##        6      3    5   15   14   19   40   23   17   15  151
##        7      8    4    7   13   14   26   25   24    5  126
##        8      3    8    5   11   13   44   49   87   68  288
##        9      2    1    1    4    4    8   12   28   47  107
##        Sum  279   95  110  123  101  180  134  194  160 1376
##
## Confusion matrix (relative):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1   0.16 0.03 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.27
##        2   0.02 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.07
##        3   0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.05
##        4   0.00 0.01 0.02 0.02 0.01 0.02 0.00 0.01 0.00 0.09
##        5   0.00 0.00 0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.04
##        6   0.00 0.00 0.01 0.01 0.01 0.03 0.02 0.01 0.01 0.11
##        7   0.01 0.00 0.01 0.01 0.01 0.02 0.02 0.02 0.00 0.09
##        8   0.00 0.01 0.00 0.01 0.01 0.03 0.04 0.06 0.05 0.21
##        9   0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.02 0.03 0.08
##        Sum 0.20 0.07 0.08 0.09 0.07 0.13 0.10 0.14 0.12 1.00
##
## Accuracy:
## 0.3459 (476/1376)
##
## Error rate:
## 0.6541 (900/1376)
##
## Error rate reduction (vs. base rate):
## 0.1796 (p-value < 2.2e-16)


As an aside you can see that the basic modelling workflow stayed the same – no matter what approach (OneR, decision tree or random forest) you chose. This standard is kept for most (modern) packages and one of the great strengths of R! With thousands of packages on CRAN alone there are of course sometimes differences but those are normally well documented (so the help function or the vignette are your friends!)

Now, back to the result of the last analysis: again, the overall accuracy is better but because we have hundreds of trees now the interpretability suffered a lot. This is also known under the name Accuracy-Interpretability Trade-Off. The best we can do in the case of random forests is to find out which features are the most important ones. Again Age, Occupation and Household Status show up, which is consistent with our analyses so far. Additionally, because of the many trees that had to be built, this analysis ran a lot longer than the other two.

Random forests are one of the best methods out of the box, so the accuracy of about 34% tells us that our first model (OneR) wasn’t doing too bad in the first place! Why are able to reach comparatively good results with just one feature. This is true for many real-world data sets. Sometimes simple models are not much worse than very complicated ones – you should keep that in mind!

If you play around with this dataset I would be interested in your results! Please post them in the comments – Thank you and stay tuned!

## Evolution works!

Hamlet: Do you see yonder cloud that’s almost in shape of a camel?
Polonius: By the mass, and ’tis like a camel, indeed.
Hamlet: Methinks it is like a weasel.
from Hamlet by William Shakespeare

The best way to see how evolution works, is to watch it in action! You can watch the evolution of cars live in this application (but be careful, it’s addictive): BoxCar 2D

It is fascinating to see how those cars get better and better over time, sometimes finding very impressive solutions:

To understand how evolution works even better, let us create an artificial evolution in R!

The famous evolutionary biologist Richard Dawkins gave in his book “The Blind Watchmaker” the following thought experiment:

I don’t know who it was first pointed out that, given enough time, a monkey bashing away at random on a typewriter could produce all the works of Shakespeare. The operative phrase is, of course, given enough time. Let us limit the task facing our monkey somewhat. Suppose that he has to produce, not the complete works of Shakespeare but just the short sentence ‘Methinks it is like a weasel’, and we shall make it relatively easy by giving him a typewriter with a restricted keyboard, one with just the 26 (capital) letters, and a space bar. How long will he take to write this one little sentence?

We are now going to put this idea into practice! The following outline is from the Wikipedia article on the weasel program (Weasel program):

2. Make 100 copies of the string (reproduce).
3. For each character in each of the 100 copies, with a probability of 5%, replace (mutate) the character with a new random character.
4. Compare each new string with the target string “METHINKS IT IS LIKE A WEASEL”, and give each a score (the number of letters in the string that are correct and in the correct position).
5. If any of the new strings has a perfect score (28), halt. Otherwise, take the highest scoring string, and go to step 2.

So let us first define some variables and helper functions for reproduction, mutation and fitness calculation:

target <- unlist(strsplit("METHINKS IT IS LIKE A WEASEL", "")) # assign target string to "target"
pop_sz <- 100 # assign population size 100 to "pop_sz"
mt_rt <- 0.05 # assign mutation rate 5% to "mt_rt"

reproduce <- function(string) {
# input: vector "string"
# output: matrix with "pop_sz" columns, where each column is vector "string"
matrix(string, nrow = length(string), ncol = pop_sz)
}

mutate <- function(pop) {
# input: matrix of population "pop"
# output: matrix of population where each character, with a probability of mt_rt per cent (= 5%), is replaced with a new random character
mt_pos <- runif(length(pop)) <= mt_rt
pop[mt_pos] <- sample(c(LETTERS, " "), sum(mt_pos), replace = TRUE)
pop
}

fitness <- function(pop) {
# input: matrix of population "pop"
# output: vector of the number of letters that are correct (= equal to target) for each column
colSums(pop == target)
}


After that we are going through all five steps listed above:

# 1. Start with a random string of 28 characters.
set.seed(70)
start <- sample(c(LETTERS, " "), length(target), replace = TRUE)

# 2. Make 100 copies of this string (reproduce).
pop <- reproduce(start)

# 3. For each character in each of the 100 copies, with a probability of 5%, replace (mutate) the character with a new random character.
pop <- mutate(pop)

# 4. Compare each new string with the target "METHINKS IT IS LIKE A WEASEL", and give each a score (the number of letters in the string that are correct and in the correct position).
score <- fitness(pop)

# 5. If any of the new strings has a perfect score (28), halt. Otherwise, take the highest scoring string, and go to step 2.
highscorer <- pop[ , which.max(score)] # assign string to "highscorer" which has max. score in the population
gen_no <- 1 #assign 1 to generation counter "gen_no"

while (max(score) < length(target)) {
cat("No. of generations: ", gen_no, ", best so far: ", highscorer, " with score: ", max(score), "\n", sep = "")
pop <- reproduce(highscorer)           # 2. select the highest scoring string for reproduction
pop <- mutate(pop)                     # 3. mutation
score <- fitness(pop)                  # 4. fitness calculation
highscorer <- pop[ , which.max(score)] # assign string to "highscorer" which has max. score in the population
gen_no <- gen_no + 1                   # increment generation counter
}
## No. of generations: 1, best so far: BZRDXXINEIMYQVJWBFZKFCVUPFYL with score: 2
## No. of generations: 2, best so far: BZRDXNINEIMYQVJWBFZKFCVUPFYL with score: 3
## No. of generations: 3, best so far: BZRDXNINEIMYQVJWBFZKACVEPFYR with score: 4
## No. of generations: 4, best so far: BZRDININEIMYQBJWBFZKACVEPFYR with score: 5
## No. of generations: 5, best so far: BZRDININEIMYIBJWBFZKACVEPFYR with score: 6
## No. of generations: 6, best so far: BZRDININEIMYIBJLBFZKACVEPFYR with score: 7
## No. of generations: 7, best so far: BRRDININEIMYIBJLOFZKACVEPFYL with score: 8
## No. of generations: 8, best so far: BRRDININEIMYIZJLOFZKACVEAFYL with score: 9
## No. of generations: 9, best so far: BRRDINKNEIMYIZJLOFZKAT EAFYL with score: 10
## No. of generations: 10, best so far: BRRDINKNEIMYIZJLOFZKATVEASYL with score: 11
## No. of generations: 11, best so far: BRRDINKNEIMYIZJLOFEKATVEASYL with score: 12
## No. of generations: 12, best so far: BRRUINKNEIMYIZJLOFEKATVEASEL with score: 13
## No. of generations: 13, best so far: BERUINKNEIMYIZJLOFEKATVEASEL with score: 14
## No. of generations: 14, best so far: BERHINKNEIMYIZJLVFEKATVEASEL with score: 15
## No. of generations: 15, best so far: BERHINKNEIMQIZJLVFE ATVEASEL with score: 16
## No. of generations: 16, best so far: BERHINKNEIMQIZ LVFE ATVEASEL with score: 17
## No. of generations: 17, best so far: BERHINKNEIMQIZ LVFE ATVEASEL with score: 17
## No. of generations: 18, best so far: BERHINKNEIMQIZ LVFE ATVEASEL with score: 17
## No. of generations: 19, best so far: TERHINKNEIMQIZ LVFE ATDEASEL with score: 17
## No. of generations: 20, best so far: TERHINKNEIMQIZ LVFE ATDEASEL with score: 17
## No. of generations: 21, best so far: TERHINKNJISQIZ LVFE ATDEASEL with score: 17
## No. of generations: 22, best so far: TERHINKNJISQIZ LVFE A DEASEL with score: 18
## No. of generations: 23, best so far: TERHINKNJISQIZ LVFE A DEASEL with score: 18
## No. of generations: 24, best so far: TERHINKNJITQIZ LVFE A YEASEL with score: 19
## No. of generations: 25, best so far: TERHINKNJITQIZ LPFE A YEASEL with score: 19
## No. of generations: 26, best so far: TERHINKN ITQIZ LPFE A YEASEL with score: 20
## No. of generations: 27, best so far: MERHINKN ITQIZ LPFE A YEASEL with score: 21
## No. of generations: 28, best so far: MERHINKN IT IZ LPFE A YEASEL with score: 22
## No. of generations: 29, best so far: MERHINKN IT IS LPFE A YEASEL with score: 23
## No. of generations: 30, best so far: MERHINKN IT IS LPFE A YEASEL with score: 23
## No. of generations: 31, best so far: MERHINKN IT IS LPFE A YEASEL with score: 23
## No. of generations: 32, best so far: MERHINKN IT IS LAFE A WEASEL with score: 24
## No. of generations: 33, best so far: METHINKN IT IS LAFE A WEASEL with score: 25
## No. of generations: 34, best so far: METHINKN IT IS LAFE A WEASEL with score: 25
## No. of generations: 35, best so far: METHINKN IT IS LAFE A WEASEL with score: 25
## No. of generations: 36, best so far: METHINKN IT IS LAFE A WEASEL with score: 25
## No. of generations: 37, best so far: METHINKN IT IS LAFE A WEASEL with score: 25
## No. of generations: 38, best so far: METHINKU IT IS LIFE A WEASEL with score: 26
## No. of generations: 39, best so far: METHINKU IT IS LIFE A WEASEL with score: 26
## No. of generations: 40, best so far: METHINKU IT IS LIFE A WEASEL with score: 26
## No. of generations: 41, best so far: METHINKU IT IS LIKE A WEASEL with score: 27
## No. of generations: 42, best so far: METHINKU IT IS LIKE A WEASEL with score: 27
## No. of generations: 43, best so far: METHINKU IT IS LIKE A WEASEL with score: 27
## No. of generations: 44, best so far: METHINKU IT IS LIKE A WEASEL with score: 27
## No. of generations: 45, best so far: METHINKU IT IS LIKE A WEASEL with score: 27

cat("Mission accomplished in ", gen_no, " generations: ", highscorer, sep = "")
## Mission accomplished in 46 generations: METHINKS IT IS LIKE A WEASEL


As you can see, the algorithm arrived at the target phrase pretty quickly. Now, you can try to tweak different parameter setting, like the population size or the mutation rate, and see what happens. You can of course also change the target phrase.

A minority of (often very religious) people reject the fact of evolution because they miss a crucial step: selection based on fitness. Selection gives evolution direction towards solutions that are better able to solve a certain problem. It is the exact opposite of pure randomness which many people still suspect behind evolution.

To see the difference the only thing we have to do is to comment out the line
pop <- reproduce(highscorer) which selects the highest scoring string for reproduction. We can see that without selection there is no improvement to be seen and the algorithm would run “forever”:

## No. of generations: 1, best so far: UJGGZYOEDJMRADTQUXFWAVWPBGFX with score: 2
## No. of generations: 2, best so far: UHGGZQOEDJERAD QBXFSBRWPBGFX with score: 2
## No. of generations: 3, best so far: UNGDZYOEDSERADTQIXFSBVWPAGFX with score: 3
## No. of generations: 4, best so far: UHGGZQNEDJERAG QBXFSBRWPBGWX with score: 2
## No. of generations: 5, best so far: IDGGTJOELJERAETQBDFSBVWEBGFX with score: 2
## No. of generations: 6, best so far: IDGGTJOELJERNETQBDFSBVWEBGFX with score: 2
## No. of generations: 7, best so far: FNJGZYOESJERERTQGXGSBVWEBSFX with score: 3
## No. of generations: 8, best so far: UJGWZBOERJMUAQTQUXFVAVWKKSFX with score: 3
## No. of generations: 9, best so far: VETGRYOEYVVSAOTQBKOSTVPPGGFM with score: 3
## No. of generations: 10, best so far: VETGRYOEYVVSAOTQBKOSTVPPGGFM with score: 3
## No. of generations: 11, best so far: VETGRYOEYVVSAKTQBKOSTVPPGGFM with score: 3
## No. of generations: 12, best so far: IETGRYOTYVVDAKTQBKOCTVPPGGFM with score: 3
## No. of generations: 13, best so far:  TTVVZOKDJERADELYXFKWGWXKGYO with score: 3
## No. of generations: 14, best so far: UNGWCYOZDEWRAD WKXKSBVWECGFX with score: 3
## No. of generations: 15, best so far: UNGWCYOZDEWRBD WKXKSBVWECGFX with score: 3
## No. of generations: 16, best so far: UNGSCYOZDEWRBD WKXKSAVCECGFX with score: 3
## No. of generations: 17, best so far: MXKGZYOMSJ RIOTQBLJSBVNPAGDL with score: 4
## No. of generations: 18, best so far: MXKGZYOMSJ RIOTQBLJSBVNPAGDL with score: 4
## No. of generations: 19, best so far: MXKGZYOMZJ RIOTQBLJSVVNPAGDL with score: 4
## No. of generations: 20, best so far:  TTVVJGKDDERADELYJXKRGWEKGYU with score: 4
## No. of generations: 21, best so far:  TTVVJGKDDERADELYDXBRGWEKGYU with score: 4
## No. of generations: 22, best so far:  TTWVJGKDQERADELYDXBRGWEKGYU with score: 4
## No. of generations: 23, best so far: MXKGOYOMCJ RIOTQBLJIVVAPAJDG with score: 3
## No. of generations: 24, best so far: MXKGOYOMCJ RIOTQBLJIVVAPAJDG with score: 3
## No. of generations: 25, best so far: MXKGOYOMCJ RIOTQBLJIVVAPAJDG with score: 3
## No. of generations: 26, best so far: MXKGOYOMCJ RIOTQBLJIVVAPAJDG with score: 3
## No. of generations: 27, best so far: TNTUXYKJPJNDAITLAJTYBAWPMGGB with score: 4
## No. of generations: 28, best so far: MXKGOYOMCJ RIOTLBLJIVVAPAJDX with score: 4
## No. of generations: 29, best so far: MXKGOYOMCJ RIOTLBLJIVVAJAJDX with score: 4
## No. of generations: 30, best so far: TUTUYYKNPJNDAITLAJTYBAAPMOGB with score: 3
## No. of generations: 31, best so far:  NGAFULYDZELWD QDPRSMPWYAPZH with score: 3
## No. of generations: 32, best so far: HKUOZSJSXDERS TLBHASAVGPBEJT with score: 3
## No. of generations: 33, best so far:  NGAFULYDTELWD QDPRSMPWYAPZH with score: 3
## No. of generations: 34, best so far: HKUYMSJAXDERS TLBHA AVGPBEJT with score: 3
## No. of generations: 35, best so far: HKUYMSJAXDSRS TLBHA AVGPBEJT with score: 3
## No. of generations: 36, best so far: HKXYMSJYXDSRS TLBHA AVGPNEJT with score: 3
## No. of generations: 37, best so far: KNGABULYDTELWD QDORSFPWYAPZH with score: 3
## No. of generations: 38, best so far: LLCIZN EOISJ DHFIEGPXNWYMYOX with score: 4
## No. of generations: 39, best so far: LLCIZN EOISJ DHFIEXPXNWYMYOX with score: 4
## No. of generations: 40, best so far: MZN KMIESQRRILELIIILFIGRYRZZ with score: 4
## No. of generations: 41, best so far: ITQXZEKK SENLSCJXAKQ EKNCNUJ with score: 3
## No. of generations: 42, best so far: MELBV VEUBRKXSNHWGILBU JVLZX with score: 3
## No. of generations: 43, best so far: DZNAKMIEOQRRILELIVILKIGVYRZZ with score: 3
## No. of generations: 44, best so far: DZNAKMIEOQRRILELIVILKIGVYRZZ with score: 3
## No. of generations: 45, best so far: LRPDILXMGCWDD ZQD BKANWHMKFI with score: 3
## No. of generations: 46, best so far: KEGAMRLYDAELDDUXLORSFPWOAPLH with score: 3
## No. of generations: 47, best so far: KEGAMRLYDAELDDUXLORSFPWOAPLH with score: 3
## No. of generations: 48, best so far: KEGAMRLYDAELDZUXLORHFPWOAPLH with score: 3
## No. of generations: 49, best so far: KEGAMRLYDAEWDZUXLORHFPWOAPLH with score: 3
## No. of generations: 50, best so far: KEGAMRLYDAEWDZDXLORHFPWOAPLH with score: 3


If this was how evolution really worked it wouldn’t work at all.

Because evolution is a very powerful optimization method there are also real world applications of so called genetic algorithms (GA). In the following example we want to find the global optimum of the so called Rastrigin function. What makes this task especially difficult for this popular test problem is the large number of local minima, as can be seen when plotting the function:

library(GA)
## Package 'GA' version 3.2
## Type 'citation("GA")' for citing this R package in publications.
##
## Attaching package: 'GA'
## The following object is masked from 'package:utils':
##
##     de
Rastrigin <- function(x1, x2) {
20 + x1^2 + x2^2 - 10*(cos(2*pi*x1) + cos(2*pi*x2))
}

x1 <- x2 <- seq(-5.12, 5.12, by = 0.1)
f <- outer(x1, x2, Rastrigin)
persp3D(x1, x2, f, theta = 50, phi = 20)


filled.contour(x1, x2, f, color.palette = bl2gr.colors)


To find the global minimum (spoiler: it is at ) we use the GA package (because GA only maximizes we use the minus sign in front of the fitness function):

set.seed(70)
GA <- ga(type = "real-valued",
fitness =  function(x) -Rastrigin(x[1], x[2]),
lower = c(-5.12, -5.12), upper = c(5.12, 5.12),
maxiter = 1000)
summary(GA)
## -- Genetic Algorithm -------------------
##
## GA settings:
## Type                  =  real-valued
## Population size       =  50
## Number of generations =  1000
## Elitism               =  2
## Crossover probability =  0.8
## Mutation probability  =  0.1
## Search domain =
##          x1    x2
## lower -5.12 -5.12
## upper  5.12  5.12
##
## GA results:
## Iterations             = 1000
## Fitness function value = -3.630204e-07
## Solution =
##               x1           x2
## [1,] 2.81408e-05 3.221658e-05

plot(GA)


filled.contour(x1, x2, f, color.palette = bl2gr.colors, plot.axes = {
axis(1); axis(2); points(GA@solution[ , 1], GA@solution[ , 2], pch = 3, cex = 2, col = "white", lwd = 2)
}
)


Quite impressive, isn’t it! Evolution just works!

In an upcoming post we will use evolutionary methods to find a nice functional form for some noisy data with a method called symbolic regression or genetic programming – so stay tuned!

Update
The post is now online: Symbolic Regression, Genetic Programming… or if Kepler had R