Learning Data Science: Understanding and Using k-means Clustering


A few months ago I published a quite popular post on Clustering the Bible… one well known clustering algorithm is k-means. If you want to learn how k-means works and how to apply it in a real-world example, read on…

k-means (not to be confused with k-nearest neighbours or KNN: Teach R to read handwritten Digits with just 4 Lines of Code) is a simple, yet often very effective unsupervised learning algorithm to find similarities in large amounts of data and cluster them accordingly. The hyperparameter k stands for the number of clusters which has to be set beforehand.

The guiding principles are:

  • The distance between data points within clusters should be as small as possible.
  • The distance of the centroids (= centres of the clusters) should be as big as possible.

Because there are too many possible combinations of all possible clusters comprising all possible data points k-means follows an iterative approach:

  1. Initialization: assign clusters randomly to all data points
  2. E-step (for expectation): assign each observation to the “nearest” (based on Euclidean distance) cluster
  3. M-step (for maximization): determine new centroids based on the mean of assigned objects
  4. Repeat steps 3 and 4 until no further changes occur

As can be seen above k-means is an example of a so-called expectation-maximization algorithm.

To implement k-means in R we first assign some variables and define a helper function for plotting the steps:

n <- 3 # no. of centroids
set.seed(1415) # set seed for reproducibility

M1 <- matrix(round(runif(100, 1, 5), 1), ncol = 2)
M2 <- matrix(round(runif(100, 7, 12), 1), ncol = 2)
M3 <- matrix(round(runif(100, 20, 25), 1), ncol = 2)
M <- rbind(M1, M2, M3)

C <- M[1:n, ] # define centroids as first n objects
obs <- length(M) / 2
A <- sample(1:n, obs, replace = TRUE) # assign objects to centroids at random
colors <- seq(10, 200, 25) 

clusterplot <- function(M, C, txt) {
  plot(M, main = txt, xlab = "", ylab = "")
  for(i in 1:n) {
    points(C[i, , drop = FALSE], pch = 23, lwd = 3, col = colors[i])
    points(M[A == i, , drop = FALSE], col = colors[i])    
  }
}
clusterplot(M, C, "Initialization")


Here comes the k-means algorithm as described above (the circles are the data points, diamonds are the centroids and the three colours symbolize cluster assignments):

repeat {
  # calculate Euclidean distance between objects and centroids
  D <- matrix(data = NA, nrow = n, ncol = obs)
  for(i in 1:n) {
    for(j in 1:obs) {
      D[i, j] <- sqrt((M[j, 1] - C[i, 1])^2 + (M[j, 2] - C[i, 2])^2)
    }
  }
  O <- A
  
  ## E-step: parameters are fixed, distributions are optimized
  A <- max.col(t(-D)) # assign objects to centroids
  if(all(O == A)) break # if no change stop
  clusterplot(M, C, "E-step")
  
  ## M-step: distributions are fixed, parameters are optimized
  # determine new centroids based on mean of assigned objects
  for(i in 1:n) {
    C[i, ] <- apply(M[A == i, , drop = FALSE], 2, mean)
  }
  clusterplot(M, C, "M-step")
}

As can seen the clusters wander slowly but surely until all three are stable. We now compare the result with the k-means function in Base R:

cl <- kmeans(M, n)
clusterplot(M, cl$centers, "Base R")

(custom <- C[order(C[ , 1]), ])
##        [,1]   [,2]
## [1,]  3.008  2.740
## [2,]  9.518  9.326
## [3,] 22.754 22.396

(base <- cl$centers[order(cl$centers[ , 1]), ])
##     [,1]   [,2]
## 2  3.008  2.740
## 1  9.518  9.326
## 3 22.754 22.396

round(base - custom, 13)
##   [,1] [,2]
## 2    0    0
## 1    0    0
## 3    0    0

As you can see, the result is the same!

Now, for some real-world application: clustering wholesale customer data. The data set refers to clients of a wholesale distributor. It includes the annual spending on diverse product categories and is from the renowned UCI Machine Learning Repository (I guess the category “Delicassen” should rather be “Delicatessen”).

Have a look at the following code:

data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale customers data.csv", header = TRUE)
head(data)
##   Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1       2      3 12669 9656    7561    214             2674       1338
## 2       2      3  7057 9810    9568   1762             3293       1776
## 3       2      3  6353 8808    7684   2405             3516       7844
## 4       1      3 13265 1196    4221   6404              507       1788
## 5       2      3 22615 5410    7198   3915             1777       5185
## 6       2      3  9413 8259    5126    666             1795       1451

set.seed(123)
k <- kmeans(data[ , -c(1, 2)], centers = 4) # remove columns 1 and 2, create 4 clusters
(centers <- k$centers) # display cluster centers
##       Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1  8149.837 18715.857 27756.592 2034.714        12523.020   2282.143
## 2 20598.389  3789.425  5027.274 3993.540         1120.142   1638.398
## 3 48777.375  6607.375  6197.792 9462.792          932.125   4435.333
## 4  5442.969  4120.071  5597.087 2258.157         1989.299   1053.272

round(prop.table(centers, 2) * 100) # percentage of sales per category
##   Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1    10   56      62     11               76         24
## 2    25   11      11     22                7         17
## 3    59   20      14     53                6         47
## 4     7   12      13     13               12         11

table(k$cluster) # number of customers per cluster
## 
##   1   2   3   4 
##  49 113  24 254

One interpretation could be the following for the four clusters:

  1. Big general shops
  2. Small food shops
  3. Big food shops
  4. Small general shops

As you can see, the interpretation of some clusters found by the algorithm can be quite a challenge. If you have a better idea of how to interpret the result please tell me in the comments below!

From Coin Tosses to p-Hacking: Make Statistics Significant Again!


One of the most notoriously difficult subjects in statistics is the concept of statistical tests. We will explain the ideas behind it step by step to give you some intuition on how to use (and misuse) it, so read on…

Let us begin with some coin tosses and the question how to find out whether a coin is fair, i.e. shows heads and tails with fifty-fifty probability. If you had all the time of the world you could throw it indefinitely often to see where the probabilities stabilize. Yet, in reality we only have some sample of tosses and have to infer results based on this. Because of this the area that deals with those problems is called inferential statistics (also inductive statistics).

So, to keep things simple let us throw our coin only ten times. Where is the point where you would suspect that something is not quite right with this coin? As soon as you don’t get five times heads and five times tails? Or only when you get ten times one side only? Possibly somewhere in between? But where exactly?

The first thing to understand is that this is a somewhat arbitrary decision… but one that should be set in advance and which should adhere to some general standard. The idea is to set some fixed probability with which you compare the probability of the result you see in your experiment (i.e. the number of heads and tails in our example). When you obtain a result that is as improbable or even more so than that pre-fixed probability you conclude that the result happened not just by chance but that something significant happened (e.g. that your coin is unfair). The generally accepted probability below which one speaks of a significant result is 5% (or 0.05), it is therefore called significance level.

After fixing the significance level at 0.05 we have to compare it with the probability of our observed result (or a result that is even more extreme). Let us say that we got nine times one side of the coin and only one time the other. What is the probability of getting this or an even more extreme result? The general probability formula is:

    \[probability\ of\ an\ event = \frac{number\ of\ favorable\ outcomes}{number\ of\ all\ possible\ outcomes}\]

For the numerator we get 22 possibilities for observing the above mentioned or an even more extreme result: two times ten possibilities that each of the ten coins could be the odd one out and two more extreme cases showing only heads or only tails. For the denominator we get 2^{10} possibilities for throwing 10 coins:

22 / 2^10
## [1] 0.02148438

So, the probability is about 2% which is below 5%, so we would conclude that our coin is not fair!

Another possibility is to use the binomial coefficient which gives us the number of possibilities to choose k items out of n items. Conveniently enough the R function is called choose(n, k):

(choose(10, 0) + choose(10, 1) + choose(10, 9) + choose(10, 10))/2^10
## [1] 0.02148438

Or, because the situation is symmetric:

2 * (choose(10, 0) + choose(10, 1))/2^10
## [1] 0.02148438

To make things even simpler we can use the binomial distribution for our calculation, where basically all of the needed binomial coefficients are summed up automatically:

prob <- 0.5 # fair coin
n <- 10     # number of coin tosses
k <- 1      # number of "successes", i.e. how many coins show a different side than the rest

2 * pbinom(k, n, prob)
## [1] 0.02148438

As you can see, we get the same probability value over and over again. If you have understood the line of thought so far you are ready for the last step: performing the statistical test with the binom.test() function:

binom.test(k, n)
## 
##  Exact binomial test
## 
## data:  k and n
## number of successes = 1, number of trials = 10, p-value = 0.02148
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.002528579 0.445016117
## sample estimates:
## probability of success 
##                    0.1

As you can see, we get the same number again: it is named p-value in the output:

The p-value is the probability of getting the observed – or an even more extreme – result under the condition that the null hypothesis is true.

The only thing that is new here is the null hypothesis. The null hypothesis is just a fancy name for the assumption that in general things are unremarkable, examples could be: our coin is fair, a new medical drug doesn’t work (its effect is just random noise) or one group of students is not better than the other.

So, after having a look at the output we would formally say that we have a significant result and therefore reject the null hypothesis that the coin is fair.

In most real world scenarios the situation is not that clear cut as it is with our coins. Therefore we won’t normally use a binomial test but the workhorse of statistical inference: the famous t-test! On the positive side you can use the t-test in many situations where you don’t have all the information on the underlying population distribution, e.g. its variance. It is based on comparatively mild assumptions, e.g. that the population distribution is normal (which is itself effectively based on the Central Limit Theorem (CLT)) – and even this can be relaxed as long as its variance is finite and when you have big sample sizes. If it is normal though small sample sizes suffice. On the negative side the t-test often only gives approximations of exact tests. Yet the bigger the sample sizes the better the approximation. Enough of the theory, let us perform a t-test with our data. We use the t.test() function for that:

data <- c(rep(0, n-k), rep(1, k))
t.test(data, mu = prob)
## 
##  One Sample t-test
## 
## data:  data
## t = -4, df = 9, p-value = 0.00311
## alternative hypothesis: true mean is not equal to 0.5
## 95 percent confidence interval:
##  -0.1262157  0.3262157
## sample estimates:
## mean of x 
##       0.1

As you can see, we get a different p-value this time but the test is still significant and leads to the right conclusion. As said before, the bigger the sample size the better the approximation.

Let us try two more examples. First we use the inbuilt mtcars dataset to test whether petrol consumption of manual and automatic transmissions differ significantly:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# make new column with better labels
mtcars$trans <- ifelse(mtcars$am == 0, "aut", "man")
mtcars$trans <- factor(mtcars$trans)

mtcars$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

mtcars$trans
##  [1] man man man aut aut aut aut aut aut aut aut aut aut aut aut aut aut
## [18] man man man aut aut aut aut aut man man man man man man man
## Levels: aut man

by(mtcars$mpg, mtcars$trans, summary)
## mtcars$trans: aut
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   14.95   17.30   17.15   19.20   24.40 
## -------------------------------------------------------- 
## mtcars$trans: man
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   21.00   22.80   24.39   30.40   33.90

boxplot(mpg ~ trans, data = mtcars)

As we can see the consumption is obviously different – but is it also significantly different at the 5% significance level? Let us find out:

M <- mtcars$trans == "man" #manual
mpg_man <- mtcars[M, ]$mpg #consumption of manual transmission
mpg_man
##  [1] 21.0 21.0 22.8 32.4 30.4 33.9 27.3 26.0 30.4 15.8 19.7 15.0 21.4

mpg_aut <- mtcars[!M, ]$mpg #consumption of automatic transmission
mpg_aut
##  [1] 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7
## [15] 21.5 15.5 15.2 13.3 19.2

t.test(mpg_man, mpg_aut)
## 
##  Welch Two Sample t-test
## 
## data:  mpg_man and mpg_aut
## t = 3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3.209684 11.280194
## sample estimates:
## mean of x mean of y 
##  24.39231  17.14737

Yes, it is significantly different!

And now for our final example, a classic use of the t-test to find out whether a medical drug has a significant effect (we use the inbuilt sleep dataset):

sleep
##    extra group ID
## 1    0.7     1  1
## 2   -1.6     1  2
## 3   -0.2     1  3
## 4   -1.2     1  4
## 5   -0.1     1  5
## 6    3.4     1  6
## 7    3.7     1  7
## 8    0.8     1  8
## 9    0.0     1  9
## 10   2.0     1 10
## 11   1.9     2  1
## 12   0.8     2  2
## 13   1.1     2  3
## 14   0.1     2  4
## 15  -0.1     2  5
## 16   4.4     2  6
## 17   5.5     2  7
## 18   1.6     2  8
## 19   4.6     2  9
## 20   3.4     2 10

by(sleep$extra, sleep$group, summary)
## sleep$group: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.600  -0.175   0.350   0.750   1.700   3.700 
## -------------------------------------------------------- 
## sleep$group: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.100   0.875   1.750   2.330   4.150   5.500

boxplot(extra ~ group, sleep)

Again, there is obviously a difference… but is it significant?

t.test(extra ~ group, sleep)
## 
##  Welch Two Sample t-test
## 
## data:  extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.3654832  0.2054832
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33

The p-value is above 0.05 which means that this time it cannot be ruled out that the difference is random!

As you can imagine whenever there is money to be made people get creative on how to game the system… unfortunately statistical tests are no exception. Especially in the medical field huge amounts of money are at stake, so one should be well aware of a manipulation technique called p-hacking (also known under the names data snooping or data dredging)!

To summarize our journey into statistical testing so far we have seen that you set a significance level (normally at 0.05) and draw a random sample. You calculate how probable the drawing of this sample (or an even more extreme sample) is and compare it to the significance level. If the probability is below the significance level you say that the test shows a significant result, otherwise you say that one cannot rule out that the difference (if there is one) is just due to chance.

Now, if you don’t like the result… why not draw a sample again? And again? And again? This is p-hacking! Just draw a lot of samples and take the one that fits your agenda!

Of course this is a stark misuse of the idea of statistical tests. But the fact remains that sometimes samples will be significant just by chance!

In the following example we “prove” that homeopathy works after all (as we all know it just doesn’t work beyond the so-called placebo effect as numerous high quality studies over many decades have shown). For that end we just conduct one hundred trials and take one out that fits our bill:

p_hack <- function() {
  homeopathy <- rnorm(20, 1.2, 5)
  t.test(homeopathy, mu = 1.2)$p.value
}

set.seed(123)
trials <- replicate(100, p_hack())
trials[6] # homeopathy works!
## [1] 0.033461758

# wait... no it doesn't!
trials
##   [1] 0.522741341 0.785376466 0.624588751 0.587983281 0.057318309
##   [6] 0.033461758 0.876112179 0.461733176 0.495169397 0.519897894
##  [11] 0.800638084 0.836667733 0.915272904 0.340430844 0.087656634
##  [16] 0.879783049 0.824428157 0.784957654 0.898447738 0.494603038
##  [21] 0.279707663 0.438005042 0.886043148 0.158778621 0.167712182
##  [26] 0.135880072 0.185375902 0.662883154 0.840370636 0.331157973
##  [31] 0.283332330 0.423746570 0.849937345 0.503256673 0.546014504
##  [36] 0.725568387 0.727886195 0.167482514 0.586386335 0.493916303
##  [41] 0.937060320 0.271651762 0.236448565 0.191925375 0.983841382
##  [46] 0.373146285 0.520463358 0.323242616 0.193315250 0.757835487
##  [51] 0.429311063 0.284986685 0.868272041 0.844042454 0.885548528
##  [56] 0.996021648 0.978855283 0.588192375 0.495521737 0.610192393
##  [61] 0.242524547 0.404220265 0.136245145 0.004912541 0.408530447
##  [66] 0.458030143 0.784011725 0.357773131 0.818207705 0.698330582
##  [71] 0.451268449 0.740943226 0.266786179 0.120894921 0.050307044
##  [76] 0.387838555 0.232600995 0.834739682 0.669270240 0.516910985
##  [81] 0.273077802 0.291004360 0.369842912 0.132130995 0.454371585
##  [86] 0.567029545 0.219163798 0.524362414 0.737950629 0.855945701
##  [91] 0.959611992 0.901298484 0.931203165 0.204489683 0.843491761
##  [96] 0.567982673 0.329414884 0.107350495 0.911279358 0.696661191

Lo and behold: trial no. 6 shows a significant result – although by design both means are the same (1.2)! So we take this study, publish it and sell lots and lots of useless homeopathy drugs!

How many significant results will we get just by chance on average? Well, exactly the amount of our significance level!

trials <- replicate(1e5, p_hack())
round(length(trials[trials < 0.05]) / length(trials), 2)
## [1] 0.05

As always our friends over at xkcd summarize the situation brilliantly:

source: https://xkcd.com/882/

Learning R: Permutations and Combinations with base R


The area of combinatorics, the art of systematic counting, is dreaded territory for many people, so let us bring some light into the matter: in this post we will explain the difference between permutations and combinations, with and without repetitions, will calculate the number of possibilities and present efficient R code to enumerate all of them, so read on…

The cases we cover in this post:

  • Permutations with repetitions
  • Combinations without repetitions
  • Permutations (of all elements) without repetitions

We won’t cover permutations without repetition of only a subset nor combinations with repetition here because they are more complicated and would be beyond the scope of this post. We will perhaps cover those in a later post.

In all cases, you can imagine somebody drawing elements from a set and the different ways to do so. Permutation implies that the order does matter, with combinations it does not (e.g. in a lottery it normally does not matter in which order the numbers are drawn). Without repetition simply means that when one has drawn an element it cannot be drawn again, so with repetition implies that it is replaced and can be drawn again.

Let us start with permutations with repetitions: as an example take a combination lock (should be permutation lock really!) where you have three positions with the numbers zero to nine each. We use the expand.grid() function for enumerating all possibilities:

P_wi <- expand.grid(rep(list(0:9), 3))
head(P_wi)
##   Var1 Var2 Var3
## 1    0    0    0
## 2    1    0    0
## 3    2    0    0
## 4    3    0    0
## 5    4    0    0
## 6    5    0    0

tail(P_wi)
##      Var1 Var2 Var3
## 995     4    9    9
## 996     5    9    9
## 997     6    9    9
## 998     7    9    9
## 999     8    9    9
## 1000    9    9    9

The formula for calculating the number of permutations is simple for obvious reasons (n is the number of elements to choose from, k is the number of actually chosen elements):

    \[n^k.\]

In R:

10^3
## [1] 1000

nrow(P_wi)
## [1] 1000

The next is combinations without repetitions: the classic example is a lottery where six out of 49 balls are chosen. We use the combn() function for finding all possibilities:

C_wo <- combn(1:49, 6)
C_wo[ , 1:10]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    1    1    1    1    1    1    1    1     1
## [2,]    2    2    2    2    2    2    2    2    2     2
## [3,]    3    3    3    3    3    3    3    3    3     3
## [4,]    4    4    4    4    4    4    4    4    4     4
## [5,]    5    5    5    5    5    5    5    5    5     5
## [6,]    6    7    8    9   10   11   12   13   14    15

To calculate the number of combinations the binomial coefficient is used:

    \[\binom{n}{k} = \frac{n!}{k! (n-k)!}.\]

To give you some intuition consider the above example: you have 49 possibilities for choosing the first ball, 48 for the second, 47 for the third and so on up to the sixth ball. So that gives 49 \cdot 48 \cdot 47 \cdot 46 \cdot 45 \cdot 44. When you think about it this is the same as \frac{49 \cdot 48 \cdot 47 \cdot 46 \cdot 45 \cdot 44 \cdot...\cdot 6 \cdot 5 \cdot 4 \cdot 3 \cdot 2 \cdot 1}{43 \cdot 42 \cdot 41 \cdot 40 \cdot 39 \cdot 38 \cdot...\cdot 6 \cdot 5 \cdot 4 \cdot 3 \cdot 2 \cdot 1} = \frac{n!}{(n-k)!} because all the coefficients smaller than 44 can be eliminated by reducing the fraction! Now, there are 6 possible positions for the first ball that is drawn, 5 for the second… and so on and because the order doesn’t matter we have to divide by 6! = k!, which gives the binomial coefficient. In R we use the choose() function to calculate it:

choose(49, 6)
## [1] 13983816

ncol(C_wo)
## [1] 13983816

So, you see that the probability of winning the lottery are about the same, no matter whether you play it… or not 😉

Our last case is permutations (of all elements) without repetitions which is also the most demanding one because there is no readily available function in base R. So, we have to write our own:

perm <- function(v) {
  n <- length(v)
  if (n == 1) v
  else {
    X <- NULL
    for (i in 1:n) X <- rbind(X, cbind(v[i], perm(v[-i])))
    X
  }
}

As you can see it is a recursive function, to understand recursion read my post: To understand Recursion you have to understand Recursion…. What makes matters a little bit more complicated is that the recursive call is within a for loop.

The idea is to fix one element after the other [for (i in 1:n) and cbind(v[i], ...)] and permute the remaining elements [perm(v[-i])] down to the base case when only one element remains [if (n == 1) v], which cannot be permuted any further. Collect all sets on the respective higher level [X <- (rbind(X, ...)] and return the whole matrix X.

Let us see this in action, as an example we’ll see how many different ways there are of four runners reaching the finishing line:

P_wo <- perm(1:4)
P_wo
##       [,1] [,2] [,3] [,4]
##  [1,]    1    2    3    4
##  [2,]    1    2    4    3
##  [3,]    1    3    2    4
##  [4,]    1    3    4    2
##  [5,]    1    4    2    3
##  [6,]    1    4    3    2
##  [7,]    2    1    3    4
##  [8,]    2    1    4    3
##  [9,]    2    3    1    4
## [10,]    2    3    4    1
## [11,]    2    4    1    3
## [12,]    2    4    3    1
## [13,]    3    1    2    4
## [14,]    3    1    4    2
## [15,]    3    2    1    4
## [16,]    3    2    4    1
## [17,]    3    4    1    2
## [18,]    3    4    2    1
## [19,]    4    1    2    3
## [20,]    4    1    3    2
## [21,]    4    2    1    3
## [22,]    4    2    3    1
## [23,]    4    3    1    2
## [24,]    4    3    2    1

After this rather complicated function the calculation of the number of ways is simple, it is just the factorial function (it should again be obvious why):

    \[n!.\]

In R we have the factorial() function:

factorial(4)
## [1] 24

nrow(P_wo)
## [1] 24

As you will see when solving real world problems with R the above functions often come in handy, so you should add them to your ever growing tool set – have fun and stay tuned!

Learning R: Painting with Fire


A few months ago I published a post on recursion: To understand Recursion you have to understand Recursion…. In this post we will see how to use recursion to fill free areas of an image with colour, the caveats of recursion and how to transform a recursive algorithm into a loop-based version using a queue – so read on…

The recursive version of the painting algorithm we want to examine here is very easy to understand, Wikipedia gives the pseudocode of the so called flood-fill algorithm:


Flood-fill (node, target-color, replacement-color):

  • If target-color is equal to replacement-color, return.
  • If the color of node is not equal to target-color, return.
  • Set the color of node to replacement-color.
  • Perform Flood-fill (one step to the south of node, target-color, replacement-color).
    Perform Flood-fill (one step to the north of node, target-color, replacement-color).
    Perform Flood-fill (one step to the west of node, target-color, replacement-color).
    Perform Flood-fill (one step to the east of node, target-color, replacement-color).
  • Return.

The translation into R couldn’t be any easier:

floodfill <- function(row, col, tcol, rcol) {
  if (tcol == rcol) return()
  if (M[row, col] != tcol) return()
  M[row, col] <<- rcol
  floodfill(row - 1, col    , tcol, rcol) # south
  floodfill(row + 1, col    , tcol, rcol) # north
  floodfill(row    , col - 1, tcol, rcol) # west
  floodfill(row    , col + 1, tcol, rcol) # east
  return("filling completed")
}

We take the image from Wikipedia as an example:

M <- matrix(c(1, 1, 1, 1, 1, 1, 1, 1, 1,
              1, 0, 0, 0, 1, 0, 0, 0, 1,
              1, 0, 0, 0, 1, 0, 0, 0, 1,
              1, 0, 0, 1, 0, 0, 0, 0, 1,
              1, 1, 1, 0, 0, 0, 1, 1, 1,
              1, 0, 0, 0, 0, 1, 0, 0, 1,
              1, 0, 0, 0, 1, 0, 0, 0, 1,
              1, 0, 0, 0, 1, 0, 0, 0, 1,
              1, 1, 1, 1, 1, 1, 1, 1, 1), 9, 9)
image(M, col = c(0, 1))

We now fill the three areas with three different colours and then plot the image again:

startrow <- 5; startcol <- 5
floodfill(startrow, startcol, 0, 2)
## [1] "filling completed"

startrow <- 3; startcol <- 3
floodfill(startrow, startcol, 0, 3)
## [1] "filling completed"

startrow <- 7; startcol <- 7
floodfill(startrow, startcol, 0, 4)
## [1] "filling completed"

image(M, col = 1:4)

This seems to work pretty well but the problem is that the more nested the algorithm becomes the bigger the stack has to be – which could lead to overflow errors. One comment on my original post on recursion read:

just keep in mind that recursion is useful in industrial work only if tail optimization is supported. otherwise your code will explode at some indeterminate time in the future. […]

One possibility is to increase the size of the stack with options(expressions = 10000) but even this may not be enough. Therefore we transform our recursive algorithm into a loop-based one and use a queue instead of a stack! The pseudocode from Wikipedia:


Flood-fill (node, target-color, replacement-color):

  • If target-color is equal to replacement-color, return.
  • If color of node is not equal to target-color, return.
  • Set the color of node to replacement-color.
  • Set Q to the empty queue.
  • Add node to the end of Q.
  • While Q is not empty:
    • Set n equal to the first element of Q.
    • Remove first element from Q.
    • If the color of the node to the west of n is target-color,
      set the color of that node to replacement-color and add that node to the end of Q.
    • If the color of the node to the east of n is target-color,
      set the color of that node to replacement-color and add that node to the end of Q.
    • If the color of the node to the north of n is target-color,
      set the color of that node to replacement-color and add that node to the end of Q.
    • If the color of the node to the south of n is target-color,
      set the color of that node to replacement-color and add that node to the end of Q.
  • Continue looping until Q is exhausted.
  • Return.

Because of the way the algorithm fills areas it is also called forest fire. Again, the translation into valid R code is straightforward:

floodfill <- function(row, col, tcol, rcol) {
  if (tcol == rcol) return()
  if (M[row, col] != tcol) return()
  Q <- matrix(c(row, col), 1, 2)
  while (dim(Q)[1] > 0) {
    n <- Q[1, , drop = FALSE]
    west  <- cbind(n[1]    , n[2] - 1)
    east  <- cbind(n[1]    , n[2] + 1)
    north <- cbind(n[1] + 1, n[2]    )
    south <- cbind(n[1] - 1, n[2]    )
    Q <- Q[-1, , drop = FALSE]
    if (M[n] == tcol) {
      M[n] <<- rcol
      if (M[west] == tcol)  Q <- rbind(Q, west)
      if (M[east] == tcol)  Q <- rbind(Q, east)
      if (M[north] == tcol) Q <- rbind(Q, north)
      if (M[south] == tcol) Q <- rbind(Q, south)
    }
  }
  return("filling completed")
}

As an example we will use a much bigger picture (it can be downloaded from here: Unfilledcirc.png):

library(png)
img <- readPNG("data/Unfilledcirc.png")
M <- img[ , , 1]
M <- ifelse(M < 0.5, 0, 1)
M <- rbind(M, 0)
M <- cbind(M, 0)
image(M, col = c(1, 0))

And now for the filling:

startrow <- 100; startcol <- 100
floodfill(startrow, startcol, 0, 2)
## [1] "filling completed"

startrow <- 50; startcol <- 50
floodfill(startrow, startcol, 1, 3)
## [1] "filling completed"

image(M, col = c(1, 0, 2, 3))

As you can see, with this version of the algorithm much bigger areas can be filled!

I also added both R implementations to the respective section of Rosetta Code: Bitmap/Flood fill.

Learning R: The Ultimate Introduction (incl. Machine Learning!)


There are a million reasons to learn R (see e.g. Why R for Data Science – and not Python?), but where to start? I present to you the ultimate introduction to bring you up to speed! So read on…

I call it ultimate because it is the essence of many years of teaching R… or put differently: it is the kind of introduction I would have liked to have when I started out with R back in the days!

A word of warning though: this is a introduction to R and not to statistics, so I won’t explain the statistics terms used here. You do not need to know any other programming language but it does no harm either. Ok, now let us start!

First you need to install R (https://www.r-project.org) and preferably RStudio as a Graphical User Interface (GUI): https://www.rstudio.com/products/RStudio/#Desktop. Both are free and available for all common operating systems.

To get a quick overview of RStudio watch this video:

You can either type in the following commands in the console or open a new script tab (File -> New File -> R Script) and run the commands by pressing Ctrl + Enter/Return after having typed them.

First of all R is a very good calculator:

2 + 2
## [1] 4

sin(0.5)
## [1] 0.4794255

abs(-10) # absolute value
## [1] 10

pi
## [1] 3.141593

exp(1) # e
## [1] 2.718282

factorial(6)
## [1] 720

By the way: The hash is used for comments, everything after it will be ignored!

Of course you can define variables and use them in your calculations:

n1 <- 2
n2 <- 3
n1 # show content of variable by just typing the name
## [1] 2

n1 + n2
## [1] 5

n1 * n2
## [1] 6

n1^n2
## [1] 8

Part of R’s power stems from the fact that functions can handle several numbers at once, called vectors, and do calculations on them. When calling a function arguments are passed with round brackets:

n3 <- c(12, 5, 27) # concatenate (combine) elements into a vector 
n3
## [1] 12  5 27

min(n3)
## [1] 5

max(n3)
## [1] 27

sum(n3)
## [1] 44

mean(n3)
## [1] 14.66667

sd(n3) # standard deviation
## [1] 11.23981

var(n3) # variance
## [1] 126.3333

median(n3)
## [1] 12

n3 / 12
## [1] 1.0000000 0.4166667 2.2500000

In the last example the 12 was recycled three times. R always tries to do that (when feasible), sometimes giving a warning when it might not be intended:

n3 / c(1, 2)
## Warning in n3/c(1, 2): longer object length is not a multiple of shorter
## object length
## [1] 12.0  2.5 27.0

In cases you only want parts of your vectors you can apply subsetting with square brackets:

n3[1]
## [1] 12

n3[c(2, 3)]
## [1]  5 27

Ranges can easily be created with the colon:

n4 <- 10:20
n4
##  [1] 10 11 12 13 14 15 16 17 18 19 20

When you test whether this vector is bigger than a certain number you will get logicals as a result. You can use those logicals for subsetting:

n4 > 15
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

n4[n4 > 15]
## [1] 16 17 18 19 20

Perhaps you have heard the story of little Gauss where his teacher gave him the task to add all numbers from 1 to 100 to keep him busy for a while? Well, he found a mathematical trick to add them within seconds… for us normal people we can use R:

sum(1:100)
## [1] 5050

When we want to use some code several times we can define our own function (a user-defined function). We do that the same way we create a vector (or any other data structure) because R is a so called functional programming language and functions are so called first-class citizens (i.e. on the same level as other data structures like vectors). The code that is being executed is put in curly brackets:

gauss <- function(x) {
  sum(1:x)
}
gauss(100)
## [1] 5050

gauss(1000)
## [1] 500500

Of course we also have other data types, e.g. matrices are basically two dimensional vectors:

M <- matrix(1:12, nrow = 3, byrow = TRUE) # create a matrix
M
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

dim(M)
## [1] 3 4

Subsetting now has to provide two numbers, the first for the row, the second for the column (like in the game Battleship). If you leave one out, all data of the respective dimension will be shown:

M[2, 3]
## [1] 7

M[ , c(1, 3)]
##      [,1] [,2]
## [1,]    1    3
## [2,]    5    7
## [3,]    9   11

Another possibility to create matrices:

v1 <- 1:4
v2 <- 4:1
M1 <- rbind(v1, v2) # row bind
M1
##    [,1] [,2] [,3] [,4]
## v1    1    2    3    4
## v2    4    3    2    1

M2 <- cbind(v1, v2)  # column bind
M2
##      v1 v2
## [1,]  1  4
## [2,]  2  3
## [3,]  3  2
## [4,]  4  1

Naming rows, here with inbuilt datasets:

rownames(M2) <- LETTERS[1:4]
M2
##   v1 v2
## A  1  4
## B  2  3
## C  3  2
## D  4  1

LETTERS
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

When some result is Not Available:

LETTERS[50]
## [1] NA

Getting the structure of your variables:

str(LETTERS)
##  chr [1:26] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" ...

str(M2)
##  int [1:4, 1:2] 1 2 3 4 4 3 2 1
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:4] "A" "B" "C" "D"
##   ..$ : chr [1:2] "v1" "v2"

Another famous dataset (iris) that is also built into base R (to get help on any function or dataset just put the cursor in it and press F1):

iris
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

Oops, that is a bit long… if you only want to show the first or last rows do the following:

head(iris) # first 6 rows
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

tail(iris, 10) # last 10 rows
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 141          6.7         3.1          5.6         2.4 virginica
## 142          6.9         3.1          5.1         2.3 virginica
## 143          5.8         2.7          5.1         1.9 virginica
## 144          6.8         3.2          5.9         2.3 virginica
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

Iris is a so called data frame, the workhorse of R and data science (you will see how to create one below):

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

As you can see, data frames can combine different data types. If you try to do that with e.g. vectors, which can only hold one data type, something called coercion happens, i.e. at least one data type is forced to become another one so that consistency is maintained:

str(c(2, "Hello")) # 2 is coerced to become a character string too
##  chr [1:2] "2" "Hello"

You can get a fast overview of your data like so:

summary(iris[1:4])
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

boxplot(iris[1:4])

As you have seen, R often runs a function on all of the data simultaneously. This feature is called vectorization and in many other languages you would need a loop for that. In R you don’t use loops that often, but of course they are available:

for (i in seq(5)) {
  print(1:i)
}
## [1] 1
## [1] 1 2
## [1] 1 2 3
## [1] 1 2 3 4
## [1] 1 2 3 4 5

Speaking of control structures: of course conditional statements are available too:

even <- function(x) ifelse(x %% 2 == 0, TRUE, FALSE) # %% gives remainder of division (= modulo operator)
even(1:5)
## [1] FALSE  TRUE FALSE  TRUE FALSE

Linear modelling (e.g. correlation and linear regression) couldn’t be any easier, it is included in the core language:

age <- c(21, 46, 55, 35, 28)
income <- c(1850, 2500, 2560, 2230, 1800)
df <- data.frame(age, income) # create a data frame
df
##   age income
## 1  21   1850
## 2  46   2500
## 3  55   2560
## 4  35   2230
## 5  28   1800

cor(df) # correlation
##              age    income
## age    1.0000000 0.9464183
## income 0.9464183 1.0000000

LinReg <- lm(income ~ age, data = df) # income as a linear model of age
LinReg
## 
## Call:
## lm(formula = income ~ age, data = df)
## 
## Coefficients:
## (Intercept)          age  
##     1279.37        24.56

summary(LinReg)
## 
## Call:
## lm(formula = income ~ age, data = df)
## 
## Residuals:
##       1       2       3       4       5 
##   54.92   90.98  -70.04   91.12 -166.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1279.367    188.510   6.787  0.00654 **
## age           24.558      4.838   5.076  0.01477 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 132.1 on 3 degrees of freedom
## Multiple R-squared:  0.8957, Adjusted R-squared:  0.8609 
## F-statistic: 25.77 on 1 and 3 DF,  p-value: 0.01477

plot(df, pch = 16, main = "Linear model")
abline(LinReg, col = "blue", lwd = 2) # adding the regression line

You could directly use the model to make predictions:

pred_LinReg <- predict(LinReg, data.frame(age = seq(15, 70, 5)))
names(pred_LinReg) <- seq(15, 70, 5)
round(pred_LinReg, 2)
##      15      20      25      30      35      40      45      50      55 
## 1647.73 1770.52 1893.31 2016.10 2138.88 2261.67 2384.46 2507.25 2630.04 
##      60      65      70 
## 2752.83 2875.61 2998.40

If you want to know more about the modelling process you can find it here: Learning Data Science: Modelling Basics

Another strength of R is the huge number of add-on packages for all kinds of specialized tasks. For the grand finale of this introduction, we’re gonna get a little taste of machine learning. For that matter we install the OneR package from CRAN (the official package repository of R): Tools -> Install packages… -> type in “OneR” -> click “Install”.

After that we build a simple model on the iris dataset to predict the Species column:

library(OneR) # load package
data <- optbin(Species ~., data = iris) # find optimal bins for numeric predictors
model <- OneR(data, verbose = TRUE) # build actual model
## 
##     Attribute    Accuracy
## 1 * Petal.Width  96%     
## 2   Petal.Length 95.33%  
## 3   Sepal.Length 74.67%  
## 4   Sepal.Width  55.33%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

summary(model) # show rules
## 
## Call:
## OneR.data.frame(x = data, verbose = TRUE)
## 
## Rules:
## If Petal.Width = (0.0976,0.791] then Species = setosa
## If Petal.Width = (0.791,1.63]   then Species = versicolor
## If Petal.Width = (1.63,2.5]     then Species = virginica
## 
## Accuracy:
## 144 of 150 instances classified correctly (96%)
## 
## Contingency table:
##             Petal.Width
## Species      (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum
##   setosa               * 50            0          0  50
##   versicolor              0         * 48          2  50
##   virginica               0            4       * 46  50
##   Sum                    50           52         48 150
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 266.35, df = 4, p-value < 2.2e-16

plot(model)

We’ll now see how well the model is doing:

prediction <- predict(model, data)
eval_model(prediction, data)
## 
## Confusion matrix (absolute):
##             Actual
## Prediction   setosa versicolor virginica Sum
##   setosa         50          0         0  50
##   versicolor      0         48         4  52
##   virginica       0          2        46  48
##   Sum            50         50        50 150
## 
## Confusion matrix (relative):
##             Actual
## Prediction   setosa versicolor virginica  Sum
##   setosa       0.33       0.00      0.00 0.33
##   versicolor   0.00       0.32      0.03 0.35
##   virginica    0.00       0.01      0.31 0.32
##   Sum          0.33       0.33      0.33 1.00
## 
## Accuracy:
## 0.96 (144/150)
## 
## Error rate:
## 0.04 (6/150)
## 
## Error rate reduction (vs. base rate):
## 0.94 (p-value < 2.2e-16)

96% accuracy is not too bad, even for this simple dataset!

If you want to know more about the OneR package you can read the vignette: OneR – Establishing a New Baseline for Machine Learning Classification Models.

Well, and that’s it for the ultimate introduction to R – hopefully you liked it and you learned something! Please share your first experiences with R in the comments and also if you miss something (I might add it in the future!) – Thank you for reading and stay tuned for more to come!

Was the Bavarian Abitur too hard this time?


Bavaria is known for its famous Oktoberfest… and within Germany also for its presumably difficult Abitur, a qualification granted by university-preparatory schools in Germany.

A mandatory part for all students is maths. This year many students protested that the maths part was way too hard, they even started an online petition with more than seventy thousand supporters at this time of writing!

It is not clear yet whether their marks will be adjusted upwards, the ministry of education is investigating the case. As a professor in Bavaria who also teaches statistics I will take the opportunity to share with you an actual question from the original examination with solution, so read on…

Let us have a look at the first (and easiest) question in the stochastics part:

Every sixth visitor to the Oktoberfest wears a gingerbread heart around his neck. During the festival 25 visitors are chosen at random. Determine the probability that at most one of the selected visitors will have a gingerbread heart.

Before you read on try to solve the task yourself…

Of course students are not allowed to use R in the examination but in general the good thing about this kind of questions is that if you have no idea how to solve them analytically solving them by simulation is often much easier:

set.seed(12)
N <- 1e7
v <- matrix(sample(c(0L, 1L), size = 25 * N, replace = TRUE, prob = c(5/6, 1/6)), ncol = 25)
sum(rowSums(v) <= 1) / N
## [1] 0.062936

The answer is about 6.3\%.

Now for the analytical solution: “At least one” implies that we have to differentiate between two cases, “no gingerbread heart” and “exactly one gingerbread heart”. “No gingerbread heart” is just \left(\frac{5}{6}\right)^{25}. “Exactly one gingerbread heart” is 25\cdot\frac{1}{6}\cdot\left(\frac{5}{6}\right)^{24} because there are 25 possibilities of where the gingerbread heart could occur. We have to add both probabilities:

(5/6)^25 + 25*1/6*(5/6)^24
## [1] 0.06289558

If you know a little bit about probability distributions you will recognize the above as the binomial distribution:

pbinom(q = 1, size = 25, prob = 1/6)
## [1] 0.06289558

Of course it is a little unfair to judge just on basis of the easiest task and without knowing the general maths level that is required. But still, I would like to hear your opinion on this. Also outsiders’ views from different countries and different educational systems are very welcome! So, what do you think:

Was the Bavarian Abitur too hard this time? Please leave your reply in the comment section below!

Update (June 7, 2019): The Bavarian Ministry of Education has decided now that the marks will not be subsequently adjusted.

Separating the Signal from the Noise: Robust Statistics for Pedestrians


One of the problems of navigating an autonomous car through a city is to extract robust signals in the face of all the noise that is present in the different sensors. Just taking something like an arithmetic mean of all the data points could possibly end in a catastrophe: if a part of a wall looks similar to the street and the algorithm calculates an average trajectory of the two this would end in leaving the road and possibly crashing into pedestrians. So we need some robust algorithm to get rid of the noise. The area of statistics that especially deals with such problems is called robust statistics and the methods used therein robust estimation.

Now, one of the problems is that one doesn’t know what is signal and what is noise. The big idea behind RANSAC (short for RAndom SAmple Consensus) is to get rid of outliers by basically taking as many points as possible which form a well-defined region and leaving out the others. It does that iteratively, similar to the famous k-means algorithm (the topic of one of the upcoming posts, so stay tuned…).

To really understand how RANSAC works we will now build it with R. We will take a simple linear regression as an example and make it robust against outliers.

For educational purposes we will do this step by step:

  1. Understanding the general outline of the algorithm.
  2. Looking at the steps in more detail.
  3. Expressing the steps in pseudocode.
  4. Translating this into R!

Conveniently enough Wikipedia gives a good outline of the algorithm and even provides us with the very clear pseudocode which will serve as the basis for our own R implementation (the given Matlab code is not very good in my opinion and has nothing to do with the pseudocode):

The RANSAC algorithm is essentially composed of two steps that are iteratively repeated:

  1. In the first step, a sample subset containing minimal data items is randomly selected from the input dataset. A fitting model and the corresponding model parameters are computed using only the elements of this sample subset. The cardinality of the sample subset is the smallest sufficient to determine the model parameters.
  2. In the second step, the algorithm checks which elements of the entire dataset are consistent with the model instantiated by the estimated model parameters obtained from the first step. A data element will be considered as an outlier if it does not fit the fitting model instantiated by the set of estimated model parameters within some error threshold that defines the maximum deviation attributable to the effect of noise. The set of inliers obtained for the fitting model is called consensus set. The RANSAC algorithm will iteratively repeat the above two steps until the obtained consensus set in certain iteration has enough inliers.

In more detail:

RANSAC achieves its goal by repeating the following steps:

  1. Select a random subset of the original data. Call this subset the hypothetical inliers.
  2. A model is fitted to the set of hypothetical inliers.
  3. All other data are then tested against the fitted model. Those points that fit the estimated model well, according to some model-specific loss function, are considered as part of the consensus set.
  4. The estimated model is reasonably good if sufficiently many points have been classified as part of the consensus set.
  5. Afterwards, the model may be improved by reestimating it using all members of the consensus set.

This procedure is repeated a fixed number of times, each time producing either a model which is rejected because too few points are part of the consensus set, or a refined model together with a corresponding consensus set size. In the latter case, we keep the refined model if its consensus set is larger than the previously saved model.

Now, this can be expressed in pseudocode:

Given:
    data - a set of observed data points
    model - a model that can be fitted to data points
    n - minimum number of data points required to fit the model
    k - maximum number of iterations allowed in the algorithm
    t - threshold value to determine when a data point fits a model
    d - number of close data points required to assert that a model fits well to data

Return:
    bestfit - model parameters which best fit the data (or nul if no good model is found)

iterations = 0
bestfit = nul
besterr = something really large
while iterations < k {
    maybeinliers = n randomly selected values from data
    maybemodel = model parameters fitted to maybeinliers
    alsoinliers = empty set
    for every point in data not in maybeinliers {
        if point fits maybemodel with an error smaller than t
             add point to alsoinliers
    }
    if the number of elements in alsoinliers is > d {
        % this implies that we may have found a good model
        % now test how good it is
        bettermodel = model parameters fitted to all points in maybeinliers and alsoinliers
        thiserr = a measure of how well model fits these points
        if thiserr < besterr {
            bestfit = bettermodel
            besterr = thiserr
        }
    }
    increment iterations
}
return bestfit

It is quite easy to convert this into valid R code (as a learner of R you should try it yourself before looking at my solution!):

ransac <- function(data, n, k, t, d) {
  iterations <- 0
  bestfit <- NULL
  besterr <- 1e5
  while (iterations < k) {
    maybeinliers <- sample(nrow(data), n)
    maybemodel <- lm(y ~ x, data = data, subset = maybeinliers)
    alsoinliers <- NULL
    for (point in setdiff(1:nrow(data), maybeinliers)) {
      if (abs(maybemodel$coefficients[2]*data[point, 1] - data[point, 2] + maybemodel$coefficients[1])/(sqrt(maybemodel$coefficients[2] + 1)) < t)
        alsoinliers <- c(alsoinliers, point)
    }
    if (length(alsoinliers) > d) {
      bettermodel <- lm(y ~ x, data = data, subset = c(maybeinliers, alsoinliers))
      thiserr <- summary(bettermodel)$sigma
      if (thiserr < besterr) {
        bestfit <- bettermodel
        besterr <- thiserr
      }
    }
    iterations <- iterations + 1
  }
  bestfit
}

We now test this with some sample data:

data <- read.csv("data/RANSAC.csv")
plot(data)
abline(lm(y ~ x, data = data))
set.seed(3141)
abline(ransac(data, n = 10, k = 10, t = 0.5, d = 10), col = "blue")

The black line is a plain vanilla linear regression, the blue line is the RANSAC-enhanced version. As you can see: no more crashing into innocent pedestrians. 🙂

Learning Data Science: Predicting Income Brackets


As promised in the post Learning Data Science: Modelling Basics we will now go a step further and try to predict income brackets with real world data and different modelling approaches. We will learn a thing or two along the way, e.g. about the so-called Accuracy-Interpretability Trade-Off, so read on…

The data we will use is from here: Marketing data set. The description reads:

This dataset contains questions from questionnaires that were filled out by shopping mall customers in the San Francisco Bay area. The goal is to predict the Annual Income of Household from the other 13 demographics attributes.

The following extra information (or metadata) is provided with the data:

cat(readLines('data/marketing-names.txt'), sep = '\n')
Marketing data set
1: Description.
This dataset contains questions from questionaries that were filled out by shopping mall customers in the San Francisco Bay area. The goal is to predict the Anual Income of Household from the other 13 demographics  attributes.
2: Type.            Classification
3: Origin.          Real world
4: Instances.       6876 (8993)
5: Features.        14
6: Classes          9
7: Missing values.  Yes
8: Header.
@relation marketing
@attribute Sex integer [1, 2]
@attribute MaritalStatus integer [1, 5]
@attribute Age integer [1, 7]
@attribute Education integer [1, 6]
@attribute Occupation integer [1, 9]
@attribute YearsInSf integer [1, 5]
@attribute DualIncome integer [1, 3]
@attribute HouseholdMembers integer [1, 9]
@attribute Under18 integer [0, 9]
@attribute HouseholdStatus integer [1, 3]
@attribute TypeOfHome integer [1, 5]
@attribute EthnicClass integer [1, 8]
@attribute Language integer [1, 3]
@attribute Income {1, 2, 3, 4, 5, 6, 7, 8, 9}
@inputs Sex, MaritalStatus, Age, Education, Occupation, YearsInSf, DualIncome, HouseholdMembers, Under18, HouseholdStatus, TypeOfHome, EthnicClass, Language
@outputs Income
DATA DICTIONARY
 1    ANNUAL INCOME OF HOUSEHOLD (PERSONAL INCOME IF SINGLE)
             1. Less than $10,000
             2. $10,000 to $14,999
             3. $15,000 to $19,999
             4. $20,000 to $24,999
             5. $25,000 to $29,999
             6. $30,000 to $39,999
             7. $40,000 to $49,999
             8. $50,000 to $74,999
             9. $75,000 or more
             
 2    SEX
             1. Male
             2. Female
  3    MARITAL STATUS
             1. Married
             2. Living together, not married
             3. Divorced or separated
             4. Widowed
             5. Single, never married
  4    AGE
             1. 14 thru 17
             2. 18 thru 24
             3. 25 thru 34
             4. 35 thru 44
             5. 45 thru 54
             6. 55 thru 64
             7. 65 and Over
  5    EDUCATION
             1. Grade 8 or less
             2. Grades 9 to 11
             3. Graduated high school
             4. 1 to 3 years of college
             5. College graduate
             6. Grad Study
 6    OCCUPATION
             1. Professional/Managerial
             2. Sales Worker
             3. Factory Worker/Laborer/Driver
             4. Clerical/Service Worker
             5. Homemaker
             6. Student, HS or College
             7. Military
             8. Retired
             9. Unemployed
  7    HOW LONG HAVE YOU LIVED IN THE SAN FRAN./OAKLAND/SAN JOSE AREA?
             1. Less than one year
             2. One to three years
             3. Four to six years
             4. Seven to ten years
             5. More than ten years
  8    DUAL INCOMES (IF MARRIED)
             1. Not Married
             2. Yes
             3. No
  9    PERSONS IN YOUR HOUSEHOLD
             1. One
             2. Two
             3. Three
             4. Four
             5. Five
             6. Six
             7. Seven
             8. Eight
             9. Nine or more

 10    PERSONS IN HOUSEHOLD UNDER 18
             0. None
             1. One
             2. Two
             3. Three
             4. Four
             5. Five
             6. Six
             7. Seven
             8. Eight
             9. Nine or more
 11    HOUSEHOLDER STATUS
             1. Own
             2. Rent
             3. Live with Parents/Family
 12    TYPE OF HOME
             1. House
             2. Condominium
             3. Apartment
             4. Mobile Home
             5. Other
 13    ETHNIC CLASSIFICATION
             1. American Indian
             2. Asian
             3. Black
             4. East Indian
             5. Hispanic
             6. Pacific Islander
             7. White
             8. Other
  14    WHAT LANGUAGE IS SPOKEN MOST OFTEN IN YOUR HOME?
             1. English
             2. Spanish
             3. Other

Our task is to predict the variable “Income”.

So, let us first load the data (you can find the correctly formatted csv-file here: marketing.csv), have a look at some of its characteristics and perform a little bit of additional formatting. After that we divide it into a training (80%) and a test set (20%) to account for potential overfitting (also see Learning Data Science: Modelling Basics):

data <- read.csv("data/marketing.csv")
dim(data)
## [1] 6876   14

str(data)
## 'data.frame':    6876 obs. of  14 variables:
##  $ Sex             : int  1 2 2 2 1 1 1 1 1 1 ...
##  $ MaritalStatus   : int  1 1 5 5 1 5 3 1 1 5 ...
##  $ Age             : int  5 3 1 1 6 2 3 6 7 2 ...
##  $ Education       : int  5 5 2 2 4 3 4 3 4 4 ...
##  $ Occupation      : int  5 1 6 6 8 9 3 8 8 9 ...
##  $ YearsInSf       : int  5 5 5 3 5 4 5 5 4 5 ...
##  $ DualIncome      : int  3 2 1 1 3 1 1 3 3 1 ...
##  $ HouseholdMembers: int  5 3 4 4 2 3 1 3 2 1 ...
##  $ Under18         : int  2 1 2 2 0 1 0 0 0 0 ...
##  $ HouseholdStatus : int  1 2 3 3 1 2 2 2 2 2 ...
##  $ TypeOfHome      : int  1 3 1 1 1 3 3 3 3 3 ...
##  $ EthnicClass     : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ Language        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Income          : int  9 9 1 1 8 1 6 2 4 1 ...

data_names <- names(data)
data <- cbind(data[-ncol(data)], factor(data$Income)) # make variable Income (which should be predicted) a factor
names(data) <- data_names

set.seed(12)
random <- sample(1:nrow(data), 0.8 * nrow(data))
data_train <- data[random, ]
data_test <- data[-random, ]

We start with a simple but comprehensible model, OneR (on CRAN), as a benchmark:

library(OneR)
data <- optbin(data_train)
model <- OneR(data, verbose = TRUE)
## 
##     Attribute        Accuracy
## 1 * Age              28.2%   
## 2   MaritalStatus    28.11%  
## 3   Occupation       28.07%  
## 4   HouseholdStatus  27.56%  
## 5   DualIncome       27.04%  
## 6   Education        25.98%  
## 7   HouseholdMembers 22.51%  
## 8   Under18          20.69%  
## 9   TypeOfHome       19.36%  
## 10  EthnicClass      19.29%  
## 11  Sex              18.07%  
## 12  Language         17.82%  
## 13  YearsInSf        17.75%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

summary(model)
## 
## Call:
## OneR.data.frame(x = data, verbose = TRUE)
## 
## Rules:
## If Age = 1 then Income = 1
## If Age = 2 then Income = 1
## If Age = 3 then Income = 6
## If Age = 4 then Income = 8
## If Age = 5 then Income = 8
## If Age = 6 then Income = 8
## If Age = 7 then Income = 6
## 
## Accuracy:
## 1551 of 5500 instances classified correctly (28.2%)
## 
## Contingency table:
##       Age
## Income     1     2     3     4     5    6    7  Sum
##    1   * 421 * 352    99    43    21   15   25  976
##    2      16   204   107    39    13   22   33  434
##    3       9   147   122    49    12   21   35  395
##    4       5   121   188    71    39   29   42  495
##    5       3    77   179    81    29   23   34  426
##    6      10    93 * 234   156    70   56 * 47  666
##    7      12    92   185   155   107   66   33  650
##    8      12   111   211 * 251 * 160 * 86   44  875
##    9      11    76   114   187   104   69   22  583
##    Sum   499  1273  1439  1032   555  387  315 5500
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 2671.2, df = 48, p-value < 2.2e-16

plot(model)

prediction <- predict(model, data_test)
eval_model(prediction, data_test)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1    232   45   46   32   33   27   19   27   24  485
##        2      0    0    0    0    0    0    0    0    0    0
##        3      0    0    0    0    0    0    0    0    0    0
##        4      0    0    0    0    0    0    0    0    0    0
##        5      0    0    0    0    0    0    0    0    0    0
##        6     31   30   44   44   41   66   44   57   50  407
##        7      0    0    0    0    0    0    0    0    0    0
##        8     16   20   20   47   27   87   71  110   86  484
##        9      0    0    0    0    0    0    0    0    0    0
##        Sum  279   95  110  123  101  180  134  194  160 1376
## 
## Confusion matrix (relative):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1   0.17 0.03 0.03 0.02 0.02 0.02 0.01 0.02 0.02 0.35
##        2   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        3   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        4   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        5   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        6   0.02 0.02 0.03 0.03 0.03 0.05 0.03 0.04 0.04 0.30
##        7   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        8   0.01 0.01 0.01 0.03 0.02 0.06 0.05 0.08 0.06 0.35
##        9   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        Sum 0.20 0.07 0.08 0.09 0.07 0.13 0.10 0.14 0.12 1.00
## 
## Accuracy:
## 0.2965 (408/1376)
## 
## Error rate:
## 0.7035 (968/1376)
## 
## Error rate reduction (vs. base rate):
## 0.1176 (p-value < 2.2e-16)

What can we conclude from this? First the most important feature is “Age” while “Marital Status”, “Occupation” and “Household Status” perform comparably well. The overall accuracy (i.e. the percentage of correctly predicted instances) is with about 30% not that great, on the other hand not that extraordinarily uncommon when dealing with real-world data. Looking at the model itself, in the form of rules and the diagnostic plot, we can see that we have non-linear relationship between Age and Income: the older one gets the higher the income bracket, except for people who are old enough to retire. This seems plausible.

OneR is basically a one step decision tree, so the natural choice for our next analysis would be to have a full decision tree built (all packages are on CRAN):

library(rpart)
library(rpart.plot)
model <- rpart(Income ~., data = data_train)
rpart.plot(model, type = 3, extra = 0, box.palette = "Grays")

prediction <- predict(model, data_test, type = "class")
eval_model(prediction, data_test)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1    201   36   22   13   16   12    8   15   12  335
##        2     43   25   32   22   17   12   10   14    6  181
##        3      0    0    0    0    0    0    0    0    0    0
##        4      0    0    0    0    0    0    0    0    0    0
##        5      0    0    0    0    0    0    0    0    0    0
##        6     18   24   40   50   42   68   32   33   22  329
##        7      0    0    0    0    0    0    0    0    0    0
##        8     17   10   16   38   26   88   84  132  120  531
##        9      0    0    0    0    0    0    0    0    0    0
##        Sum  279   95  110  123  101  180  134  194  160 1376
## 
## Confusion matrix (relative):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1   0.15 0.03 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.24
##        2   0.03 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.00 0.13
##        3   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        4   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        5   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        6   0.01 0.02 0.03 0.04 0.03 0.05 0.02 0.02 0.02 0.24
##        7   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        8   0.01 0.01 0.01 0.03 0.02 0.06 0.06 0.10 0.09 0.39
##        9   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        Sum 0.20 0.07 0.08 0.09 0.07 0.13 0.10 0.14 0.12 1.00
## 
## Accuracy:
## 0.3096 (426/1376)
## 
## Error rate:
## 0.6904 (950/1376)
## 
## Error rate reduction (vs. base rate):
## 0.134 (p-value < 2.2e-16)

The new model has an accuracy that is a little bit better but the interpretability is a little bit worse. You have to go through the different branches to see in which income bracket it ends. So for example when the age bracket is below 2 (which means that it is 1) it predicts income bracket 1, when it is bigger than 2 and the Household Status bracket is 1 it predicts income income bracket 8 and so on. The most important variable, as you can see is again Age (which is reassuring that OneR was on the right track) but we also see Household Status and Occupation again.

What is better than one tree? Many trees! So the next natural step is to have many trees built, while varying the features and the examples that should be included in each tree. At the end it may be that different trees give different income brackets as their respective prediction, which we solve via voting as in a good democracy. This whole method is fittingly called Random Forests and fortunately there are many good packages available in R. We use the randomForest package (also on CRAN) here:

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.

set.seed(4543) # for reproducibility
model <- randomForest(Income ~., data = data_train, importance = TRUE)
varImpPlot(model)

prediction <- predict(model, data_test)
eval_model(prediction, data_test)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1    223   35   26   16   19   11    9   18   16  373
##        2     24   15   12   20    7   11    1    4    1   95
##        3      9   10   11   14    9    6    3    4    2   68
##        4      5   15   25   22   10   22    6    9    5  119
##        5      2    2    8    9    6   12    6    3    1   49
##        6      3    5   15   14   19   40   23   17   15  151
##        7      8    4    7   13   14   26   25   24    5  126
##        8      3    8    5   11   13   44   49   87   68  288
##        9      2    1    1    4    4    8   12   28   47  107
##        Sum  279   95  110  123  101  180  134  194  160 1376
## 
## Confusion matrix (relative):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1   0.16 0.03 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.27
##        2   0.02 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.07
##        3   0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.05
##        4   0.00 0.01 0.02 0.02 0.01 0.02 0.00 0.01 0.00 0.09
##        5   0.00 0.00 0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.04
##        6   0.00 0.00 0.01 0.01 0.01 0.03 0.02 0.01 0.01 0.11
##        7   0.01 0.00 0.01 0.01 0.01 0.02 0.02 0.02 0.00 0.09
##        8   0.00 0.01 0.00 0.01 0.01 0.03 0.04 0.06 0.05 0.21
##        9   0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.02 0.03 0.08
##        Sum 0.20 0.07 0.08 0.09 0.07 0.13 0.10 0.14 0.12 1.00
## 
## Accuracy:
## 0.3459 (476/1376)
## 
## Error rate:
## 0.6541 (900/1376)
## 
## Error rate reduction (vs. base rate):
## 0.1796 (p-value < 2.2e-16)

As an aside you can see that the basic modelling workflow stayed the same – no matter what approach (OneR, decision tree or random forest) you chose. This standard is kept for most (modern) packages and one of the great strengths of R! With thousands of packages on CRAN alone there are of course sometimes differences but those are normally well documented (so the help function or the vignette are your friends!)

Now, back to the result of the last analysis: again, the overall accuracy is better but because we have hundreds of trees now the interpretability suffered a lot. This is also known under the name Accuracy-Interpretability Trade-Off. The best we can do in the case of random forests is to find out which features are the most important ones. Again Age, Occupation and Household Status show up, which is consistent with our analyses so far. Additionally, because of the many trees that had to be built, this analysis ran a lot longer than the other two.

Random forests are one of the best methods out of the box, so the accuracy of about 34% tells us that our first model (OneR) wasn’t doing too bad in the first place! Why are able to reach comparatively good results with just one feature. This is true for many real-world data sets. Sometimes simple models are not much worse than very complicated ones – you should keep that in mind!

If you play around with this dataset I would be interested in your results! Please post them in the comments – Thank you and stay tuned!

Learning R: The Collatz Conjecture


In this post we will see that a little bit of simple R code can go a very long way! So let’s get started!

One of the fascinating features of number theory (unlike many other branches of mathematics) is that many statements are easy to make but the brightest minds are not able to prove them, the so called Collatz conjecture (named after the German mathematician Lothar Collatz) is an especially fascinating example:


The Collatz conjecture states that when you start with any positive integer and

  • if it is even, the next number is one half the previous number and,
  • if it is odd, the next number is three times the previous number plus one
  • the sequence will always reach one.


It doesn’t get any simpler than that but no one has been able to prove this – and not for a lack of trying! The great mathematician Paul Erdős said about it “Mathematics may not be ready for such problems.” You can read more on Wikipedia: Collatz conjecture and watch an especially nice film that was made by a group of students:

So let us write a little program and try some numbers!

First we need a simple helper function to determine whether a number is even:

is.even <- function(x) {
  if (x %% 2 == 0) TRUE
  else FALSE
}

is.even(2)
## [1] TRUE

is.even(3)
## [1] FALSE

Normally we wouldn’t use a dot within function names but R itself (because of its legacy code) is not totally consistent here and the is-function family (like is.na or is.integer) all use a dot. After that we write a function for the rule itself, making use of the is.even function:

collatz <- function(n) {
  if (is.even(n)) n/2
  else 3 * n + 1
}

collatz(6)
## [1] 3

collatz(5)
## [1] 16

To try a number and plot it (like in the Wikipedia article) we could use a while-loop:

n_total <- n <- 27
while (n != 1) {
  n <- collatz(n)
  n_total <- c(n_total, n)
}

n_total
##   [1]   27   82   41  124   62   31   94   47  142   71  214  107  322  161
##  [15]  484  242  121  364  182   91  274  137  412  206  103  310  155  466
##  [29]  233  700  350  175  526  263  790  395 1186  593 1780  890  445 1336
##  [43]  668  334  167  502  251  754  377 1132  566  283  850  425 1276  638
##  [57]  319  958  479 1438  719 2158 1079 3238 1619 4858 2429 7288 3644 1822
##  [71]  911 2734 1367 4102 2051 6154 3077 9232 4616 2308 1154  577 1732  866
##  [85]  433 1300  650  325  976  488  244  122   61  184   92   46   23   70
##  [99]   35  106   53  160   80   40   20   10    5   16    8    4    2    1

plot(n_total, type = "l", col = "blue", xlab = "", ylab = "")

As you can see, after a wild ride the sequence finally reaches one as expected. We end with some nerd humour from the cult website xkcd:

Source: xkcd

To understand Recursion you have to understand Recursion…


Sorting values is one of the bread and butter tasks in computer science: this post uses it as a use case to learn what recursion is all about. It starts with some nerd humour… and ends with some more, so read on!

Have a look at the following code and try to understand it:

bogosort <- function(x) {
  while(is.unsorted(x)) x <- sample(x)
  x
}

set.seed(123)
(n <- sample(1:100, 10))
##  [1] 29 79 41 86 91  5 50 83 51 42

bogosort(n)
##  [1]  5 29 41 42 50 51 79 83 86 91

Obviously bogosort did its job… but in a ludicrously inefficient way! It just shuffles all values over and over until they are sorted by chance! In the worst case it just shuffles forever (i.e. until the sun swallows the earth or your partner switches the computer off… whatever comes first).

To create a more efficient sorting algorithm one could use the following strategy:

  1. Draw the smallest number, put it into a vector and
  2. Sort the remaining numbers by going back to 1.

The algorithm (which is called selection sort, see below) effectively sorts by calling itself! This is the main idea behind recursion.

To dive deeper into how recursion works watch the following video:

So, basically

Recursion means defining a function in terms of itself!

Additionally you need some stopping criterion or base case, otherwise you would create an infinite loop.

The recursive version of the factorial function is:

    \[1!=1, n!=n(n-1)!\]

In the video this formula is written in C, here is the R version:

fact <- function(n) {
  if (n == 1) 1
  else n * fact(n-1)
}
fact(4)
## [1] 24

factorial(4) #inbuilt factorial function
## [1] 24

After we have seen how recursion actually works, let us implement the above selection sort algorithm in R. First watch this, admittedly unconventional, video:

So, the main idea behind the selection sort algorithm is to take the smallest number each time and add it to the sorted subset on the left. The sorting function calls itself for the unsorted subset on the right.

We now implement this idea recursively:

selsort <- function(x) {
  if(length(x) > 1) {
    mini <- which.min(x)
    c(x[mini], selsort(x[-mini]))
  } else x
}

set.seed(123)
(v <- sample(1:100, 20))
##  [1] 29 79 41 86 91  5 50 83 51 42 87 98 60 94  9 77 21  4 27 78

selsort(v)
##  [1]  4  5  9 21 27 29 41 42 50 51 60 77 78 79 83 86 87 91 94 98

Another, much more efficient sorting algorithm, quicksort, also works recursively. Watch the following video to get the idea:

So, the main idea behind the quicksort algorithm is to find a pivot for the unsorted set of numbers which splits it into two similarly sized subsets, on the left the numbers that are smaller than the pivot, on the right the numbers that are bigger. Then the sorting function calls itself for each subset.

Have a look at the following code:

qsort <- function(v) {
  if (length(v) > 1) {
    pivot <- (min(v) + max(v)) / 2
    c(qsort(v[v < pivot]), v[v == pivot], qsort(v[v > pivot])) 
  } else v
}

qsort(v)
##  [1]  4  5  9 21 27 29 41 42 50 51 60 77 78 79 83 86 87 91 94 98

Now for some speed comparisons between selection sort, quicksort and R’s sort function (on my lame computer…):

options(expressions = 7500)
N <- 7000

set.seed(123)
vs <- runif(N)

system.time(u <- selsort(vs))
##    user  system elapsed 
##    0.59    0.14    0.73

system.time(u <- qsort(vs))
##    user  system elapsed 
##    0.01    0.00    0.01

system.time(u <- sort(vs))
##    user  system elapsed 
##       0       0       0

Wow, although quicksort was implemented in R (and not in C as the sort function) it is impressively fast. Why? Because each subset of unsorted numbers is again split into two subsets only about half as big. This pushes down the sizes of subsets that still have to be sorted down pretty fast. In the case of selection sort the subset only gets smaller by one number (the smallest one) in each step.

Let us end this post with a little easter egg from Google – do you get it?