From Coin Tosses to p-Hacking: Make Statistics Significant Again!


One of the most notoriously difficult subjects in statistics is the concept of statistical tests. We will explain the ideas behind it step by step to give you some intuition on how to use (and misuse) it, so read on…

Let us begin with some coin tosses and the question how to find out whether a coin is fair, i.e. shows heads and tails with fifty-fifty probability. If you had all the time of the world you could throw it indefinitely often to see where the probabilities stabilize. Yet, in reality we only have some sample of tosses and have to infer results based on this. Because of this the area that deals with those problems is called inferential statistics (also inductive statistics).

So, to keep things simple let us throw our coin only ten times. Where is the point where you would suspect that something is not quite right with this coin? As soon as you don’t get five times heads and five times tails? Or only when you get ten times one side only? Possibly somewhere in between? But where exactly?

The first thing to understand is that this is a somewhat arbitrary decision… but one that should be set in advance and which should adhere to some general standard. The idea is to set some fixed probability with which you compare the probability of the result you see in your experiment (i.e. the number of heads and tails in our example). When you obtain a result that is as improbable or even more so than that pre-fixed probability you conclude that the result happened not just by chance but that something significant happened (e.g. that your coin is unfair). The generally accepted probability below which one speaks of a significant result is 5% (or 0.05), it is therefore called significance level.

After fixing the significance level at 0.05 we have to compare it with the probability of our observed result (or a result that is even more extreme). Let us say that we got nine times one side of the coin and only one time the other. What is the probability of getting this or an even more extreme result? The general probability formula is:

    \[probability\ of\ an\ event = \frac{number\ of\ favorable\ outcomes}{number\ of\ all\ possible\ outcomes}\]

For the numerator we get 22 possibilities for observing the above mentioned or an even more extreme result: two times ten possibilities that each of the ten coins could be the odd one out and two more extreme cases showing only heads or only tails. For the denominator we get 2^{10} possibilities for throwing 10 coins:

22 / 2^10
## [1] 0.02148438

So, the probability is about 2% which is below 5%, so we would conclude that our coin is not fair!

Another possibility is to use the binomial coefficient which gives us the number of possibilities to choose k items out of n items. Conveniently enough the R function is called choose(n, k):

(choose(10, 0) + choose(10, 1) + choose(10, 9) + choose(10, 10))/2^10
## [1] 0.02148438

Or, because the situation is symmetric:

2 * (choose(10, 0) + choose(10, 1))/2^10
## [1] 0.02148438

To make things even simpler we can use the binomial distribution for our calculation, where basically all of the needed binomial coefficients are summed up automatically:

prob <- 0.5 # fair coin
n <- 10     # number of coin tosses
k <- 1      # number of "successes", i.e. how many coins show a different side than the rest

2 * pbinom(k, n, prob)
## [1] 0.02148438

As you can see, we get the same probability value over and over again. If you have understood the line of thought so far you are ready for the last step: performing the statistical test with the binom.test() function:

binom.test(k, n)
## 
##  Exact binomial test
## 
## data:  k and n
## number of successes = 1, number of trials = 10, p-value = 0.02148
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.002528579 0.445016117
## sample estimates:
## probability of success 
##                    0.1

As you can see, we get the same number again: it is named p-value in the output:

The p-value is the probability of getting the observed – or an even more extreme – result under the condition that the null hypothesis is true.

The only thing that is new here is the null hypothesis. The null hypothesis is just a fancy name for the assumption that in general things are unremarkable, examples could be: our coin is fair, a new medical drug doesn’t work (its effect is just random noise) or one group of students is not better than the other.

So, after having a look at the output we would formally say that we have a significant result and therefore reject the null hypothesis that the coin is fair.

In most real world scenarios the situation is not that clear cut as it is with our coins. Therefore we won’t normally use a binomial test but the workhorse of statistical inference: the famous t-test! On the positive side you can use the t-test in many situations where you don’t have all the information on the underlying population distribution, e.g. its variance. It is based on comparatively mild assumptions, e.g. that the population distribution is normal (which is itself effectively based on the Central Limit Theorem (CLT)) – and even this can be relaxed as long as its variance is finite and when you have big sample sizes. If it is normal though small sample sizes suffice. On the negative side the t-test often only gives approximations of exact tests. Yet the bigger the sample sizes the better the approximation. Enough of the theory, let us perform a t-test with our data. We use the t.test() function for that:

data <- c(rep(0, n-k), rep(1, k))
t.test(data, mu = prob)
## 
##  One Sample t-test
## 
## data:  data
## t = -4, df = 9, p-value = 0.00311
## alternative hypothesis: true mean is not equal to 0.5
## 95 percent confidence interval:
##  -0.1262157  0.3262157
## sample estimates:
## mean of x 
##       0.1

As you can see, we get a different p-value this time but the test is still significant and leads to the right conclusion. As said before, the bigger the sample size the better the approximation.

Let us try two more examples. First we use the inbuilt mtcars dataset to test whether petrol consumption of manual and automatic transmissions differ significantly:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# make new column with better labels
mtcars$trans <- ifelse(mtcars$am == 0, "aut", "man")
mtcars$trans <- factor(mtcars$trans)

mtcars$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

mtcars$trans
##  [1] man man man aut aut aut aut aut aut aut aut aut aut aut aut aut aut
## [18] man man man aut aut aut aut aut man man man man man man man
## Levels: aut man

by(mtcars$mpg, mtcars$trans, summary)
## mtcars$trans: aut
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   14.95   17.30   17.15   19.20   24.40 
## -------------------------------------------------------- 
## mtcars$trans: man
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   21.00   22.80   24.39   30.40   33.90

boxplot(mpg ~ trans, data = mtcars)

As we can see the consumption is obviously different – but is it also significantly different at the 5% significance level? Let us find out:

M <- mtcars$trans == "man" #manual
mpg_man <- mtcars[M, ]$mpg #consumption of manual transmission
mpg_man
##  [1] 21.0 21.0 22.8 32.4 30.4 33.9 27.3 26.0 30.4 15.8 19.7 15.0 21.4

mpg_aut <- mtcars[!M, ]$mpg #consumption of automatic transmission
mpg_aut
##  [1] 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7
## [15] 21.5 15.5 15.2 13.3 19.2

t.test(mpg_man, mpg_aut)
## 
##  Welch Two Sample t-test
## 
## data:  mpg_man and mpg_aut
## t = 3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3.209684 11.280194
## sample estimates:
## mean of x mean of y 
##  24.39231  17.14737

Yes, it is significantly different!

And now for our final example, a classic use of the t-test to find out whether a medical drug has a significant effect (we use the inbuilt sleep dataset):

sleep
##    extra group ID
## 1    0.7     1  1
## 2   -1.6     1  2
## 3   -0.2     1  3
## 4   -1.2     1  4
## 5   -0.1     1  5
## 6    3.4     1  6
## 7    3.7     1  7
## 8    0.8     1  8
## 9    0.0     1  9
## 10   2.0     1 10
## 11   1.9     2  1
## 12   0.8     2  2
## 13   1.1     2  3
## 14   0.1     2  4
## 15  -0.1     2  5
## 16   4.4     2  6
## 17   5.5     2  7
## 18   1.6     2  8
## 19   4.6     2  9
## 20   3.4     2 10

by(sleep$extra, sleep$group, summary)
## sleep$group: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.600  -0.175   0.350   0.750   1.700   3.700 
## -------------------------------------------------------- 
## sleep$group: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.100   0.875   1.750   2.330   4.150   5.500

boxplot(extra ~ group, sleep)

Again, there is obviously a difference… but is it significant?

t.test(extra ~ group, sleep)
## 
##  Welch Two Sample t-test
## 
## data:  extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.3654832  0.2054832
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33

The p-value is above 0.05 which means that this time it cannot be ruled out that the difference is random!

As you can imagine whenever there is money to be made people get creative on how to game the system… unfortunately statistical tests are no exception. Especially in the medical field huge amounts of money are at stake, so one should be well aware of a manipulation technique called p-hacking (also known under the names data snooping or data dredging)!

To summarize our journey into statistical testing so far we have seen that you set a significance level (normally at 0.05) and draw a random sample. You calculate how probable the drawing of this sample (or an even more extreme sample) is and compare it to the significance level. If the probability is below the significance level you say that the test shows a significant result, otherwise you say that one cannot rule out that the difference (if there is one) is just due to chance.

Now, if you don’t like the result… why not draw a sample again? And again? And again? This is p-hacking! Just draw a lot of samples and take the one that fits your agenda!

Of course this is a stark misuse of the idea of statistical tests. But the fact remains that sometimes samples will be significant just by chance!

In the following example we “prove” that homeopathy works after all (as we all know it just doesn’t work beyond the so-called placebo effect as numerous high quality studies over many decades have shown). For that end we just conduct one hundred trials and take one out that fits our bill:

p_hack <- function() {
  homeopathy <- rnorm(20, 1.2, 5)
  t.test(homeopathy, mu = 1.2)$p.value
}

set.seed(123)
trials <- replicate(100, p_hack())
trials[6] # homeopathy works!
## [1] 0.033461758

# wait... no it doesn't!
trials
##   [1] 0.522741341 0.785376466 0.624588751 0.587983281 0.057318309
##   [6] 0.033461758 0.876112179 0.461733176 0.495169397 0.519897894
##  [11] 0.800638084 0.836667733 0.915272904 0.340430844 0.087656634
##  [16] 0.879783049 0.824428157 0.784957654 0.898447738 0.494603038
##  [21] 0.279707663 0.438005042 0.886043148 0.158778621 0.167712182
##  [26] 0.135880072 0.185375902 0.662883154 0.840370636 0.331157973
##  [31] 0.283332330 0.423746570 0.849937345 0.503256673 0.546014504
##  [36] 0.725568387 0.727886195 0.167482514 0.586386335 0.493916303
##  [41] 0.937060320 0.271651762 0.236448565 0.191925375 0.983841382
##  [46] 0.373146285 0.520463358 0.323242616 0.193315250 0.757835487
##  [51] 0.429311063 0.284986685 0.868272041 0.844042454 0.885548528
##  [56] 0.996021648 0.978855283 0.588192375 0.495521737 0.610192393
##  [61] 0.242524547 0.404220265 0.136245145 0.004912541 0.408530447
##  [66] 0.458030143 0.784011725 0.357773131 0.818207705 0.698330582
##  [71] 0.451268449 0.740943226 0.266786179 0.120894921 0.050307044
##  [76] 0.387838555 0.232600995 0.834739682 0.669270240 0.516910985
##  [81] 0.273077802 0.291004360 0.369842912 0.132130995 0.454371585
##  [86] 0.567029545 0.219163798 0.524362414 0.737950629 0.855945701
##  [91] 0.959611992 0.901298484 0.931203165 0.204489683 0.843491761
##  [96] 0.567982673 0.329414884 0.107350495 0.911279358 0.696661191

Lo and behold: trial no. 6 shows a significant result – although by design both means are the same (1.2)! So we take this study, publish it and sell lots and lots of useless homeopathy drugs!

How many significant results will we get just by chance on average? Well, exactly the amount of our significance level!

trials <- replicate(1e5, p_hack())
round(length(trials[trials < 0.05]) / length(trials), 2)
## [1] 0.05

As always our friends over at xkcd summarize the situation brilliantly:

source: https://xkcd.com/882/

Learning R: Permutations and Combinations with base R


The area of combinatorics, the art of systematic counting, is dreaded territory for many people, so let us bring some light into the matter: in this post we will explain the difference between permutations and combinations, with and without repetitions, will calculate the number of possibilities and present efficient R code to enumerate all of them, so read on…

The cases we cover in this post:

  • Permutations with repetitions
  • Combinations without repetitions
  • Permutations (of all elements) without repetitions

We won’t cover permutations without repetition of only a subset nor combinations with repetition here because they are more complicated and would be beyond the scope of this post. We will perhaps cover those in a later post.

In all cases, you can imagine somebody drawing elements from a set and the different ways to do so. Permutation implies that the order does matter, with combinations it does not (e.g. in a lottery it normally does not matter in which order the numbers are drawn). Without repetition simply means that when one has drawn an element it cannot be drawn again, so with repetition implies that it is replaced and can be drawn again.

Let us start with permutations with repetitions: as an example take a combination lock (should be permutation lock really!) where you have three positions with the numbers zero to nine each. We use the expand.grid() function for enumerating all possibilities:

P_wi <- expand.grid(rep(list(0:9), 3))
head(P_wi)
##   Var1 Var2 Var3
## 1    0    0    0
## 2    1    0    0
## 3    2    0    0
## 4    3    0    0
## 5    4    0    0
## 6    5    0    0

tail(P_wi)
##      Var1 Var2 Var3
## 995     4    9    9
## 996     5    9    9
## 997     6    9    9
## 998     7    9    9
## 999     8    9    9
## 1000    9    9    9

The formula for calculating the number of permutations is simple for obvious reasons (n is the number of elements to choose from, k is the number of actually chosen elements):

    \[n^k.\]

In R:

10^3
## [1] 1000

nrow(P_wi)
## [1] 1000

The next is combinations without repetitions: the classic example is a lottery where six out of 49 balls are chosen. We use the combn() function for finding all possibilities:

C_wo <- combn(1:49, 6)
C_wo[ , 1:10]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    1    1    1    1    1    1    1    1     1
## [2,]    2    2    2    2    2    2    2    2    2     2
## [3,]    3    3    3    3    3    3    3    3    3     3
## [4,]    4    4    4    4    4    4    4    4    4     4
## [5,]    5    5    5    5    5    5    5    5    5     5
## [6,]    6    7    8    9   10   11   12   13   14    15

To calculate the number of combinations the binomial coefficient is used:

    \[\binom{n}{k} = \frac{n!}{k! (n-k)!}.\]

To give you some intuition consider the above example: you have 49 possibilities for choosing the first ball, 48 for the second, 47 for the third and so on up to the sixth ball. So that gives 49 \cdot 48 \cdot 47 \cdot 46 \cdot 45 \cdot 44. When you think about it this is the same as \frac{49 \cdot 48 \cdot 47 \cdot 46 \cdot 45 \cdot 44 \cdot...\cdot 6 \cdot 5 \cdot 4 \cdot 3 \cdot 2 \cdot 1}{43 \cdot 42 \cdot 41 \cdot 40 \cdot 39 \cdot 38 \cdot...\cdot 6 \cdot 5 \cdot 4 \cdot 3 \cdot 2 \cdot 1} = \frac{n!}{(n-k)!} because all the coefficients smaller than 44 can be eliminated by reducing the fraction! Now, there are 6 possible positions for the first ball that is drawn, 5 for the second… and so on and because the order doesn’t matter we have to divide by 6! = k!, which gives the binomial coefficient. In R we use the choose() function to calculate it:

choose(49, 6)
## [1] 13983816

ncol(C_wo)
## [1] 13983816

So, you see that the probability of winning the lottery are about the same, no matter whether you play it… or not 😉

Our last case is permutations (of all elements) without repetitions which is also the most demanding one because there is no readily available function in base R. So, we have to write our own:

perm <- function(v) {
  n <- length(v)
  if (n == 1) v
  else {
    X <- NULL
    for (i in 1:n) X <- rbind(X, cbind(v[i], perm(v[-i])))
    X
  }
}

As you can see it is a recursive function, to understand recursion read my post: To understand Recursion you have to understand Recursion…. What makes matters a little bit more complicated is that the recursive call is within a for loop.

The idea is to fix one element after the other [for (i in 1:n) and cbind(v[i], ...)] and permute the remaining elements [perm(v[-i])] down to the base case when only one element remains [if (n == 1) v], which cannot be permuted any further. Collect all sets on the respective higher level [X <- (rbind(X, ...)] and return the whole matrix X.

Let us see this in action, as an example we’ll see how many different ways there are of four runners reaching the finishing line:

P_wo <- perm(1:4)
P_wo
##       [,1] [,2] [,3] [,4]
##  [1,]    1    2    3    4
##  [2,]    1    2    4    3
##  [3,]    1    3    2    4
##  [4,]    1    3    4    2
##  [5,]    1    4    2    3
##  [6,]    1    4    3    2
##  [7,]    2    1    3    4
##  [8,]    2    1    4    3
##  [9,]    2    3    1    4
## [10,]    2    3    4    1
## [11,]    2    4    1    3
## [12,]    2    4    3    1
## [13,]    3    1    2    4
## [14,]    3    1    4    2
## [15,]    3    2    1    4
## [16,]    3    2    4    1
## [17,]    3    4    1    2
## [18,]    3    4    2    1
## [19,]    4    1    2    3
## [20,]    4    1    3    2
## [21,]    4    2    1    3
## [22,]    4    2    3    1
## [23,]    4    3    1    2
## [24,]    4    3    2    1

After this rather complicated function the calculation of the number of ways is simple, it is just the factorial function (it should again be obvious why):

    \[n!.\]

In R we have the factorial() function:

factorial(4)
## [1] 24

nrow(P_wo)
## [1] 24

As you will see when solving real world problems with R the above functions often come in handy, so you should add them to your ever growing tool set – have fun and stay tuned!

Learning R: Painting with Fire


A few months ago I published a post on recursion: To understand Recursion you have to understand Recursion…. In this post we will see how to use recursion to fill free areas of an image with colour, the caveats of recursion and how to transform a recursive algorithm into a loop-based version using a queue – so read on…

The recursive version of the painting algorithm we want to examine here is very easy to understand, Wikipedia gives the pseudocode of the so called flood-fill algorithm:


Flood-fill (node, target-color, replacement-color):

  • If target-color is equal to replacement-color, return.
  • If the color of node is not equal to target-color, return.
  • Set the color of node to replacement-color.
  • Perform Flood-fill (one step to the south of node, target-color, replacement-color).
    Perform Flood-fill (one step to the north of node, target-color, replacement-color).
    Perform Flood-fill (one step to the west of node, target-color, replacement-color).
    Perform Flood-fill (one step to the east of node, target-color, replacement-color).
  • Return.

The translation into R couldn’t be any easier:

floodfill <- function(row, col, tcol, rcol) {
  if (tcol == rcol) return()
  if (M[row, col] != tcol) return()
  M[row, col] <<- rcol
  floodfill(row - 1, col    , tcol, rcol) # south
  floodfill(row + 1, col    , tcol, rcol) # north
  floodfill(row    , col - 1, tcol, rcol) # west
  floodfill(row    , col + 1, tcol, rcol) # east
  return("filling completed")
}

We take the image from Wikipedia as an example:

M <- matrix(c(1, 1, 1, 1, 1, 1, 1, 1, 1,
              1, 0, 0, 0, 1, 0, 0, 0, 1,
              1, 0, 0, 0, 1, 0, 0, 0, 1,
              1, 0, 0, 1, 0, 0, 0, 0, 1,
              1, 1, 1, 0, 0, 0, 1, 1, 1,
              1, 0, 0, 0, 0, 1, 0, 0, 1,
              1, 0, 0, 0, 1, 0, 0, 0, 1,
              1, 0, 0, 0, 1, 0, 0, 0, 1,
              1, 1, 1, 1, 1, 1, 1, 1, 1), 9, 9)
image(M, col = c(0, 1))

We now fill the three areas with three different colours and then plot the image again:

startrow <- 5; startcol <- 5
floodfill(startrow, startcol, 0, 2)
## [1] "filling completed"

startrow <- 3; startcol <- 3
floodfill(startrow, startcol, 0, 3)
## [1] "filling completed"

startrow <- 7; startcol <- 7
floodfill(startrow, startcol, 0, 4)
## [1] "filling completed"

image(M, col = 1:4)

This seems to work pretty well but the problem is that the more nested the algorithm becomes the bigger the stack has to be – which could lead to overflow errors. One comment on my original post on recursion read:

just keep in mind that recursion is useful in industrial work only if tail optimization is supported. otherwise your code will explode at some indeterminate time in the future. […]

One possibility is to increase the size of the stack with options(expressions = 10000) but even this may not be enough. Therefore we transform our recursive algorithm into a loop-based one and use a queue instead of a stack! The pseudocode from Wikipedia:


Flood-fill (node, target-color, replacement-color):

  • If target-color is equal to replacement-color, return.
  • If color of node is not equal to target-color, return.
  • Set the color of node to replacement-color.
  • Set Q to the empty queue.
  • Add node to the end of Q.
  • While Q is not empty:
    • Set n equal to the first element of Q.
    • Remove first element from Q.
    • If the color of the node to the west of n is target-color,
      set the color of that node to replacement-color and add that node to the end of Q.
    • If the color of the node to the east of n is target-color,
      set the color of that node to replacement-color and add that node to the end of Q.
    • If the color of the node to the north of n is target-color,
      set the color of that node to replacement-color and add that node to the end of Q.
    • If the color of the node to the south of n is target-color,
      set the color of that node to replacement-color and add that node to the end of Q.
  • Continue looping until Q is exhausted.
  • Return.

Because of the way the algorithm fills areas it is also called forest fire. Again, the translation into valid R code is straightforward:

floodfill <- function(row, col, tcol, rcol) {
  if (tcol == rcol) return()
  if (M[row, col] != tcol) return()
  Q <- matrix(c(row, col), 1, 2)
  while (dim(Q)[1] > 0) {
    n <- Q[1, , drop = FALSE]
    west  <- cbind(n[1]    , n[2] - 1)
    east  <- cbind(n[1]    , n[2] + 1)
    north <- cbind(n[1] + 1, n[2]    )
    south <- cbind(n[1] - 1, n[2]    )
    Q <- Q[-1, , drop = FALSE]
    if (M[n] == tcol) {
      M[n] <<- rcol
      if (M[west] == tcol)  Q <- rbind(Q, west)
      if (M[east] == tcol)  Q <- rbind(Q, east)
      if (M[north] == tcol) Q <- rbind(Q, north)
      if (M[south] == tcol) Q <- rbind(Q, south)
    }
  }
  return("filling completed")
}

As an example we will use a much bigger picture (it can be downloaded from here: Unfilledcirc.png):

library(png)
img <- readPNG("data/Unfilledcirc.png")
M <- img[ , , 1]
M <- ifelse(M < 0.5, 0, 1)
M <- rbind(M, 0)
M <- cbind(M, 0)
image(M, col = c(1, 0))

And now for the filling:

startrow <- 100; startcol <- 100
floodfill(startrow, startcol, 0, 2)
## [1] "filling completed"

startrow <- 50; startcol <- 50
floodfill(startrow, startcol, 1, 3)
## [1] "filling completed"

image(M, col = c(1, 0, 2, 3))

As you can see, with this version of the algorithm much bigger areas can be filled!

I also added both R implementations to the respective section of Rosetta Code: Bitmap/Flood fill.

Learning R: The Ultimate Introduction (incl. Machine Learning!)


There are a million reasons to learn R (see e.g. Why R for Data Science – and not Python?), but where to start? I present to you the ultimate introduction to bring you up to speed! So read on…

I call it ultimate because it is the essence of many years of teaching R… or put differently: it is the kind of introduction I would have liked to have when I started out with R back in the days!

A word of warning though: this is a introduction to R and not to statistics, so I won’t explain the statistics terms used here. You do not need to know any other programming language but it does no harm either. Ok, now let us start!

First you need to install R (https://www.r-project.org) and preferably RStudio as a Graphical User Interface (GUI): https://www.rstudio.com/products/RStudio/#Desktop. Both are free and available for all common operating systems.

To get a quick overview of RStudio watch this video:

You can either type in the following commands in the console or open a new script tab (File -> New File -> R Script) and run the commands by pressing Ctrl + Enter/Return after having typed them.

First of all R is a very good calculator:

2 + 2
## [1] 4

sin(0.5)
## [1] 0.4794255

abs(-10) # absolute value
## [1] 10

pi
## [1] 3.141593

exp(1) # e
## [1] 2.718282

factorial(6)
## [1] 720

By the way: The hash is used for comments, everything after it will be ignored!

Of course you can define variables and use them in your calculations:

n1 <- 2
n2 <- 3
n1 # show content of variable by just typing the name
## [1] 2

n1 + n2
## [1] 5

n1 * n2
## [1] 6

n1^n2
## [1] 8

Part of R’s power stems from the fact that functions can handle several numbers at once, called vectors, and do calculations on them. When calling a function arguments are passed with round brackets:

n3 <- c(12, 5, 27) # concatenate (combine) elements into a vector 
n3
## [1] 12  5 27

min(n3)
## [1] 5

max(n3)
## [1] 27

sum(n3)
## [1] 44

mean(n3)
## [1] 14.66667

sd(n3) # standard deviation
## [1] 11.23981

var(n3) # variance
## [1] 126.3333

median(n3)
## [1] 12

n3 / 12
## [1] 1.0000000 0.4166667 2.2500000

In the last example the 12 was recycled three times. R always tries to do that (when feasible), sometimes giving a warning when it might not be intended:

n3 / c(1, 2)
## Warning in n3/c(1, 2): longer object length is not a multiple of shorter
## object length
## [1] 12.0  2.5 27.0

In cases you only want parts of your vectors you can apply subsetting with square brackets:

n3[1]
## [1] 12

n3[c(2, 3)]
## [1]  5 27

Ranges can easily be created with the colon:

n4 <- 10:20
n4
##  [1] 10 11 12 13 14 15 16 17 18 19 20

When you test whether this vector is bigger than a certain number you will get logicals as a result. You can use those logicals for subsetting:

n4 > 15
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

n4[n4 > 15]
## [1] 16 17 18 19 20

Perhaps you have heard the story of little Gauss where his teacher gave him the task to add all numbers from 1 to 100 to keep him busy for a while? Well, he found a mathematical trick to add them within seconds… for us normal people we can use R:

sum(1:100)
## [1] 5050

When we want to use some code several times we can define our own function (a user-defined function). We do that the same way we create a vector (or any other data structure) because R is a so called functional programming language and functions are so called first-class citizens (i.e. on the same level as other data structures like vectors). The code that is being executed is put in curly brackets:

gauss <- function(x) {
  sum(1:x)
}
gauss(100)
## [1] 5050

gauss(1000)
## [1] 500500

Of course we also have other data types, e.g. matrices are basically two dimensional vectors:

M <- matrix(1:12, nrow = 3, byrow = TRUE) # create a matrix
M
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

dim(M)
## [1] 3 4

Subsetting now has to provide two numbers, the first for the row, the second for the column (like in the game Battleship). If you leave one out, all data of the respective dimension will be shown:

M[2, 3]
## [1] 7

M[ , c(1, 3)]
##      [,1] [,2]
## [1,]    1    3
## [2,]    5    7
## [3,]    9   11

Another possibility to create matrices:

v1 <- 1:4
v2 <- 4:1
M1 <- rbind(v1, v2) # row bind
M1
##    [,1] [,2] [,3] [,4]
## v1    1    2    3    4
## v2    4    3    2    1

M2 <- cbind(v1, v2)  # column bind
M2
##      v1 v2
## [1,]  1  4
## [2,]  2  3
## [3,]  3  2
## [4,]  4  1

Naming rows, here with inbuilt datasets:

rownames(M2) <- LETTERS[1:4]
M2
##   v1 v2
## A  1  4
## B  2  3
## C  3  2
## D  4  1

LETTERS
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

When some result is Not Available:

LETTERS[50]
## [1] NA

Getting the structure of your variables:

str(LETTERS)
##  chr [1:26] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" ...

str(M2)
##  int [1:4, 1:2] 1 2 3 4 4 3 2 1
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:4] "A" "B" "C" "D"
##   ..$ : chr [1:2] "v1" "v2"

Another famous dataset (iris) that is also built into base R (to get help on any function or dataset just put the cursor in it and press F1):

iris
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

Oops, that is a bit long… if you only want to show the first or last rows do the following:

head(iris) # first 6 rows
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

tail(iris, 10) # last 10 rows
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 141          6.7         3.1          5.6         2.4 virginica
## 142          6.9         3.1          5.1         2.3 virginica
## 143          5.8         2.7          5.1         1.9 virginica
## 144          6.8         3.2          5.9         2.3 virginica
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

Iris is a so called data frame, the workhorse of R and data science (you will see how to create one below):

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

As you can see, data frames can combine different data types. If you try to do that with e.g. vectors, which can only hold one data type, something called coercion happens, i.e. at least one data type is forced to become another one so that consistency is maintained:

str(c(2, "Hello")) # 2 is coerced to become a character string too
##  chr [1:2] "2" "Hello"

You can get a fast overview of your data like so:

summary(iris[1:4])
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

boxplot(iris[1:4])

As you have seen, R often runs a function on all of the data simultaneously. This feature is called vectorization and in many other languages you would need a loop for that. In R you don’t use loops that often, but of course they are available:

for (i in seq(5)) {
  print(1:i)
}
## [1] 1
## [1] 1 2
## [1] 1 2 3
## [1] 1 2 3 4
## [1] 1 2 3 4 5

Speaking of control structures: of course conditional statements are available too:

even <- function(x) ifelse(x %% 2 == 0, TRUE, FALSE) # %% gives remainder of division (= modulo operator)
even(1:5)
## [1] FALSE  TRUE FALSE  TRUE FALSE

Linear modelling (e.g. correlation and linear regression) couldn’t be any easier, it is included in the core language:

age <- c(21, 46, 55, 35, 28)
income <- c(1850, 2500, 2560, 2230, 1800)
df <- data.frame(age, income) # create a data frame
df
##   age income
## 1  21   1850
## 2  46   2500
## 3  55   2560
## 4  35   2230
## 5  28   1800

cor(df) # correlation
##              age    income
## age    1.0000000 0.9464183
## income 0.9464183 1.0000000

LinReg <- lm(income ~ age, data = df) # income as a linear model of age
LinReg
## 
## Call:
## lm(formula = income ~ age, data = df)
## 
## Coefficients:
## (Intercept)          age  
##     1279.37        24.56

summary(LinReg)
## 
## Call:
## lm(formula = income ~ age, data = df)
## 
## Residuals:
##       1       2       3       4       5 
##   54.92   90.98  -70.04   91.12 -166.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1279.367    188.510   6.787  0.00654 **
## age           24.558      4.838   5.076  0.01477 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 132.1 on 3 degrees of freedom
## Multiple R-squared:  0.8957, Adjusted R-squared:  0.8609 
## F-statistic: 25.77 on 1 and 3 DF,  p-value: 0.01477

plot(df, pch = 16, main = "Linear model")
abline(LinReg, col = "blue", lwd = 2) # adding the regression line

You could directly use the model to make predictions:

pred_LinReg <- predict(LinReg, data.frame(age = seq(15, 70, 5)))
names(pred_LinReg) <- seq(15, 70, 5)
round(pred_LinReg, 2)
##      15      20      25      30      35      40      45      50      55 
## 1647.73 1770.52 1893.31 2016.10 2138.88 2261.67 2384.46 2507.25 2630.04 
##      60      65      70 
## 2752.83 2875.61 2998.40

If you want to know more about the modelling process you can find it here: Learning Data Science: Modelling Basics

Another strength of R is the huge number of add-on packages for all kinds of specialized tasks. For the grand finale of this introduction, we’re gonna get a little taste of machine learning. For that matter we install the OneR package from CRAN (the official package repository of R): Tools -> Install packages… -> type in “OneR” -> click “Install”.

After that we build a simple model on the iris dataset to predict the Species column:

library(OneR) # load package
data <- optbin(Species ~., data = iris) # find optimal bins for numeric predictors
model <- OneR(data, verbose = TRUE) # build actual model
## 
##     Attribute    Accuracy
## 1 * Petal.Width  96%     
## 2   Petal.Length 95.33%  
## 3   Sepal.Length 74.67%  
## 4   Sepal.Width  55.33%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

summary(model) # show rules
## 
## Call:
## OneR.data.frame(x = data, verbose = TRUE)
## 
## Rules:
## If Petal.Width = (0.0976,0.791] then Species = setosa
## If Petal.Width = (0.791,1.63]   then Species = versicolor
## If Petal.Width = (1.63,2.5]     then Species = virginica
## 
## Accuracy:
## 144 of 150 instances classified correctly (96%)
## 
## Contingency table:
##             Petal.Width
## Species      (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum
##   setosa               * 50            0          0  50
##   versicolor              0         * 48          2  50
##   virginica               0            4       * 46  50
##   Sum                    50           52         48 150
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 266.35, df = 4, p-value < 2.2e-16

plot(model)

We’ll now see how well the model is doing:

prediction <- predict(model, data)
eval_model(prediction, data)
## 
## Confusion matrix (absolute):
##             Actual
## Prediction   setosa versicolor virginica Sum
##   setosa         50          0         0  50
##   versicolor      0         48         4  52
##   virginica       0          2        46  48
##   Sum            50         50        50 150
## 
## Confusion matrix (relative):
##             Actual
## Prediction   setosa versicolor virginica  Sum
##   setosa       0.33       0.00      0.00 0.33
##   versicolor   0.00       0.32      0.03 0.35
##   virginica    0.00       0.01      0.31 0.32
##   Sum          0.33       0.33      0.33 1.00
## 
## Accuracy:
## 0.96 (144/150)
## 
## Error rate:
## 0.04 (6/150)
## 
## Error rate reduction (vs. base rate):
## 0.94 (p-value < 2.2e-16)

96% accuracy is not too bad, even for this simple dataset!

If you want to know more about the OneR package you can read the vignette: OneR – Establishing a New Baseline for Machine Learning Classification Models.

Well, and that’s it for the ultimate introduction to R – hopefully you liked it and you learned something! Please share your first experiences with R in the comments and also if you miss something (I might add it in the future!) – Thank you for reading and stay tuned for more to come!

Was the Bavarian Abitur too hard this time?


Bavaria is known for its famous Oktoberfest… and within Germany also for its presumably difficult Abitur, a qualification granted by university-preparatory schools in Germany.

A mandatory part for all students is maths. This year many students protested that the maths part was way too hard, they even started an online petition with more than seventy thousand supporters at this time of writing!

It is not clear yet whether their marks will be adjusted upwards, the ministry of education is investigating the case. As a professor in Bavaria who also teaches statistics I will take the opportunity to share with you an actual question from the original examination with solution, so read on…

Let us have a look at the first (and easiest) question in the stochastics part:

Every sixth visitor to the Oktoberfest wears a gingerbread heart around his neck. During the festival 25 visitors are chosen at random. Determine the probability that at most one of the selected visitors will have a gingerbread heart.

Before you read on try to solve the task yourself…

Of course students are not allowed to use R in the examination but in general the good thing about this kind of questions is that if you have no idea how to solve them analytically solving them by simulation is often much easier:

set.seed(12)
N <- 1e7
v <- matrix(sample(c(0L, 1L), size = 25 * N, replace = TRUE, prob = c(5/6, 1/6)), ncol = 25)
sum(rowSums(v) <= 1) / N
## [1] 0.062936

The answer is about 6.3\%.

Now for the analytical solution: “At least one” implies that we have to differentiate between two cases, “no gingerbread heart” and “exactly one gingerbread heart”. “No gingerbread heart” is just \left(\frac{5}{6}\right)^{25}. “Exactly one gingerbread heart” is 25\cdot\frac{1}{6}\cdot\left(\frac{5}{6}\right)^{24} because there are 25 possibilities of where the gingerbread heart could occur. We have to add both probabilities:

(5/6)^25 + 25*1/6*(5/6)^24
## [1] 0.06289558

If you know a little bit about probability distributions you will recognize the above as the binomial distribution:

pbinom(q = 1, size = 25, prob = 1/6)
## [1] 0.06289558

Of course it is a little unfair to judge just on basis of the easiest task and without knowing the general maths level that is required. But still, I would like to hear your opinion on this. Also outsiders’ views from different countries and different educational systems are very welcome! So, what do you think:

Was the Bavarian Abitur too hard this time? Please leave your reply in the comment section below!

Update (June 7, 2019): The Bavarian Ministry of Education has decided now that the marks will not be subsequently adjusted.

Backtest Trading Strategies like a real Quant


R is one of the best choices when it comes to quantitative finance. Here we will show you how to load financial data, plot charts and give you a step-by-step template to backtest trading strategies. So, read on…

We begin by just plotting a chart of the Standard & Poor’s 500 (S&P 500), an index of the 500 biggest companies in the US. To get the index data and plot the chart we use the powerful quantmod package (on CRAN). After that we add two popular indicators, the simple moving average (SMI) and the exponential moving average (EMA).

Have a look at the code:

library(quantmod)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.

getSymbols("^GSPC", from = "2000-01-01")
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## [1] "^GSPC"

head(GSPC)
##            GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume
## 2000-01-03   1469.25   1478.00  1438.36    1455.22   931800000
## 2000-01-04   1455.22   1455.22  1397.43    1399.42  1009000000
## 2000-01-05   1399.42   1413.27  1377.68    1402.11  1085500000
## 2000-01-06   1402.11   1411.90  1392.10    1403.45  1092300000
## 2000-01-07   1403.45   1441.47  1400.73    1441.47  1225200000
## 2000-01-10   1441.47   1464.36  1441.47    1457.60  1064800000
##            GSPC.Adjusted
## 2000-01-03       1455.22
## 2000-01-04       1399.42
## 2000-01-05       1402.11
## 2000-01-06       1403.45
## 2000-01-07       1441.47
## 2000-01-10       1457.60

tail(GSPC)
##            GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume
## 2019-04-24   2934.00   2936.83  2926.05    2927.25  3448960000
## 2019-04-25   2928.99   2933.10  2912.84    2926.17  3425280000
## 2019-04-26   2925.81   2939.88  2917.56    2939.88  3248500000
## 2019-04-29   2940.58   2949.52  2939.35    2943.03  3118780000
## 2019-04-30   2937.14   2948.22  2924.11    2945.83  3919330000
## 2019-05-01   2952.33   2954.13  2923.36    2923.73  3645850000
##            GSPC.Adjusted
## 2019-04-24       2927.25
## 2019-04-25       2926.17
## 2019-04-26       2939.88
## 2019-04-29       2943.03
## 2019-04-30       2945.83
## 2019-05-01       2923.73

chartSeries(GSPC, theme = chartTheme("white"), subset = "last 10 months", show.grid = TRUE)

addSMA(20)

addEMA(20)

As you can see the moving averages are basically smoothed out versions of the original data shifted by the given number of days. While with the SMA (red curve) all days are weighted equally with the EMA (blue curve) the more recent days are weighted stronger, so that the resulting indicator detects changes quicker. The idea is that by using those indicators investors might be able to detect longer term trends and act accordingly. For example a trading rule could be to buy the index whenever it crosses the MA from below and sell when it goes the other direction. Judge for yourself if this could have worked.

Well, having said that it might not be that easy to find out the profitability of certain trading rules just by staring at a chart. We are looking for something more systematic! We would need a decent backtest! This can of course also be done with R, a great choice is the PerformanceAnalytics package (on CRAN).

To backtest a trading strategy I provide you with a step-by-step template:

  1. Load libraries and data
  2. Create your indicator
  3. Use indicator to create equity curve
  4. Evaluate strategy performance

As an example we want to test the idea that it might be profitable to sell the index when the financial markets exhibit significant stress. Interestingly enough “stress” can be measured by certain indicators that are freely available. One of them is the National Financial Conditions Index (NFCI) of the Federal Reserve Bank of Chicago (https://www.chicagofed.org/publications/nfci/index):

The Chicago Fed’s National Financial Conditions Index (NFCI) provides a comprehensive weekly update on U.S. financial conditions in money markets, debt and equity markets and the traditional and “shadow” banking systems. […] The NFCI [is] constructed to have an average value of zero and a standard deviation of one over a sample period extending back to 1971. Positive values of the NFCI have been historically associated with tighter-than-average financial conditions, while negative values have been historically associated with looser-than-average financial conditions.

To make it more concrete we want to create a buy signal when the index is above one standard deviation in negative territory and a sell signal otherwise.

Have a look at the following code:

# Step 1: Load libraries and data
library(quantmod)
library(PerformanceAnalytics)
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend

getSymbols('NFCI', src = 'FRED', , from = '2000-01-01')
## [1] "NFCI"

NFCI <- na.omit(lag(NFCI)) # we can only act on the signal after release, i.e. the next day
getSymbols("^GSPC", from = '2000-01-01')
## [1] "^GSPC"

data <- na.omit(merge(NFCI, GSPC)) # merge before (!) calculating returns)
data$GSPC <- na.omit(ROC(Cl(GSPC))) # calculate returns of closing prices

# Step 2: Create your indicator
data$sig <- ifelse(data$NFCI < 1, 1, 0)
data$sig <- na.locf(data$sig)

# Step 3: Use indicator to create equity curve
perf <- na.omit(merge(data$sig * data$GSPC, data$GSPC))
colnames(perf) <- c("Stress-based strategy", "SP500")

# Step 4: Evaluate strategy performance
table.DownsideRisk(perf)
##                               Stress-based strategy   SP500
## Semi Deviation                               0.0075  0.0087
## Gain Deviation                               0.0071  0.0085
## Loss Deviation                               0.0079  0.0095
## Downside Deviation (MAR=210%)                0.0125  0.0135
## Downside Deviation (Rf=0%)                   0.0074  0.0087
## Downside Deviation (0%)                      0.0074  0.0087
## Maximum Drawdown                             0.5243  0.6433
## Historical VaR (95%)                        -0.0173 -0.0188
## Historical ES (95%)                         -0.0250 -0.0293
## Modified VaR (95%)                          -0.0166 -0.0182
## Modified ES (95%)                           -0.0268 -0.0311

table.Stats(perf)
##                 Stress-based strategy     SP500
## Observations                4858.0000 4858.0000
## NAs                            0.0000    0.0000
## Minimum                       -0.0690   -0.0947
## Quartile 1                    -0.0042   -0.0048
## Median                         0.0003    0.0005
## Arithmetic Mean                0.0002    0.0002
## Geometric Mean                 0.0002    0.0001
## Quartile 3                     0.0053    0.0057
## Maximum                        0.0557    0.1096
## SE Mean                        0.0001    0.0002
## LCL Mean (0.95)               -0.0001   -0.0002
## UCL Mean (0.95)                0.0005    0.0005
## Variance                       0.0001    0.0001
## Stdev                          0.0103    0.0120
## Skewness                      -0.1881   -0.2144
## Kurtosis                       3.4430    8.5837

charts.PerformanceSummary(perf)

chart.RelativePerformance(perf[ , 1], perf[ , 2])

chart.RiskReturnScatter(perf)

The first chart shows that the stress-based strategy (black curve) clearly outperformed its benchmark, the S&P 500 (red curve). This can also be seen in the second chart, showing the relative performance. In the third chart we see that both return (more) and (!) risk (less) of our backtested strategy are more favourable compared to the benchmark.

So, all in all this seems to be a viable strategy! But of course before investing real money many more tests have to be performed! You can use this framework for backtesting your own ideas.

Here is not the place to explain all of the above tables and plots but as you can see both packages are very, very powerful and I have only shown you a small fraction of their capabilities. To use their full potential you should have a look at the extensive documentation that comes with it on CRAN.

Disclaimer:
This is no investment advice! No responsibility is taken whatsoever if you lose money!

If you gain money though I would be happy if you could buy me a coffee… that is not too much to ask, is it? 😉

The Rich didn’t earn their Wealth, they just got Lucky


Tomorrow, on the First of May, many countries celebrate the so called International Workers’ Day (or Labour Day): time to talk about the unequal distribution of wealth again!

A few months ago I posted a piece with the title “If wealth had anything to do with intelligence…” where I argued that ability, e.g. intelligence, as an input has nothing to do with wealth as an output. It drew a lot of criticism (as expected), most of it unfounded in my opinion but one piece merits some discussion: the fact that the intelligence quotient (IQ) is normally distributed by construction. The argument goes that intelligence per se may be a distribution with fat tails too but by the way the IQ is constructed the metric is being transformed into a well formed gaussian distribution. To a degree this is certainly true, yet I would still argue that the distribution of intelligence and all other human abilities are far more well behaved than the extremely unequal distribution of wealth. I wrote in a comment:

There are many aspects in your comment that are certainly true. Obviously there are huge problems in measuring “true” mental abilities, which is the exact reason why people came up with a somewhat artificial “intelligence quotient” with all its shortcomings.

What would be interesting to see is (and I don’t know if you perhaps have a source about this) what the outcome of an intelligence test would look like without the “quotient” part, i.e. without subsequently normalizing the results.

I guess the relationship wouldn’t be strictly linear but it wouldn’t be as extreme as the wealth distribution either.

What I think is true in any case, independent of the distributions, is when you rank all people by intelligence and by wealth respectively you wouldn’t really see any stable connection – and that spirit was the intention of my post in the first place and I still stand by it, although some of the technicalities are obviously debatable.

So, if you have a source, Dear Reader, you are more than welcome to share it in the comments – I am always eager to learn!

I ended my post with:

But if it is not ability/intelligence that determines the distribution of wealth what else could account for the extreme inequality we perceive in the world?

In this post I will argue that luck is a good candidate, so read on…

In 2014 there was a special issue of the renowned magazine Science titled “The science of inequality”. In one of the articles (Cho, A.: “Physicists say it’s simple”) the following thought experiment is being proposed:

Suppose you randomly divide 500 million in income among 10,000 people. There’s only one way to give everyone an equal, 50,000 share. So if you’re doling out earnings randomly, equality is extremely unlikely. But there are countless ways to give a few people a lot of cash and many people a little or nothing. In fact, given all the ways you could divvy out income, most of them produce an exponential distribution of income.

So, the basic idea is to randomly throw 9,999 darts at a scale ranging from zero to 500 million and study the resulting distribution of intervals:

library(MASS)

w <- 5e8 # wealth
p <- 1e4 # no. of people

set.seed(123)
d <- diff(c(0, sort(runif(p-1, max = w)), w)) # wealth distribution
h <- hist(d, col = "red", main = "Exponential decline", freq = FALSE, breaks = 45, xlim = c(0, quantile(d, 0.99)))

fit <- fitdistr(d, "exponential")
curve(dexp(x, rate = fit$estimate), col = "black", type = "p", pch = 16, add = TRUE)

The resulting distribution fits an exponential distribution very well. You can read some interesting discussions concerning this result on CrossValidated StackExchange: How can I analytically prove that randomly dividing an amount results in an exponential distribution (of e.g. income and wealth)?

Just to give you an idea of how unfair this distribution is: the richest six persons have more wealth than the poorest ten percent together:

sum(sort(d)[9994:10000]) - sum(sort(d)[0:1000])
## [1] 183670.8

If you think that this is ridiculous just look at the real global wealth distribution: here it is not six but three persons who own more than the poorest ten percent!

Now, what does that mean? Well, equality seems to be the exception and (extreme) inequality the rule. The intervals were found randomly, no interval had any special skills, just luck – and the result is (extreme) inequality – as in the real world!

If you can reproduce the wealth distribution of a society stochastically this could have the implication that it weren’t so much the extraordinary skills of the rich which made them rich but they just got lucky.

Some rich people are decent enough to admit this. In his impressive essay “Why Poverty Is Like a Disease” Christian H. Cooper, a hillbilly turned investment banker writes:

So how did I get out? By chance.

It’s easy to attach a post-facto narrative of talent and hard work to my story, because that’s what we’re fed by everything from Hollywood to political stump speeches. But it’s the wrong story. My escape was made up of a series of incredibly unlikely events, none of which I had real control over.

[…]

I am the exception that proves the rule—but that rule is that escape from poverty is a matter of chance, and not a matter of merit.

A consequence would be that you cannot really learn much from the rich. So throw away all of your self help books on how to become successful. I will end with a cartoon, which brings home this message, on a closely related concept, the so called survivorship bias (which is also important to keep in mind when backtesting trading strategies in quantitative finance, the topic of an upcoming post… so stay tuned!):

Source: xkcd.com/1827

Google’s Eigenvector… or how a Random Surfer finds the most relevant Webpages


Like most people you will have used a search engine lately, like Google. But have you ever thought about how it manages to give you the most fitting results? How does it order the results so that the best are on top? Read on to find out!

The earliest search engines either had human curated indices, like Yahoo! or used some simple heuristic like the more often the keyword you were looking for was mentioned on a page the better, like Altavista – which led to all kinds of crazy effects like certain keywords being repeated thousands of times on webpages to make them more “relevant”.

Now, most of those search engines are long gone because a new kid arrived on the block: Google! Google’s search engine results were much better than all of the competition and they became the dominant player in no time. How did they do that?

The big idea was in fact published by the two founders: Sergey Brin and Lawrence Page, it is called the pagerank algorithm (which is of course a pun because one of the authors was named Page too). The original paper can be found here: S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Let us start with another, related question: which properties are the best to own in Monopoly? Many would instinctively answer with the most expensive ones, i.e. Park Place and Boardwalk. But a second thought reveals that those might be the the ones where you get the biggest rent if somebody lands on them but that the last part is the caveat… “IF” somebody lands on them! The best streets are actually the ones where players land on the most. Those happen to be the orange streets, St. James Place, Tennessee Avenue and New York Avenue and therefore they are the key to winning the game.

How do find those properties? For example by simulation: you just simulate thousands of dice rolls and see where the players land.

A similar idea holds true for finding the best web pages: you just start from a random position and simulate a surfer who visits different web pages by chance. For each surfing session you tally the respective webpage where she ends up and after many runs we get a percentage for each page. The higher this percentage is the more relevant the webpage!

Let us do this with some R code. First we define a very small net and plot it (the actual example can be found in chapter 30 of the very good book “Chaotic Fishponds and Mirror Universes” by Richard Elwes):

library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union

# cols represent outgoing links, rows incoming links
# A links to C, D; B links to A; C links to A; D links to A,B,C
M <- matrix(c(0, 0, 1, 1,
              1, 0, 0, 0,
              1, 0, 0, 0, 
              1, 1, 1, 0), nrow = 4)
colnames(M) <- rownames(M) <- c("A", "B", "C", "D")
M
##   A B C D
## A 0 1 1 1
## B 0 0 0 1
## C 1 0 0 1
## D 1 0 0 0

g <- graph_from_adjacency_matrix(t(M)) # careful with how the adjacency matrix is defined -> transpose of matrix
plot(g)

Now, we are running the actual simulation. We define two helper functions for that, next_page for getting a random but possible next page given the page our surfer is on at the moment and last_page which gives the final page after N clicks:

next_page <- function(page, graph) {
  l <- sample(rownames(graph)[as.logical(graph[ , as.logical(page)])], 1)
  as.numeric(rownames(graph) == l)
}

last_page <- function(page, graph, N = 100) {
  for (i in seq(N)) {
    page <- next_page(page, graph)  
  }
  page
}

current_page <- c(1, 0, 0, 0) # random surfer starting from A
random_surfer <- replicate(2e4, last_page(current_page, M, 50))
round(rowSums(random_surfer) / sum(random_surfer), 2)
## [1] 0.43 0.07 0.28 0.22

So we see that page A is the most relevant one because our surfer ends up being there in more than 40% of all sessions, after that come the pages C, D and B. When you look at the net that makes sense, since all pages refer to A whereas B gets only one link, so it doesn’t seem to be that relevant.

As you have seen the simulation even for this small net took quite long so we need some clever mathematics to speed up the process. One idea is to transform our matrix which represents the network into a matrix which gives the probabilities of landing on the next pages and then multiply the probability matrix with the current position (and thereby transform the probabilities). Let us do this for the first step:

M_prob <- prop.table(M, 2) # create probability matrix
M_prob
##     A B C         D
## A 0.0 1 1 0.3333333
## B 0.0 0 0 0.3333333
## C 0.5 0 0 0.3333333
## D 0.5 0 0 0.0000000

M_prob %*% current_page
##   [,1]
## A  0.0
## B  0.0
## C  0.5
## D  0.5

The result says that there is a fifty-fifty chance of landing on C or D. When you look at the graph you see that this is correct since there are two links, one to C and one to D! For the next step you would have to multiply the matrix with the result again, or first multiply the matrix with itself before multiplying with the current position, which gives:

    \[M \cdot M = M^2.\]

If we want to do this a hundred times we just have to raise this probability matrix to the one hundredth power:

    \[M^{100}.\]

We use the %^% operator in the expm package (on CRAN) for that:

library(expm)
## Loading required package: Matrix
## 
## Attaching package: 'expm'
## The following object is masked from 'package:Matrix':
## 
##     expm

r <- M_prob %^% 100 %*% current_page
r
##         [,1]
## A 0.42857143
## B 0.07142857
## C 0.28571429
## D 0.21428571

Again, we get the same result! You might ask: why 100? The answer is that this is in most cases enough to get a stable result so that any further multiplication still results in the same result:

    \[M_{prob} \cdot r=r\]

The last equations opens up still another possibility: we are obviously looking for a vector r which goes unaffected when multiplied by the matrix M_{prob}. There is a mathematical name for that kind of behaviour: eigenvector! As you might have guessed the name is an import from the German language where it means something like “own vector”.

This hints at the problem we were solving all along (without consciously realizing perhaps): a page is the more relevant the more relevant a page is that links to it… now we have to know the importance of that page but that page two is the more relevant… and so on and so forth, we are going in circles here. The same is true when you look at the equation above: you define r in terms of rr is the eigenvector of matrix M_{prob}!

There are very fast and powerful methods to find the eigenvectors of a matrix, and the corresponding eigen function is even a function in base R:

lr <- Re(eigen(M_prob)$vectors[ , 1]) # real parts of biggest eigenvector
lr / sum(lr) # normalization
## [1] 0.42857143 0.07142857 0.28571429 0.21428571

Again, the same result! You can now understand the title of this post and titles of other articles about the pagerank algorithm and Google like “The $25,000,000,000 eigenvector”.

Yet, a word of warning is in order: there are cases where the probability matrix is not diagonalizable (we won’t get into the mathematical details here), which means that the eigenvector method won’t give sensible results. To check this the following code must evaluate to TRUE:

ev <- eigen(M_prob)$values
length(unique(ev)) == length(ev)
## [1] TRUE

We now repeat the last two methods for a bigger network:

set.seed(1415)
n <- 10
g <- sample_gnp(n, p = 1/4, directed = TRUE) # create random graph
g <- set_vertex_attr(g, "name", value = LETTERS[1:n])
plot(g)

M <- t(as_adjacency_matrix(g, sparse = FALSE))
M_prob <- prop.table(M, 2) # create probability matrix
M_prob
##      A B C D   E   F   G         H         I   J
## A 0.00 0 0 1 0.5 0.5 0.5 0.0000000 0.0000000 0.5
## B 0.00 0 0 0 0.0 0.0 0.0 0.3333333 0.0000000 0.0
## C 0.00 1 0 0 0.0 0.0 0.0 0.0000000 0.3333333 0.5
## D 0.25 0 0 0 0.0 0.0 0.0 0.0000000 0.0000000 0.0
## E 0.25 0 0 0 0.0 0.0 0.5 0.3333333 0.3333333 0.0
## F 0.00 0 1 0 0.0 0.0 0.0 0.0000000 0.3333333 0.0
## G 0.25 0 0 0 0.0 0.0 0.0 0.0000000 0.0000000 0.0
## H 0.00 0 0 0 0.5 0.0 0.0 0.0000000 0.0000000 0.0
## I 0.00 0 0 0 0.0 0.5 0.0 0.0000000 0.0000000 0.0
## J 0.25 0 0 0 0.0 0.0 0.0 0.3333333 0.0000000 0.0

current_page <- c(1, rep(0, n-1))
r <- M_prob %^% 100 %*% current_page
r
##         [,1]
## A 0.27663574
## B 0.02429905
## C 0.08878509
## D 0.06915881
## E 0.14579434
## F 0.10654199
## G 0.06915881
## H 0.07289723
## I 0.05327107
## J 0.09345787

lr <- Re(eigen(M_prob)$vectors[ , 1])
lr / sum(lr) # normalization of the real parts
##  [1] 0.27663551 0.02429907 0.08878505 0.06915888 0.14579439 0.10654206
##  [7] 0.06915888 0.07289720 0.05327103 0.09345794

We can now order the pages according to their importance – like the first 10 results of a google search:

search <- data.frame(Page = LETTERS[1:n], Rank = r)
search[order(search$Rank, decreasing = TRUE), ]
##   Page       Rank
## A    A 0.27663574
## E    E 0.14579434
## F    F 0.10654199
## J    J 0.09345787
## C    C 0.08878509
## H    H 0.07289723
## D    D 0.06915881
## G    G 0.06915881
## I    I 0.05327107
## B    B 0.02429905

Looking at the net, does the resulting order make sense to you?

Congratulations, you now understand the big idea behind one the greatest revolutions in information technology!

Base Rate Fallacy – or why No One is justified to believe that Jesus rose


In this post we are talking about one of the most unintuitive results in statistics: the so called false positive paradox which is an example of the so called base rate fallacy. It describes a situation where a positive test result of a very sensitive medical test shows that you have the respective disease… yet you are most probably healthy!

The reason for this is that the disease itself is so rare that even with a very sensitive test the result is most probably false positive: it shows that you have the disease yet this result is false, you are healthy.

The key to understanding this result is to understand the difference between two conditional probabilities: the probability that you have a positive test result when you are sick and the probability that you are sick in case you got a positive test result – you are interested in the last (am I really sick?) but you only know the first.

Now for some notation (the vertical dash means “under the condition that”, P stands for probability):

  • P(B \mid A): if you are sick (A) you will probably have a positive test result (B) – this (the test result) is what we know
  • P(A \mid B): if you have a positive test result (B) you are probably not sick (A) – this (whether we are sick) is what we want to know

To calculate one conditional probability from the other we use the famous Bayes’ theorem:

    \[P(A\mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}\]

In the following example we assume a disease with an infection rate of 1 in 1000 and a test to detect this disease with a sensitivity of 99%. Have a look at the following code which illustrates the situation with Euler diagrams, first the big picture, then a zoomed-in version:

library(eulerr)

A <- 0.001 # prevalence of disease
BlA <- 0.99 # sensitivity of test

B <- A * BlA + (1 - A) * (1 - BlA) # positive test (specificity same as sensitivity)
AnB <- BlA * A
AlB <- BlA * A / B # Bayes's theorem
#AnB / B # Bayes's theorem in different form

C <- 1 # the whole population
main <- paste0("P(B|A) = ", round(BlA, 2), ", but P(A|B) = ", round(AlB, 2))

set.seed(123)
fit1 <- euler(c("A" = A, "B" = B, "C" = C, "A&B" = AnB, "A&C" = A, "B&C" = B, "A&B&C" = AnB), input = "union")
plot(fit1, main = main, fill = c("red", "green", "gray90"))

fit2 <- euler(c("A" = A, "B" = B, "A&B" = AnB), input = "union")
plot(fit2, main = main, fill = c("red", "green"))

As you can see although this test is very sensitive when you get a positive test result the probability of you being infected is only 9%!

In the diagrams C is the whole population and A are the infected individuals. B shows the people with a positive test result and you can see in the second diagram that almost all of the infected A are also part of B (the brown area = true positive), but still most ob B are outside of A (the green area), so although they are not infected they have a positive test result! They are false positive.

The red area shows the people that are infected (A) but get a negative test result, stating that they are healthy. This is called false negative. The grey area shows the people who are healthy and get a negative test result, they are true negative.

Due to the occasion we are now coming to an even more extreme example: did Jesus rise from the dead? It is inspired by the very good essay “A drop in the sea”: Don’t believe in miracles until you’ve done the math.

Let us assume that we had very, very reliable witnesses (as a side note what is strange though is that the gospels cannot even agree on how many men or angels appeared at the tomb: it is one angel in Matthew, a young man in Mark, two men in Luke and two angels in John… but anyway), yet the big problem is that not many people so far have been able to defy death. I have only heard of two cases: supposedly the King of Kings (Jesus) but also of course the King himself (Elvis!), whereby sightings of Elvis after his death are much more numerous than of Jesus (just saying… 😉 )

Have a look at the following code (source for the number of people who have ever lived: WolframAlpha)

A <- 2/108500000000 # probability of coming back from the dead (The King = Elvis and the King of Kings = Jesus)
BlA <- 0.9999999 # sensitivity of test -> very, very reliable witnesses (many more in case of Elvis 😉

B <- A * BlA + (1 - A) * (1 - BlA) # positive test = witnesses say He rose
AnB <- BlA * A
AlB <- BlA * A / B # Bayes's theorem

C <- 1 # all people
main <- paste0("P(B|A) = ", round(BlA, 2), ", but P(A|B) = ", round(AlB, 2))

fit1 <- euler(c("A" = A, "B" = B, "C" = C, "A&B" = AnB, "A&C" = A, "B&C" = B, "A&B&C" = AnB), input = "union")
plot(fit1, main = main, fill = c("red", "green", "gray90"))

fit2 <- euler(c("A" = A, "B" = B, "A&B" = AnB), input = "union")
plot(fit2, main = main, fill = c("red", "green"))

So, in this case C is the unfortunate group of people who have to go for good… it is us. 🙁 As you can see although the witnesses are super reliable when they claim that somebody rose it is almost certain that they are wrong:

  • P(B \mid A): if Jesus rose (A) the very, very reliable witnesses would with a very high probability say so (B)
  • P(A \mid B): if the very, very reliable witnesses said that Jesus rose (B) Jesus would still almost surely have stayed dead

Or in the words of the above mentioned essay:

No one is justified in believing in Jesus’s resurrection. The numbers simply don’t justify the conclusion.

But this chimes well with a famous Christian saying “I believe because it is absurd” (or in Latin “Credo quia absurdum”) – you can find out more about that in another highly interesting essay: ‘I believe because it is absurd’: Christianity’s first meme

Unfortunately this devastating conclusion is also true in the case of Elvis…

Separating the Signal from the Noise: Robust Statistics for Pedestrians


One of the problems of navigating an autonomous car through a city is to extract robust signals in the face of all the noise that is present in the different sensors. Just taking something like an arithmetic mean of all the data points could possibly end in a catastrophe: if a part of a wall looks similar to the street and the algorithm calculates an average trajectory of the two this would end in leaving the road and possibly crashing into pedestrians. So we need some robust algorithm to get rid of the noise. The area of statistics that especially deals with such problems is called robust statistics and the methods used therein robust estimation.

Now, one of the problems is that one doesn’t know what is signal and what is noise. The big idea behind RANSAC (short for RAndom SAmple Consensus) is to get rid of outliers by basically taking as many points as possible which form a well-defined region and leaving out the others. It does that iteratively, similar to the famous k-means algorithm (the topic of one of the upcoming posts, so stay tuned…).

To really understand how RANSAC works we will now build it with R. We will take a simple linear regression as an example and make it robust against outliers.

For educational purposes we will do this step by step:

  1. Understanding the general outline of the algorithm.
  2. Looking at the steps in more detail.
  3. Expressing the steps in pseudocode.
  4. Translating this into R!

Conveniently enough Wikipedia gives a good outline of the algorithm and even provides us with the very clear pseudocode which will serve as the basis for our own R implementation (the given Matlab code is not very good in my opinion and has nothing to do with the pseudocode):

The RANSAC algorithm is essentially composed of two steps that are iteratively repeated:

  1. In the first step, a sample subset containing minimal data items is randomly selected from the input dataset. A fitting model and the corresponding model parameters are computed using only the elements of this sample subset. The cardinality of the sample subset is the smallest sufficient to determine the model parameters.
  2. In the second step, the algorithm checks which elements of the entire dataset are consistent with the model instantiated by the estimated model parameters obtained from the first step. A data element will be considered as an outlier if it does not fit the fitting model instantiated by the set of estimated model parameters within some error threshold that defines the maximum deviation attributable to the effect of noise. The set of inliers obtained for the fitting model is called consensus set. The RANSAC algorithm will iteratively repeat the above two steps until the obtained consensus set in certain iteration has enough inliers.

In more detail:

RANSAC achieves its goal by repeating the following steps:

  1. Select a random subset of the original data. Call this subset the hypothetical inliers.
  2. A model is fitted to the set of hypothetical inliers.
  3. All other data are then tested against the fitted model. Those points that fit the estimated model well, according to some model-specific loss function, are considered as part of the consensus set.
  4. The estimated model is reasonably good if sufficiently many points have been classified as part of the consensus set.
  5. Afterwards, the model may be improved by reestimating it using all members of the consensus set.

This procedure is repeated a fixed number of times, each time producing either a model which is rejected because too few points are part of the consensus set, or a refined model together with a corresponding consensus set size. In the latter case, we keep the refined model if its consensus set is larger than the previously saved model.

Now, this can be expressed in pseudocode:

Given:
    data - a set of observed data points
    model - a model that can be fitted to data points
    n - minimum number of data points required to fit the model
    k - maximum number of iterations allowed in the algorithm
    t - threshold value to determine when a data point fits a model
    d - number of close data points required to assert that a model fits well to data

Return:
    bestfit - model parameters which best fit the data (or nul if no good model is found)

iterations = 0
bestfit = nul
besterr = something really large
while iterations < k {
    maybeinliers = n randomly selected values from data
    maybemodel = model parameters fitted to maybeinliers
    alsoinliers = empty set
    for every point in data not in maybeinliers {
        if point fits maybemodel with an error smaller than t
             add point to alsoinliers
    }
    if the number of elements in alsoinliers is > d {
        % this implies that we may have found a good model
        % now test how good it is
        bettermodel = model parameters fitted to all points in maybeinliers and alsoinliers
        thiserr = a measure of how well model fits these points
        if thiserr < besterr {
            bestfit = bettermodel
            besterr = thiserr
        }
    }
    increment iterations
}
return bestfit

It is quite easy to convert this into valid R code (as a learner of R you should try it yourself before looking at my solution!):

ransac <- function(data, n, k, t, d) {
  iterations <- 0
  bestfit <- NULL
  besterr <- 1e5
  while (iterations < k) {
    maybeinliers <- sample(nrow(data), n)
    maybemodel <- lm(y ~ x, data = data, subset = maybeinliers)
    alsoinliers <- NULL
    for (point in setdiff(1:nrow(data), maybeinliers)) {
      if (abs(maybemodel$coefficients[2]*data[point, 1] - data[point, 2] + maybemodel$coefficients[1])/(sqrt(maybemodel$coefficients[2] + 1)) < t)
        alsoinliers <- c(alsoinliers, point)
    }
    if (length(alsoinliers) > d) {
      bettermodel <- lm(y ~ x, data = data, subset = c(maybeinliers, alsoinliers))
      thiserr <- summary(bettermodel)$sigma
      if (thiserr < besterr) {
        bestfit <- bettermodel
        besterr <- thiserr
      }
    }
    iterations <- iterations + 1
  }
  bestfit
}

We now test this with some sample data:

data <- read.csv("data/RANSAC.csv")
plot(data)
abline(lm(y ~ x, data = data))
set.seed(3141)
abline(ransac(data, n = 10, k = 10, t = 0.5, d = 10), col = "blue")

The black line is a plain vanilla linear regression, the blue line is the RANSAC-enhanced version. As you can see: no more crashing into innocent pedestrians. 🙂