One of the most notoriously difficult subjects in statistics is the concept of *statistical tests*. We will explain the ideas behind it step by step to give you some intuition on how to use (and misuse) it, so read on…

Let us begin with some *coin tosses* and the question how to find out whether a coin is *fair*, i.e. shows heads and tails with fifty-fifty probability. If you had all the time of the world you could throw it indefinitely often to see where the probabilities stabilize. Yet, in reality we only have some *sample* of tosses and have to *infer* results based on this. Because of this the area that deals with those problems is called *inferential statistics* (also *inductive statistics*).

So, to keep things simple let us throw our coin only ten times. Where is the point where you would suspect that something is not quite right with this coin? As soon as you don’t get five times heads and five times tails? Or only when you get ten times one side only? Possibly somewhere in between? But where exactly?

The first thing to understand is that this is a somewhat arbitrary decision… but one that should be set in advance and which should adhere to some general standard. The idea is to set some fixed probability with which you compare the probability of the result you see in your experiment (i.e. the number of heads and tails in our example). When you obtain a result that is *as improbable or even more so* than that pre-fixed probability you conclude that the result happened not just by chance but that something *significant* happened (e.g. that your coin is unfair). The generally accepted probability below which one speaks of a significant result is 5% (or 0.05), it is therefore called *significance level*.

After fixing the significance level at 0.05 we have to compare it with the probability of our observed result (or a result that is even more extreme). Let us say that we got nine times one side of the coin and only one time the other. What is the probability of getting this or an even more extreme result? The general *probability formula* is:

For the numerator we get *22* possibilities for observing the above mentioned or an even more extreme result: two times ten possibilities that each of the ten coins could be the odd one out and two more extreme cases showing only heads or only tails. For the denominator we get possibilities for throwing coins:

22 / 2^10 ## [1] 0.02148438

So, the probability is about 2% which is below 5%, so we would conclude that our coin is *not* fair!

Another possibility is to use the *binomial coefficient* which gives us the number of possibilities to choose *k* items out of *n* items. Conveniently enough the R function is called `choose(n, k)`

:

(choose(10, 0) + choose(10, 1) + choose(10, 9) + choose(10, 10))/2^10 ## [1] 0.02148438

Or, because the situation is symmetric:

2 * (choose(10, 0) + choose(10, 1))/2^10 ## [1] 0.02148438

To make things even simpler we can use the *binomial distribution* for our calculation, where basically all of the needed binomial coefficients are summed up automatically:

prob <- 0.5 # fair coin n <- 10 # number of coin tosses k <- 1 # number of "successes", i.e. how many coins show a different side than the rest 2 * pbinom(k, n, prob) ## [1] 0.02148438

As you can see, we get the same probability value over and over again. If you have understood the line of thought so far you are ready for the last step: performing the statistical test with the `binom.test()`

function:

binom.test(k, n) ## ## Exact binomial test ## ## data: k and n ## number of successes = 1, number of trials = 10, p-value = 0.02148 ## alternative hypothesis: true probability of success is not equal to 0.5 ## 95 percent confidence interval: ## 0.002528579 0.445016117 ## sample estimates: ## probability of success ## 0.1

As you can see, we get the same number again: it is named *p-value* in the output:

The p-value is the probability of getting the observed – or an even more extreme – result under the condition that the null hypothesis is true.

The only thing that is new here is the *null hypothesis*. The null hypothesis is just a fancy name for the assumption that in general things are unremarkable, examples could be: our coin is fair, a new medical drug doesn’t work (its effect is just random noise) or one group of students is not better than the other.

So, after having a look at the output we would formally say that we *have a significant result* and therefore *reject the null hypothesis* that the coin is fair.

In most real world scenarios the situation is not that clear cut as it is with our coins. Therefore we won’t normally use a binomial test but the workhorse of statistical inference: the famous *t-test*! On the positive side you can use the t-test in many situations where you don’t have all the information on the underlying *population distribution*, e.g. its *variance*. It is based on comparatively mild assumptions, e.g. that the population distribution is normal (which is itself effectively based on the *Central Limit Theorem (CLT)*) – and even this can be relaxed as long as its variance is finite and when you have big sample sizes. If it is normal though small sample sizes suffice. On the negative side the t-test often only gives approximations of exact tests. Yet the bigger the sample sizes the better the approximation. Enough of the theory, let us perform a t-test with our data. We use the `t.test()`

function for that:

data <- c(rep(0, n-k), rep(1, k)) t.test(data, mu = prob) ## ## One Sample t-test ## ## data: data ## t = -4, df = 9, p-value = 0.00311 ## alternative hypothesis: true mean is not equal to 0.5 ## 95 percent confidence interval: ## -0.1262157 0.3262157 ## sample estimates: ## mean of x ## 0.1

As you can see, we get a different p-value this time but the test is still significant and leads to the right conclusion. As said before, the bigger the sample size the better the approximation.

Let us try two more examples. First we use the inbuilt `mtcars`

dataset to test whether petrol consumption of manual and automatic transmissions differ significantly:

head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 # make new column with better labels mtcars$trans <- ifelse(mtcars$am == 0, "aut", "man") mtcars$trans <- factor(mtcars$trans) mtcars$mpg ## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 ## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 ## [29] 15.8 19.7 15.0 21.4 mtcars$trans ## [1] man man man aut aut aut aut aut aut aut aut aut aut aut aut aut aut ## [18] man man man aut aut aut aut aut man man man man man man man ## Levels: aut man by(mtcars$mpg, mtcars$trans, summary) ## mtcars$trans: aut ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 10.40 14.95 17.30 17.15 19.20 24.40 ## -------------------------------------------------------- ## mtcars$trans: man ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 15.00 21.00 22.80 24.39 30.40 33.90 boxplot(mpg ~ trans, data = mtcars)

As we can see the consumption is obviously different – but is it also *significantly* different at the *5%* significance level? Let us find out:

M <- mtcars$trans == "man" #manual mpg_man <- mtcars[M, ]$mpg #consumption of manual transmission mpg_man ## [1] 21.0 21.0 22.8 32.4 30.4 33.9 27.3 26.0 30.4 15.8 19.7 15.0 21.4 mpg_aut <- mtcars[!M, ]$mpg #consumption of automatic transmission mpg_aut ## [1] 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 ## [15] 21.5 15.5 15.2 13.3 19.2 t.test(mpg_man, mpg_aut) ## ## Welch Two Sample t-test ## ## data: mpg_man and mpg_aut ## t = 3.7671, df = 18.332, p-value = 0.001374 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 3.209684 11.280194 ## sample estimates: ## mean of x mean of y ## 24.39231 17.14737

Yes, it is significantly different!

And now for our final example, a classic use of the t-test to find out whether a medical drug has a significant effect (we use the inbuilt `sleep`

dataset):

sleep ## extra group ID ## 1 0.7 1 1 ## 2 -1.6 1 2 ## 3 -0.2 1 3 ## 4 -1.2 1 4 ## 5 -0.1 1 5 ## 6 3.4 1 6 ## 7 3.7 1 7 ## 8 0.8 1 8 ## 9 0.0 1 9 ## 10 2.0 1 10 ## 11 1.9 2 1 ## 12 0.8 2 2 ## 13 1.1 2 3 ## 14 0.1 2 4 ## 15 -0.1 2 5 ## 16 4.4 2 6 ## 17 5.5 2 7 ## 18 1.6 2 8 ## 19 4.6 2 9 ## 20 3.4 2 10 by(sleep$extra, sleep$group, summary) ## sleep$group: 1 ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -1.600 -0.175 0.350 0.750 1.700 3.700 ## -------------------------------------------------------- ## sleep$group: 2 ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -0.100 0.875 1.750 2.330 4.150 5.500 boxplot(extra ~ group, sleep)

Again, there is obviously a difference… but is it significant?

t.test(extra ~ group, sleep) ## ## Welch Two Sample t-test ## ## data: extra by group ## t = -1.8608, df = 17.776, p-value = 0.07939 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -3.3654832 0.2054832 ## sample estimates: ## mean in group 1 mean in group 2 ## 0.75 2.33

The p-value is above 0.05 which means that this time it cannot be ruled out that the difference is random!

As you can imagine whenever there is money to be made people get creative on how to game the system… unfortunately statistical tests are no exception. Especially in the medical field huge amounts of money are at stake, so one should be well aware of a manipulation technique called *p-hacking* (also known under the names *data snooping* or *data dredging*)!

To summarize our journey into statistical testing so far we have seen that you set a significance level (normally at 0.05) and draw a *random sample*. You calculate how probable the drawing of this sample (or an even more extreme sample) is and compare it to the significance level. If the probability is below the significance level you say that the test shows a significant result, otherwise you say that one cannot rule out that the difference (if there is one) is just due to chance.

Now, if you don’t like the result… why not draw a sample again? And again? And again? This is p-hacking! Just draw a lot of samples and take the one that fits your agenda!

Of course this is a stark misuse of the idea of statistical tests. But the fact remains that sometimes samples will be significant *just by chance*!

In the following example we “prove” that *homeopathy* works after all (as we all know it just doesn’t work beyond the so-called *placebo effect* as numerous high quality studies over many decades have shown). For that end we just conduct one hundred trials and take one out that fits our bill:

p_hack <- function() { homeopathy <- rnorm(20, 1.2, 5) t.test(homeopathy, mu = 1.2)$p.value } set.seed(123) trials <- replicate(100, p_hack()) trials[6] # homeopathy works! ## [1] 0.033461758 # wait... no it doesn't! trials ## [1] 0.522741341 0.785376466 0.624588751 0.587983281 0.057318309 ## [6] 0.033461758 0.876112179 0.461733176 0.495169397 0.519897894 ## [11] 0.800638084 0.836667733 0.915272904 0.340430844 0.087656634 ## [16] 0.879783049 0.824428157 0.784957654 0.898447738 0.494603038 ## [21] 0.279707663 0.438005042 0.886043148 0.158778621 0.167712182 ## [26] 0.135880072 0.185375902 0.662883154 0.840370636 0.331157973 ## [31] 0.283332330 0.423746570 0.849937345 0.503256673 0.546014504 ## [36] 0.725568387 0.727886195 0.167482514 0.586386335 0.493916303 ## [41] 0.937060320 0.271651762 0.236448565 0.191925375 0.983841382 ## [46] 0.373146285 0.520463358 0.323242616 0.193315250 0.757835487 ## [51] 0.429311063 0.284986685 0.868272041 0.844042454 0.885548528 ## [56] 0.996021648 0.978855283 0.588192375 0.495521737 0.610192393 ## [61] 0.242524547 0.404220265 0.136245145 0.004912541 0.408530447 ## [66] 0.458030143 0.784011725 0.357773131 0.818207705 0.698330582 ## [71] 0.451268449 0.740943226 0.266786179 0.120894921 0.050307044 ## [76] 0.387838555 0.232600995 0.834739682 0.669270240 0.516910985 ## [81] 0.273077802 0.291004360 0.369842912 0.132130995 0.454371585 ## [86] 0.567029545 0.219163798 0.524362414 0.737950629 0.855945701 ## [91] 0.959611992 0.901298484 0.931203165 0.204489683 0.843491761 ## [96] 0.567982673 0.329414884 0.107350495 0.911279358 0.696661191

Lo and behold: trial no. 6 shows a significant result – although by design both means are the same (1.2)! So we take this study, publish it and sell lots and lots of useless homeopathy drugs!

How many significant results will we get just by chance on average? Well, exactly the amount of our significance level!

trials <- replicate(1e5, p_hack()) round(length(trials[trials < 0.05]) / length(trials), 2) ## [1] 0.05

As always our friends over at *xkcd* summarize the situation brilliantly:

The area of *combinatorics*, the art of systematic counting, is dreaded territory for many people, so let us bring some light into the matter: in this post we will explain the difference between *permutations* and *combinations*, with and without *repetitions*, will calculate the number of possibilities and present efficient R code to enumerate all of them, so read on…

The cases we cover in this post:

- Permutations with repetitions
- Combinations without repetitions
- Permutations (of all elements) without repetitions

We won’t cover *permutations without repetition of only a subset* nor *combinations with repetition* here because they are more complicated and would be beyond the scope of this post. We will perhaps cover those in a later post.

In all cases, you can imagine somebody drawing *elements* from a *set* and the different ways to do so. *Permutation* implies that the order does matter, with *combinations* it does not (e.g. in a lottery it normally does not matter in which order the numbers are drawn). *Without repetition* simply means that when one has drawn an element it cannot be drawn again, so *with repetition* implies that it is replaced and can be drawn again.

Let us start with *permutations with repetitions*: as an example take a combination lock (should be *permutation lock* really!) where you have three positions with the numbers zero to nine each. We use the `expand.grid()`

function for enumerating all possibilities:

P_wi <- expand.grid(rep(list(0:9), 3)) head(P_wi) ## Var1 Var2 Var3 ## 1 0 0 0 ## 2 1 0 0 ## 3 2 0 0 ## 4 3 0 0 ## 5 4 0 0 ## 6 5 0 0 tail(P_wi) ## Var1 Var2 Var3 ## 995 4 9 9 ## 996 5 9 9 ## 997 6 9 9 ## 998 7 9 9 ## 999 8 9 9 ## 1000 9 9 9

The formula for calculating the number of permutations is simple for obvious reasons ( is the number of elements to choose from, is the number of actually chosen elements):

In R:

10^3 ## [1] 1000 nrow(P_wi) ## [1] 1000

The next is *combinations without repetitions*: the classic example is a lottery where six out of *49* balls are chosen. We use the `combn()`

function for finding all possibilities:

C_wo <- combn(1:49, 6) C_wo[ , 1:10] ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 1 1 1 1 1 1 1 1 1 1 ## [2,] 2 2 2 2 2 2 2 2 2 2 ## [3,] 3 3 3 3 3 3 3 3 3 3 ## [4,] 4 4 4 4 4 4 4 4 4 4 ## [5,] 5 5 5 5 5 5 5 5 5 5 ## [6,] 6 7 8 9 10 11 12 13 14 15

To calculate the number of combinations the *binomial coefficient* is used:

To give you some intuition consider the above example: you have possibilities for choosing the first ball, for the second, for the third and so on up to the sixth ball. So that gives . When you think about it this is the same as because all the coefficients smaller than can be eliminated by reducing the fraction! Now, there are possible positions for the first ball that is drawn, for the second… and so on and because the order doesn’t matter we have to divide by , which gives the binomial coefficient. In R we use the `choose()`

function to calculate it:

choose(49, 6) ## [1] 13983816 ncol(C_wo) ## [1] 13983816

So, you see that the probability of winning the lottery are about the same, no matter whether you play it… or not

Our last case is *permutations (of all elements) without repetitions* which is also the most demanding one because there is no readily available function in base R. So, we have to write our own:

perm <- function(v) { n <- length(v) if (n == 1) v else { X <- NULL for (i in 1:n) X <- rbind(X, cbind(v[i], perm(v[-i]))) X } }

As you can see it is a recursive function, to understand recursion read my post: To understand Recursion you have to understand RecursionâŚ. What makes matters a little bit more complicated is that the *recursive call* is within a *for loop*.

The idea is to fix one element after the other [`for (i in 1:n)`

and `cbind(v[i], ...)`

] and permute the remaining elements [`perm(v[-i])`

] down to the *base case* when only one element remains [`if (n == 1) v`

], which cannot be permuted any further. Collect all sets on the respective higher level [`X <- (rbind(X, ...)`

] and return the whole matrix `X`

.

Let us see this in action, as an example we’ll see how many different ways there are of four runners reaching the finishing line:

P_wo <- perm(1:4) P_wo ## [,1] [,2] [,3] [,4] ## [1,] 1 2 3 4 ## [2,] 1 2 4 3 ## [3,] 1 3 2 4 ## [4,] 1 3 4 2 ## [5,] 1 4 2 3 ## [6,] 1 4 3 2 ## [7,] 2 1 3 4 ## [8,] 2 1 4 3 ## [9,] 2 3 1 4 ## [10,] 2 3 4 1 ## [11,] 2 4 1 3 ## [12,] 2 4 3 1 ## [13,] 3 1 2 4 ## [14,] 3 1 4 2 ## [15,] 3 2 1 4 ## [16,] 3 2 4 1 ## [17,] 3 4 1 2 ## [18,] 3 4 2 1 ## [19,] 4 1 2 3 ## [20,] 4 1 3 2 ## [21,] 4 2 1 3 ## [22,] 4 2 3 1 ## [23,] 4 3 1 2 ## [24,] 4 3 2 1

After this rather complicated function the calculation of the number of ways is simple, it is just the *factorial function* (it should again be obvious why):

In R we have the `factorial()`

function:

factorial(4) ## [1] 24 nrow(P_wo) ## [1] 24

As you will see when solving real world problems with R the above functions often come in handy, so you should add them to your ever growing tool set – have fun and stay tuned!

]]>

A few months ago I published a post on *recursion*: To understand Recursion you have to understand RecursionâŚ. In this post we will see how to use recursion to fill free areas of an image with colour, the caveats of recursion and how to transform a recursive algorithm into a loop-based version using a *queue* – so read on…

The recursive version of the painting algorithm we want to examine here is very easy to understand, Wikipedia gives the pseudocode of the so called *flood-fill algorithm*:

- If target-color is equal to replacement-color, return.
- If the color of node is not equal to target-color, return.
- Set the color of node to replacement-color.
- Perform
**Flood-fill**(one step to the south of node, target-color, replacement-color).

Perform**Flood-fill**(one step to the north of node, target-color, replacement-color).

Perform**Flood-fill**(one step to the west of node, target-color, replacement-color).

Perform**Flood-fill**(one step to the east of node, target-color, replacement-color). - Return.

The translation into R couldn’t be any easier:

floodfill <- function(row, col, tcol, rcol) { if (tcol == rcol) return() if (M[row, col] != tcol) return() M[row, col] <<- rcol floodfill(row - 1, col , tcol, rcol) # south floodfill(row + 1, col , tcol, rcol) # north floodfill(row , col - 1, tcol, rcol) # west floodfill(row , col + 1, tcol, rcol) # east return("filling completed") }

We take the image from Wikipedia as an example:

M <- matrix(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 9, 9) image(M, col = c(0, 1))

We now fill the three areas with three different colours and then plot the image again:

startrow <- 5; startcol <- 5 floodfill(startrow, startcol, 0, 2) ## [1] "filling completed" startrow <- 3; startcol <- 3 floodfill(startrow, startcol, 0, 3) ## [1] "filling completed" startrow <- 7; startcol <- 7 floodfill(startrow, startcol, 0, 4) ## [1] "filling completed" image(M, col = 1:4)

This seems to work pretty well but the problem is that the more nested the algorithm becomes the bigger the *stack* has to be – which could lead to overflow errors. One comment on my original post on recursion read:

just keep in mind that recursion is useful in industrial work only if tail optimization is supported. otherwise your code will explode at some indeterminate time in the future. […]

One possibility is to increase the size of the stack with `options(expressions = 10000)`

but even this may not be enough. Therefore we transform our recursive algorithm into a loop-based one and use a *queue* instead of a stack! The pseudocode from Wikipedia:

- If target-color is equal to replacement-color, return.
- If color of node is not equal to target-color, return.
- Set the color of node to replacement-color.
- Set Q to the empty queue.
- Add node to the end of Q.
- While Q is not empty:
- Set n equal to the first element of Q.
- Remove first element from Q.
- If the color of the node to the west of n is target-color,

set the color of that node to replacement-color and add that node to the end of Q. - If the color of the node to the east of n is target-color,

set the color of that node to replacement-color and add that node to the end of Q. - If the color of the node to the north of n is target-color,

set the color of that node to replacement-color and add that node to the end of Q. - If the color of the node to the south of n is target-color,

set the color of that node to replacement-color and add that node to the end of Q. - Continue looping until Q is exhausted.
- Return.

Because of the way the algorithm fills areas it is also called *forest fire*. Again, the translation into valid R code is straightforward:

floodfill <- function(row, col, tcol, rcol) { if (tcol == rcol) return() if (M[row, col] != tcol) return() Q <- matrix(c(row, col), 1, 2) while (dim(Q)[1] > 0) { n <- Q[1, , drop = FALSE] west <- cbind(n[1] , n[2] - 1) east <- cbind(n[1] , n[2] + 1) north <- cbind(n[1] + 1, n[2] ) south <- cbind(n[1] - 1, n[2] ) Q <- Q[-1, , drop = FALSE] if (M[n] == tcol) { M[n] <<- rcol if (M[west] == tcol) Q <- rbind(Q, west) if (M[east] == tcol) Q <- rbind(Q, east) if (M[north] == tcol) Q <- rbind(Q, north) if (M[south] == tcol) Q <- rbind(Q, south) } } return("filling completed") }

As an example we will use a much bigger picture (it can be downloaded from here: Unfilledcirc.png):

library(png) img <- readPNG("data/Unfilledcirc.png") M <- img[ , , 1] M <- ifelse(M < 0.5, 0, 1) M <- rbind(M, 0) M <- cbind(M, 0) image(M, col = c(1, 0))

And now for the filling:

startrow <- 100; startcol <- 100 floodfill(startrow, startcol, 0, 2) ## [1] "filling completed" startrow <- 50; startcol <- 50 floodfill(startrow, startcol, 1, 3) ## [1] "filling completed" image(M, col = c(1, 0, 2, 3))

As you can see, with this version of the algorithm much bigger areas can be filled!

I also added both R implementations to the respective section of Rosetta Code: Bitmap/Flood fill.

]]>

There are a million reasons to learn R (see e.g. Why R for Data Science â and not Python?), but where to start? I present to you the ultimate introduction to bring you up to speed! So read on…

I call it *ultimate* because it is the essence of many years of teaching R… or put differently: it is the kind of introduction I would have liked to have when I started out with R back in the days!

A word of warning though: this is a introduction to R and not to statistics, so I won’t explain the statistics terms used here. You do not need to know any other *programming language* but it does no harm either. Ok, now let us start!

First you need to install **R** (https://www.r-project.org) and preferably **RStudio** as a *Graphical User Interface (GUI)*: https://www.rstudio.com/products/RStudio/#Desktop. Both are free and available for all common operating systems.

To get a quick overview of RStudio watch this video:

You can either type in the following commands in the *console* or open a new *script* tab (File -> New File -> R Script) and run the commands by pressing Ctrl + Enter/Return after having typed them.

First of all R is a very good *calculator*:

2 + 2 ## [1] 4 sin(0.5) ## [1] 0.4794255 abs(-10) # absolute value ## [1] 10 pi ## [1] 3.141593 exp(1) # e ## [1] 2.718282 factorial(6) ## [1] 720

By the way: The hash is used for *comments*, everything after it will be ignored!

Of course you can define *variables* and use them in your calculations:

n1 <- 2 n2 <- 3 n1 # show content of variable by just typing the name ## [1] 2 n1 + n2 ## [1] 5 n1 * n2 ## [1] 6 n1^n2 ## [1] 8

Part of R’s power stems from the fact that *functions* can handle several numbers at once, called *vectors*, and do calculations on them. When calling a function *arguments* are passed with round brackets:

n3 <- c(12, 5, 27) # concatenate (combine) elements into a vector n3 ## [1] 12 5 27 min(n3) ## [1] 5 max(n3) ## [1] 27 sum(n3) ## [1] 44 mean(n3) ## [1] 14.66667 sd(n3) # standard deviation ## [1] 11.23981 var(n3) # variance ## [1] 126.3333 median(n3) ## [1] 12 n3 / 12 ## [1] 1.0000000 0.4166667 2.2500000

In the last example the *12* was *recycled* three times. R always tries to do that (when feasible), sometimes giving a *warning* when it might not be intended:

n3 / c(1, 2) ## Warning in n3/c(1, 2): longer object length is not a multiple of shorter ## object length ## [1] 12.0 2.5 27.0

In cases you only want parts of your vectors you can apply *subsetting* with square brackets:

n3[1] ## [1] 12 n3[c(2, 3)] ## [1] 5 27

*Ranges* can easily be created with the colon:

n4 <- 10:20 n4 ## [1] 10 11 12 13 14 15 16 17 18 19 20

When you test whether this vector is bigger than a certain number you will get *logicals* as a result. You can use those logicals for subsetting:

n4 > 15 ## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE n4[n4 > 15] ## [1] 16 17 18 19 20

Perhaps you have heard the story of little Gauss where his teacher gave him the task to add all numbers from *1* to *100* to keep him busy for a while? Well, he found a mathematical trick to add them within seconds… for us normal people we can use R:

sum(1:100) ## [1] 5050

When we want to use some code several times we can define our own function (a *user-defined function*). We do that the same way we create a vector (or any other *data structure*) because R is a so called *functional programming language* and functions are so called *first-class citizens* (i.e. on the same level as other data structures like vectors). The code that is being executed is put in curly brackets:

gauss <- function(x) { sum(1:x) } gauss(100) ## [1] 5050 gauss(1000) ## [1] 500500

Of course we also have other data types, e.g. *matrices* are basically two dimensional vectors:

M <- matrix(1:12, nrow = 3, byrow = TRUE) # create a matrix M ## [,1] [,2] [,3] [,4] ## [1,] 1 2 3 4 ## [2,] 5 6 7 8 ## [3,] 9 10 11 12 dim(M) ## [1] 3 4

Subsetting now has to provide two numbers, the first for the *row*, the second for the *column* (like in the game Battleship). If you leave one out, all data of the respective dimension will be shown:

M[2, 3] ## [1] 7 M[ , c(1, 3)] ## [,1] [,2] ## [1,] 1 3 ## [2,] 5 7 ## [3,] 9 11

Another possibility to create matrices:

v1 <- 1:4 v2 <- 4:1 M1 <- rbind(v1, v2) # row bind M1 ## [,1] [,2] [,3] [,4] ## v1 1 2 3 4 ## v2 4 3 2 1 M2 <- cbind(v1, v2) # column bind M2 ## v1 v2 ## [1,] 1 4 ## [2,] 2 3 ## [3,] 3 2 ## [4,] 4 1

*Naming* rows, here with inbuilt datasets:

rownames(M2) <- LETTERS[1:4] M2 ## v1 v2 ## A 1 4 ## B 2 3 ## C 3 2 ## D 4 1 LETTERS ## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" ## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z" letters ## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" ## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

When some result is **N**ot **A**vailable:

LETTERS[50] ## [1] NA

Getting the *structure* of your variables:

str(LETTERS) ## chr [1:26] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" ... str(M2) ## int [1:4, 1:2] 1 2 3 4 4 3 2 1 ## - attr(*, "dimnames")=List of 2 ## ..$ : chr [1:4] "A" "B" "C" "D" ## ..$ : chr [1:2] "v1" "v2"

Another famous dataset (*iris*) that is also built into base R (to get help on any function or dataset just put the cursor in it and press *F1*):

iris ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5.0 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## 11 5.4 3.7 1.5 0.2 setosa ## 12 4.8 3.4 1.6 0.2 setosa ## 13 4.8 3.0 1.4 0.1 setosa ## 14 4.3 3.0 1.1 0.1 setosa ## 15 5.8 4.0 1.2 0.2 setosa ## 16 5.7 4.4 1.5 0.4 setosa ## 17 5.4 3.9 1.3 0.4 setosa ## 18 5.1 3.5 1.4 0.3 setosa ## 19 5.7 3.8 1.7 0.3 setosa ## 20 5.1 3.8 1.5 0.3 setosa ## 21 5.4 3.4 1.7 0.2 setosa ## 22 5.1 3.7 1.5 0.4 setosa ## 23 4.6 3.6 1.0 0.2 setosa ## 24 5.1 3.3 1.7 0.5 setosa ## 25 4.8 3.4 1.9 0.2 setosa ## 26 5.0 3.0 1.6 0.2 setosa ## 27 5.0 3.4 1.6 0.4 setosa ## 28 5.2 3.5 1.5 0.2 setosa ## 29 5.2 3.4 1.4 0.2 setosa ## 30 4.7 3.2 1.6 0.2 setosa ## 31 4.8 3.1 1.6 0.2 setosa ## 32 5.4 3.4 1.5 0.4 setosa ## 33 5.2 4.1 1.5 0.1 setosa ## 34 5.5 4.2 1.4 0.2 setosa ## 35 4.9 3.1 1.5 0.2 setosa ## 36 5.0 3.2 1.2 0.2 setosa ## 37 5.5 3.5 1.3 0.2 setosa ## 38 4.9 3.6 1.4 0.1 setosa ## 39 4.4 3.0 1.3 0.2 setosa ## 40 5.1 3.4 1.5 0.2 setosa ## 41 5.0 3.5 1.3 0.3 setosa ## 42 4.5 2.3 1.3 0.3 setosa ## 43 4.4 3.2 1.3 0.2 setosa ## 44 5.0 3.5 1.6 0.6 setosa ## 45 5.1 3.8 1.9 0.4 setosa ## 46 4.8 3.0 1.4 0.3 setosa ## 47 5.1 3.8 1.6 0.2 setosa ## 48 4.6 3.2 1.4 0.2 setosa ## 49 5.3 3.7 1.5 0.2 setosa ## 50 5.0 3.3 1.4 0.2 setosa ## 51 7.0 3.2 4.7 1.4 versicolor ## 52 6.4 3.2 4.5 1.5 versicolor ## 53 6.9 3.1 4.9 1.5 versicolor ## 54 5.5 2.3 4.0 1.3 versicolor ## 55 6.5 2.8 4.6 1.5 versicolor ## 56 5.7 2.8 4.5 1.3 versicolor ## 57 6.3 3.3 4.7 1.6 versicolor ## 58 4.9 2.4 3.3 1.0 versicolor ## 59 6.6 2.9 4.6 1.3 versicolor ## 60 5.2 2.7 3.9 1.4 versicolor ## 61 5.0 2.0 3.5 1.0 versicolor ## 62 5.9 3.0 4.2 1.5 versicolor ## 63 6.0 2.2 4.0 1.0 versicolor ## 64 6.1 2.9 4.7 1.4 versicolor ## 65 5.6 2.9 3.6 1.3 versicolor ## 66 6.7 3.1 4.4 1.4 versicolor ## 67 5.6 3.0 4.5 1.5 versicolor ## 68 5.8 2.7 4.1 1.0 versicolor ## 69 6.2 2.2 4.5 1.5 versicolor ## 70 5.6 2.5 3.9 1.1 versicolor ## 71 5.9 3.2 4.8 1.8 versicolor ## 72 6.1 2.8 4.0 1.3 versicolor ## 73 6.3 2.5 4.9 1.5 versicolor ## 74 6.1 2.8 4.7 1.2 versicolor ## 75 6.4 2.9 4.3 1.3 versicolor ## 76 6.6 3.0 4.4 1.4 versicolor ## 77 6.8 2.8 4.8 1.4 versicolor ## 78 6.7 3.0 5.0 1.7 versicolor ## 79 6.0 2.9 4.5 1.5 versicolor ## 80 5.7 2.6 3.5 1.0 versicolor ## 81 5.5 2.4 3.8 1.1 versicolor ## 82 5.5 2.4 3.7 1.0 versicolor ## 83 5.8 2.7 3.9 1.2 versicolor ## 84 6.0 2.7 5.1 1.6 versicolor ## 85 5.4 3.0 4.5 1.5 versicolor ## 86 6.0 3.4 4.5 1.6 versicolor ## 87 6.7 3.1 4.7 1.5 versicolor ## 88 6.3 2.3 4.4 1.3 versicolor ## 89 5.6 3.0 4.1 1.3 versicolor ## 90 5.5 2.5 4.0 1.3 versicolor ## 91 5.5 2.6 4.4 1.2 versicolor ## 92 6.1 3.0 4.6 1.4 versicolor ## 93 5.8 2.6 4.0 1.2 versicolor ## 94 5.0 2.3 3.3 1.0 versicolor ## 95 5.6 2.7 4.2 1.3 versicolor ## 96 5.7 3.0 4.2 1.2 versicolor ## 97 5.7 2.9 4.2 1.3 versicolor ## 98 6.2 2.9 4.3 1.3 versicolor ## 99 5.1 2.5 3.0 1.1 versicolor ## 100 5.7 2.8 4.1 1.3 versicolor ## 101 6.3 3.3 6.0 2.5 virginica ## 102 5.8 2.7 5.1 1.9 virginica ## 103 7.1 3.0 5.9 2.1 virginica ## 104 6.3 2.9 5.6 1.8 virginica ## 105 6.5 3.0 5.8 2.2 virginica ## 106 7.6 3.0 6.6 2.1 virginica ## 107 4.9 2.5 4.5 1.7 virginica ## 108 7.3 2.9 6.3 1.8 virginica ## 109 6.7 2.5 5.8 1.8 virginica ## 110 7.2 3.6 6.1 2.5 virginica ## 111 6.5 3.2 5.1 2.0 virginica ## 112 6.4 2.7 5.3 1.9 virginica ## 113 6.8 3.0 5.5 2.1 virginica ## 114 5.7 2.5 5.0 2.0 virginica ## 115 5.8 2.8 5.1 2.4 virginica ## 116 6.4 3.2 5.3 2.3 virginica ## 117 6.5 3.0 5.5 1.8 virginica ## 118 7.7 3.8 6.7 2.2 virginica ## 119 7.7 2.6 6.9 2.3 virginica ## 120 6.0 2.2 5.0 1.5 virginica ## 121 6.9 3.2 5.7 2.3 virginica ## 122 5.6 2.8 4.9 2.0 virginica ## 123 7.7 2.8 6.7 2.0 virginica ## 124 6.3 2.7 4.9 1.8 virginica ## 125 6.7 3.3 5.7 2.1 virginica ## 126 7.2 3.2 6.0 1.8 virginica ## 127 6.2 2.8 4.8 1.8 virginica ## 128 6.1 3.0 4.9 1.8 virginica ## 129 6.4 2.8 5.6 2.1 virginica ## 130 7.2 3.0 5.8 1.6 virginica ## 131 7.4 2.8 6.1 1.9 virginica ## 132 7.9 3.8 6.4 2.0 virginica ## 133 6.4 2.8 5.6 2.2 virginica ## 134 6.3 2.8 5.1 1.5 virginica ## 135 6.1 2.6 5.6 1.4 virginica ## 136 7.7 3.0 6.1 2.3 virginica ## 137 6.3 3.4 5.6 2.4 virginica ## 138 6.4 3.1 5.5 1.8 virginica ## 139 6.0 3.0 4.8 1.8 virginica ## 140 6.9 3.1 5.4 2.1 virginica ## 141 6.7 3.1 5.6 2.4 virginica ## 142 6.9 3.1 5.1 2.3 virginica ## 143 5.8 2.7 5.1 1.9 virginica ## 144 6.8 3.2 5.9 2.3 virginica ## 145 6.7 3.3 5.7 2.5 virginica ## 146 6.7 3.0 5.2 2.3 virginica ## 147 6.3 2.5 5.0 1.9 virginica ## 148 6.5 3.0 5.2 2.0 virginica ## 149 6.2 3.4 5.4 2.3 virginica ## 150 5.9 3.0 5.1 1.8 virginica

Oops, that is a bit long… if you only want to show the first or last rows do the following:

head(iris) # first 6 rows ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa tail(iris, 10) # last 10 rows ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 141 6.7 3.1 5.6 2.4 virginica ## 142 6.9 3.1 5.1 2.3 virginica ## 143 5.8 2.7 5.1 1.9 virginica ## 144 6.8 3.2 5.9 2.3 virginica ## 145 6.7 3.3 5.7 2.5 virginica ## 146 6.7 3.0 5.2 2.3 virginica ## 147 6.3 2.5 5.0 1.9 virginica ## 148 6.5 3.0 5.2 2.0 virginica ## 149 6.2 3.4 5.4 2.3 virginica ## 150 5.9 3.0 5.1 1.8 virginica

Iris is a so called *data frame*, the workhorse of R and data science (you will see how to create one below):

str(iris) ## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

As you can see, data frames can combine different data types. If you try to do that with e.g. vectors, which can only hold one data type, something called *coercion* happens, i.e. at least one data type is forced to become another one so that consistency is maintained:

str(c(2, "Hello")) # 2 is coerced to become a character string too ## chr [1:2] "2" "Hello"

You can get a fast overview of your data like so:

summary(iris[1:4]) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 boxplot(iris[1:4])

As you have seen, R often runs a function on all of the data simultaneously. This feature is called *vectorization* and in many other languages you would need a *loop* for that. In R you don’t use loops that often, but of course they are available:

for (i in seq(5)) { print(1:i) } ## [1] 1 ## [1] 1 2 ## [1] 1 2 3 ## [1] 1 2 3 4 ## [1] 1 2 3 4 5

Speaking of *control structures*: of course *conditional statements* are available too:

even <- function(x) ifelse(x %% 2 == 0, TRUE, FALSE) # %% gives remainder of division (= modulo operator) even(1:5) ## [1] FALSE TRUE FALSE TRUE FALSE

Linear modelling (e.g. *correlation* and *linear regression*) couldn’t be any easier, it is included in the core language:

age <- c(21, 46, 55, 35, 28) income <- c(1850, 2500, 2560, 2230, 1800) df <- data.frame(age, income) # create a data frame df ## age income ## 1 21 1850 ## 2 46 2500 ## 3 55 2560 ## 4 35 2230 ## 5 28 1800 cor(df) # correlation ## age income ## age 1.0000000 0.9464183 ## income 0.9464183 1.0000000 LinReg <- lm(income ~ age, data = df) # income as a linear model of age LinReg ## ## Call: ## lm(formula = income ~ age, data = df) ## ## Coefficients: ## (Intercept) age ## 1279.37 24.56 summary(LinReg) ## ## Call: ## lm(formula = income ~ age, data = df) ## ## Residuals: ## 1 2 3 4 5 ## 54.92 90.98 -70.04 91.12 -166.98 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1279.367 188.510 6.787 0.00654 ** ## age 24.558 4.838 5.076 0.01477 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 132.1 on 3 degrees of freedom ## Multiple R-squared: 0.8957, Adjusted R-squared: 0.8609 ## F-statistic: 25.77 on 1 and 3 DF, p-value: 0.01477 plot(df, pch = 16, main = "Linear model") abline(LinReg, col = "blue", lwd = 2) # adding the regression line

You could directly use the model to make predictions:

pred_LinReg <- predict(LinReg, data.frame(age = seq(15, 70, 5))) names(pred_LinReg) <- seq(15, 70, 5) round(pred_LinReg, 2) ## 15 20 25 30 35 40 45 50 55 ## 1647.73 1770.52 1893.31 2016.10 2138.88 2261.67 2384.46 2507.25 2630.04 ## 60 65 70 ## 2752.83 2875.61 2998.40

If you want to know more about the modelling process you can find it here: Learning Data Science: Modelling Basics

Another strength of R is the huge number of add-on *packages* for all kinds of specialized tasks. For the grand finale of this introduction, we’re gonna get a little taste of *machine learning*. For that matter we install the `OneR`

package from *CRAN* (the official package repository of R): Tools -> Install packages… -> type in “OneR” -> click “Install”.

After that we build a simple model on the iris dataset to predict the *Species* column:

library(OneR) # load package data <- optbin(Species ~., data = iris) # find optimal bins for numeric predictors model <- OneR(data, verbose = TRUE) # build actual model ## ## Attribute Accuracy ## 1 * Petal.Width 96% ## 2 Petal.Length 95.33% ## 3 Sepal.Length 74.67% ## 4 Sepal.Width 55.33% ## --- ## Chosen attribute due to accuracy ## and ties method (if applicable): '*' summary(model) # show rules ## ## Call: ## OneR.data.frame(x = data, verbose = TRUE) ## ## Rules: ## If Petal.Width = (0.0976,0.791] then Species = setosa ## If Petal.Width = (0.791,1.63] then Species = versicolor ## If Petal.Width = (1.63,2.5] then Species = virginica ## ## Accuracy: ## 144 of 150 instances classified correctly (96%) ## ## Contingency table: ## Petal.Width ## Species (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum ## setosa * 50 0 0 50 ## versicolor 0 * 48 2 50 ## virginica 0 4 * 46 50 ## Sum 50 52 48 150 ## --- ## Maximum in each column: '*' ## ## Pearson's Chi-squared test: ## X-squared = 266.35, df = 4, p-value < 2.2e-16 plot(model)

We’ll now see how well the model is doing:

prediction <- predict(model, data) eval_model(prediction, data) ## ## Confusion matrix (absolute): ## Actual ## Prediction setosa versicolor virginica Sum ## setosa 50 0 0 50 ## versicolor 0 48 4 52 ## virginica 0 2 46 48 ## Sum 50 50 50 150 ## ## Confusion matrix (relative): ## Actual ## Prediction setosa versicolor virginica Sum ## setosa 0.33 0.00 0.00 0.33 ## versicolor 0.00 0.32 0.03 0.35 ## virginica 0.00 0.01 0.31 0.32 ## Sum 0.33 0.33 0.33 1.00 ## ## Accuracy: ## 0.96 (144/150) ## ## Error rate: ## 0.04 (6/150) ## ## Error rate reduction (vs. base rate): ## 0.94 (p-value < 2.2e-16)

*96% accuracy* is not too bad, even for this simple dataset!

If you want to know more about the OneR package you can read the *vignette*: OneR – Establishing a New Baseline for Machine Learning Classification Models.

Well, and that’s it for the ultimate introduction to R – hopefully you liked it and you learned something! Please share your first experiences with R in the comments and also if you miss something (I might add it in the future!) – Thank you for reading and stay tuned for more to come!

]]>

Bavaria is known for its famous Oktoberfest… and within Germany also for its presumably difficult *Abitur,* a qualification granted by university-preparatory schools in Germany.

A mandatory part for all students is maths. This year many students protested that the maths part was way too hard, they even started an online petition with more than seventy thousand supporters at this time of writing!

It is not clear yet whether their marks will be adjusted upwards, the ministry of education is investigating the case. As a professor in Bavaria who also teaches statistics I will take the opportunity to share with you an actual question from the original examination with solution, so read on…

Let us have a look at the first (and easiest) question in the stochastics part:

Every sixth visitor to the Oktoberfest wears a gingerbread heart around his neck. During the festival 25 visitors are chosen at random. Determine the probability that at most one of the selected visitors will have a gingerbread heart.

Before you read on try to solve the task yourself…

Of course students are not allowed to use R in the examination but in general the good thing about this kind of questions is that if you have no idea how to solve them *analytically* solving them by *simulation* is often much easier:

set.seed(12) N <- 1e7 v <- matrix(sample(c(0L, 1L), size = 25 * N, replace = TRUE, prob = c(5/6, 1/6)), ncol = 25) sum(rowSums(v) <= 1) / N ## [1] 0.062936

The answer is about .

Now for the analytical solution: “At least one” implies that we have to differentiate between two cases, “no gingerbread heart” and “exactly one gingerbread heart”. “No gingerbread heart” is just . “Exactly one gingerbread heart” is because there are possibilities of where the gingerbread heart could occur. We have to add both probabilities:

(5/6)^25 + 25*1/6*(5/6)^24 ## [1] 0.06289558

If you know a little bit about *probability distributions* you will recognize the above as the *binomial distribution*:

pbinom(q = 1, size = 25, prob = 1/6) ## [1] 0.06289558

Of course it is a little unfair to judge just on basis of the easiest task and without knowing the general maths level that is required. But still, I would like to hear your opinion on this. Also outsiders’ views from different countries and different educational systems are very welcome! So, what do you think:

**Was the Bavarian Abitur too hard this time? Please leave your reply in the comment section below!**

**Update** (June 7, 2019): The Bavarian Ministry of Education has decided now that the marks will *not* be subsequently adjusted.

R is one of the best choices when it comes to *quantitative finance*. Here we will show you how to load financial data, plot *charts* and give you a step-by-step template to *backtest trading strategies*. So, read on…

We begin by just plotting a chart of the *Standard & Poor’s 500 (S&P 500)*, an index of the 500 biggest companies in the US. To get the index data and plot the chart we use the powerful `quantmod`

package (on CRAN). After that we add two popular *indicators*, the *simple moving average (SMI)* and the *exponential moving average (EMA)*.

Have a look at the code:

library(quantmod) ## Loading required package: xts ## Loading required package: zoo ## ## Attaching package: 'zoo' ## The following objects are masked from 'package:base': ## ## as.Date, as.Date.numeric ## Loading required package: TTR ## Version 0.4-0 included new data defaults. See ?getSymbols. getSymbols("^GSPC", from = "2000-01-01") ## 'getSymbols' currently uses auto.assign=TRUE by default, but will ## use auto.assign=FALSE in 0.5-0. You will still be able to use ## 'loadSymbols' to automatically load data. getOption("getSymbols.env") ## and getOption("getSymbols.auto.assign") will still be checked for ## alternate defaults. ## ## This message is shown once per session and may be disabled by setting ## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details. ## [1] "^GSPC" head(GSPC) ## GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume ## 2000-01-03 1469.25 1478.00 1438.36 1455.22 931800000 ## 2000-01-04 1455.22 1455.22 1397.43 1399.42 1009000000 ## 2000-01-05 1399.42 1413.27 1377.68 1402.11 1085500000 ## 2000-01-06 1402.11 1411.90 1392.10 1403.45 1092300000 ## 2000-01-07 1403.45 1441.47 1400.73 1441.47 1225200000 ## 2000-01-10 1441.47 1464.36 1441.47 1457.60 1064800000 ## GSPC.Adjusted ## 2000-01-03 1455.22 ## 2000-01-04 1399.42 ## 2000-01-05 1402.11 ## 2000-01-06 1403.45 ## 2000-01-07 1441.47 ## 2000-01-10 1457.60 tail(GSPC) ## GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume ## 2019-04-24 2934.00 2936.83 2926.05 2927.25 3448960000 ## 2019-04-25 2928.99 2933.10 2912.84 2926.17 3425280000 ## 2019-04-26 2925.81 2939.88 2917.56 2939.88 3248500000 ## 2019-04-29 2940.58 2949.52 2939.35 2943.03 3118780000 ## 2019-04-30 2937.14 2948.22 2924.11 2945.83 3919330000 ## 2019-05-01 2952.33 2954.13 2923.36 2923.73 3645850000 ## GSPC.Adjusted ## 2019-04-24 2927.25 ## 2019-04-25 2926.17 ## 2019-04-26 2939.88 ## 2019-04-29 2943.03 ## 2019-04-30 2945.83 ## 2019-05-01 2923.73 chartSeries(GSPC, theme = chartTheme("white"), subset = "last 10 months", show.grid = TRUE)

addSMA(20)

addEMA(20)

As you can see the moving averages are basically smoothed out versions of the original data shifted by the given number of days. While with the SMA (red curve) all days are weighted equally with the EMA (blue curve) the more recent days are weighted stronger, so that the resulting indicator detects changes quicker. The idea is that by using those indicators investors might be able to detect longer term trends and act accordingly. For example a trading rule could be to buy the index whenever it crosses the MA from below and sell when it goes the other direction. Judge for yourself if this could have worked.

Well, having said that it might not be that easy to find out the profitability of certain trading rules just by staring at a chart. We are looking for something more systematic! We would need a decent *backtest*! This can of course also be done with R, a great choice is the `PerformanceAnalytics`

package (on CRAN).

To backtest a trading strategy I provide you with a step-by-step template:

- Load libraries and data
- Create your indicator
- Use indicator to create equity curve
- Evaluate strategy performance

As an example we want to test the idea that it might be profitable to sell the index when the financial markets exhibit significant stress. Interestingly enough âstressâ can be measured by certain *indicators* that are freely available. One of them is the *National Financial Conditions Index (NFCI)* of the *Federal Reserve Bank of Chicago* (https://www.chicagofed.org/publications/nfci/index):

The Chicago Fedâs National Financial Conditions Index (NFCI) provides a comprehensive weekly update on U.S. financial conditions in money markets, debt and equity markets and the traditional and âshadowâ banking systems. […] The NFCI [is] constructed to have an average value of zero and a standard deviation of one over a sample period extending back to 1971. Positive values of the NFCI have been historically associated with tighter-than-average financial conditions, while negative values have been historically associated with looser-than-average financial conditions.

To make it more concrete we want to create a *buy signal* when the index is above one standard deviation in negative territory and a *sell signal* otherwise.

Have a look at the following code:

# Step 1: Load libraries and data library(quantmod) library(PerformanceAnalytics) ## ## Attaching package: 'PerformanceAnalytics' ## The following object is masked from 'package:graphics': ## ## legend getSymbols('NFCI', src = 'FRED', , from = '2000-01-01') ## [1] "NFCI" NFCI <- na.omit(lag(NFCI)) # we can only act on the signal after release, i.e. the next day getSymbols("^GSPC", from = '2000-01-01') ## [1] "^GSPC" data <- na.omit(merge(NFCI, GSPC)) # merge before (!) calculating returns) data$GSPC <- na.omit(ROC(Cl(GSPC))) # calculate returns of closing prices # Step 2: Create your indicator data$sig <- ifelse(data$NFCI < 1, 1, 0) data$sig <- na.locf(data$sig) # Step 3: Use indicator to create equity curve perf <- na.omit(merge(data$sig * data$GSPC, data$GSPC)) colnames(perf) <- c("Stress-based strategy", "SP500") # Step 4: Evaluate strategy performance table.DownsideRisk(perf) ## Stress-based strategy SP500 ## Semi Deviation 0.0075 0.0087 ## Gain Deviation 0.0071 0.0085 ## Loss Deviation 0.0079 0.0095 ## Downside Deviation (MAR=210%) 0.0125 0.0135 ## Downside Deviation (Rf=0%) 0.0074 0.0087 ## Downside Deviation (0%) 0.0074 0.0087 ## Maximum Drawdown 0.5243 0.6433 ## Historical VaR (95%) -0.0173 -0.0188 ## Historical ES (95%) -0.0250 -0.0293 ## Modified VaR (95%) -0.0166 -0.0182 ## Modified ES (95%) -0.0268 -0.0311 table.Stats(perf) ## Stress-based strategy SP500 ## Observations 4858.0000 4858.0000 ## NAs 0.0000 0.0000 ## Minimum -0.0690 -0.0947 ## Quartile 1 -0.0042 -0.0048 ## Median 0.0003 0.0005 ## Arithmetic Mean 0.0002 0.0002 ## Geometric Mean 0.0002 0.0001 ## Quartile 3 0.0053 0.0057 ## Maximum 0.0557 0.1096 ## SE Mean 0.0001 0.0002 ## LCL Mean (0.95) -0.0001 -0.0002 ## UCL Mean (0.95) 0.0005 0.0005 ## Variance 0.0001 0.0001 ## Stdev 0.0103 0.0120 ## Skewness -0.1881 -0.2144 ## Kurtosis 3.4430 8.5837 charts.PerformanceSummary(perf)

chart.RelativePerformance(perf[ , 1], perf[ , 2])

chart.RiskReturnScatter(perf)

The first chart shows that the stress-based strategy (black curve) clearly outperformed its *benchmark*, the S&P 500 (red curve). This can also be seen in the second chart, showing the relative performance. In the third chart we see that both return (more) and (!) risk (less) of our backtested strategy are more favourable compared to the benchmark.

So, all in all this seems to be a viable strategy! But of course before investing real money many more tests have to be performed! You can use this framework for backtesting your own ideas.

Here is not the place to explain all of the above tables and plots but as you can see both packages are very, very powerful and I have only shown you a small fraction of their capabilities. To use their full potential you should have a look at the extensive documentation that comes with it on CRAN.

**Disclaimer: This is no investment advice! No responsibility is taken whatsoever if you lose money!**

If you gain money though I would be happy if you could buy me a coffee… that is not too much to ask, is it?

]]>

Tomorrow, on the *First of May*, many countries celebrate the so called *International Workers’ Day* (or *Labour Day*): time to talk about the *unequal distribution of wealth* again!

A few months ago I posted a piece with the title “If wealth had anything to do with intelligenceâŚ” where I argued that *ability*, e.g. *intelligence*, as an input has nothing to do with *wealth* as an output. It drew a lot of criticism (as expected), most of it unfounded in my opinion but one piece merits some discussion: the fact that the *intelligence quotient* (IQ) is normally distributed *by construction*. The argument goes that intelligence per se may be a distribution with *fat tails* too but by the way the IQ is constructed the metric is being transformed into a well formed *gaussian distribution*. To a degree this is certainly true, yet I would still argue that the distribution of intelligence and all other human abilities are far more well behaved than the extremely unequal distribution of wealth. I wrote in a comment:

There are many aspects in your comment that are certainly true. Obviously there are huge problems in measuring âtrueâ mental abilities, which is the exact reason why people came up with a somewhat artificial âintelligence quotientâ with all its shortcomings.

What would be interesting to see is (and I donât know if you perhaps have a source about this) what the outcome of an intelligence test would look like without the âquotientâ part, i.e. without subsequently normalizing the results.

I guess the relationship wouldnât be strictly linear but it wouldnât be as extreme as the wealth distribution either.

What I think is true in any case, independent of the distributions, is when you rank all people by intelligence and by wealth respectively you wouldnât really see any stable connection â and that spirit was the intention of my post in the first place and I still stand by it, although some of the technicalities are obviously debatable.

So, if you have a source, Dear Reader, you are more than welcome to share it in the comments – I am always eager to learn!

I ended my post with:

But if it is not ability/intelligence that determines the distribution of wealth what else could account for the extreme inequality we perceive in the world?

In this post I will argue that *luck* is a good candidate, so read on…

In *2014* there was a special issue of the renowned magazine *Science* titled “The science of inequality”. In one of the articles (Cho, A.: “Physicists say itâs simple”) the following thought experiment is being proposed:

Suppose you randomly divide 500 million in income among 10,000 people. Thereâs only one way to give everyone an equal, 50,000 share. So if youâre doling out earnings randomly, equality is extremely unlikely. But there are countless ways to give a few people a lot of cash and many people a little or nothing. In fact, given all the ways you could divvy out income, most of them produce an exponential distribution of income.

So, the basic idea is to randomly throw *9,999* darts at a scale ranging from zero to *500* million and study the resulting distribution of intervals:

library(MASS) w <- 5e8 # wealth p <- 1e4 # no. of people set.seed(123) d <- diff(c(0, sort(runif(p-1, max = w)), w)) # wealth distribution h <- hist(d, col = "red", main = "Exponential decline", freq = FALSE, breaks = 45, xlim = c(0, quantile(d, 0.99))) fit <- fitdistr(d, "exponential") curve(dexp(x, rate = fit$estimate), col = "black", type = "p", pch = 16, add = TRUE)

The resulting distribution fits an *exponential distribution* very well. You can read some interesting discussions concerning this result on *CrossValidated StackExchange*: How can I analytically prove that randomly dividing an amount results in an exponential distribution (of e.g. income and wealth)?

Just to give you an idea of how unfair this distribution is: the richest six persons have more wealth than the poorest ten percent together:

sum(sort(d)[9994:10000]) - sum(sort(d)[0:1000]) ## [1] 183670.8

If you think that this is ridiculous just look at the real global wealth distribution: here it is not six but three persons who own more than the poorest ten percent!

Now, what does that mean? Well, equality seems to be the exception and (extreme) inequality the rule. The intervals were found randomly, no interval had any special skills, just luck – and the result is (extreme) inequality – as in the real world!

If you can reproduce the wealth distribution of a society stochastically this could have the implication that it werenât so much the extraordinary skills of the rich which made them rich but they just got lucky.

Some rich people are decent enough to admit this. In his impressive essay “Why Poverty Is Like a Disease” Christian H. Cooper, a hillbilly turned investment banker writes:

So how did I get out? By chance.

Itâs easy to attach a post-facto narrative of talent and hard work to my story, because thatâs what weâre fed by everything from Hollywood to political stump speeches. But itâs the wrong story. My escape was made up of a series of incredibly unlikely events, none of which I had real control over.

[…]

I am the exception that proves the ruleâbut that rule is that escape from poverty is a matter of chance, and not a matter of merit.

A consequence would be that you cannot really learn much from the rich. So throw away all of your self help books on how to become successful. I will end with a cartoon, which brings home this message, on a closely related concept, the so called *survivorship bias* (which is also important to keep in mind when backtesting trading strategies in quantitative finance, the topic of an upcoming post… so stay tuned!):

Like most people you will have used a search engine lately, like *Google*. But have you ever thought about how it manages to give you the most fitting results? How does it order the results so that the best are on top? Read on to find out!

The earliest search engines either had human curated indices, like *Yahoo!* or used some simple heuristic like the more often the keyword you were looking for was mentioned on a page the better, like *Altavista* – which led to all kinds of crazy effects like certain keywords being repeated thousands of times on webpages to make them more “relevant”.

Now, most of those search engines are long gone because a new kid arrived on the block: Google! Googleâs search engine results were much better than all of the competition and they became the dominant player in no time. How did they do that?

The big idea was in fact published by the two founders: Sergey Brin and Lawrence Page, it is called the *pagerank algorithm* (which is of course a pun because one of the authors was named Page too). The original paper can be found here: S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Let us start with another, related question: which properties are the best to own in Monopoly? Many would instinctively answer with the most expensive ones, i.e. Park Place and Boardwalk. But a second thought reveals that those might be the the ones where you get the biggest rent if somebody lands on them but that the last part is the caveatâŚ âIFâ somebody lands on them! The best streets are actually the ones where players land on the most. Those happen to be the orange streets, St. James Place, Tennessee Avenue and New York Avenue and therefore they are the key to winning the game.

How do find those properties? For example by simulation: you just simulate thousands of dice rolls and see where the players land.

A similar idea holds true for finding the best web pages: you just start from a random position and simulate a surfer who visits different web pages by chance. For each surfing session you tally the respective webpage where she ends up and after many runs we get a percentage for each page. The higher this percentage is the more relevant the webpage!

Let us do this with some R code. First we define a very small net and plot it (the actual example can be found in chapter 30 of the very good book “Chaotic Fishponds and Mirror Universes” by Richard Elwes):

library(igraph) ## ## Attaching package: 'igraph' ## The following objects are masked from 'package:stats': ## ## decompose, spectrum ## The following object is masked from 'package:base': ## ## union # cols represent outgoing links, rows incoming links # A links to C, D; B links to A; C links to A; D links to A,B,C M <- matrix(c(0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0), nrow = 4) colnames(M) <- rownames(M) <- c("A", "B", "C", "D") M ## A B C D ## A 0 1 1 1 ## B 0 0 0 1 ## C 1 0 0 1 ## D 1 0 0 0 g <- graph_from_adjacency_matrix(t(M)) # careful with how the adjacency matrix is defined -> transpose of matrix plot(g)

Now, we are running the actual simulation. We define two helper functions for that, `next_page`

for getting a random but possible next page given the page our surfer is on at the moment and `last_page`

which gives the final page after `N`

clicks:

next_page <- function(page, graph) { l <- sample(rownames(graph)[as.logical(graph[ , as.logical(page)])], 1) as.numeric(rownames(graph) == l) } last_page <- function(page, graph, N = 100) { for (i in seq(N)) { page <- next_page(page, graph) } page } current_page <- c(1, 0, 0, 0) # random surfer starting from A random_surfer <- replicate(2e4, last_page(current_page, M, 50)) round(rowSums(random_surfer) / sum(random_surfer), 2) ## [1] 0.43 0.07 0.28 0.22

So we see that page A is the most relevant one because our surfer ends up being there in more than 40% of all sessions, after that come the pages C, D and B. When you look at the net that makes sense, since all pages refer to A whereas B gets only one link, so it doesnât seem to be that relevant.

As you have seen the simulation even for this small net took quite long so we need some clever mathematics to speed up the process. One idea is to transform our matrix which represents the network into a matrix which gives the probabilities of landing on the next pages and then multiply the probability matrix with the current position (and thereby transform the probabilities). Let us do this for the first step:

M_prob <- prop.table(M, 2) # create probability matrix M_prob ## A B C D ## A 0.0 1 1 0.3333333 ## B 0.0 0 0 0.3333333 ## C 0.5 0 0 0.3333333 ## D 0.5 0 0 0.0000000 M_prob %*% current_page ## [,1] ## A 0.0 ## B 0.0 ## C 0.5 ## D 0.5

The result says that there is a fifty-fifty chance of landing on C or D. When you look at the graph you see that this is correct since there are two links, one to C and one to D! For the next step you would have to multiply the matrix with the result again, or first multiply the matrix with itself before multiplying with the current position, which gives:

If we want to do this a hundred times we just have to raise this probability matrix to the one hundredth power:

We use the `%^%`

operator in the `expm`

package (on CRAN) for that:

library(expm) ## Loading required package: Matrix ## ## Attaching package: 'expm' ## The following object is masked from 'package:Matrix': ## ## expm r <- M_prob %^% 100 %*% current_page r ## [,1] ## A 0.42857143 ## B 0.07142857 ## C 0.28571429 ## D 0.21428571

Again, we get the same result! You might ask: why ? The answer is that this is in most cases enough to get a stable result so that any further multiplication still results in the same result:

The last equations opens up still another possibility: we are obviously looking for a vector which goes unaffected when multiplied by the matrix . There is a mathematical name for that kind of behaviour: *eigenvector*! As you might have guessed the name is an import from the German language where it means something like “own vector”.

This hints at the problem we were solving all along (without consciously realizing perhaps): a page is the more relevant the more relevant a page is that links to it… now we have to know the importance of that page but that page two is the more relevant… and so on and so forth, we are going in circles here. The same is true when you look at the equation above: you define in terms of – is the eigenvector of matrix !

There are very fast and powerful methods to find the eigenvectors of a matrix, and the corresponding `eigen`

function is even a function in base R:

lr <- Re(eigen(M_prob)$vectors[ , 1]) # real parts of biggest eigenvector lr / sum(lr) # normalization ## [1] 0.42857143 0.07142857 0.28571429 0.21428571

Again, the same result! You can now understand the title of this post and titles of other articles about the pagerank algorithm and Google like “The $25,000,000,000 eigenvector”.

Yet, a word of warning is in order: there are cases where the probability matrix is not diagonalizable (we wonât get into the mathematical details here), which means that the eigenvector method wonât give sensible results. To check this the following code must evaluate to `TRUE`

:

ev <- eigen(M_prob)$values length(unique(ev)) == length(ev) ## [1] TRUE

We now repeat the last two methods for a bigger network:

set.seed(1415) n <- 10 g <- sample_gnp(n, p = 1/4, directed = TRUE) # create random graph g <- set_vertex_attr(g, "name", value = LETTERS[1:n]) plot(g)

M <- t(as_adjacency_matrix(g, sparse = FALSE)) M_prob <- prop.table(M, 2) # create probability matrix M_prob ## A B C D E F G H I J ## A 0.00 0 0 1 0.5 0.5 0.5 0.0000000 0.0000000 0.5 ## B 0.00 0 0 0 0.0 0.0 0.0 0.3333333 0.0000000 0.0 ## C 0.00 1 0 0 0.0 0.0 0.0 0.0000000 0.3333333 0.5 ## D 0.25 0 0 0 0.0 0.0 0.0 0.0000000 0.0000000 0.0 ## E 0.25 0 0 0 0.0 0.0 0.5 0.3333333 0.3333333 0.0 ## F 0.00 0 1 0 0.0 0.0 0.0 0.0000000 0.3333333 0.0 ## G 0.25 0 0 0 0.0 0.0 0.0 0.0000000 0.0000000 0.0 ## H 0.00 0 0 0 0.5 0.0 0.0 0.0000000 0.0000000 0.0 ## I 0.00 0 0 0 0.0 0.5 0.0 0.0000000 0.0000000 0.0 ## J 0.25 0 0 0 0.0 0.0 0.0 0.3333333 0.0000000 0.0 current_page <- c(1, rep(0, n-1)) r <- M_prob %^% 100 %*% current_page r ## [,1] ## A 0.27663574 ## B 0.02429905 ## C 0.08878509 ## D 0.06915881 ## E 0.14579434 ## F 0.10654199 ## G 0.06915881 ## H 0.07289723 ## I 0.05327107 ## J 0.09345787 lr <- Re(eigen(M_prob)$vectors[ , 1]) lr / sum(lr) # normalization of the real parts ## [1] 0.27663551 0.02429907 0.08878505 0.06915888 0.14579439 0.10654206 ## [7] 0.06915888 0.07289720 0.05327103 0.09345794

We can now order the pages according to their importance – like the first 10 results of a google search:

search <- data.frame(Page = LETTERS[1:n], Rank = r) search[order(search$Rank, decreasing = TRUE), ] ## Page Rank ## A A 0.27663574 ## E E 0.14579434 ## F F 0.10654199 ## J J 0.09345787 ## C C 0.08878509 ## H H 0.07289723 ## D D 0.06915881 ## G G 0.06915881 ## I I 0.05327107 ## B B 0.02429905

Looking at the net, does the resulting order make sense to you?

Congratulations, you now understand the big idea behind one the greatest revolutions in information technology!

]]>

In this post we are talking about one of the most unintuitive results in statistics: the so called *false positive paradox* which is an example of the so called *base rate fallacy*. It describes a situation where a positive test result of a very sensitive medical test shows that you have the respective disease… yet you are most probably healthy!

The reason for this is that the disease itself is so rare that even with a very sensitive test the result is most probably *false positive*: it shows that you have the disease yet this result is false, you are healthy.

The key to understanding this result is to understand the difference between two conditional probabilities: the probability that you have a positive test result when you are sick and the probability that you are sick in case you got a positive test result – you are interested in the last (am I really sick?) but you only know the first.

Now for some notation (the vertical dash means “under the condition that”, P stands for probability):

- : if you are sick () you will probably have a positive test result () – this (the test result) is what we know
- : if you have a positive test result () you are probably not sick () – this (whether we are sick) is what we want to know

To calculate one conditional probability from the other we use the famous **Bayes’ theorem**:

In the following example we assume a disease with an infection rate of *1* in *1000* and a test to detect this disease with a *sensitivity* of *99%*. Have a look at the following code which illustrates the situation with *Euler diagrams*, first the big picture, then a zoomed-in version:

library(eulerr) A <- 0.001 # prevalence of disease BlA <- 0.99 # sensitivity of test B <- A * BlA + (1 - A) * (1 - BlA) # positive test (specificity same as sensitivity) AnB <- BlA * A AlB <- BlA * A / B # Bayes's theorem #AnB / B # Bayes's theorem in different form C <- 1 # the whole population main <- paste0("P(B|A) = ", round(BlA, 2), ", but P(A|B) = ", round(AlB, 2)) set.seed(123) fit1 <- euler(c("A" = A, "B" = B, "C" = C, "A&B" = AnB, "A&C" = A, "B&C" = B, "A&B&C" = AnB), input = "union") plot(fit1, main = main, fill = c("red", "green", "gray90"))

fit2 <- euler(c("A" = A, "B" = B, "A&B" = AnB), input = "union") plot(fit2, main = main, fill = c("red", "green"))

As you can see although this test is very sensitive when you get a positive test result the probability of you being infected is only *9%*!

In the diagrams *C* is the whole population and *A* are the infected individuals. *B* shows the people with a positive test result and you can see in the second diagram that almost all of the infected *A* are also part of *B* (the brown area = *true positive*), but still most ob *B* are outside of *A* (the green area), so although they are not infected they have a positive test result! They are *false positive*.

The red area shows the people that are infected (*A*) but get a negative test result, stating that they are healthy. This is called *false negative*. The grey area shows the people who are healthy and get a negative test result, they are *true negative*.

Due to the occasion we are now coming to an even more extreme example: did Jesus rise from the dead? It is inspired by the very good essay “A drop in the sea”: Don’t believe in miracles until you’ve done the math.

Let us assume that we had very, very reliable witnesses (as a side note what is strange though is that the gospels cannot even agree on how many men or angels appeared at the tomb: it is one angel in Matthew, a young man in Mark, two men in Luke and two angels in JohnâŚ but anyway), yet the big problem is that not many people so far have been able to defy death. I have only heard of two cases: supposedly the King of Kings (Jesus) but also of course the King himself (Elvis!), whereby sightings of Elvis after his death are much more numerous than of Jesus (just sayingâŚ )

Have a look at the following code (source for the number of people who have ever lived: WolframAlpha)

A <- 2/108500000000 # probability of coming back from the dead (The King = Elvis and the King of Kings = Jesus) BlA <- 0.9999999 # sensitivity of test -> very, very reliable witnesses (many more in case of Elvis B <- A * BlA + (1 - A) * (1 - BlA) # positive test = witnesses say He rose AnB <- BlA * A AlB <- BlA * A / B # Bayes's theorem C <- 1 # all people main <- paste0("P(B|A) = ", round(BlA, 2), ", but P(A|B) = ", round(AlB, 2)) fit1 <- euler(c("A" = A, "B" = B, "C" = C, "A&B" = AnB, "A&C" = A, "B&C" = B, "A&B&C" = AnB), input = "union") plot(fit1, main = main, fill = c("red", "green", "gray90"))

fit2 <- euler(c("A" = A, "B" = B, "A&B" = AnB), input = "union") plot(fit2, main = main, fill = c("red", "green"))

So, in this case *C* is the unfortunate group of people who have to go for goodâŚ it is us. As you can see although the witnesses are super reliable when they claim that somebody rose it is almost certain that they are wrong:

- : if Jesus rose () the very, very reliable witnesses would with a very high probability say so ()
- : if the very, very reliable witnesses said that Jesus rose () Jesus would still almost surely have stayed dead

Or in the words of the above mentioned essay:

No one is justified in believing in Jesus’s resurrection. The numbers simply don’t justify the conclusion.

But this chimes well with a famous Christian saying “I believe because it is absurd” (or in Latin “Credo quia absurdum”) – you can find out more about that in another highly interesting essay: ‘I believe because it is absurd’: Christianityâs first meme

Unfortunately this devastating conclusion is also true in the case of Elvis…

]]>

One of the problems of navigating an autonomous car through a city is to extract *robust signals* in the face of all the *noise* that is present in the different sensors. Just taking something like an arithmetic mean of all the data points could possibly end in a catastrophe: if a part of a wall looks similar to the street and the algorithm calculates an average trajectory of the two this would end in leaving the road and possibly crashing into pedestrians. So we need some robust algorithm to get rid of the noise. The area of statistics that especially deals with such problems is called *robust statistics* and the methods used therein *robust estimation*.

Now, one of the problems is that one doesn’t know what is signal and what is noise. The big idea behind *RANSAC* (short for *RAndom SAmple Consensus*) is to get rid of *outliers* by basically taking as many points as possible which form a well-defined region and leaving out the others. It does that iteratively, similar to the famous *k-means algorithm* (the topic of one of the upcoming posts, so stay tuned…).

To really understand how RANSAC works we will now build it with R. We will take a simple linear regression as an example and make it robust against outliers.

For educational purposes we will do this step by step:

- Understanding the general outline of the algorithm.
- Looking at the steps in more detail.
- Expressing the steps in pseudocode.
- Translating this into R!

Conveniently enough Wikipedia gives a good outline of the algorithm and even provides us with the very clear pseudocode which will serve as the basis for our own R implementation (the given Matlab code is not very good in my opinion and has nothing to do with the pseudocode):

The RANSAC algorithm is essentially composed of two steps that are iteratively repeated:

- In the first step, a sample subset containing minimal data items is randomly selected from the input dataset. A fitting model and the corresponding model parameters are computed using only the elements of this sample subset. The cardinality of the sample subset is the smallest sufficient to determine the model parameters.
- In the second step, the algorithm checks which elements of the entire dataset are consistent with the model instantiated by the estimated model parameters obtained from the first step. A data element will be considered as an outlier if it does not fit the fitting model instantiated by the set of estimated model parameters within some error threshold that defines the maximum deviation attributable to the effect of noise. The set of inliers obtained for the fitting model is called consensus set. The RANSAC algorithm will iteratively repeat the above two steps until the obtained consensus set in certain iteration has enough inliers.

In more detail:

RANSAC achieves its goal by repeating the following steps:

- Select a random subset of the original data. Call this subset the hypothetical inliers.
- A model is fitted to the set of hypothetical inliers.
- All other data are then tested against the fitted model. Those points that fit the estimated model well, according to some model-specific loss function, are considered as part of the consensus set.
- The estimated model is reasonably good if sufficiently many points have been classified as part of the consensus set.
- Afterwards, the model may be improved by reestimating it using all members of the consensus set.
This procedure is repeated a fixed number of times, each time producing either a model which is rejected because too few points are part of the consensus set, or a refined model together with a corresponding consensus set size. In the latter case, we keep the refined model if its consensus set is larger than the previously saved model.

Now, this can be expressed in pseudocode:

Given: data - a set of observed data points model - a model that can be fitted to data points n - minimum number of data points required to fit the model k - maximum number of iterations allowed in the algorithm t - threshold value to determine when a data point fits a model d - number of close data points required to assert that a model fits well to data Return: bestfit - model parameters which best fit the data (or nul if no good model is found) iterations = 0 bestfit = nul besterr = something really large while iterations < k { maybeinliers = n randomly selected values from data maybemodel = model parameters fitted to maybeinliers alsoinliers = empty set for every point in data not in maybeinliers { if point fits maybemodel with an error smaller than t add point to alsoinliers } if the number of elements in alsoinliers is > d { % this implies that we may have found a good model % now test how good it is bettermodel = model parameters fitted to all points in maybeinliers and alsoinliers thiserr = a measure of how well model fits these points if thiserr < besterr { bestfit = bettermodel besterr = thiserr } } increment iterations } return bestfit

It is quite easy to convert this into valid R code (as a learner of R you should try it yourself before looking at my solution!):

ransac <- function(data, n, k, t, d) { iterations <- 0 bestfit <- NULL besterr <- 1e5 while (iterations < k) { maybeinliers <- sample(nrow(data), n) maybemodel <- lm(y ~ x, data = data, subset = maybeinliers) alsoinliers <- NULL for (point in setdiff(1:nrow(data), maybeinliers)) { if (abs(maybemodel$coefficients[2]*data[point, 1] - data[point, 2] + maybemodel$coefficients[1])/(sqrt(maybemodel$coefficients[2] + 1)) < t) alsoinliers <- c(alsoinliers, point) } if (length(alsoinliers) > d) { bettermodel <- lm(y ~ x, data = data, subset = c(maybeinliers, alsoinliers)) thiserr <- summary(bettermodel)$sigma if (thiserr < besterr) { bestfit <- bettermodel besterr <- thiserr } } iterations <- iterations + 1 } bestfit }

We now test this with some sample data:

data <- read.csv("data/RANSAC.csv") plot(data) abline(lm(y ~ x, data = data)) set.seed(3141) abline(ransac(data, n = 10, k = 10, t = 0.5, d = 10), col = "blue")

The black line is a plain vanilla linear regression, the blue line is the RANSAC-enhanced version. As you can see: no more crashing into innocent pedestrians.

]]>