## Causation doesn’t imply Correlation either

You may have misread the title as the old correlation does not imply causation mantra, but the opposite is also true! If you don’t believe me, read on…

First I want to provide you with some intuition on what correlation is really all about! For many people (and many of my students for sure) the implications of the following formula for the correlation coefficient of two variables and are not immediately clear:

In fact the most interesting part is this: . We see a product of two differences. The differences consist of the data points minus the respective means (average values): in effect this leads to the origin being moved to the means of both variables (as if you moved the crosshair right into the centre of all data points).

There are now four possible quadrants for every data point: top or bottom, left or right. Top right means that both differences are positive, so the result will be positive too. The same is true for the bottom left quadrant because minus times minus equals plus (it often boils down to simple school maths)! The other two quadrants give negative results because minus times plus and plus times minus equals minus.

After that we sum over all products and normalize them by dividing by the respective standard deviations (how much the data are spread out), so that we will only get values between and .

Let us see this in action with an example. First we define a helper function for visualizing this intuition:

cor.plot <- function(data) {
x_mean <- mean(data[ , 1])
y_mean <- mean(data[ , 2])
plot(data, type = "n") # plot invisibly
limits = par()$usr # limits of plot # plot correlation quadrants rect(x_mean, y_mean, limits[2], limits[4], col = "lightgreen") rect(x_mean, y_mean, limits[1], limits[4], col = "orangered") rect(x_mean, y_mean, limits[1], limits[3], col = "lightgreen") rect(x_mean, y_mean, limits[2], limits[3], col = "orangered") points(data, pch = 16) # plot scatterplot on top colnames(data) <- c("x", "y") # rename cols instead of dynamic variable names in lm abline(lm(y ~ x, data), lwd = 2) # add regression line title(paste("cor =", round(cor(data[1], data[2]), 2))) # add cor as title }  Now for the actual example (in fact the same example we had in this post: Learning Data Science: Modelling Basics): age <- c(21, 46, 55, 35, 28) income <- c(1850, 2500, 2560, 2230, 1800) data <- data.frame(age, income) plot(data, pch = 16)  cor.plot(data)  The correlation is very high because most of the data points are in the positive (green) quadrants and the data is close to its regression line (linear regression and correlation are closely related mathematically). Now, let us get to the actual topic of this post: Causation doesn’t imply Correlation either. What could be “more causal” than a parabolic shot? When you shoot a projectile without air resistance the trajectory will form a perfect parabola! This is in fact rocket science! Let us simulate such a shot and calculate the correlation between time and altitude, two variables that are perfectly causally dependent: t <- c(-30:30) x <- -t^2 data <- data.frame(t, x) plot(data, pch = 16)  cor.plot(data)  The correlation is exactly zero, zip, nada! And it is clear why: the data points in the positive and in the negative quadrants cancel each other out completely because of the perfect symmetry! This leads us to the following very important insight: Correlation is a measure of linear dependance (and linear dependance only!). Even a strong causal relationship can be overlooked by correlation because of its non-linear nature (as in this case with the quadratic relationship). The following example conveys the same idea in a somewhat more humorous manner – it is the by now infamous datasaurus: library(datasauRus) # on CRAN dino <- datasaurus_dozen[datasaurus_dozen$dataset == "dino", 2:3]
plot(dino, pch = 16, cex = 2)


cor.plot(dino)


As with the above example we can clearly see why the correlation is so low, although there is a whole dinosaur hiding in your data…

The learning is that you should never just blindly trust statistical measures on their own, always visualize your data when possible: there might be some real beauties hiding inside your data, waiting to be discovered…

## From Coin Tosses to p-Hacking: Make Statistics Significant Again!

One of the most notoriously difficult subjects in statistics is the concept of statistical tests. We will explain the ideas behind it step by step to give you some intuition on how to use (and misuse) it, so read on…

Let us begin with some coin tosses and the question how to find out whether a coin is fair, i.e. shows heads and tails with fifty-fifty probability. If you had all the time of the world you could throw it indefinitely often to see where the probabilities stabilize. Yet, in reality we only have some sample of tosses and have to infer results based on this. Because of this the area that deals with those problems is called inferential statistics (also inductive statistics).

So, to keep things simple let us throw our coin only ten times. Where is the point where you would suspect that something is not quite right with this coin? As soon as you don’t get five times heads and five times tails? Or only when you get ten times one side only? Possibly somewhere in between? But where exactly?

The first thing to understand is that this is a somewhat arbitrary decision… but one that should be set in advance and which should adhere to some general standard. The idea is to set some fixed probability with which you compare the probability of the result you see in your experiment (i.e. the number of heads and tails in our example). When you obtain a result that is as improbable or even more so than that pre-fixed probability you conclude that the result happened not just by chance but that something significant happened (e.g. that your coin is unfair). The generally accepted probability below which one speaks of a significant result is 5% (or 0.05), it is therefore called significance level.

After fixing the significance level at 0.05 we have to compare it with the probability of our observed result (or a result that is even more extreme). Let us say that we got nine times one side of the coin and only one time the other. What is the probability of getting this or an even more extreme result? The general probability formula is:

For the numerator we get 22 possibilities for observing the above mentioned or an even more extreme result: two times ten possibilities that each of the ten coins could be the odd one out and two more extreme cases showing only heads or only tails. For the denominator we get possibilities for throwing coins:

22 / 2^10
## [1] 0.02148438


So, the probability is about 2% which is below 5%, so we would conclude that our coin is not fair!

Another possibility is to use the binomial coefficient which gives us the number of possibilities to choose k items out of n items. Conveniently enough the R function is called choose(n, k):

(choose(10, 0) + choose(10, 1) + choose(10, 9) + choose(10, 10))/2^10
## [1] 0.02148438


Or, because the situation is symmetric:

2 * (choose(10, 0) + choose(10, 1))/2^10
## [1] 0.02148438


To make things even simpler we can use the binomial distribution for our calculation, where basically all of the needed binomial coefficients are summed up automatically:

prob <- 0.5 # fair coin
n <- 10     # number of coin tosses
k <- 1      # number of "successes", i.e. how many coins show a different side than the rest

2 * pbinom(k, n, prob)
## [1] 0.02148438


As you can see, we get the same probability value over and over again. If you have understood the line of thought so far you are ready for the last step: performing the statistical test with the binom.test() function:

binom.test(k, n)
##
##  Exact binomial test
##
## data:  k and n
## number of successes = 1, number of trials = 10, p-value = 0.02148
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.002528579 0.445016117
## sample estimates:
## probability of success
##                    0.1


As you can see, we get the same number again: it is named p-value in the output:

The p-value is the probability of getting the observed – or an even more extreme – result under the condition that the null hypothesis is true.

The only thing that is new here is the null hypothesis. The null hypothesis is just a fancy name for the assumption that in general things are unremarkable, examples could be: our coin is fair, a new medical drug doesn’t work (its effect is just random noise) or one group of students is not better than the other.

So, after having a look at the output we would formally say that we have a significant result and therefore reject the null hypothesis that the coin is fair.

In most real world scenarios the situation is not that clear cut as it is with our coins. Therefore we won’t normally use a binomial test but the workhorse of statistical inference: the famous t-test! On the positive side you can use the t-test in many situations where you don’t have all the information on the underlying population distribution, e.g. its variance. It is based on comparatively mild assumptions, e.g. that the population distribution is normal (which is itself effectively based on the Central Limit Theorem (CLT)) – and even this can be relaxed as long as its variance is finite and when you have big sample sizes. If it is normal though small sample sizes suffice. On the negative side the t-test often only gives approximations of exact tests. Yet the bigger the sample sizes the better the approximation. Enough of the theory, let us perform a t-test with our data. We use the t.test() function for that:

data <- c(rep(0, n-k), rep(1, k))
t.test(data, mu = prob)
##
##  One Sample t-test
##
## data:  data
## t = -4, df = 9, p-value = 0.00311
## alternative hypothesis: true mean is not equal to 0.5
## 95 percent confidence interval:
##  -0.1262157  0.3262157
## sample estimates:
## mean of x
##       0.1


As you can see, we get a different p-value this time but the test is still significant and leads to the right conclusion. As said before, the bigger the sample size the better the approximation.

Let us try two more examples. First we use the inbuilt mtcars dataset to test whether petrol consumption of manual and automatic transmissions differ significantly:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# make new column with better labels
mtcars$trans <- ifelse(mtcars$am == 0, "aut", "man")
mtcars$trans <- factor(mtcars$trans)

mtcars$mpg ## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 ## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 ## [29] 15.8 19.7 15.0 21.4 mtcars$trans
##  [1] man man man aut aut aut aut aut aut aut aut aut aut aut aut aut aut
## [18] man man man aut aut aut aut aut man man man man man man man
## Levels: aut man

by(mtcars$mpg, mtcars$trans, summary)
## mtcars$trans: aut ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 10.40 14.95 17.30 17.15 19.20 24.40 ## -------------------------------------------------------- ## mtcars$trans: man
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   15.00   21.00   22.80   24.39   30.40   33.90

boxplot(mpg ~ trans, data = mtcars)


As we can see the consumption is obviously different – but is it also significantly different at the 5% significance level? Let us find out:

M <- mtcars$trans == "man" #manual mpg_man <- mtcars[M, ]$mpg #consumption of manual transmission
mpg_man
##  [1] 21.0 21.0 22.8 32.4 30.4 33.9 27.3 26.0 30.4 15.8 19.7 15.0 21.4

mpg_aut <- mtcars[!M, ]$mpg #consumption of automatic transmission mpg_aut ## [1] 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 ## [15] 21.5 15.5 15.2 13.3 19.2 t.test(mpg_man, mpg_aut) ## ## Welch Two Sample t-test ## ## data: mpg_man and mpg_aut ## t = 3.7671, df = 18.332, p-value = 0.001374 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 3.209684 11.280194 ## sample estimates: ## mean of x mean of y ## 24.39231 17.14737  Yes, it is significantly different! And now for our final example, a classic use of the t-test to find out whether a medical drug has a significant effect (we use the inbuilt sleep dataset): sleep ## extra group ID ## 1 0.7 1 1 ## 2 -1.6 1 2 ## 3 -0.2 1 3 ## 4 -1.2 1 4 ## 5 -0.1 1 5 ## 6 3.4 1 6 ## 7 3.7 1 7 ## 8 0.8 1 8 ## 9 0.0 1 9 ## 10 2.0 1 10 ## 11 1.9 2 1 ## 12 0.8 2 2 ## 13 1.1 2 3 ## 14 0.1 2 4 ## 15 -0.1 2 5 ## 16 4.4 2 6 ## 17 5.5 2 7 ## 18 1.6 2 8 ## 19 4.6 2 9 ## 20 3.4 2 10 by(sleep$extra, sleep$group, summary) ## sleep$group: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  -1.600  -0.175   0.350   0.750   1.700   3.700
## --------------------------------------------------------
## sleep$group: 2 ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -0.100 0.875 1.750 2.330 4.150 5.500 boxplot(extra ~ group, sleep)  Again, there is obviously a difference… but is it significant? t.test(extra ~ group, sleep) ## ## Welch Two Sample t-test ## ## data: extra by group ## t = -1.8608, df = 17.776, p-value = 0.07939 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -3.3654832 0.2054832 ## sample estimates: ## mean in group 1 mean in group 2 ## 0.75 2.33  The p-value is above 0.05 which means that this time it cannot be ruled out that the difference is random! As you can imagine whenever there is money to be made people get creative on how to game the system… unfortunately statistical tests are no exception. Especially in the medical field huge amounts of money are at stake, so one should be well aware of a manipulation technique called p-hacking (also known under the names data snooping or data dredging)! To summarize our journey into statistical testing so far we have seen that you set a significance level (normally at 0.05) and draw a random sample. You calculate how probable the drawing of this sample (or an even more extreme sample) is and compare it to the significance level. If the probability is below the significance level you say that the test shows a significant result, otherwise you say that one cannot rule out that the difference (if there is one) is just due to chance. Now, if you don’t like the result… why not draw a sample again? And again? And again? This is p-hacking! Just draw a lot of samples and take the one that fits your agenda! Of course this is a stark misuse of the idea of statistical tests. But the fact remains that sometimes samples will be significant just by chance! In the following example we “prove” that homeopathy works after all (as we all know it just doesn’t work beyond the so-called placebo effect as numerous high quality studies over many decades have shown). For that end we just conduct one hundred trials and take one out that fits our bill: p_hack <- function() { homeopathy <- rnorm(20, 1.2, 5) t.test(homeopathy, mu = 1.2)$p.value
}

set.seed(123)
trials <- replicate(100, p_hack())
trials[6] # homeopathy works!
## [1] 0.033461758

# wait... no it doesn't!
trials
##   [1] 0.522741341 0.785376466 0.624588751 0.587983281 0.057318309
##   [6] 0.033461758 0.876112179 0.461733176 0.495169397 0.519897894
##  [11] 0.800638084 0.836667733 0.915272904 0.340430844 0.087656634
##  [16] 0.879783049 0.824428157 0.784957654 0.898447738 0.494603038
##  [21] 0.279707663 0.438005042 0.886043148 0.158778621 0.167712182
##  [26] 0.135880072 0.185375902 0.662883154 0.840370636 0.331157973
##  [31] 0.283332330 0.423746570 0.849937345 0.503256673 0.546014504
##  [36] 0.725568387 0.727886195 0.167482514 0.586386335 0.493916303
##  [41] 0.937060320 0.271651762 0.236448565 0.191925375 0.983841382
##  [46] 0.373146285 0.520463358 0.323242616 0.193315250 0.757835487
##  [51] 0.429311063 0.284986685 0.868272041 0.844042454 0.885548528
##  [56] 0.996021648 0.978855283 0.588192375 0.495521737 0.610192393
##  [61] 0.242524547 0.404220265 0.136245145 0.004912541 0.408530447
##  [66] 0.458030143 0.784011725 0.357773131 0.818207705 0.698330582
##  [71] 0.451268449 0.740943226 0.266786179 0.120894921 0.050307044
##  [76] 0.387838555 0.232600995 0.834739682 0.669270240 0.516910985
##  [81] 0.273077802 0.291004360 0.369842912 0.132130995 0.454371585
##  [86] 0.567029545 0.219163798 0.524362414 0.737950629 0.855945701
##  [91] 0.959611992 0.901298484 0.931203165 0.204489683 0.843491761
##  [96] 0.567982673 0.329414884 0.107350495 0.911279358 0.696661191


Lo and behold: trial no. 6 shows a significant result – although by design both means are the same (1.2)! So we take this study, publish it and sell lots and lots of useless homeopathy drugs!

How many significant results will we get just by chance on average? Well, exactly the amount of our significance level!

trials <- replicate(1e5, p_hack())
round(length(trials[trials < 0.05]) / length(trials), 2)
## [1] 0.05


As always our friends over at xkcd summarize the situation brilliantly:

## Learning R: Permutations and Combinations with base R

The area of combinatorics, the art of systematic counting, is dreaded territory for many people, so let us bring some light into the matter: in this post we will explain the difference between permutations and combinations, with and without repetitions, will calculate the number of possibilities and present efficient R code to enumerate all of them, so read on…

The cases we cover in this post:

• Permutations with repetitions
• Combinations without repetitions
• Permutations (of all elements) without repetitions

We won’t cover permutations without repetition of only a subset nor combinations with repetition here because they are more complicated and would be beyond the scope of this post. We will perhaps cover those in a later post.

In all cases, you can imagine somebody drawing elements from a set and the different ways to do so. Permutation implies that the order does matter, with combinations it does not (e.g. in a lottery it normally does not matter in which order the numbers are drawn). Without repetition simply means that when one has drawn an element it cannot be drawn again, so with repetition implies that it is replaced and can be drawn again.

Let us start with permutations with repetitions: as an example take a combination lock (should be permutation lock really!) where you have three positions with the numbers zero to nine each. We use the expand.grid() function for enumerating all possibilities:

P_wi <- expand.grid(rep(list(0:9), 3))
##   Var1 Var2 Var3
## 1    0    0    0
## 2    1    0    0
## 3    2    0    0
## 4    3    0    0
## 5    4    0    0
## 6    5    0    0

tail(P_wi)
##      Var1 Var2 Var3
## 995     4    9    9
## 996     5    9    9
## 997     6    9    9
## 998     7    9    9
## 999     8    9    9
## 1000    9    9    9


The formula for calculating the number of permutations is simple for obvious reasons ( is the number of elements to choose from, is the number of actually chosen elements):

In R:

10^3
## [1] 1000

nrow(P_wi)
## [1] 1000


The next is combinations without repetitions: the classic example is a lottery where six out of 49 balls are chosen. We use the combn() function for finding all possibilities:

C_wo <- combn(1:49, 6)
C_wo[ , 1:10]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    1    1    1    1    1    1    1    1     1
## [2,]    2    2    2    2    2    2    2    2    2     2
## [3,]    3    3    3    3    3    3    3    3    3     3
## [4,]    4    4    4    4    4    4    4    4    4     4
## [5,]    5    5    5    5    5    5    5    5    5     5
## [6,]    6    7    8    9   10   11   12   13   14    15


To calculate the number of combinations the binomial coefficient is used:

To give you some intuition consider the above example: you have possibilities for choosing the first ball, for the second, for the third and so on up to the sixth ball. So that gives . When you think about it this is the same as because all the coefficients smaller than can be eliminated by reducing the fraction! Now, there are possible positions for the first ball that is drawn, for the second… and so on and because the order doesn’t matter we have to divide by , which gives the binomial coefficient. In R we use the choose() function to calculate it:

choose(49, 6)
## [1] 13983816

ncol(C_wo)
## [1] 13983816


So, you see that the probability of winning the lottery are about the same, no matter whether you play it… or not 😉

Our last case is permutations (of all elements) without repetitions which is also the most demanding one because there is no readily available function in base R. So, we have to write our own:

perm <- function(v) {
n <- length(v)
if (n == 1) v
else {
X <- NULL
for (i in 1:n) X <- rbind(X, cbind(v[i], perm(v[-i])))
X
}
}


As you can see it is a recursive function, to understand recursion read my post: To understand Recursion you have to understand Recursion…. What makes matters a little bit more complicated is that the recursive call is within a for loop.

The idea is to fix one element after the other [for (i in 1:n) and cbind(v[i], ...)] and permute the remaining elements [perm(v[-i])] down to the base case when only one element remains [if (n == 1) v], which cannot be permuted any further. Collect all sets on the respective higher level [X <- (rbind(X, ...)] and return the whole matrix X.

Let us see this in action, as an example we’ll see how many different ways there are of four runners reaching the finishing line:

P_wo <- perm(1:4)
P_wo
##       [,1] [,2] [,3] [,4]
##  [1,]    1    2    3    4
##  [2,]    1    2    4    3
##  [3,]    1    3    2    4
##  [4,]    1    3    4    2
##  [5,]    1    4    2    3
##  [6,]    1    4    3    2
##  [7,]    2    1    3    4
##  [8,]    2    1    4    3
##  [9,]    2    3    1    4
## [10,]    2    3    4    1
## [11,]    2    4    1    3
## [12,]    2    4    3    1
## [13,]    3    1    2    4
## [14,]    3    1    4    2
## [15,]    3    2    1    4
## [16,]    3    2    4    1
## [17,]    3    4    1    2
## [18,]    3    4    2    1
## [19,]    4    1    2    3
## [20,]    4    1    3    2
## [21,]    4    2    1    3
## [22,]    4    2    3    1
## [23,]    4    3    1    2
## [24,]    4    3    2    1


After this rather complicated function the calculation of the number of ways is simple, it is just the factorial function (it should again be obvious why):

In R we have the factorial() function:

factorial(4)
## [1] 24

nrow(P_wo)
## [1] 24


As you will see when solving real world problems with R the above functions often come in handy, so you should add them to your ever growing tool set – have fun and stay tuned!

## Learning R: The Ultimate Introduction (incl. Machine Learning!)

There are a million reasons to learn R (see e.g. Why R for Data Science – and not Python?), but where to start? I present to you the ultimate introduction to bring you up to speed! So read on…

I call it ultimate because it is the essence of many years of teaching R… or put differently: it is the kind of introduction I would have liked to have when I started out with R back in the days!

A word of warning though: this is a introduction to R and not to statistics, so I won’t explain the statistics terms used here. You do not need to know any other programming language but it does no harm either. Ok, now let us start!

First you need to install R (https://www.r-project.org) and preferably RStudio as a Graphical User Interface (GUI): https://www.rstudio.com/products/RStudio/#Desktop. Both are free and available for all common operating systems.

To get a quick overview of RStudio watch this video:

You can either type in the following commands in the console or open a new script tab (File -> New File -> R Script) and run the commands by pressing Ctrl + Enter/Return after having typed them.

First of all R is a very good calculator:

2 + 2
## [1] 4

sin(0.5)
## [1] 0.4794255

abs(-10) # absolute value
## [1] 10

pi
## [1] 3.141593

exp(1) # e
## [1] 2.718282

factorial(6)
## [1] 720


By the way: The hash is used for comments, everything after it will be ignored!

Of course you can define variables and use them in your calculations:

n1 <- 2
n2 <- 3
n1 # show content of variable by just typing the name
## [1] 2

n1 + n2
## [1] 5

n1 * n2
## [1] 6

n1^n2
## [1] 8


Part of R’s power stems from the fact that functions can handle several numbers at once, called vectors, and do calculations on them. When calling a function arguments are passed with round brackets:

n3 <- c(12, 5, 27) # concatenate (combine) elements into a vector
n3
## [1] 12  5 27

min(n3)
## [1] 5

max(n3)
## [1] 27

sum(n3)
## [1] 44

mean(n3)
## [1] 14.66667

sd(n3) # standard deviation
## [1] 11.23981

var(n3) # variance
## [1] 126.3333

median(n3)
## [1] 12

n3 / 12
## [1] 1.0000000 0.4166667 2.2500000


In the last example the 12 was recycled three times. R always tries to do that (when feasible), sometimes giving a warning when it might not be intended:

n3 / c(1, 2)
## Warning in n3/c(1, 2): longer object length is not a multiple of shorter
## object length
## [1] 12.0  2.5 27.0


In cases you only want parts of your vectors you can apply subsetting with square brackets:

n3[1]
## [1] 12

n3[c(2, 3)]
## [1]  5 27


Ranges can easily be created with the colon:

n4 <- 10:20
n4
##  [1] 10 11 12 13 14 15 16 17 18 19 20


When you test whether this vector is bigger than a certain number you will get logicals as a result. You can use those logicals for subsetting:

n4 > 15
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

n4[n4 > 15]
## [1] 16 17 18 19 20


Perhaps you have heard the story of little Gauss where his teacher gave him the task to add all numbers from 1 to 100 to keep him busy for a while? Well, he found a mathematical trick to add them within seconds… for us normal people we can use R:

sum(1:100)
## [1] 5050


When we want to use some code several times we can define our own function (a user-defined function). We do that the same way we create a vector (or any other data structure) because R is a so called functional programming language and functions are so called first-class citizens (i.e. on the same level as other data structures like vectors). The code that is being executed is put in curly brackets:

gauss <- function(x) {
sum(1:x)
}
gauss(100)
## [1] 5050

gauss(1000)
## [1] 500500


Of course we also have other data types, e.g. matrices are basically two dimensional vectors:

M <- matrix(1:12, nrow = 3, byrow = TRUE) # create a matrix
M
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

dim(M)
## [1] 3 4


Subsetting now has to provide two numbers, the first for the row, the second for the column (like in the game Battleship). If you leave one out, all data of the respective dimension will be shown:

M[2, 3]
## [1] 7

M[ , c(1, 3)]
##      [,1] [,2]
## [1,]    1    3
## [2,]    5    7
## [3,]    9   11


Another possibility to create matrices:

v1 <- 1:4
v2 <- 4:1
M1 <- rbind(v1, v2) # row bind
M1
##    [,1] [,2] [,3] [,4]
## v1    1    2    3    4
## v2    4    3    2    1

M2 <- cbind(v1, v2)  # column bind
M2
##      v1 v2
## [1,]  1  4
## [2,]  2  3
## [3,]  3  2
## [4,]  4  1


Naming rows, here with inbuilt datasets:

rownames(M2) <- LETTERS[1:4]
M2
##   v1 v2
## A  1  4
## B  2  3
## C  3  2
## D  4  1

LETTERS
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"


When some result is Not Available:

LETTERS[50]
## [1] NA


Getting the structure of your variables:

str(LETTERS)
##  chr [1:26] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" ...

str(M2)
##  int [1:4, 1:2] 1 2 3 4 4 3 2 1
##  - attr(*, "dimnames")=List of 2
##   ..$: chr [1:4] "A" "B" "C" "D" ## ..$ : chr [1:2] "v1" "v2"


Another famous dataset (iris) that is also built into base R (to get help on any function or dataset just put the cursor in it and press F1):

iris
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica


Oops, that is a bit long… if you only want to show the first or last rows do the following:

head(iris) # first 6 rows
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

tail(iris, 10) # last 10 rows
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 141          6.7         3.1          5.6         2.4 virginica
## 142          6.9         3.1          5.1         2.3 virginica
## 143          5.8         2.7          5.1         1.9 virginica
## 144          6.8         3.2          5.9         2.3 virginica
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica


Iris is a so called data frame, the workhorse of R and data science (you will see how to create one below):

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ##$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ##$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...  As you can see, data frames can combine different data types. If you try to do that with e.g. vectors, which can only hold one data type, something called coercion happens, i.e. at least one data type is forced to become another one so that consistency is maintained: str(c(2, "Hello")) # 2 is coerced to become a character string too ## chr [1:2] "2" "Hello"  You can get a fast overview of your data like so: summary(iris[1:4]) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 boxplot(iris[1:4])  As you have seen, R often runs a function on all of the data simultaneously. This feature is called vectorization and in many other languages you would need a loop for that. In R you don’t use loops that often, but of course they are available: for (i in seq(5)) { print(1:i) } ## [1] 1 ## [1] 1 2 ## [1] 1 2 3 ## [1] 1 2 3 4 ## [1] 1 2 3 4 5  Speaking of control structures: of course conditional statements are available too: even <- function(x) ifelse(x %% 2 == 0, TRUE, FALSE) # %% gives remainder of division (= modulo operator) even(1:5) ## [1] FALSE TRUE FALSE TRUE FALSE  Linear modelling (e.g. correlation and linear regression) couldn’t be any easier, it is included in the core language: age <- c(21, 46, 55, 35, 28) income <- c(1850, 2500, 2560, 2230, 1800) df <- data.frame(age, income) # create a data frame df ## age income ## 1 21 1850 ## 2 46 2500 ## 3 55 2560 ## 4 35 2230 ## 5 28 1800 cor(df) # correlation ## age income ## age 1.0000000 0.9464183 ## income 0.9464183 1.0000000 LinReg <- lm(income ~ age, data = df) # income as a linear model of age LinReg ## ## Call: ## lm(formula = income ~ age, data = df) ## ## Coefficients: ## (Intercept) age ## 1279.37 24.56 summary(LinReg) ## ## Call: ## lm(formula = income ~ age, data = df) ## ## Residuals: ## 1 2 3 4 5 ## 54.92 90.98 -70.04 91.12 -166.98 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1279.367 188.510 6.787 0.00654 ** ## age 24.558 4.838 5.076 0.01477 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 132.1 on 3 degrees of freedom ## Multiple R-squared: 0.8957, Adjusted R-squared: 0.8609 ## F-statistic: 25.77 on 1 and 3 DF, p-value: 0.01477 plot(df, pch = 16, main = "Linear model") abline(LinReg, col = "blue", lwd = 2) # adding the regression line  You could directly use the model to make predictions: pred_LinReg <- predict(LinReg, data.frame(age = seq(15, 70, 5))) names(pred_LinReg) <- seq(15, 70, 5) round(pred_LinReg, 2) ## 15 20 25 30 35 40 45 50 55 ## 1647.73 1770.52 1893.31 2016.10 2138.88 2261.67 2384.46 2507.25 2630.04 ## 60 65 70 ## 2752.83 2875.61 2998.40  If you want to know more about the modelling process you can find it here: Learning Data Science: Modelling Basics Another strength of R is the huge number of add-on packages for all kinds of specialized tasks. For the grand finale of this introduction, we’re gonna get a little taste of machine learning. For that matter we install the OneR package from CRAN (the official package repository of R): Tools -> Install packages… -> type in “OneR” -> click “Install”. After that we build a simple model on the iris dataset to predict the Species column: library(OneR) # load package data <- optbin(Species ~., data = iris) # find optimal bins for numeric predictors model <- OneR(data, verbose = TRUE) # build actual model ## ## Attribute Accuracy ## 1 * Petal.Width 96% ## 2 Petal.Length 95.33% ## 3 Sepal.Length 74.67% ## 4 Sepal.Width 55.33% ## --- ## Chosen attribute due to accuracy ## and ties method (if applicable): '*' summary(model) # show rules ## ## Call: ## OneR.data.frame(x = data, verbose = TRUE) ## ## Rules: ## If Petal.Width = (0.0976,0.791] then Species = setosa ## If Petal.Width = (0.791,1.63] then Species = versicolor ## If Petal.Width = (1.63,2.5] then Species = virginica ## ## Accuracy: ## 144 of 150 instances classified correctly (96%) ## ## Contingency table: ## Petal.Width ## Species (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum ## setosa * 50 0 0 50 ## versicolor 0 * 48 2 50 ## virginica 0 4 * 46 50 ## Sum 50 52 48 150 ## --- ## Maximum in each column: '*' ## ## Pearson's Chi-squared test: ## X-squared = 266.35, df = 4, p-value < 2.2e-16 plot(model)  We’ll now see how well the model is doing: prediction <- predict(model, data) eval_model(prediction, data) ## ## Confusion matrix (absolute): ## Actual ## Prediction setosa versicolor virginica Sum ## setosa 50 0 0 50 ## versicolor 0 48 4 52 ## virginica 0 2 46 48 ## Sum 50 50 50 150 ## ## Confusion matrix (relative): ## Actual ## Prediction setosa versicolor virginica Sum ## setosa 0.33 0.00 0.00 0.33 ## versicolor 0.00 0.32 0.03 0.35 ## virginica 0.00 0.01 0.31 0.32 ## Sum 0.33 0.33 0.33 1.00 ## ## Accuracy: ## 0.96 (144/150) ## ## Error rate: ## 0.04 (6/150) ## ## Error rate reduction (vs. base rate): ## 0.94 (p-value < 2.2e-16)  96% accuracy is not too bad, even for this simple dataset! If you want to know more about the OneR package you can read the vignette: OneR – Establishing a New Baseline for Machine Learning Classification Models. Well, and that’s it for the ultimate introduction to R – hopefully you liked it and you learned something! Please share your first experiences with R in the comments and also if you miss something (I might add it in the future!) – Thank you for reading and stay tuned for more to come! ## Was the Bavarian Abitur too hard this time? Bavaria is known for its famous Oktoberfest… and within Germany also for its presumably difficult Abitur, a qualification granted by university-preparatory schools in Germany. A mandatory part for all students is maths. This year many students protested that the maths part was way too hard, they even started an online petition with more than seventy thousand supporters at this time of writing! It is not clear yet whether their marks will be adjusted upwards, the ministry of education is investigating the case. As a professor in Bavaria who also teaches statistics I will take the opportunity to share with you an actual question from the original examination with solution, so read on… Let us have a look at the first (and easiest) question in the stochastics part: Every sixth visitor to the Oktoberfest wears a gingerbread heart around his neck. During the festival 25 visitors are chosen at random. Determine the probability that at most one of the selected visitors will have a gingerbread heart. Before you read on try to solve the task yourself… Of course students are not allowed to use R in the examination but in general the good thing about this kind of questions is that if you have no idea how to solve them analytically solving them by simulation is often much easier: set.seed(12) N <- 1e7 v <- matrix(sample(c(0L, 1L), size = 25 * N, replace = TRUE, prob = c(5/6, 1/6)), ncol = 25) sum(rowSums(v) <= 1) / N ## [1] 0.062936  The answer is about . Now for the analytical solution: “At least one” implies that we have to differentiate between two cases, “no gingerbread heart” and “exactly one gingerbread heart”. “No gingerbread heart” is just . “Exactly one gingerbread heart” is because there are possibilities of where the gingerbread heart could occur. We have to add both probabilities: (5/6)^25 + 25*1/6*(5/6)^24 ## [1] 0.06289558  If you know a little bit about probability distributions you will recognize the above as the binomial distribution: pbinom(q = 1, size = 25, prob = 1/6) ## [1] 0.06289558  Of course it is a little unfair to judge just on basis of the easiest task and without knowing the general maths level that is required. But still, I would like to hear your opinion on this. Also outsiders’ views from different countries and different educational systems are very welcome! So, what do you think: Was the Bavarian Abitur too hard this time? Please leave your reply in the comment section below! Update (June 7, 2019): The Bavarian Ministry of Education has decided now that the marks will not be subsequently adjusted. ## Backtest Trading Strategies like a real Quant R is one of the best choices when it comes to quantitative finance. Here we will show you how to load financial data, plot charts and give you a step-by-step template to backtest trading strategies. So, read on… We begin by just plotting a chart of the Standard & Poor’s 500 (S&P 500), an index of the 500 biggest companies in the US. To get the index data and plot the chart we use the powerful quantmod package (on CRAN). After that we add two popular indicators, the simple moving average (SMI) and the exponential moving average (EMA). Have a look at the code: library(quantmod) ## Loading required package: xts ## Loading required package: zoo ## ## Attaching package: 'zoo' ## The following objects are masked from 'package:base': ## ## as.Date, as.Date.numeric ## Loading required package: TTR ## Version 0.4-0 included new data defaults. See ?getSymbols. getSymbols("^GSPC", from = "2000-01-01") ## 'getSymbols' currently uses auto.assign=TRUE by default, but will ## use auto.assign=FALSE in 0.5-0. You will still be able to use ## 'loadSymbols' to automatically load data. getOption("getSymbols.env") ## and getOption("getSymbols.auto.assign") will still be checked for ## alternate defaults. ## ## This message is shown once per session and may be disabled by setting ## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details. ## [1] "^GSPC" head(GSPC) ## GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume ## 2000-01-03 1469.25 1478.00 1438.36 1455.22 931800000 ## 2000-01-04 1455.22 1455.22 1397.43 1399.42 1009000000 ## 2000-01-05 1399.42 1413.27 1377.68 1402.11 1085500000 ## 2000-01-06 1402.11 1411.90 1392.10 1403.45 1092300000 ## 2000-01-07 1403.45 1441.47 1400.73 1441.47 1225200000 ## 2000-01-10 1441.47 1464.36 1441.47 1457.60 1064800000 ## GSPC.Adjusted ## 2000-01-03 1455.22 ## 2000-01-04 1399.42 ## 2000-01-05 1402.11 ## 2000-01-06 1403.45 ## 2000-01-07 1441.47 ## 2000-01-10 1457.60 tail(GSPC) ## GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume ## 2019-04-24 2934.00 2936.83 2926.05 2927.25 3448960000 ## 2019-04-25 2928.99 2933.10 2912.84 2926.17 3425280000 ## 2019-04-26 2925.81 2939.88 2917.56 2939.88 3248500000 ## 2019-04-29 2940.58 2949.52 2939.35 2943.03 3118780000 ## 2019-04-30 2937.14 2948.22 2924.11 2945.83 3919330000 ## 2019-05-01 2952.33 2954.13 2923.36 2923.73 3645850000 ## GSPC.Adjusted ## 2019-04-24 2927.25 ## 2019-04-25 2926.17 ## 2019-04-26 2939.88 ## 2019-04-29 2943.03 ## 2019-04-30 2945.83 ## 2019-05-01 2923.73 chartSeries(GSPC, theme = chartTheme("white"), subset = "last 10 months", show.grid = TRUE)  addSMA(20)  addEMA(20)  As you can see the moving averages are basically smoothed out versions of the original data shifted by the given number of days. While with the SMA (red curve) all days are weighted equally with the EMA (blue curve) the more recent days are weighted stronger, so that the resulting indicator detects changes quicker. The idea is that by using those indicators investors might be able to detect longer term trends and act accordingly. For example a trading rule could be to buy the index whenever it crosses the MA from below and sell when it goes the other direction. Judge for yourself if this could have worked. Well, having said that it might not be that easy to find out the profitability of certain trading rules just by staring at a chart. We are looking for something more systematic! We would need a decent backtest! This can of course also be done with R, a great choice is the PerformanceAnalytics package (on CRAN). To backtest a trading strategy I provide you with a step-by-step template: 1. Load libraries and data 2. Create your indicator 3. Use indicator to create equity curve 4. Evaluate strategy performance As an example we want to test the idea that it might be profitable to sell the index when the financial markets exhibit significant stress. Interestingly enough “stress” can be measured by certain indicators that are freely available. One of them is the National Financial Conditions Index (NFCI) of the Federal Reserve Bank of Chicago (https://www.chicagofed.org/publications/nfci/index): The Chicago Fed’s National Financial Conditions Index (NFCI) provides a comprehensive weekly update on U.S. financial conditions in money markets, debt and equity markets and the traditional and “shadow” banking systems. […] The NFCI [is] constructed to have an average value of zero and a standard deviation of one over a sample period extending back to 1971. Positive values of the NFCI have been historically associated with tighter-than-average financial conditions, while negative values have been historically associated with looser-than-average financial conditions. To make it more concrete we want to create a buy signal when the index is above one standard deviation in negative territory and a sell signal otherwise. Have a look at the following code: # Step 1: Load libraries and data library(quantmod) library(PerformanceAnalytics) ## ## Attaching package: 'PerformanceAnalytics' ## The following object is masked from 'package:graphics': ## ## legend getSymbols('NFCI', src = 'FRED', , from = '2000-01-01') ## [1] "NFCI" NFCI <- na.omit(lag(NFCI)) # we can only act on the signal after release, i.e. the next day getSymbols("^GSPC", from = '2000-01-01') ## [1] "^GSPC" data <- na.omit(merge(NFCI, GSPC)) # merge before (!) calculating returns) data$GSPC <- na.omit(ROC(Cl(GSPC))) # calculate returns of closing prices

# Step 2: Create your indicator
data$sig <- ifelse(data$NFCI < 1, 1, 0)
data$sig <- na.locf(data$sig)

# Step 3: Use indicator to create equity curve
perf <- na.omit(merge(data$sig * data$GSPC, data$GSPC)) colnames(perf) <- c("Stress-based strategy", "SP500") # Step 4: Evaluate strategy performance table.DownsideRisk(perf) ## Stress-based strategy SP500 ## Semi Deviation 0.0075 0.0087 ## Gain Deviation 0.0071 0.0085 ## Loss Deviation 0.0079 0.0095 ## Downside Deviation (MAR=210%) 0.0125 0.0135 ## Downside Deviation (Rf=0%) 0.0074 0.0087 ## Downside Deviation (0%) 0.0074 0.0087 ## Maximum Drawdown 0.5243 0.6433 ## Historical VaR (95%) -0.0173 -0.0188 ## Historical ES (95%) -0.0250 -0.0293 ## Modified VaR (95%) -0.0166 -0.0182 ## Modified ES (95%) -0.0268 -0.0311 table.Stats(perf) ## Stress-based strategy SP500 ## Observations 4858.0000 4858.0000 ## NAs 0.0000 0.0000 ## Minimum -0.0690 -0.0947 ## Quartile 1 -0.0042 -0.0048 ## Median 0.0003 0.0005 ## Arithmetic Mean 0.0002 0.0002 ## Geometric Mean 0.0002 0.0001 ## Quartile 3 0.0053 0.0057 ## Maximum 0.0557 0.1096 ## SE Mean 0.0001 0.0002 ## LCL Mean (0.95) -0.0001 -0.0002 ## UCL Mean (0.95) 0.0005 0.0005 ## Variance 0.0001 0.0001 ## Stdev 0.0103 0.0120 ## Skewness -0.1881 -0.2144 ## Kurtosis 3.4430 8.5837 charts.PerformanceSummary(perf)  chart.RelativePerformance(perf[ , 1], perf[ , 2])  chart.RiskReturnScatter(perf)  The first chart shows that the stress-based strategy (black curve) clearly outperformed its benchmark, the S&P 500 (red curve). This can also be seen in the second chart, showing the relative performance. In the third chart we see that both return (more) and (!) risk (less) of our backtested strategy are more favourable compared to the benchmark. So, all in all this seems to be a viable strategy! But of course before investing real money many more tests have to be performed! You can use this framework for backtesting your own ideas. Here is not the place to explain all of the above tables and plots but as you can see both packages are very, very powerful and I have only shown you a small fraction of their capabilities. To use their full potential you should have a look at the extensive documentation that comes with it on CRAN. Disclaimer: This is no investment advice! No responsibility is taken whatsoever if you lose money! If you gain money though I would be happy if you could buy me a coffee… that is not too much to ask, is it? 😉 ## The Rich didn’t earn their Wealth, they just got Lucky Tomorrow, on the First of May, many countries celebrate the so called International Workers’ Day (or Labour Day): time to talk about the unequal distribution of wealth again! A few months ago I posted a piece with the title “If wealth had anything to do with intelligence…” where I argued that ability, e.g. intelligence, as an input has nothing to do with wealth as an output. It drew a lot of criticism (as expected), most of it unfounded in my opinion but one piece merits some discussion: the fact that the intelligence quotient (IQ) is normally distributed by construction. The argument goes that intelligence per se may be a distribution with fat tails too but by the way the IQ is constructed the metric is being transformed into a well formed gaussian distribution. To a degree this is certainly true, yet I would still argue that the distribution of intelligence and all other human abilities are far more well behaved than the extremely unequal distribution of wealth. I wrote in a comment: There are many aspects in your comment that are certainly true. Obviously there are huge problems in measuring “true” mental abilities, which is the exact reason why people came up with a somewhat artificial “intelligence quotient” with all its shortcomings. What would be interesting to see is (and I don’t know if you perhaps have a source about this) what the outcome of an intelligence test would look like without the “quotient” part, i.e. without subsequently normalizing the results. I guess the relationship wouldn’t be strictly linear but it wouldn’t be as extreme as the wealth distribution either. What I think is true in any case, independent of the distributions, is when you rank all people by intelligence and by wealth respectively you wouldn’t really see any stable connection – and that spirit was the intention of my post in the first place and I still stand by it, although some of the technicalities are obviously debatable. So, if you have a source, Dear Reader, you are more than welcome to share it in the comments – I am always eager to learn! I ended my post with: But if it is not ability/intelligence that determines the distribution of wealth what else could account for the extreme inequality we perceive in the world? In this post I will argue that luck is a good candidate, so read on… In 2014 there was a special issue of the renowned magazine Science titled “The science of inequality”. In one of the articles (Cho, A.: “Physicists say it’s simple”) the following thought experiment is being proposed: Suppose you randomly divide 500 million in income among 10,000 people. There’s only one way to give everyone an equal, 50,000 share. So if you’re doling out earnings randomly, equality is extremely unlikely. But there are countless ways to give a few people a lot of cash and many people a little or nothing. In fact, given all the ways you could divvy out income, most of them produce an exponential distribution of income. So, the basic idea is to randomly throw 9,999 darts at a scale ranging from zero to 500 million and study the resulting distribution of intervals: library(MASS) w <- 5e8 # wealth p <- 1e4 # no. of people set.seed(123) d <- diff(c(0, sort(runif(p-1, max = w)), w)) # wealth distribution h <- hist(d, col = "red", main = "Exponential decline", freq = FALSE, breaks = 45, xlim = c(0, quantile(d, 0.99))) fit <- fitdistr(d, "exponential") curve(dexp(x, rate = fit$estimate), col = "black", type = "p", pch = 16, add = TRUE)


The resulting distribution fits an exponential distribution very well. You can read some interesting discussions concerning this result on CrossValidated StackExchange: How can I analytically prove that randomly dividing an amount results in an exponential distribution (of e.g. income and wealth)?

Just to give you an idea of how unfair this distribution is: the richest six persons have more wealth than the poorest ten percent together:

sum(sort(d)[9994:10000]) - sum(sort(d)[0:1000])
## [1] 183670.8


If you think that this is ridiculous just look at the real global wealth distribution: here it is not six but three persons who own more than the poorest ten percent!

Now, what does that mean? Well, equality seems to be the exception and (extreme) inequality the rule. The intervals were found randomly, no interval had any special skills, just luck – and the result is (extreme) inequality – as in the real world!

If you can reproduce the wealth distribution of a society stochastically this could have the implication that it weren’t so much the extraordinary skills of the rich which made them rich but they just got lucky.

Some rich people are decent enough to admit this. In his impressive essay “Why Poverty Is Like a Disease” Christian H. Cooper, a hillbilly turned investment banker writes:

So how did I get out? By chance.

It’s easy to attach a post-facto narrative of talent and hard work to my story, because that’s what we’re fed by everything from Hollywood to political stump speeches. But it’s the wrong story. My escape was made up of a series of incredibly unlikely events, none of which I had real control over.

[…]

I am the exception that proves the rule—but that rule is that escape from poverty is a matter of chance, and not a matter of merit.

A consequence would be that you cannot really learn much from the rich. So throw away all of your self help books on how to become successful. I will end with a cartoon, which brings home this message, on a closely related concept, the so called survivorship bias (which is also important to keep in mind when backtesting trading strategies in quantitative finance, the topic of an upcoming post… so stay tuned!):

## Base Rate Fallacy – or why No One is justified to believe that Jesus rose

In this post we are talking about one of the most unintuitive results in statistics: the so called false positive paradox which is an example of the so called base rate fallacy. It describes a situation where a positive test result of a very sensitive medical test shows that you have the respective disease… yet you are most probably healthy!

The reason for this is that the disease itself is so rare that even with a very sensitive test the result is most probably false positive: it shows that you have the disease yet this result is false, you are healthy.

The key to understanding this result is to understand the difference between two conditional probabilities: the probability that you have a positive test result when you are sick and the probability that you are sick in case you got a positive test result – you are interested in the last (am I really sick?) but you only know the first.

Now for some notation (the vertical dash means “under the condition that”, P stands for probability):

• : if you are sick () you will probably have a positive test result () – this (the test result) is what we know
• : if you have a positive test result () you are probably not sick () – this (whether we are sick) is what we want to know

To calculate one conditional probability from the other we use the famous Bayes’ theorem:

In the following example we assume a disease with an infection rate of 1 in 1000 and a test to detect this disease with a sensitivity of 99%. Have a look at the following code which illustrates the situation with Euler diagrams, first the big picture, then a zoomed-in version:

library(eulerr)

A <- 0.001 # prevalence of disease
BlA <- 0.99 # sensitivity of test

B <- A * BlA + (1 - A) * (1 - BlA) # positive test (specificity same as sensitivity)
AnB <- BlA * A
AlB <- BlA * A / B # Bayes's theorem
#AnB / B # Bayes's theorem in different form

C <- 1 # the whole population
main <- paste0("P(B|A) = ", round(BlA, 2), ", but P(A|B) = ", round(AlB, 2))

set.seed(123)
fit1 <- euler(c("A" = A, "B" = B, "C" = C, "A&B" = AnB, "A&C" = A, "B&C" = B, "A&B&C" = AnB), input = "union")
plot(fit1, main = main, fill = c("red", "green", "gray90"))


fit2 <- euler(c("A" = A, "B" = B, "A&B" = AnB), input = "union")
plot(fit2, main = main, fill = c("red", "green"))


As you can see although this test is very sensitive when you get a positive test result the probability of you being infected is only 9%!

In the diagrams C is the whole population and A are the infected individuals. B shows the people with a positive test result and you can see in the second diagram that almost all of the infected A are also part of B (the brown area = true positive), but still most ob B are outside of A (the green area), so although they are not infected they have a positive test result! They are false positive.

The red area shows the people that are infected (A) but get a negative test result, stating that they are healthy. This is called false negative. The grey area shows the people who are healthy and get a negative test result, they are true negative.

Due to the occasion we are now coming to an even more extreme example: did Jesus rise from the dead? It is inspired by the very good essay “A drop in the sea”: Don’t believe in miracles until you’ve done the math.

Let us assume that we had very, very reliable witnesses (as a side note what is strange though is that the gospels cannot even agree on how many men or angels appeared at the tomb: it is one angel in Matthew, a young man in Mark, two men in Luke and two angels in John… but anyway), yet the big problem is that not many people so far have been able to defy death. I have only heard of two cases: supposedly the King of Kings (Jesus) but also of course the King himself (Elvis!), whereby sightings of Elvis after his death are much more numerous than of Jesus (just saying… 😉 )

Have a look at the following code (source for the number of people who have ever lived: WolframAlpha)

A <- 2/108500000000 # probability of coming back from the dead (The King = Elvis and the King of Kings = Jesus)
BlA <- 0.9999999 # sensitivity of test -> very, very reliable witnesses (many more in case of Elvis 😉

B <- A * BlA + (1 - A) * (1 - BlA) # positive test = witnesses say He rose
AnB <- BlA * A
AlB <- BlA * A / B # Bayes's theorem

C <- 1 # all people
main <- paste0("P(B|A) = ", round(BlA, 2), ", but P(A|B) = ", round(AlB, 2))

fit1 <- euler(c("A" = A, "B" = B, "C" = C, "A&B" = AnB, "A&C" = A, "B&C" = B, "A&B&C" = AnB), input = "union")
plot(fit1, main = main, fill = c("red", "green", "gray90"))


fit2 <- euler(c("A" = A, "B" = B, "A&B" = AnB), input = "union")
plot(fit2, main = main, fill = c("red", "green"))


So, in this case C is the unfortunate group of people who have to go for good… it is us. 🙁 As you can see although the witnesses are super reliable when they claim that somebody rose it is almost certain that they are wrong:

• : if Jesus rose () the very, very reliable witnesses would with a very high probability say so ()
• : if the very, very reliable witnesses said that Jesus rose () Jesus would still almost surely have stayed dead

Or in the words of the above mentioned essay:

No one is justified in believing in Jesus’s resurrection. The numbers simply don’t justify the conclusion.

But this chimes well with a famous Christian saying “I believe because it is absurd” (or in Latin “Credo quia absurdum”) – you can find out more about that in another highly interesting essay: ‘I believe because it is absurd’: Christianity’s first meme

Unfortunately this devastating conclusion is also true in the case of Elvis…

## Separating the Signal from the Noise: Robust Statistics for Pedestrians

One of the problems of navigating an autonomous car through a city is to extract robust signals in the face of all the noise that is present in the different sensors. Just taking something like an arithmetic mean of all the data points could possibly end in a catastrophe: if a part of a wall looks similar to the street and the algorithm calculates an average trajectory of the two this would end in leaving the road and possibly crashing into pedestrians. So we need some robust algorithm to get rid of the noise. The area of statistics that especially deals with such problems is called robust statistics and the methods used therein robust estimation.

Now, one of the problems is that one doesn’t know what is signal and what is noise. The big idea behind RANSAC (short for RAndom SAmple Consensus) is to get rid of outliers by basically taking as many points as possible which form a well-defined region and leaving out the others. It does that iteratively, similar to the famous k-means algorithm (the topic of one of the upcoming posts, so stay tuned…).

To really understand how RANSAC works we will now build it with R. We will take a simple linear regression as an example and make it robust against outliers.

For educational purposes we will do this step by step:

1. Understanding the general outline of the algorithm.
2. Looking at the steps in more detail.
3. Expressing the steps in pseudocode.
4. Translating this into R!

Conveniently enough Wikipedia gives a good outline of the algorithm and even provides us with the very clear pseudocode which will serve as the basis for our own R implementation (the given Matlab code is not very good in my opinion and has nothing to do with the pseudocode):

The RANSAC algorithm is essentially composed of two steps that are iteratively repeated:

1. In the first step, a sample subset containing minimal data items is randomly selected from the input dataset. A fitting model and the corresponding model parameters are computed using only the elements of this sample subset. The cardinality of the sample subset is the smallest sufficient to determine the model parameters.
2. In the second step, the algorithm checks which elements of the entire dataset are consistent with the model instantiated by the estimated model parameters obtained from the first step. A data element will be considered as an outlier if it does not fit the fitting model instantiated by the set of estimated model parameters within some error threshold that defines the maximum deviation attributable to the effect of noise. The set of inliers obtained for the fitting model is called consensus set. The RANSAC algorithm will iteratively repeat the above two steps until the obtained consensus set in certain iteration has enough inliers.

In more detail:

RANSAC achieves its goal by repeating the following steps:

1. Select a random subset of the original data. Call this subset the hypothetical inliers.
2. A model is fitted to the set of hypothetical inliers.
3. All other data are then tested against the fitted model. Those points that fit the estimated model well, according to some model-specific loss function, are considered as part of the consensus set.
4. The estimated model is reasonably good if sufficiently many points have been classified as part of the consensus set.
5. Afterwards, the model may be improved by reestimating it using all members of the consensus set.

This procedure is repeated a fixed number of times, each time producing either a model which is rejected because too few points are part of the consensus set, or a refined model together with a corresponding consensus set size. In the latter case, we keep the refined model if its consensus set is larger than the previously saved model.

Now, this can be expressed in pseudocode:

Given:
data - a set of observed data points
model - a model that can be fitted to data points
n - minimum number of data points required to fit the model
k - maximum number of iterations allowed in the algorithm
t - threshold value to determine when a data point fits a model
d - number of close data points required to assert that a model fits well to data

Return:
bestfit - model parameters which best fit the data (or nul if no good model is found)

iterations = 0
bestfit = nul
besterr = something really large
while iterations < k {
maybeinliers = n randomly selected values from data
maybemodel = model parameters fitted to maybeinliers
alsoinliers = empty set
for every point in data not in maybeinliers {
if point fits maybemodel with an error smaller than t
add point to alsoinliers
}
if the number of elements in alsoinliers is > d {
% this implies that we may have found a good model
% now test how good it is
bettermodel = model parameters fitted to all points in maybeinliers and alsoinliers
thiserr = a measure of how well model fits these points
if thiserr < besterr {
bestfit = bettermodel
besterr = thiserr
}
}
increment iterations
}
return bestfit


It is quite easy to convert this into valid R code (as a learner of R you should try it yourself before looking at my solution!):

ransac <- function(data, n, k, t, d) {
iterations <- 0
bestfit <- NULL
besterr <- 1e5
while (iterations < k) {
maybeinliers <- sample(nrow(data), n)
maybemodel <- lm(y ~ x, data = data, subset = maybeinliers)
alsoinliers <- NULL
for (point in setdiff(1:nrow(data), maybeinliers)) {
if (abs(maybemodel$coefficients[2]*data[point, 1] - data[point, 2] + maybemodel$coefficients[1])/(sqrt(maybemodel$coefficients[2] + 1)) < t) alsoinliers <- c(alsoinliers, point) } if (length(alsoinliers) > d) { bettermodel <- lm(y ~ x, data = data, subset = c(maybeinliers, alsoinliers)) thiserr <- summary(bettermodel)$sigma
if (thiserr < besterr) {
bestfit <- bettermodel
besterr <- thiserr
}
}
iterations <- iterations + 1
}
bestfit
}


We now test this with some sample data:

data <- read.csv("data/RANSAC.csv")
plot(data)
abline(lm(y ~ x, data = data))
set.seed(3141)
abline(ransac(data, n = 10, k = 10, t = 0.5, d = 10), col = "blue")


The black line is a plain vanilla linear regression, the blue line is the RANSAC-enhanced version. As you can see: no more crashing into innocent pedestrians. 🙂

## Inverse Statistics – and how to create Gain-Loss Asymmetry plots in R

Asset returns have certain statistical properties, also called stylized facts. Important ones are:

• Absence of autocorrelation: basically the direction of the return of one day doesn’t tell you anything useful about the direction of the next day.
• Fat tails: returns are not normal, i.e. there are many more extreme events than there would be if returns were normal.
• Volatility clustering: basically financial markets exhibit high-volatility and low-volatility regimes.
• Leverage effect: high-volatility regimes tend to coincide with falling prices and vice versa.

A good introduction and overview can be found in R. Cont: Empirical properties of asset returns: stylized facts and statistical issues.

One especially fascinating statistical property is the so called gain-loss asymmetry: it basically states that upward movements tend to take a lot longer than downward movements which often come in the form of sudden hefty crashes. So, an abstract illustration of this property would be a sawtooth pattern:

The same effect in real life:

suppressWarnings(suppressMessages(library(quantmod)))
suppressWarnings(suppressMessages(getSymbols("^GSPC", from = "1950-01-01")))
## [1] "GSPC"
plot.zoo(GSPC\$GSPC.Close, xlim = c(as.Date("2000-01-01"), as.Date("2013-01-01")), ylim = c(600, 1700), ylab ="", main ="S&P from 2000 to 2013")


The practical implication for your investment horizon is that your losses often come much faster than your gains (life is just not fair…). To illustrate this authors often plot the investment horizon distribution. It illustrates how long you have to wait for a certain target return, negative as well as positive (for some examples see e.g. here, also the source of the following plot):

This is closely related to what statisticians call first passage time: when is a given threshold passed for the first time? To perform such an analysis you need something called inverse statistics. Normally you would plot the distribution of returns given a fixed time window (= forward statistics). Here we do it the other way around: you fix the return and want to find the shortest waiting time needed to obtain at least the respective return. To achieve that you have to test all possible time windows which can be quite time consuming.

Because I wanted to reproduce those plots I tried to find some code somewhere… to no avail. I then contacted some of the authors of the respective papers… no answer. I finally asked a question on Quantitative Finance StackExchange… and got no satisfying answer either. I therefore wrote the code myself and thereby answered my own question:

inv_stat <- function(symbol, name, target = 0.05) {
p <- coredata(Cl(symbol))
end <- length(p)
days_n <- days_p <- integer(end)

# go through all days and look when target is reached the first time from there
for (d in 1:end) {
ret <- cumsum(as.numeric(na.omit(ROC(p[d:end]))))
cond_n <- ret < -target
cond_p <- ret > target
suppressWarnings(days_n[d] <- min(which(cond_n)))
suppressWarnings(days_p[d] <- min(which(cond_p)))
}

days_n_norm <- prop.table(as.integer(table(days_n, exclude = "Inf")))
days_p_norm <- prop.table(as.integer(table(days_p, exclude = "Inf")))

plot(days_n_norm, log = "x", xlim = c(1, 1000), main = paste0(name, " gain-/loss-asymmetry with target ", target), xlab = "days", ylab = "density", col = "red")
points(days_p_norm, col = "blue")

c(which.max(days_n_norm), which.max(days_p_norm)) # mode of days to obtain (at least) neg. and pos. target return
}

inv_stat(GSPC, name = "S&P 500")


## [1] 10 24


So, here you see that for the S&P 500 since 1950 the mode (peak) of the days to obtain a loss of at least 5% has been 10 days and a gain of the same size 24 days! That is the gain-loss asymmetry in action!

Still two things are missing in the code:

• Detrending of the time series.
• Fitting a probability distribution (the generalized gamma distribution seems to work well).

If you want to add them or if you have ideas how to improve the code, please let me know in the comments! Thank you and stay tuned!