Learning R: Painting with Fire


A few months ago I published a post on recursion: To understand Recursion you have to understand Recursion…. In this post we will see how to use recursion to fill free areas of an image with colour, the caveats of recursion and how to transform a recursive algorithm into a loop-based version using a queue – so read on…

The recursive version of the painting algorithm we want to examine here is very easy to understand, Wikipedia gives the pseudocode of the so called flood-fill algorithm:


Flood-fill (node, target-color, replacement-color):

  • If target-color is equal to replacement-color, return.
  • If the color of node is not equal to target-color, return.
  • Set the color of node to replacement-color.
  • Perform Flood-fill (one step to the south of node, target-color, replacement-color).
    Perform Flood-fill (one step to the north of node, target-color, replacement-color).
    Perform Flood-fill (one step to the west of node, target-color, replacement-color).
    Perform Flood-fill (one step to the east of node, target-color, replacement-color).
  • Return.

The translation into R couldn’t be any easier:

floodfill <- function(row, col, tcol, rcol) {
  if (tcol == rcol) return()
  if (M[row, col] != tcol) return()
  M[row, col] <<- rcol
  floodfill(row - 1, col    , tcol, rcol) # south
  floodfill(row + 1, col    , tcol, rcol) # north
  floodfill(row    , col - 1, tcol, rcol) # west
  floodfill(row    , col + 1, tcol, rcol) # east
  return("filling completed")
}

We take the image from Wikipedia as an example:

M <- matrix(c(1, 1, 1, 1, 1, 1, 1, 1, 1,
              1, 0, 0, 0, 1, 0, 0, 0, 1,
              1, 0, 0, 0, 1, 0, 0, 0, 1,
              1, 0, 0, 1, 0, 0, 0, 0, 1,
              1, 1, 1, 0, 0, 0, 1, 1, 1,
              1, 0, 0, 0, 0, 1, 0, 0, 1,
              1, 0, 0, 0, 1, 0, 0, 0, 1,
              1, 0, 0, 0, 1, 0, 0, 0, 1,
              1, 1, 1, 1, 1, 1, 1, 1, 1), 9, 9)
image(M, col = c(0, 1))

We now fill the three areas with three different colours and then plot the image again:

startrow <- 5; startcol <- 5
floodfill(startrow, startcol, 0, 2)
## [1] "filling completed"

startrow <- 3; startcol <- 3
floodfill(startrow, startcol, 0, 3)
## [1] "filling completed"

startrow <- 7; startcol <- 7
floodfill(startrow, startcol, 0, 4)
## [1] "filling completed"

image(M, col = 1:4)

This seems to work pretty well but the problem is that the more nested the algorithm becomes the bigger the stack has to be – which could lead to overflow errors. One comment on my original post on recursion read:

just keep in mind that recursion is useful in industrial work only if tail optimization is supported. otherwise your code will explode at some indeterminate time in the future. […]

One possibility is to increase the size of the stack with options(expressions = 10000) but even this may not be enough. Therefore we transform our recursive algorithm into a loop-based one and use a queue instead of a stack! The pseudocode from Wikipedia:


Flood-fill (node, target-color, replacement-color):

  • If target-color is equal to replacement-color, return.
  • If color of node is not equal to target-color, return.
  • Set the color of node to replacement-color.
  • Set Q to the empty queue.
  • Add node to the end of Q.
  • While Q is not empty:
    • Set n equal to the first element of Q.
    • Remove first element from Q.
    • If the color of the node to the west of n is target-color,
      set the color of that node to replacement-color and add that node to the end of Q.
    • If the color of the node to the east of n is target-color,
      set the color of that node to replacement-color and add that node to the end of Q.
    • If the color of the node to the north of n is target-color,
      set the color of that node to replacement-color and add that node to the end of Q.
    • If the color of the node to the south of n is target-color,
      set the color of that node to replacement-color and add that node to the end of Q.
  • Continue looping until Q is exhausted.
  • Return.

Because of the way the algorithm fills areas it is also called forest fire. Again, the translation into valid R code is straightforward:

floodfill <- function(row, col, tcol, rcol) {
  if (tcol == rcol) return()
  if (M[row, col] != tcol) return()
  Q <- matrix(c(row, col), 1, 2)
  while (dim(Q)[1] > 0) {
    n <- Q[1, , drop = FALSE]
    west  <- cbind(n[1]    , n[2] - 1)
    east  <- cbind(n[1]    , n[2] + 1)
    north <- cbind(n[1] + 1, n[2]    )
    south <- cbind(n[1] - 1, n[2]    )
    Q <- Q[-1, , drop = FALSE]
    if (M[n] == tcol) {
      M[n] <<- rcol
      if (M[west] == tcol)  Q <- rbind(Q, west)
      if (M[east] == tcol)  Q <- rbind(Q, east)
      if (M[north] == tcol) Q <- rbind(Q, north)
      if (M[south] == tcol) Q <- rbind(Q, south)
    }
  }
  return("filling completed")
}

As an example we will use a much bigger picture (it can be downloaded from here: Unfilledcirc.png):

library(png)
img <- readPNG("data/Unfilledcirc.png")
M <- img[ , , 1]
M <- ifelse(M < 0.5, 0, 1)
M <- rbind(M, 0)
M <- cbind(M, 0)
image(M, col = c(1, 0))

And now for the filling:

startrow <- 100; startcol <- 100
floodfill(startrow, startcol, 0, 2)
## [1] "filling completed"

startrow <- 50; startcol <- 50
floodfill(startrow, startcol, 1, 3)
## [1] "filling completed"

image(M, col = c(1, 0, 2, 3))

As you can see, with this version of the algorithm much bigger areas can be filled!

I also added both R implementations to the respective section of Rosetta Code: Bitmap/Flood fill.

Learning R: The Ultimate Introduction (incl. Machine Learning!)


There are a million reasons to learn R (see e.g. Why R for Data Science – and not Python?), but where to start? I present to you the ultimate introduction to bring you up to speed! So read on…

I call it ultimate because it is the essence of many years of teaching R… or put differently: it is the kind of introduction I would have liked to have when I started out with R back in the days!

A word of warning though: this is a introduction to R and not to statistics, so I won’t explain the statistics terms used here. You do not need to know any other programming language but it does no harm either. Ok, now let us start!

First you need to install R (https://www.r-project.org) and preferably RStudio as a Graphical User Interface (GUI): https://www.rstudio.com/products/RStudio/#Desktop. Both are free and available for all common operating systems.

To get a quick overview of RStudio watch this video:

You can either type in the following commands in the console or open a new script tab (File -> New File -> R Script) and run the commands by pressing Ctrl + Enter/Return after having typed them.

First of all R is a very good calculator:

2 + 2
## [1] 4

sin(0.5)
## [1] 0.4794255

abs(-10) # absolute value
## [1] 10

pi
## [1] 3.141593

exp(1) # e
## [1] 2.718282

factorial(6)
## [1] 720

By the way: The hash is used for comments, everything after it will be ignored!

Of course you can define variables and use them in your calculations:

n1 <- 2
n2 <- 3
n1 # show content of variable by just typing the name
## [1] 2

n1 + n2
## [1] 5

n1 * n2
## [1] 6

n1^n2
## [1] 8

Part of R’s power stems from the fact that functions can handle several numbers at once, called vectors, and do calculations on them. When calling a function arguments are passed with round brackets:

n3 <- c(12, 5, 27) # concatenate (combine) elements into a vector 
n3
## [1] 12  5 27

min(n3)
## [1] 5

max(n3)
## [1] 27

sum(n3)
## [1] 44

mean(n3)
## [1] 14.66667

sd(n3) # standard deviation
## [1] 11.23981

var(n3) # variance
## [1] 126.3333

median(n3)
## [1] 12

n3 / 12
## [1] 1.0000000 0.4166667 2.2500000

In the last example the 12 was recycled three times. R always tries to do that (when feasible), sometimes giving a warning when it might not be intended:

n3 / c(1, 2)
## Warning in n3/c(1, 2): longer object length is not a multiple of shorter
## object length
## [1] 12.0  2.5 27.0

In cases you only want parts of your vectors you can apply subsetting with square brackets:

n3[1]
## [1] 12

n3[c(2, 3)]
## [1]  5 27

Ranges can easily be created with the colon:

n4 <- 10:20
n4
##  [1] 10 11 12 13 14 15 16 17 18 19 20

When you test whether this vector is bigger than a certain number you will get logicals as a result. You can use those logicals for subsetting:

n4 > 15
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

n4[n4 > 15]
## [1] 16 17 18 19 20

Perhaps you have heard the story of little Gauss where his teacher gave him the task to add all numbers from 1 to 100 to keep him busy for a while? Well, he found a mathematical trick to add them within seconds… for us normal people we can use R:

sum(1:100)
## [1] 5050

When we want to use some code several times we can define our own function (a user-defined function). We do that the same way we create a vector (or any other data structure) because R is a so called functional programming language and functions are so called first-class citizens (i.e. on the same level as other data structures like vectors). The code that is being executed is put in curly brackets:

gauss <- function(x) {
  sum(1:x)
}
gauss(100)
## [1] 5050

gauss(1000)
## [1] 500500

Of course we also have other data types, e.g. matrices are basically two dimensional vectors:

M <- matrix(1:12, nrow = 3, byrow = TRUE) # create a matrix
M
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

dim(M)
## [1] 3 4

Subsetting now has to provide two numbers, the first for the row, the second for the column (like in the game Battleship). If you leave one out, all data of the respective dimension will be shown:

M[2, 3]
## [1] 7

M[ , c(1, 3)]
##      [,1] [,2]
## [1,]    1    3
## [2,]    5    7
## [3,]    9   11

Another possibility to create matrices:

v1 <- 1:4
v2 <- 4:1
M1 <- rbind(v1, v2) # row bind
M1
##    [,1] [,2] [,3] [,4]
## v1    1    2    3    4
## v2    4    3    2    1

M2 <- cbind(v1, v2)  # column bind
M2
##      v1 v2
## [1,]  1  4
## [2,]  2  3
## [3,]  3  2
## [4,]  4  1

Naming rows, here with inbuilt datasets:

rownames(M2) <- LETTERS[1:4]
M2
##   v1 v2
## A  1  4
## B  2  3
## C  3  2
## D  4  1

LETTERS
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

When some result is Not Available:

LETTERS[50]
## [1] NA

Getting the structure of your variables:

str(LETTERS)
##  chr [1:26] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" ...

str(M2)
##  int [1:4, 1:2] 1 2 3 4 4 3 2 1
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:4] "A" "B" "C" "D"
##   ..$ : chr [1:2] "v1" "v2"

Another famous dataset (iris) that is also built into base R (to get help on any function or dataset just put the cursor in it and press F1):

iris
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

Oops, that is a bit long… if you only want to show the first or last rows do the following:

head(iris) # first 6 rows
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

tail(iris, 10) # last 10 rows
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 141          6.7         3.1          5.6         2.4 virginica
## 142          6.9         3.1          5.1         2.3 virginica
## 143          5.8         2.7          5.1         1.9 virginica
## 144          6.8         3.2          5.9         2.3 virginica
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

Iris is a so called data frame, the workhorse of R and data science (you will see how to create one below):

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

As you can see, data frames can combine different data types. If you try to do that with e.g. vectors, which can only hold one data type, something called coercion happens, i.e. at least one data type is forced to become another one so that consistency is maintained:

str(c(2, "Hello")) # 2 is coerced to become a character string too
##  chr [1:2] "2" "Hello"

You can get a fast overview of your data like so:

summary(iris[1:4])
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

boxplot(iris[1:4])

As you have seen, R often runs a function on all of the data simultaneously. This feature is called vectorization and in many other languages you would need a loop for that. In R you don’t use loops that often, but of course they are available:

for (i in seq(5)) {
  print(1:i)
}
## [1] 1
## [1] 1 2
## [1] 1 2 3
## [1] 1 2 3 4
## [1] 1 2 3 4 5

Speaking of control structures: of course conditional statements are available too:

even <- function(x) ifelse(x %% 2 == 0, TRUE, FALSE) # %% gives remainder of division (= modulo operator)
even(1:5)
## [1] FALSE  TRUE FALSE  TRUE FALSE

Linear modelling (e.g. correlation and linear regression) couldn’t be any easier, it is included in the core language:

age <- c(21, 46, 55, 35, 28)
income <- c(1850, 2500, 2560, 2230, 1800)
df <- data.frame(age, income) # create a data frame
df
##   age income
## 1  21   1850
## 2  46   2500
## 3  55   2560
## 4  35   2230
## 5  28   1800

cor(df) # correlation
##              age    income
## age    1.0000000 0.9464183
## income 0.9464183 1.0000000

LinReg <- lm(income ~ age, data = df) # income as a linear model of age
LinReg
## 
## Call:
## lm(formula = income ~ age, data = df)
## 
## Coefficients:
## (Intercept)          age  
##     1279.37        24.56

summary(LinReg)
## 
## Call:
## lm(formula = income ~ age, data = df)
## 
## Residuals:
##       1       2       3       4       5 
##   54.92   90.98  -70.04   91.12 -166.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1279.367    188.510   6.787  0.00654 **
## age           24.558      4.838   5.076  0.01477 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 132.1 on 3 degrees of freedom
## Multiple R-squared:  0.8957, Adjusted R-squared:  0.8609 
## F-statistic: 25.77 on 1 and 3 DF,  p-value: 0.01477

plot(df, pch = 16, main = "Linear model")
abline(LinReg, col = "blue", lwd = 2) # adding the regression line

You could directly use the model to make predictions:

pred_LinReg <- predict(LinReg, data.frame(age = seq(15, 70, 5)))
names(pred_LinReg) <- seq(15, 70, 5)
round(pred_LinReg, 2)
##      15      20      25      30      35      40      45      50      55 
## 1647.73 1770.52 1893.31 2016.10 2138.88 2261.67 2384.46 2507.25 2630.04 
##      60      65      70 
## 2752.83 2875.61 2998.40

If you want to know more about the modelling process you can find it here: Learning Data Science: Modelling Basics

Another strength of R is the huge number of add-on packages for all kinds of specialized tasks. For the grand finale of this introduction, we’re gonna get a little taste of machine learning. For that matter we install the OneR package from CRAN (the official package repository of R): Tools -> Install packages… -> type in “OneR” -> click “Install”.

After that we build a simple model on the iris dataset to predict the Species column:

library(OneR) # load package
data <- optbin(Species ~., data = iris) # find optimal bins for numeric predictors
model <- OneR(data, verbose = TRUE) # build actual model
## 
##     Attribute    Accuracy
## 1 * Petal.Width  96%     
## 2   Petal.Length 95.33%  
## 3   Sepal.Length 74.67%  
## 4   Sepal.Width  55.33%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

summary(model) # show rules
## 
## Call:
## OneR.data.frame(x = data, verbose = TRUE)
## 
## Rules:
## If Petal.Width = (0.0976,0.791] then Species = setosa
## If Petal.Width = (0.791,1.63]   then Species = versicolor
## If Petal.Width = (1.63,2.5]     then Species = virginica
## 
## Accuracy:
## 144 of 150 instances classified correctly (96%)
## 
## Contingency table:
##             Petal.Width
## Species      (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum
##   setosa               * 50            0          0  50
##   versicolor              0         * 48          2  50
##   virginica               0            4       * 46  50
##   Sum                    50           52         48 150
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 266.35, df = 4, p-value < 2.2e-16

plot(model)

We’ll now see how well the model is doing:

prediction <- predict(model, data)
eval_model(prediction, data)
## 
## Confusion matrix (absolute):
##             Actual
## Prediction   setosa versicolor virginica Sum
##   setosa         50          0         0  50
##   versicolor      0         48         4  52
##   virginica       0          2        46  48
##   Sum            50         50        50 150
## 
## Confusion matrix (relative):
##             Actual
## Prediction   setosa versicolor virginica  Sum
##   setosa       0.33       0.00      0.00 0.33
##   versicolor   0.00       0.32      0.03 0.35
##   virginica    0.00       0.01      0.31 0.32
##   Sum          0.33       0.33      0.33 1.00
## 
## Accuracy:
## 0.96 (144/150)
## 
## Error rate:
## 0.04 (6/150)
## 
## Error rate reduction (vs. base rate):
## 0.94 (p-value < 2.2e-16)

96% accuracy is not too bad, even for this simple dataset!

If you want to know more about the OneR package you can read the vignette: OneR – Establishing a New Baseline for Machine Learning Classification Models.

Well, and that’s it for the ultimate introduction to R – hopefully you liked it and you learned something! Please share your first experiences with R in the comments and also if you miss something (I might add it in the future!) – Thank you for reading and stay tuned for more to come!

Was the Bavarian Abitur too hard this time?


Bavaria is known for its famous Oktoberfest… and within Germany also for its presumably difficult Abitur, a qualification granted by university-preparatory schools in Germany.

A mandatory part for all students is maths. This year many students protested that the maths part was way too hard, they even started an online petition with more than seventy thousand supporters at this time of writing!

It is not clear yet whether their marks will be adjusted upwards, the ministry of education is investigating the case. As a professor in Bavaria who also teaches statistics I will take the opportunity to share with you an actual question from the original examination with solution, so read on…

Let us have a look at the first (and easiest) question in the stochastics part:

Every sixth visitor to the Oktoberfest wears a gingerbread heart around his neck. During the festival 25 visitors are chosen at random. Determine the probability that at most one of the selected visitors will have a gingerbread heart.

Before you read on try to solve the task yourself…

Of course students are not allowed to use R in the examination but in general the good thing about this kind of questions is that if you have no idea how to solve them analytically solving them by simulation is often much easier:

set.seed(12)
N <- 1e7
v <- matrix(sample(c(0L, 1L), size = 25 * N, replace = TRUE, prob = c(5/6, 1/6)), ncol = 25)
sum(rowSums(v) <= 1) / N
## [1] 0.062936

The answer is about 6.3\%.

Now for the analytical solution: “At least one” implies that we have to differentiate between two cases, “no gingerbread heart” and “exactly one gingerbread heart”. “No gingerbread heart” is just \left(\frac{5}{6}\right)^{25}. “Exactly one gingerbread heart” is 25\cdot\frac{1}{6}\cdot\left(\frac{5}{6}\right)^{24} because there are 25 possibilities of where the gingerbread heart could occur. We have to add both probabilities:

(5/6)^25 + 25*1/6*(5/6)^24
## [1] 0.06289558

If you know a little bit about probability distributions you will recognize the above as the binomial distribution:

pbinom(q = 1, size = 25, prob = 1/6)
## [1] 0.06289558

Of course it is a little unfair to judge just on basis of the easiest task and without knowing the general maths level that is required. But still, I would like to hear your opinion on this. Also outsiders’ views from different countries and different educational systems are very welcome! So, what do you think:

Was the Bavarian Abitur too hard this time? Please leave your reply in the comment section below!

Update (June 7, 2019): The Bavarian Ministry of Education has decided now that the marks will not be subsequently adjusted.

Backtest Trading Strategies like a real Quant


R is one of the best choices when it comes to quantitative finance. Here we will show you how to load financial data, plot charts and give you a step-by-step template to backtest trading strategies. So, read on…

We begin by just plotting a chart of the Standard & Poor’s 500 (S&P 500), an index of the 500 biggest companies in the US. To get the index data and plot the chart we use the powerful quantmod package (on CRAN). After that we add two popular indicators, the simple moving average (SMI) and the exponential moving average (EMA).

Have a look at the code:

library(quantmod)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.

getSymbols("^GSPC", from = "2000-01-01")
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## [1] "^GSPC"

head(GSPC)
##            GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume
## 2000-01-03   1469.25   1478.00  1438.36    1455.22   931800000
## 2000-01-04   1455.22   1455.22  1397.43    1399.42  1009000000
## 2000-01-05   1399.42   1413.27  1377.68    1402.11  1085500000
## 2000-01-06   1402.11   1411.90  1392.10    1403.45  1092300000
## 2000-01-07   1403.45   1441.47  1400.73    1441.47  1225200000
## 2000-01-10   1441.47   1464.36  1441.47    1457.60  1064800000
##            GSPC.Adjusted
## 2000-01-03       1455.22
## 2000-01-04       1399.42
## 2000-01-05       1402.11
## 2000-01-06       1403.45
## 2000-01-07       1441.47
## 2000-01-10       1457.60

tail(GSPC)
##            GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume
## 2019-04-24   2934.00   2936.83  2926.05    2927.25  3448960000
## 2019-04-25   2928.99   2933.10  2912.84    2926.17  3425280000
## 2019-04-26   2925.81   2939.88  2917.56    2939.88  3248500000
## 2019-04-29   2940.58   2949.52  2939.35    2943.03  3118780000
## 2019-04-30   2937.14   2948.22  2924.11    2945.83  3919330000
## 2019-05-01   2952.33   2954.13  2923.36    2923.73  3645850000
##            GSPC.Adjusted
## 2019-04-24       2927.25
## 2019-04-25       2926.17
## 2019-04-26       2939.88
## 2019-04-29       2943.03
## 2019-04-30       2945.83
## 2019-05-01       2923.73

chartSeries(GSPC, theme = chartTheme("white"), subset = "last 10 months", show.grid = TRUE)

addSMA(20)

addEMA(20)

As you can see the moving averages are basically smoothed out versions of the original data shifted by the given number of days. While with the SMA (red curve) all days are weighted equally with the EMA (blue curve) the more recent days are weighted stronger, so that the resulting indicator detects changes quicker. The idea is that by using those indicators investors might be able to detect longer term trends and act accordingly. For example a trading rule could be to buy the index whenever it crosses the MA from below and sell when it goes the other direction. Judge for yourself if this could have worked.

Well, having said that it might not be that easy to find out the profitability of certain trading rules just by staring at a chart. We are looking for something more systematic! We would need a decent backtest! This can of course also be done with R, a great choice is the PerformanceAnalytics package (on CRAN).

To backtest a trading strategy I provide you with a step-by-step template:

  1. Load libraries and data
  2. Create your indicator
  3. Use indicator to create equity curve
  4. Evaluate strategy performance

As an example we want to test the idea that it might be profitable to sell the index when the financial markets exhibit significant stress. Interestingly enough “stress” can be measured by certain indicators that are freely available. One of them is the National Financial Conditions Index (NFCI) of the Federal Reserve Bank of Chicago (https://www.chicagofed.org/publications/nfci/index):

The Chicago Fed’s National Financial Conditions Index (NFCI) provides a comprehensive weekly update on U.S. financial conditions in money markets, debt and equity markets and the traditional and “shadow” banking systems. […] The NFCI [is] constructed to have an average value of zero and a standard deviation of one over a sample period extending back to 1971. Positive values of the NFCI have been historically associated with tighter-than-average financial conditions, while negative values have been historically associated with looser-than-average financial conditions.

To make it more concrete we want to create a buy signal when the index is above one standard deviation in negative territory and a sell signal otherwise.

Have a look at the following code:

# Step 1: Load libraries and data
library(quantmod)
library(PerformanceAnalytics)
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend

getSymbols('NFCI', src = 'FRED', , from = '2000-01-01')
## [1] "NFCI"

NFCI <- na.omit(lag(NFCI)) # we can only act on the signal after release, i.e. the next day
getSymbols("^GSPC", from = '2000-01-01')
## [1] "^GSPC"

data <- na.omit(merge(NFCI, GSPC)) # merge before (!) calculating returns)
data$GSPC <- na.omit(ROC(Cl(GSPC))) # calculate returns of closing prices

# Step 2: Create your indicator
data$sig <- ifelse(data$NFCI < 1, 1, 0)
data$sig <- na.locf(data$sig)

# Step 3: Use indicator to create equity curve
perf <- na.omit(merge(data$sig * data$GSPC, data$GSPC))
colnames(perf) <- c("Stress-based strategy", "SP500")

# Step 4: Evaluate strategy performance
table.DownsideRisk(perf)
##                               Stress-based strategy   SP500
## Semi Deviation                               0.0075  0.0087
## Gain Deviation                               0.0071  0.0085
## Loss Deviation                               0.0079  0.0095
## Downside Deviation (MAR=210%)                0.0125  0.0135
## Downside Deviation (Rf=0%)                   0.0074  0.0087
## Downside Deviation (0%)                      0.0074  0.0087
## Maximum Drawdown                             0.5243  0.6433
## Historical VaR (95%)                        -0.0173 -0.0188
## Historical ES (95%)                         -0.0250 -0.0293
## Modified VaR (95%)                          -0.0166 -0.0182
## Modified ES (95%)                           -0.0268 -0.0311

table.Stats(perf)
##                 Stress-based strategy     SP500
## Observations                4858.0000 4858.0000
## NAs                            0.0000    0.0000
## Minimum                       -0.0690   -0.0947
## Quartile 1                    -0.0042   -0.0048
## Median                         0.0003    0.0005
## Arithmetic Mean                0.0002    0.0002
## Geometric Mean                 0.0002    0.0001
## Quartile 3                     0.0053    0.0057
## Maximum                        0.0557    0.1096
## SE Mean                        0.0001    0.0002
## LCL Mean (0.95)               -0.0001   -0.0002
## UCL Mean (0.95)                0.0005    0.0005
## Variance                       0.0001    0.0001
## Stdev                          0.0103    0.0120
## Skewness                      -0.1881   -0.2144
## Kurtosis                       3.4430    8.5837

charts.PerformanceSummary(perf)

chart.RelativePerformance(perf[ , 1], perf[ , 2])

chart.RiskReturnScatter(perf)

The first chart shows that the stress-based strategy (black curve) clearly outperformed its benchmark, the S&P 500 (red curve). This can also be seen in the second chart, showing the relative performance. In the third chart we see that both return (more) and (!) risk (less) of our backtested strategy are more favourable compared to the benchmark.

So, all in all this seems to be a viable strategy! But of course before investing real money many more tests have to be performed! You can use this framework for backtesting your own ideas.

Here is not the place to explain all of the above tables and plots but as you can see both packages are very, very powerful and I have only shown you a small fraction of their capabilities. To use their full potential you should have a look at the extensive documentation that comes with it on CRAN.

Disclaimer:
This is no investment advice! No responsibility is taken whatsoever if you lose money!

If you gain money though I would be happy if you could buy me a coffee… that is not too much to ask, is it? 😉