Learning R: The Ultimate Introduction (incl. Machine Learning!)


There are a million reasons to learn R (see e.g. Why R for Data Science – and not Python?), but where to start? I present to you the ultimate introduction to bring you up to speed! So read on…

I call it ultimate because it is the essence of many years of teaching R… or put differently: it is the kind of introduction I would have liked to have when I started out with R back in the days!

A word of warning though: this is a introduction to R and not to statistics, so I won’t explain the statistics terms used here. You do not need to know any other programming language but it does no harm either. Ok, now let us start!

First you need to install R (https://www.r-project.org) and preferably RStudio as a Graphical User Interface (GUI): https://www.rstudio.com/products/RStudio/#Desktop. Both are free and available for all common operating systems.

To get a quick overview of RStudio watch this video:

You can either type in the following commands in the console or open a new script tab (File -> New File -> R Script) and run the commands by pressing Ctrl + Enter/Return after having typed them.

First of all R is a very good calculator:

2 + 2
## [1] 4

sin(0.5)
## [1] 0.4794255

abs(-10) # absolute value
## [1] 10

pi
## [1] 3.141593

exp(1) # e
## [1] 2.718282

factorial(6)
## [1] 720

By the way: The hash is used for comments, everything after it will be ignored!

Of course you can define variables and use them in your calculations:

n1 <- 2
n2 <- 3
n1 # show content of variable by just typing the name
## [1] 2

n1 + n2
## [1] 5

n1 * n2
## [1] 6

n1^n2
## [1] 8

Part of R’s power stems from the fact that functions can handle several numbers at once, called vectors, and do calculations on them. When calling a function arguments are passed with round brackets:

n3 <- c(12, 5, 27) # concatenate (combine) elements into a vector 
n3
## [1] 12  5 27

min(n3)
## [1] 5

max(n3)
## [1] 27

sum(n3)
## [1] 44

mean(n3)
## [1] 14.66667

sd(n3) # standard deviation
## [1] 11.23981

var(n3) # variance
## [1] 126.3333

median(n3)
## [1] 12

n3 / 12
## [1] 1.0000000 0.4166667 2.2500000

In the last example the 12 was recycled three times. R always tries to do that (when feasible), sometimes giving a warning when it might not be intended:

n3 / c(1, 2)
## Warning in n3/c(1, 2): longer object length is not a multiple of shorter
## object length
## [1] 12.0  2.5 27.0

In cases you only want parts of your vectors you can apply subsetting with square brackets:

n3[1]
## [1] 12

n3[c(2, 3)]
## [1]  5 27

Ranges can easily be created with the colon:

n4 <- 10:20
n4
##  [1] 10 11 12 13 14 15 16 17 18 19 20

When you test whether this vector is bigger than a certain number you will get logicals as a result. You can use those logicals for subsetting:

n4 > 15
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

n4[n4 > 15]
## [1] 16 17 18 19 20

Perhaps you have heard the story of little Gauss where his teacher gave him the task to add all numbers from 1 to 100 to keep him busy for a while? Well, he found a mathematical trick to add them within seconds… for us normal people we can use R:

sum(1:100)
## [1] 5050

When we want to use some code several times we can define our own function (a user-defined function). We do that the same way we create a vector (or any other data structure) because R is a so called functional programming language and functions are so called first-class citizens (i.e. on the same level as other data structures like vectors). The code that is being executed is put in curly brackets:

gauss <- function(x) {
  sum(1:x)
}
gauss(100)
## [1] 5050

gauss(1000)
## [1] 500500

Of course we also have other data types, e.g. matrices are basically two dimensional vectors:

M <- matrix(1:12, nrow = 3, byrow = TRUE) # create a matrix
M
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

dim(M)
## [1] 3 4

Subsetting now has to provide two numbers, the first for the row, the second for the column (like in the game Battleship). If you leave one out, all data of the respective dimension will be shown:

M[2, 3]
## [1] 7

M[ , c(1, 3)]
##      [,1] [,2]
## [1,]    1    3
## [2,]    5    7
## [3,]    9   11

Another possibility to create matrices:

v1 <- 1:4
v2 <- 4:1
M1 <- rbind(v1, v2) # row bind
M1
##    [,1] [,2] [,3] [,4]
## v1    1    2    3    4
## v2    4    3    2    1

M2 <- cbind(v1, v2)  # column bind
M2
##      v1 v2
## [1,]  1  4
## [2,]  2  3
## [3,]  3  2
## [4,]  4  1

Naming rows, here with inbuilt datasets:

rownames(M2) <- LETTERS[1:4]
M2
##   v1 v2
## A  1  4
## B  2  3
## C  3  2
## D  4  1

LETTERS
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

When some result is Not Available:

LETTERS[50]
## [1] NA

Getting the structure of your variables:

str(LETTERS)
##  chr [1:26] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" ...

str(M2)
##  int [1:4, 1:2] 1 2 3 4 4 3 2 1
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:4] "A" "B" "C" "D"
##   ..$ : chr [1:2] "v1" "v2"

Another famous dataset (iris) that is also built into base R (to get help on any function or dataset just put the cursor in it and press F1):

iris
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

Oops, that is a bit long… if you only want to show the first or last rows do the following:

head(iris) # first 6 rows
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

tail(iris, 10) # last 10 rows
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 141          6.7         3.1          5.6         2.4 virginica
## 142          6.9         3.1          5.1         2.3 virginica
## 143          5.8         2.7          5.1         1.9 virginica
## 144          6.8         3.2          5.9         2.3 virginica
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

Iris is a so called data frame, the working horse of R and data science (you will see how to create one below):

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

As you can see, data frames can combine different data types. If you try to do that with e.g. vectors, which can only hold one data type, something called coercion happens, i.e. at least one data type is forced to become another one so that consistency is maintained:

str(c(2, "Hello")) # 2 is coerced to become a character string too
##  chr [1:2] "2" "Hello"

You can get a fast overview of your data like so:

summary(iris[1:4])
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

boxplot(iris[1:4])

As you have seen, R often runs a function on all of the data simultaneously. This feature is called vectorization and in many other languages you would need a loop for that. In R you don’t use loops that often, but of course they are available:

for (i in seq(5)) {
  print(1:i)
}
## [1] 1
## [1] 1 2
## [1] 1 2 3
## [1] 1 2 3 4
## [1] 1 2 3 4 5

Speaking of control structures: of course conditional statements are available too:

even <- function(x) ifelse(x %% 2 == 0, TRUE, FALSE) # %% gives remainder of division (= modulo operator)
even(1:5)
## [1] FALSE  TRUE FALSE  TRUE FALSE

Linear modelling (e.g. correlation and linear regression) couldn’t be any easier, it is included in the core language:

age <- c(21, 46, 55, 35, 28)
income <- c(1850, 2500, 2560, 2230, 1800)
df <- data.frame(age, income) # create a data frame
df
##   age income
## 1  21   1850
## 2  46   2500
## 3  55   2560
## 4  35   2230
## 5  28   1800

cor(df) # correlation
##              age    income
## age    1.0000000 0.9464183
## income 0.9464183 1.0000000

LinReg <- lm(income ~ age, data = df) # income as a linear model of age
LinReg
## 
## Call:
## lm(formula = income ~ age, data = df)
## 
## Coefficients:
## (Intercept)          age  
##     1279.37        24.56

summary(LinReg)
## 
## Call:
## lm(formula = income ~ age, data = df)
## 
## Residuals:
##       1       2       3       4       5 
##   54.92   90.98  -70.04   91.12 -166.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1279.367    188.510   6.787  0.00654 **
## age           24.558      4.838   5.076  0.01477 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 132.1 on 3 degrees of freedom
## Multiple R-squared:  0.8957, Adjusted R-squared:  0.8609 
## F-statistic: 25.77 on 1 and 3 DF,  p-value: 0.01477

plot(df, pch = 16, main = "Linear model")
abline(LinReg, col = "blue", lwd = 2) # adding the regression line

You could directly use the model to make predictions:

pred_LinReg <- predict(LinReg, data.frame(age = seq(15, 70, 5)))
names(pred_LinReg) <- seq(15, 70, 5)
round(pred_LinReg, 2)
##      15      20      25      30      35      40      45      50      55 
## 1647.73 1770.52 1893.31 2016.10 2138.88 2261.67 2384.46 2507.25 2630.04 
##      60      65      70 
## 2752.83 2875.61 2998.40

If you want to know more about the modelling process you can find it here: Learning Data Science: Modelling Basics

Another strength of R is the huge number of add-on packages for all kinds of specialized tasks. For the grand finale of this introduction, we’re gonna get a little taste of machine learning. For that matter we install the OneR package from CRAN (the official package repository of R): Tools -> Install packages… -> type in “OneR” -> click “Install”.

After that we build a simple model on the iris dataset to predict the Species column:

library(OneR) # load package
data <- optbin(Species ~., data = iris) # find optimal bins for numeric predictors
model <- OneR(data, verbose = TRUE) # build actual model
## 
##     Attribute    Accuracy
## 1 * Petal.Width  96%     
## 2   Petal.Length 95.33%  
## 3   Sepal.Length 74.67%  
## 4   Sepal.Width  55.33%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

summary(model) # show rules
## 
## Call:
## OneR.data.frame(x = data, verbose = TRUE)
## 
## Rules:
## If Petal.Width = (0.0976,0.791] then Species = setosa
## If Petal.Width = (0.791,1.63]   then Species = versicolor
## If Petal.Width = (1.63,2.5]     then Species = virginica
## 
## Accuracy:
## 144 of 150 instances classified correctly (96%)
## 
## Contingency table:
##             Petal.Width
## Species      (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum
##   setosa               * 50            0          0  50
##   versicolor              0         * 48          2  50
##   virginica               0            4       * 46  50
##   Sum                    50           52         48 150
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 266.35, df = 4, p-value < 2.2e-16

plot(model)

We’ll now see how well the model is doing:

prediction <- predict(model, data)
eval_model(prediction, data)
## 
## Confusion matrix (absolute):
##             Actual
## Prediction   setosa versicolor virginica Sum
##   setosa         50          0         0  50
##   versicolor      0         48         4  52
##   virginica       0          2        46  48
##   Sum            50         50        50 150
## 
## Confusion matrix (relative):
##             Actual
## Prediction   setosa versicolor virginica  Sum
##   setosa       0.33       0.00      0.00 0.33
##   versicolor   0.00       0.32      0.03 0.35
##   virginica    0.00       0.01      0.31 0.32
##   Sum          0.33       0.33      0.33 1.00
## 
## Accuracy:
## 0.96 (144/150)
## 
## Error rate:
## 0.04 (6/150)
## 
## Error rate reduction (vs. base rate):
## 0.94 (p-value < 2.2e-16)

96% accuracy is not too bad, even for this simple dataset!

If you want to know more about the OneR package you can read the vignette: OneR – Establishing a New Baseline for Machine Learning Classification Models.

Well, and that’s it for the ultimate introduction to R – hopefully you liked it and you learned something! Please share your first experiences with R in the comments and also if you miss something (I might add it in the future!) – Thank you for reading and stay tuned for more to come!

Was the Bavarian Abitur too hard this time?


Bavaria is known for its famous Oktoberfest… and within Germany also for its presumably difficult Abitur, a qualification granted by university-preparatory schools in Germany.

A mandatory part for all students is maths. This year many students protested that the maths part was way too hard, they even started an online petition with more than seventy thousand supporters at this time of writing!

It is not clear yet whether their marks will be adjusted upwards, the ministry of education is investigating the case. As a professor in Bavaria who also teaches statistics I will take the opportunity to share with you an actual question from the original examination with solution, so read on…

Let us have a look at the first (and easiest) question in the stochastics part:

Every sixth visitor to the Oktoberfest wears a gingerbread heart around his neck. During the festival 25 visitors are chosen at random. Determine the probability that at most one of the selected visitors will have a gingerbread heart.

Before you read on try to solve the task yourself…

Of course students are not allowed to use R in the examination but in general the good thing about this kind of questions is that if you have no idea how to solve them analytically solving them by simulation is often much easier:

set.seed(12)
N <- 1e7
v <- matrix(sample(c(0L, 1L), size = 25 * N, replace = TRUE, prob = c(5/6, 1/6)), ncol = 25)
sum(rowSums(v) <= 1) / N
## [1] 0.062936

The answer is about 6.3\%.

Now for the analytical solution: “At least one” implies that we have to differentiate between two cases, “no gingerbread heart” and “exactly one gingerbread heart”. “No gingerbread heart” is just \left(\frac{5}{6}\right)^{25}. “Exactly one gingerbread heart” is 25\cdot\frac{1}{6}\cdot\left(\frac{5}{6}\right)^{24} because there are 25 possibilities of where the gingerbread heart could occur. We have to add both probabilities:

(5/6)^25 + 25*1/6*(5/6)^24
## [1] 0.06289558

If you know a little bit about probability distributions you will recognize the above as the binomial distribution:

pbinom(q = 1, size = 25, prob = 1/6)
## [1] 0.06289558

Of course it is a little unfair to judge just on basis of the easiest task and without knowing the general maths level that is required. But still, I would like to hear your opinion on this. Also outsiders’ views from different countries and different educational systems are very welcome! So, what do you think:

Was the Bavarian Abitur too hard this time? Please leave your reply in the comment section below!

Separating the Signal from the Noise: Robust Statistics for Pedestrians


One of the problems of navigating an autonomous car through a city is to extract robust signals in the face of all the noise that is present in the different sensors. Just taking something like an arithmetic mean of all the data points could possibly end in a catastrophe: if a part of a wall looks similar to the street and the algorithm calculates an average trajectory of the two this would end in leaving the road and possibly crashing into pedestrians. So we need some robust algorithm to get rid of the noise. The area of statistics that especially deals with such problems is called robust statistics and the methods used therein robust estimation.

Now, one of the problems is that one doesn’t know what is signal and what is noise. The big idea behind RANSAC (short for RAndom SAmple Consensus) is to get rid of outliers by basically taking as many points as possible which form a well-defined region and leaving out the others. It does that iteratively, similar to the famous k-means algorithm (the topic of one of the upcoming posts, so stay tuned…).

To really understand how RANSAC works we will now build it with R. We will take a simple linear regression as an example and make it robust against outliers.

For educational purposes we will do this step by step:

  1. Understanding the general outline of the algorithm.
  2. Looking at the steps in more detail.
  3. Expressing the steps in pseudocode.
  4. Translating this into R!

Conveniently enough Wikipedia gives a good outline of the algorithm and even provides us with the very clear pseudocode which will serve as the basis for our own R implementation (the given Matlab code is not very good in my opinion and has nothing to do with the pseudocode):

The RANSAC algorithm is essentially composed of two steps that are iteratively repeated:

  1. In the first step, a sample subset containing minimal data items is randomly selected from the input dataset. A fitting model and the corresponding model parameters are computed using only the elements of this sample subset. The cardinality of the sample subset is the smallest sufficient to determine the model parameters.
  2. In the second step, the algorithm checks which elements of the entire dataset are consistent with the model instantiated by the estimated model parameters obtained from the first step. A data element will be considered as an outlier if it does not fit the fitting model instantiated by the set of estimated model parameters within some error threshold that defines the maximum deviation attributable to the effect of noise. The set of inliers obtained for the fitting model is called consensus set. The RANSAC algorithm will iteratively repeat the above two steps until the obtained consensus set in certain iteration has enough inliers.

In more detail:

RANSAC achieves its goal by repeating the following steps:

  1. Select a random subset of the original data. Call this subset the hypothetical inliers.
  2. A model is fitted to the set of hypothetical inliers.
  3. All other data are then tested against the fitted model. Those points that fit the estimated model well, according to some model-specific loss function, are considered as part of the consensus set.
  4. The estimated model is reasonably good if sufficiently many points have been classified as part of the consensus set.
  5. Afterwards, the model may be improved by reestimating it using all members of the consensus set.

This procedure is repeated a fixed number of times, each time producing either a model which is rejected because too few points are part of the consensus set, or a refined model together with a corresponding consensus set size. In the latter case, we keep the refined model if its consensus set is larger than the previously saved model.

Now, this can be expressed in pseudocode:

Given:
    data - a set of observed data points
    model - a model that can be fitted to data points
    n - minimum number of data points required to fit the model
    k - maximum number of iterations allowed in the algorithm
    t - threshold value to determine when a data point fits a model
    d - number of close data points required to assert that a model fits well to data

Return:
    bestfit - model parameters which best fit the data (or nul if no good model is found)

iterations = 0
bestfit = nul
besterr = something really large
while iterations < k {
    maybeinliers = n randomly selected values from data
    maybemodel = model parameters fitted to maybeinliers
    alsoinliers = empty set
    for every point in data not in maybeinliers {
        if point fits maybemodel with an error smaller than t
             add point to alsoinliers
    }
    if the number of elements in alsoinliers is > d {
        % this implies that we may have found a good model
        % now test how good it is
        bettermodel = model parameters fitted to all points in maybeinliers and alsoinliers
        thiserr = a measure of how well model fits these points
        if thiserr < besterr {
            bestfit = bettermodel
            besterr = thiserr
        }
    }
    increment iterations
}
return bestfit

It is quite easy to convert this into valid R code (as a learner of R you should try it yourself before looking at my solution!):

ransac <- function(data, n, k, t, d) {
  iterations <- 0
  bestfit <- NULL
  besterr <- 1e5
  while (iterations < k) {
    maybeinliers <- sample(nrow(data), n)
    maybemodel <- lm(y ~ x, data = data, subset = maybeinliers)
    alsoinliers <- NULL
    for (point in setdiff(1:nrow(data), maybeinliers)) {
      if (abs(maybemodel$coefficients[2]*data[point, 1] - data[point, 2] + maybemodel$coefficients[1])/(sqrt(maybemodel$coefficients[2] + 1)) < t)
        alsoinliers <- c(alsoinliers, point)
    }
    if (length(alsoinliers) > d) {
      bettermodel <- lm(y ~ x, data = data, subset = c(maybeinliers, alsoinliers))
      thiserr <- summary(bettermodel)$sigma
      if (thiserr < besterr) {
        bestfit <- bettermodel
        besterr <- thiserr
      }
    }
    iterations <- iterations + 1
  }
  bestfit
}

We now test this with some sample data:

data <- read.csv("data/RANSAC.csv")
plot(data)
abline(lm(y ~ x, data = data))
set.seed(3141)
abline(ransac(data, n = 10, k = 10, t = 0.5, d = 10), col = "blue")

The black line is a plain vanilla linear regression, the blue line is the RANSAC-enhanced version. As you can see: no more crashing into innocent pedestrians. 🙂

Learning Data Science: Predicting Income Brackets


As promised in the post Learning Data Science: Modelling Basics we will now go a step further and try to predict income brackets with real world data and different modelling approaches. We will learn a thing or two along the way, e.g. about the so-called Accuracy-Interpretability Trade-Off, so read on…

The data we will use is from here: Marketing data set. The description reads:

This dataset contains questions from questionnaires that were filled out by shopping mall customers in the San Francisco Bay area. The goal is to predict the Annual Income of Household from the other 13 demographics attributes.

The following extra information (or metadata) is provided with the data:

cat(readLines('data/marketing-names.txt'), sep = '\n')
Marketing data set
1: Description.
This dataset contains questions from questionaries that were filled out by shopping mall customers in the San Francisco Bay area. The goal is to predict the Anual Income of Household from the other 13 demographics  attributes.
2: Type.            Classification
3: Origin.          Real world
4: Instances.       6876 (8993)
5: Features.        14
6: Classes          9
7: Missing values.  Yes
8: Header.
@relation marketing
@attribute Sex integer [1, 2]
@attribute MaritalStatus integer [1, 5]
@attribute Age integer [1, 7]
@attribute Education integer [1, 6]
@attribute Occupation integer [1, 9]
@attribute YearsInSf integer [1, 5]
@attribute DualIncome integer [1, 3]
@attribute HouseholdMembers integer [1, 9]
@attribute Under18 integer [0, 9]
@attribute HouseholdStatus integer [1, 3]
@attribute TypeOfHome integer [1, 5]
@attribute EthnicClass integer [1, 8]
@attribute Language integer [1, 3]
@attribute Income {1, 2, 3, 4, 5, 6, 7, 8, 9}
@inputs Sex, MaritalStatus, Age, Education, Occupation, YearsInSf, DualIncome, HouseholdMembers, Under18, HouseholdStatus, TypeOfHome, EthnicClass, Language
@outputs Income
DATA DICTIONARY
 1    ANNUAL INCOME OF HOUSEHOLD (PERSONAL INCOME IF SINGLE)
             1. Less than $10,000
             2. $10,000 to $14,999
             3. $15,000 to $19,999
             4. $20,000 to $24,999
             5. $25,000 to $29,999
             6. $30,000 to $39,999
             7. $40,000 to $49,999
             8. $50,000 to $74,999
             9. $75,000 or more
             
 2    SEX
             1. Male
             2. Female
  3    MARITAL STATUS
             1. Married
             2. Living together, not married
             3. Divorced or separated
             4. Widowed
             5. Single, never married
  4    AGE
             1. 14 thru 17
             2. 18 thru 24
             3. 25 thru 34
             4. 35 thru 44
             5. 45 thru 54
             6. 55 thru 64
             7. 65 and Over
  5    EDUCATION
             1. Grade 8 or less
             2. Grades 9 to 11
             3. Graduated high school
             4. 1 to 3 years of college
             5. College graduate
             6. Grad Study
 6    OCCUPATION
             1. Professional/Managerial
             2. Sales Worker
             3. Factory Worker/Laborer/Driver
             4. Clerical/Service Worker
             5. Homemaker
             6. Student, HS or College
             7. Military
             8. Retired
             9. Unemployed
  7    HOW LONG HAVE YOU LIVED IN THE SAN FRAN./OAKLAND/SAN JOSE AREA?
             1. Less than one year
             2. One to three years
             3. Four to six years
             4. Seven to ten years
             5. More than ten years
  8    DUAL INCOMES (IF MARRIED)
             1. Not Married
             2. Yes
             3. No
  9    PERSONS IN YOUR HOUSEHOLD
             1. One
             2. Two
             3. Three
             4. Four
             5. Five
             6. Six
             7. Seven
             8. Eight
             9. Nine or more

 10    PERSONS IN HOUSEHOLD UNDER 18
             0. None
             1. One
             2. Two
             3. Three
             4. Four
             5. Five
             6. Six
             7. Seven
             8. Eight
             9. Nine or more
 11    HOUSEHOLDER STATUS
             1. Own
             2. Rent
             3. Live with Parents/Family
 12    TYPE OF HOME
             1. House
             2. Condominium
             3. Apartment
             4. Mobile Home
             5. Other
 13    ETHNIC CLASSIFICATION
             1. American Indian
             2. Asian
             3. Black
             4. East Indian
             5. Hispanic
             6. Pacific Islander
             7. White
             8. Other
  14    WHAT LANGUAGE IS SPOKEN MOST OFTEN IN YOUR HOME?
             1. English
             2. Spanish
             3. Other

Our task is to predict the variable “Income”.

So, let us first load the data (you can find the correctly formatted csv-file here: marketing.csv), have a look at some of its characteristics and perform a little bit of additional formatting. After that we divide it into a training (80%) and a test set (20%) to account for potential overfitting (also see Learning Data Science: Modelling Basics):

data <- read.csv("data/marketing.csv")
dim(data)
## [1] 6876   14

str(data)
## 'data.frame':    6876 obs. of  14 variables:
##  $ Sex             : int  1 2 2 2 1 1 1 1 1 1 ...
##  $ MaritalStatus   : int  1 1 5 5 1 5 3 1 1 5 ...
##  $ Age             : int  5 3 1 1 6 2 3 6 7 2 ...
##  $ Education       : int  5 5 2 2 4 3 4 3 4 4 ...
##  $ Occupation      : int  5 1 6 6 8 9 3 8 8 9 ...
##  $ YearsInSf       : int  5 5 5 3 5 4 5 5 4 5 ...
##  $ DualIncome      : int  3 2 1 1 3 1 1 3 3 1 ...
##  $ HouseholdMembers: int  5 3 4 4 2 3 1 3 2 1 ...
##  $ Under18         : int  2 1 2 2 0 1 0 0 0 0 ...
##  $ HouseholdStatus : int  1 2 3 3 1 2 2 2 2 2 ...
##  $ TypeOfHome      : int  1 3 1 1 1 3 3 3 3 3 ...
##  $ EthnicClass     : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ Language        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Income          : int  9 9 1 1 8 1 6 2 4 1 ...

data_names <- names(data)
data <- cbind(data[-ncol(data)], factor(data$Income)) # make variable Income (which should be predicted) a factor
names(data) <- data_names

set.seed(12)
random <- sample(1:nrow(data), 0.8 * nrow(data))
data_train <- data[random, ]
data_test <- data[-random, ]

We start with a simple but comprehensible model, OneR (on CRAN), as a benchmark:

library(OneR)
data <- optbin(data_train)
model <- OneR(data, verbose = TRUE)
## 
##     Attribute        Accuracy
## 1 * Age              28.2%   
## 2   MaritalStatus    28.11%  
## 3   Occupation       28.07%  
## 4   HouseholdStatus  27.56%  
## 5   DualIncome       27.04%  
## 6   Education        25.98%  
## 7   HouseholdMembers 22.51%  
## 8   Under18          20.69%  
## 9   TypeOfHome       19.36%  
## 10  EthnicClass      19.29%  
## 11  Sex              18.07%  
## 12  Language         17.82%  
## 13  YearsInSf        17.75%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

summary(model)
## 
## Call:
## OneR.data.frame(x = data, verbose = TRUE)
## 
## Rules:
## If Age = 1 then Income = 1
## If Age = 2 then Income = 1
## If Age = 3 then Income = 6
## If Age = 4 then Income = 8
## If Age = 5 then Income = 8
## If Age = 6 then Income = 8
## If Age = 7 then Income = 6
## 
## Accuracy:
## 1551 of 5500 instances classified correctly (28.2%)
## 
## Contingency table:
##       Age
## Income     1     2     3     4     5    6    7  Sum
##    1   * 421 * 352    99    43    21   15   25  976
##    2      16   204   107    39    13   22   33  434
##    3       9   147   122    49    12   21   35  395
##    4       5   121   188    71    39   29   42  495
##    5       3    77   179    81    29   23   34  426
##    6      10    93 * 234   156    70   56 * 47  666
##    7      12    92   185   155   107   66   33  650
##    8      12   111   211 * 251 * 160 * 86   44  875
##    9      11    76   114   187   104   69   22  583
##    Sum   499  1273  1439  1032   555  387  315 5500
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 2671.2, df = 48, p-value < 2.2e-16

plot(model)

prediction <- predict(model, data_test)
eval_model(prediction, data_test)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1    232   45   46   32   33   27   19   27   24  485
##        2      0    0    0    0    0    0    0    0    0    0
##        3      0    0    0    0    0    0    0    0    0    0
##        4      0    0    0    0    0    0    0    0    0    0
##        5      0    0    0    0    0    0    0    0    0    0
##        6     31   30   44   44   41   66   44   57   50  407
##        7      0    0    0    0    0    0    0    0    0    0
##        8     16   20   20   47   27   87   71  110   86  484
##        9      0    0    0    0    0    0    0    0    0    0
##        Sum  279   95  110  123  101  180  134  194  160 1376
## 
## Confusion matrix (relative):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1   0.17 0.03 0.03 0.02 0.02 0.02 0.01 0.02 0.02 0.35
##        2   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        3   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        4   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        5   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        6   0.02 0.02 0.03 0.03 0.03 0.05 0.03 0.04 0.04 0.30
##        7   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        8   0.01 0.01 0.01 0.03 0.02 0.06 0.05 0.08 0.06 0.35
##        9   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        Sum 0.20 0.07 0.08 0.09 0.07 0.13 0.10 0.14 0.12 1.00
## 
## Accuracy:
## 0.2965 (408/1376)
## 
## Error rate:
## 0.7035 (968/1376)
## 
## Error rate reduction (vs. base rate):
## 0.1176 (p-value < 2.2e-16)

What can we conclude from this? First the most important feature is “Age” while “Marital Status”, “Occupation” and “Household Status” perform comparably well. The overall accuracy (i.e. the percentage of correctly predicted instances) is with about 30% not that great, on the other hand not that extraordinarily uncommon when dealing with real-world data. Looking at the model itself, in the form of rules and the diagnostic plot, we can see that we have non-linear relationship between Age and Income: the older one gets the higher the income bracket, except for people who are old enough to retire. This seems plausible.

OneR is basically a one step decision tree, so the natural choice for our next analysis would be to have a full decision tree built (all packages are on CRAN):

library(rpart)
library(rpart.plot)
model <- rpart(Income ~., data = data_train)
rpart.plot(model, type = 3, extra = 0, box.palette = "Grays")

prediction <- predict(model, data_test, type = "class")
eval_model(prediction, data_test)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1    201   36   22   13   16   12    8   15   12  335
##        2     43   25   32   22   17   12   10   14    6  181
##        3      0    0    0    0    0    0    0    0    0    0
##        4      0    0    0    0    0    0    0    0    0    0
##        5      0    0    0    0    0    0    0    0    0    0
##        6     18   24   40   50   42   68   32   33   22  329
##        7      0    0    0    0    0    0    0    0    0    0
##        8     17   10   16   38   26   88   84  132  120  531
##        9      0    0    0    0    0    0    0    0    0    0
##        Sum  279   95  110  123  101  180  134  194  160 1376
## 
## Confusion matrix (relative):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1   0.15 0.03 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.24
##        2   0.03 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.00 0.13
##        3   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        4   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        5   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        6   0.01 0.02 0.03 0.04 0.03 0.05 0.02 0.02 0.02 0.24
##        7   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        8   0.01 0.01 0.01 0.03 0.02 0.06 0.06 0.10 0.09 0.39
##        9   0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##        Sum 0.20 0.07 0.08 0.09 0.07 0.13 0.10 0.14 0.12 1.00
## 
## Accuracy:
## 0.3096 (426/1376)
## 
## Error rate:
## 0.6904 (950/1376)
## 
## Error rate reduction (vs. base rate):
## 0.134 (p-value < 2.2e-16)

The new model has an accuracy that is a little bit better but the interpretability is a little bit worse. You have to go through the different branches to see in which income bracket it ends. So for example when the age bracket is below 2 (which means that it is 1) it predicts income bracket 1, when it is bigger than 2 and the Household Status bracket is 1 it predicts income income bracket 8 and so on. The most important variable, as you can see is again Age (which is reassuring that OneR was on the right track) but we also see Household Status and Occupation again.

What is better than one tree? Many trees! So the next natural step is to have many trees built, while varying the features and the examples that should be included in each tree. At the end it may be that different trees give different income brackets as their respective prediction, which we solve via voting as in a good democracy. This whole method is fittingly called Random Forests and fortunately there are many good packages available in R. We use the randomForest package (also on CRAN) here:

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.

set.seed(4543) # for reproducibility
model <- randomForest(Income ~., data = data_train, importance = TRUE)
varImpPlot(model)

prediction <- predict(model, data_test)
eval_model(prediction, data_test)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1    223   35   26   16   19   11    9   18   16  373
##        2     24   15   12   20    7   11    1    4    1   95
##        3      9   10   11   14    9    6    3    4    2   68
##        4      5   15   25   22   10   22    6    9    5  119
##        5      2    2    8    9    6   12    6    3    1   49
##        6      3    5   15   14   19   40   23   17   15  151
##        7      8    4    7   13   14   26   25   24    5  126
##        8      3    8    5   11   13   44   49   87   68  288
##        9      2    1    1    4    4    8   12   28   47  107
##        Sum  279   95  110  123  101  180  134  194  160 1376
## 
## Confusion matrix (relative):
##           Actual
## Prediction    1    2    3    4    5    6    7    8    9  Sum
##        1   0.16 0.03 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.27
##        2   0.02 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.07
##        3   0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.05
##        4   0.00 0.01 0.02 0.02 0.01 0.02 0.00 0.01 0.00 0.09
##        5   0.00 0.00 0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.04
##        6   0.00 0.00 0.01 0.01 0.01 0.03 0.02 0.01 0.01 0.11
##        7   0.01 0.00 0.01 0.01 0.01 0.02 0.02 0.02 0.00 0.09
##        8   0.00 0.01 0.00 0.01 0.01 0.03 0.04 0.06 0.05 0.21
##        9   0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.02 0.03 0.08
##        Sum 0.20 0.07 0.08 0.09 0.07 0.13 0.10 0.14 0.12 1.00
## 
## Accuracy:
## 0.3459 (476/1376)
## 
## Error rate:
## 0.6541 (900/1376)
## 
## Error rate reduction (vs. base rate):
## 0.1796 (p-value < 2.2e-16)

As an aside you can see that the basic modelling workflow stayed the same – no matter what approach (OneR, decision tree or random forest) you chose. This standard is kept for most (modern) packages and one of the great strengths of R! With thousands of packages on CRAN alone there are of course sometimes differences but those are normally well documented (so the help function or the vignette are your friends!)

Now, back to the result of the last analysis: again, the overall accuracy is better but because we have hundreds of trees now the interpretability suffered a lot. This is also known under the name Accuracy-Interpretability Trade-Off. The best we can do in the case of random forests is to find out which features are the most important ones. Again Age, Occupation and Household Status show up, which is consistent with our analyses so far. Additionally, because of the many trees that had to be built, this analysis ran a lot longer than the other two.

Random forests are one of the best methods out of the box, so the accuracy of about 34% tells us that our first model (OneR) wasn’t doing too bad in the first place! Why are able to reach comparatively good results with just one feature. This is true for many real-world data sets. Sometimes simple models are not much worse than very complicated ones – you should keep that in mind!

If you play around with this dataset I would be interested in your results! Please post them in the comments – Thank you and stay tuned!

Learning R: The Collatz Conjecture


In this post we will see that a little bit of simple R code can go a very long way! So let’s get started!

One of the fascinating features of number theory (unlike many other branches of mathematics) is that many statements are easy to make but the brightest minds are not able to prove them, the so called Collatz conjecture (named after the German mathematician Lothar Collatz) is an especially fascinating example:


The Collatz conjecture states that when you start with any positive integer and

  • if it is even, the next number is one half the previous number and,
  • if it is odd, the next number is three times the previous number plus one
  • the sequence will always reach one.


It doesn’t get any simpler than that but no one has been able to prove this – and not for a lack of trying! The great mathematician Paul Erdős said about it “Mathematics may not be ready for such problems.” You can read more on Wikipedia: Collatz conjecture and watch an especially nice film that was made by a group of students:

So let us write a little program and try some numbers!

First we need a simple helper function to determine whether a number is even:

is.even <- function(x) {
  if (x %% 2 == 0) TRUE
  else FALSE
}

is.even(2)
## [1] TRUE

is.even(3)
## [1] FALSE

Normally we wouldn’t use a dot within function names but R itself (because of its legacy code) is not totally consistent here and the is-function family (like is.na or is.integer) all use a dot. After that we write a function for the rule itself, making use of the is.even function:

collatz <- function(n) {
  if (is.even(n)) n/2
  else 3 * n + 1
}

collatz(6)
## [1] 3

collatz(5)
## [1] 16

To try a number and plot it (like in the Wikipedia article) we could use a while-loop:

n_total <- n <- 27
while (n != 1) {
  n <- collatz(n)
  n_total <- c(n_total, n)
}

n_total
##   [1]   27   82   41  124   62   31   94   47  142   71  214  107  322  161
##  [15]  484  242  121  364  182   91  274  137  412  206  103  310  155  466
##  [29]  233  700  350  175  526  263  790  395 1186  593 1780  890  445 1336
##  [43]  668  334  167  502  251  754  377 1132  566  283  850  425 1276  638
##  [57]  319  958  479 1438  719 2158 1079 3238 1619 4858 2429 7288 3644 1822
##  [71]  911 2734 1367 4102 2051 6154 3077 9232 4616 2308 1154  577 1732  866
##  [85]  433 1300  650  325  976  488  244  122   61  184   92   46   23   70
##  [99]   35  106   53  160   80   40   20   10    5   16    8    4    2    1

plot(n_total, type = "l", col = "blue", xlab = "", ylab = "")

As you can see, after a wild ride the sequence finally reaches one as expected. We end with some nerd humour from the cult website xkcd:

Source: xkcd

To understand Recursion you have to understand Recursion…


Sorting values is one of the bread and butter tasks in computer science: this post uses it as a use case to learn what recursion is all about. It starts with some nerd humour… and ends with some more, so read on!

Have a look at the following code and try to understand it:

bogosort <- function(x) {
  while(is.unsorted(x)) x <- sample(x)
  x
}

set.seed(123)
(n <- sample(1:100, 10))
##  [1] 29 79 41 86 91  5 50 83 51 42

bogosort(n)
##  [1]  5 29 41 42 50 51 79 83 86 91

Obviously bogosort did its job… but in a ludicrously inefficient way! It just shuffles all values over and over until they are sorted by chance! In the worst case it just shuffles forever (i.e. until the sun swallows the earth or your partner switches the computer off… whatever comes first).

To create a more efficient sorting algorithm one could use the following strategy:

  1. Draw the smallest number, put it into a vector and
  2. Sort the remaining numbers by going back to 1.

The algorithm (which is called selection sort, see below) effectively sorts by calling itself! This is the main idea behind recursion.

To dive deeper into how recursion works watch the following video:

So, basically

Recursion means defining a function in terms of itself!

Additionally you need some stopping criterion or base case, otherwise you would create an infinite loop.

The recursive version of the factorial function is:

    \[1!=1, n!=n(n-1)!\]

In the video this formula is written in C, here is the R version:

fact <- function(n) {
  if (n == 1) 1
  else n * fact(n-1)
}
fact(4)
## [1] 24

factorial(4) #inbuilt factorial function
## [1] 24

After we have seen how recursion actually works, let us implement the above selection sort algorithm in R. First watch this, admittedly unconventional, video:
We now implement the main idea recursively:

selsort <- function(x) {
  if(length(x) > 1) {
    mini <- which.min(x)
    c(x[mini], selsort(x[-mini]))
  } else x
}

set.seed(123)
(v <- sample(1:100, 20))
##  [1] 29 79 41 86 91  5 50 83 51 42 87 98 60 94  9 77 21  4 27 78

selsort(v)
##  [1]  4  5  9 21 27 29 41 42 50 51 60 77 78 79 83 86 87 91 94 98

Another, even more efficient sorting algorithm, quicksort, also works recursively. Watch the following video to get the idea:

So, the main idea behind the quicksort algorithm is to find a pivot for each unsorted subset of numbers which splits it into two similarly sized subsets. Then the sorting function calls itself for each subset.

Have a look at the following code:

qsort <- function(v) {
  if (length(v) > 1) {
    pivot <- (min(v) + max(v)) / 2
    c(qsort(v[v < pivot]), v[v == pivot], qsort(v[v > pivot])) 
  } else v
}

qsort(v)
##  [1]  4  5  9 21 27 29 41 42 50 51 60 77 78 79 83 86 87 91 94 98

Now for some speed comparisons between selection sort, quicksort and R’s sort function (on my lame computer…):

options(expressions = 7500)
N <- 7000

set.seed(123)
vs <- runif(N)

system.time(u <- selsort(vs))
##    user  system elapsed 
##    0.59    0.14    0.73

system.time(u <- qsort(vs))
##    user  system elapsed 
##    0.01    0.00    0.01

system.time(u <- sort(vs))
##    user  system elapsed 
##       0       0       0

Wow, although quicksort was implemented in R (and not in C as the sort function) it is impressively fast. Why? Because each subset of unsorted numbers is again split into two subsets only about half as big. This pushes down the sizes of subsets that still have to be sorted down pretty fast. In the case of selection sort the subset only gets smaller by one number (the smallest one) in each step.

Let us end this post with a little easter egg from google – do you get it?

So, what is AI really?


One of the topics that is totally hyped at the moment is obviously Artificial Intelligence or AI for short. There are many self-proclaimed experts running around trying to sell you the stuff they have been doing all along under this new label. When you ask them what AI means you will normally get some convoluted explanations (which is a good sign that they don’t get it themselves) and some “success stories”. The truth is that many of those talking heads don’t really know what they are talking about, yet happen to have a friend who knows somebody who picked up a book at the local station bookshop… ok, that was nasty but unfortunately often not too far away from the truth.

So, what is AI really? This post tries to give some guidance.

The traditional way coding a computer program worked was through carefully analyzing the problem, trying to separate its different parts into simpler sub-problems, put the solution into an algorithmic form (kind of a recipe) and finally code it. Let’s have a look at an example!

Let’s say you want to program an app for mushroom pickers with warnings for certain qualities. The idea is to find an easy to follow guide in the form of “if your mushroom has this quality then DO NOT eat it!” (in computer lingo called “conditional statements” or just “conditionals”). As any mushroom picker can attest this is not an easy task. For the matter have a look at the following dataset (you can find it here: mushrooms, originally it is from https://archive.ics.uci.edu/ml/datasets/mushroom):

mushrooms <- read.csv("data/mushrooms.csv")
str(mushrooms)
## 'data.frame':    8124 obs. of  23 variables:
##  $ cap_shape               : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ...
##  $ cap_surface             : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
##  $ cap_color               : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ...
##  $ gill_attachment         : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill_spacing            : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill_size               : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill_color              : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ...
##  $ stalk_shape             : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk_root              : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
##  $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_color_above_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk_color_below_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil_type               : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
##  $ veil_color              : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring_number             : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring_type               : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore_print_color       : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ...
##  $ population              : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...
##  $ type                    : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...

head(mushrooms)
##   cap_shape cap_surface cap_color bruises    odor gill_attachment
## 1    convex      smooth     brown     yes pungent            free
## 2    convex      smooth    yellow     yes  almond            free
## 3      bell      smooth     white     yes   anise            free
## 4    convex       scaly     white     yes pungent            free
## 5    convex      smooth      gray      no    none            free
## 6    convex       scaly    yellow     yes  almond            free
##   gill_spacing gill_size gill_color stalk_shape stalk_root
## 1        close    narrow      black   enlarging      equal
## 2        close     broad      black   enlarging       club
## 3        close     broad      brown   enlarging       club
## 4        close    narrow      brown   enlarging      equal
## 5      crowded     broad      black    tapering      equal
## 6        close     broad      brown   enlarging       club
##   stalk_surface_above_ring stalk_surface_below_ring stalk_color_above_ring
## 1                   smooth                   smooth                  white
## 2                   smooth                   smooth                  white
## 3                   smooth                   smooth                  white
## 4                   smooth                   smooth                  white
## 5                   smooth                   smooth                  white
## 6                   smooth                   smooth                  white
##   stalk_color_below_ring veil_type veil_color ring_number  ring_type
## 1                  white   partial      white         one    pendant
## 2                  white   partial      white         one    pendant
## 3                  white   partial      white         one    pendant
## 4                  white   partial      white         one    pendant
## 5                  white   partial      white         one evanescent
## 6                  white   partial      white         one    pendant
##   spore_print_color population habitat      type
## 1             black  scattered   urban poisonous
## 2             brown   numerous grasses    edible
## 3             brown   numerous meadows    edible
## 4             black  scattered   urban poisonous
## 5             brown   abundant grasses    edible
## 6             black   numerous grasses    edible

The dataset consists of 8124 examples of mushrooms with 22 qualities each (plus the attribute whether the respective mushroom is edible or poisonous). Well, obviously this is not going to be easy…

A naive approach would be to formulate rules for every instance: if the cap shape is convex and the cap surface smooth and the cap colour brown… and so on for all 22 attributes, then DO NOT eat it! This would obviously not be very helpful. Another approach would be to go through every attribute and see whether it is helpful in determining the type, so for example:

table(mushrooms$cap_shape, mushrooms$type)
##          
##           edible poisonous
##   bell       404        48
##   conical      0         4
##   convex    1948      1708
##   flat      1596      1556
##   knobbed    228       600
##   sunken      32         0

Obviously this attribute isn’t very helpful, in many cases it just gives a “maybe, maybe not”-answer. Perhaps the approach itself is not so bad after all but you would have to try it for all 22 attributes, interpret the results, pick the best one, formulate if-then-rules and code them… tiresome and error-prone.

Wouldn’t it be nice to do it the other way around: just show the computer all of the examples and it magically programs itself by finding the appropriate if-then-rules automatically? This is what AI is all about:

Artificial Intelligence (AI): Showing a computer examples of a problem so that it programs itself to solve it.

So, let us throw AI at our problem in the form of the OneR package (on CRAN):

library(OneR)
OneR(mushrooms, verbose = TRUE)
## 
##     Attribute                Accuracy
## 1 * odor                     98.52%  
## 2   spore_print_color        86.8%   
## 3   gill_color               80.5%   
## 4   ring_type                77.55%  
## 5   stalk_surface_above_ring 77.45%  
## 6   stalk_surface_below_ring 76.61%  
## 7   gill_size                75.63%  
## 8   bruises                  74.4%   
## 9   population               72.18%  
## 10  stalk_color_above_ring   71.64%  
## 11  stalk_color_below_ring   71.44%  
## 12  habitat                  69.03%  
## 13  stalk_root               64.6%   
## 14  gill_spacing             61.6%   
## 15  cap_color                59.53%  
## 16  cap_surface              58.05%  
## 17  cap_shape                56.43%  
## 18  stalk_shape              55.29%  
## 19  ring_number              53.82%  
## 20  veil_color               51.9%   
## 21  gill_attachment          51.8%   
## 21  veil_type                51.8%   
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## 
## Call:
## OneR.data.frame(x = mushrooms, verbose = TRUE)
## 
## Rules:
## If odor = almond   then type = edible
## If odor = anise    then type = edible
## If odor = creosote then type = poisonous
## If odor = fishy    then type = poisonous
## If odor = foul     then type = poisonous
## If odor = musty    then type = poisonous
## If odor = none     then type = edible
## If odor = pungent  then type = poisonous
## If odor = spicy    then type = poisonous
## 
## Accuracy:
## 8004 of 8124 instances classified correctly (98.52%)

Wow! Within the blink of an eye and with just one command (OneR) we got all the rules we need for our app! It is the odour: if it smells poisonous it probably is. The accuracy of this is nearly 99%, not too bad for such a simple rule… we wouldn’t even need an app for that.

For more examples, some deeper explanation (and even a video) on the OneR package go here: OneR – Establishing a New Baseline for Machine Learning Classification Models.

By the way, in the words “programs itself” – impressive as it may be – is still the term “programming”, so we are not talking about machines developing intentions, feelings or even consciousness anytime soon. This is the domain of Hollywood and not AI!

As with every hype there are many terms flying around, like Machine Learning (ML), Data Science and (Predictive) Analytics and somewhat older terms, like Data Mining, Knowledge Discovery and Business Intelligence… and many, many more. Of course you can define them in all sorts of ways but to be honest with you in essence they all mean the same (see definition above). I would argue that Data Science is a somewhat broader term which comprises also e.g. the handling of data, interpretation by domain experts and presentation to the business but that is just one definition. There is an old joke that what is Machine Learning when put on powerpoint becomes Artificial Intelligence – it just sounds so much better than any technical term.

We can now answer another question: why now? Why is there a hype now? I will share a secret with you: most of the AI methods used today are quite old! For example the core principles of (artificial) neural networks (a.k.a. deep learning) are from the 60’s of the last century! Now, more than half a century later we’ve got the hype. The reason is:

It’s the data, stupid!

Because AI is all about learning from examples, we need lots and lots of data and because of the internet and mobile revolution we are now drowning in those data. Combine this with more and more powerful hardware (also in the form of cheap cloud services) and you’ve got a revolution at your hands.

And there is yet another lesson to be learned: when I said “we” are drowning in data it was not entirely correct: tech companies like Google, Amazon, Facebook, Apple (GAFA) and their Chinese counterparts, like Baidu, Alibaba, and Tencent (BAT) are drowning in our (!) data. This is the reason why they are also leading the field of AI and not because they necessarily have the brightest AI geniuses.

This point is best illustrated by the example of DeepL: a small German startup in the area of machine translation. Many tests conducted by professional translators came to the conclusion that the translations by DeepL are far better than any other machine translations (like Google Translate). Why? Not because there necessarily is a secret sauce in their algorithms but because the company behind it (Linguee) had been amassing hand-curated bilingual language pairs for many, many years. The quality of their data is just so much better than of all the other translation services out there. This is their main differentiator.

That is not to say that there haven’t been any major developments in the area of AI algorithms or computer hardware or that you don’t need bright people to set up and finetune those systems (quite to the contrary!) but many of the algorithms are open-source and computing power is relatively cheap via cloud services and, yes, there is a shortage of good people at the moment but there are still many highly qualified data scientists around: the real difference is in (the quality and amount of) the data!

One last thing: many people think that AI is more objective than humans, after all its maths and cool logic, right? Wrong! When you only learn from examples and the examples given are racist the learned rules will be racist too! A very good book on the subject is Weapons of Math Destruction by Cathy O’Neal.

Hope that this gave you some perspective on the ongoing hype… perhaps hype is not such a good word after all because when you look at the underlying reasons you can see that this mega trend is here to stay!

Learning Data Science: Modelling Basics


Data Science is all about building good models, so let us start by building a very simple model: we want to predict monthly income from age (in a later post we will see that age is indeed a good predictor for income).

For illustrative purposes we just make up some numbers for age and income, load both into a data frame, plot it and calculate its correlation matrix:

age <- c(21, 46, 55, 35, 28)
income <- c(1850, 2500, 2560, 2230, 1800)
data <- data.frame(age, income)
plot(data, pch = 16)

cor(data) # correlation
##              age    income
## age    1.0000000 0.9464183
## income 0.9464183 1.0000000

Just by looking at the plot we can see that the points seem to be linearly dependent. The correlation matrix corroborates this with a correlation of nearly 95\%. To build a linear regression model we use the lm function (for linear model). To specify the model we use the standard R formula interface: on the left hand side comes the dependent variable, i.e. the variable that we want to predict (income) and on the right hand side comes the independent variable, i.e. the variable that we want to use to predict income, viz. age – in between both variables we put a tilde (~). After that we plot the model as a line and look at some model diagnostics:

# building linear regression model
LinReg <- lm(income ~ age) # assign model to variable
abline(LinReg, col = "green", lwd = 2) # add regression line

LinReg # coefficients
## 
## Call:
## lm(formula = income ~ age)
## 
## Coefficients:
## (Intercept)          age  
##     1279.37        24.56

summary(LinReg) # model summary
## 
## Call:
## lm(formula = income ~ age)
## 
## Residuals:
##       1       2       3       4       5 
##   54.92   90.98  -70.04   91.12 -166.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1279.367    188.510   6.787  0.00654 **
## age           24.558      4.838   5.076  0.01477 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 132.1 on 3 degrees of freedom
## Multiple R-squared:  0.8957, Adjusted R-squared:  0.8609 
## F-statistic: 25.77 on 1 and 3 DF,  p-value: 0.01477

The general line equation is y=mx+b, where m is the slope (in this case 24.58) and b the intercept (here 1279.367). The p-values in the last column show us that both parameters are significant, so with a high probability not due to randomness. If we now want to predict the income of a 20-year old we just have to calculate 1279.37+20\cdot24.59=1770.52.

To do the same calculation in R we can conveniently use the predict function which does this calculation for us automatically. We can also use vectors as input and get several predictions at once. Important is that we use the independent variable with its name as a data frame, otherwise R will throw an error.

# predicting income with linear model
predict(LinReg, data.frame(age = 20))
##       1 
## 1770.52
pred_LinReg <- predict(LinReg, data.frame(age = seq(from = 0, to = 80, by = 5)))
names(pred_LinReg) <- seq(0, 80, 5)
pred_LinReg
##        0        5       10       15       20       25       30       35 
## 1279.367 1402.155 1524.944 1647.732 1770.520 1893.308 2016.097 2138.885 
##       40       45       50       55       60       65       70       75 
## 2261.673 2384.461 2507.249 2630.038 2752.826 2875.614 2998.402 3121.190 
##       80 
## 3243.979

So far so good, but wouldn’t it be nice to actually have a model that is even more accurate, a model where the line actually goes through all the data points so that all the available information is being used. Let us do that with a so called polynomial regression (in red):

# polynomial regression
PolyReg <- lm(income ~ poly(age, 4, raw = TRUE))
lines(c(20:55), predict(PolyReg, data.frame(age = c(20:55))), col = "red")

pred_PolyReg <- predict(PolyReg, data.frame(age = seq(0, 80, 5)))
names(pred_PolyReg) <- seq(0, 80, 5)
pred_PolyReg
##         0         5        10        15        20        25        30 
## 19527.182 11103.509  5954.530  3164.488  1959.098  1705.541  1912.468 
##        35        40        45        50        55        60        65 
##  2230.000  2449.726  2504.705  2469.464  2560.000  3133.778  4689.732 
##        70        75        80 
##  7868.267 13451.255 22362.037

What do you think? Is this a better model? Well, there is no good reason why income should go up the younger or older you are, and there is no good reason why there should be a bump between 50 and 55. In other words, this model takes the noise as legitimate input which actually makes it worse! To have a good model you always need some rigidity to generalize the data.

This illustration is a good example of one of the big problems of machine learning: overfitting the data! The other extreme would be underfitting, which is included in the following plot as a blue line:

# mean income as a model
abline(h = mean(income), col = "blue")

title(main = "All models are wrong, some are useful")
legend("bottomright", c("Mean: Underfitting", "Linear fit", "Polynomial Reg.: Overfitting"), col = c("blue", "green", "red"), lwd = 3)

mean(income)
## [1] 2188

We just took the mean value of income and claimed that no matter the age everybody would earn about 2200. This also obviously doesn’t make the cut. In practice and in all data science projects you always have to look for a model that hits the sweet spot between over- and underfitting. This is as much an art as a science, although there are certain best practices like dividing the data into a training and a test set, cross validation etc.

Another important point it that all models have their area of applicability. In physics for example quantum physics works well for the very small but not very good for the very big, it is the other way round for relativity theory. In this case we didn’t have data for people of very young or very old age, so it is in general very dangerous to make prediction for those people, i.e. for values outside of the original observation range (also called extrapolation – in contrast to interpolation = predictions within the original range). In case of the linear model we got an income of about 1280 for newborns (the intercept)! And the income gets higher the older people get, also something that doesn’t make sense. This behaviour was a lot worse for the non-linear model.

So, here you can see that it is one thing to just let R run and spit something out and quite another one to find a good model and interpret it correctly – even for such a simple problem as this one. In a later post we will use real-world data to predict income brackets with OneR, decision trees and random forests, so stay tuned!

Addendum
The post on now online: Learning Data Science: Predicting Income Brackets!

Hash Me If You Can

We are living in the era of Big Data but the problem of course is that the bigger our data sets become the slower even simple search operations get. I will now show you a trick that is the next best thing to magic: building a search function that practically doesn’t slow down even for large data sets… in base R!

On first thought this is totally counterintuitive: the bigger the data set is, the longer it should take to search it, right? Wrong!

The data structure we will be talking about is called a hash or a dictionary (sometimes also called associated array). The big idea is to use a mathematical function (called hash function) which maps each data item (e.g. a name) to an address (called hash) where the corresponding value (e.g. a telephone number) is stored. So to find the telephone number for a certain name you don’t have to search through all the names but you just put it into the hash function and you get back the address for the telephone number instantaneously:

Source: wikimedia

This image is from the very good wikipedia article on hash function algorithms: Hash function. It also gives a trivial example of a hash function to get the idea:

If the data to be hashed is small enough, one can use the data itself (reinterpreted as an integer) as the hashed value. The cost of computing this “trivial” (identity) hash function is effectively zero. This hash function is perfect, as it maps each input to a distinct hash value.

To do this with R we use environments with the option hash = TRUE. The following example is from an answer I gave on stackoverflow (https://stackoverflow.com/a/42350672/468305):

# vectorize assign, get and exists for convenience
assign_hash <- Vectorize(assign, vectorize.args = c("x", "value"))
get_hash <- Vectorize(get, vectorize.args = "x")
exists_hash <- Vectorize(exists, vectorize.args = "x")

# keys and values
key <- c("tic", "tac", "toe")
value <- c(1, 22, 333)

# initialize hash
hash <- new.env(hash = TRUE, parent = emptyenv(), size = 100L)

# assign values to keys
assign_hash(key, value, hash)
## tic tac toe 
##   1  22 333

# get values for keys
get_hash(c("toe", "tic"), hash)
## toe tic 
## 333   1

# alternatively:
mget(c("toe", "tic"), hash)
## $toe
## [1] 333
## 
## $tic
## [1] 1

# show all keys
ls(hash)
## [1] "tac" "tic" "toe"

# show all keys with values
get_hash(ls(hash), hash)
## tac tic toe 
##  22   1 333

# remove key-value pairs
rm(list = c("toe", "tic"), envir = hash)
get_hash(ls(hash), hash)
## tac 
##  22

# check if keys are in hash
exists_hash(c("tac", "nothere"), hash)
##     tac nothere 
##    TRUE   FALSE

# for single keys this is also possible:
# show value for single key
hash[["tac"]]
## [1] 22

# create new key-value pair
hash[["test"]] <- 1234
get_hash(ls(hash), hash)
##  tac test 
##   22 1234

# update single value
hash[["test"]] <- 54321
get_hash(ls(hash), hash)
##   tac  test 
##    22 54321

So you see that using the inbuilt hash functionality is quite simple. To get an idea of the performance boost there is a very thorough article here: http://jeffreyhorner.tumblr.com/post/114524915928/hash-table-performance-in-r-part-i.

Have a look at the following plot from the article:

Horner writes:

Bam! See that blue line? That’s near constant time for searching the entire 2^{15} size hash table!

If I could whet your appetite I want to close with a pointer to a much more professional implementation of hash tables using environments: https://CRAN.R-project.org/package=hash

I haven’t tried the package myself so far but the author, Christopher Brown, promises:

The hash package is the only full featured hash implementation for the R language. It provides more features and finer control of the hash behavior than the native feature set and has similar and sometimes better performance.

If you use this package (or any other hash implementation) I would love to read about your experience in the comments!

Learning R: A gentle introduction to higher-order functions

Have you ever thought about why the definition of a function in R is different from many other programming languages? The part that causes the biggest difficulties (especially for beginners of R) is that you state the name of the function at the beginning and use the assignment operator – as if functions were like any other data type, like vectors, matrices or data frames…

Congratulations! You just encountered one of the big ideas of functional programming: functions are indeed like any other data type, they are not special – or in programming lingo, functions are first-class members. Now, you might ask: So what? Well, there are many ramifications, for example that you could use functions on other functions by using one function as an argument for another function. Sounds complicated?

In mathematics most of you will be familiar with taking the derivative of a function. When you think about it you could say that you put one function into the derivative function (or operator) and get out another function!

In R there are many applications as well, let us go through a simple example step by step.

Let’s say I want to apply the mean function on the first four columns of the iris dataset. I could do the following:

mean(iris[ , 1])
## [1] 5.843333
mean(iris[ , 2])
## [1] 3.057333
mean(iris[ , 3])
## [1] 3.758
mean(iris[ , 4])
## [1] 1.199333

Quite tedious and not very elegant. Of course, we can use a for loop for that:

for (x in iris[1:4]) {
  print(mean(x))
}
## [1] 5.843333
## [1] 3.057333
## [1] 3.758
## [1] 1.199333

This works fine but there is an even more intuitive approach. Just look at the original task: “apply the mean function on the first four columns of the iris dataset” – so let us do just that:

apply(iris[1:4], 2, mean)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

Wow, this is very concise and works perfectly (the 2 just stands for “go through the data column wise”, 1 would be for “row wise”). apply is called a “higher-order function” and we could use it with all kinds of other functions:

apply(iris[1:4], 2, sd)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.8280661    0.4358663    1.7652982    0.7622377
apply(iris[1:4], 2, min)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          4.3          2.0          1.0          0.1
apply(iris[1:4], 2, max)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          7.9          4.4          6.9          2.5

You can also use user-defined functions:

midrange <- function(x) (min(x) + max(x)) / 2
apply(iris[1:4], 2, midrange)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##         6.10         3.20         3.95         1.30

We can even use new functions that are defined “on the fly” (or in functional programming lingo “anonymous functions”):

apply(iris[1:4], 2, function(x) (min(x) + max(x)) / 2)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##         6.10         3.20         3.95         1.30

Let us now switch to another inbuilt data set, the mtcars dataset with 11 different variables of 32 cars (if you want to find out more, please consult the documentation):

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

To see the power of higher-order functions let us create a (numeric) matrix with minimum, first quartile, median, mean, third quartile and maximum for all 11 columns of the mtcars dataset with just one command!

apply(mtcars, 2, summary)
##              mpg    cyl     disp       hp     drat      wt     qsec     vs      am   gear   carb
## Min.    10.40000 4.0000  71.1000  52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000
## 1st Qu. 15.42500 4.0000 120.8250  96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000
## Median  19.20000 6.0000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000
## Mean    20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
## 3rd Qu. 22.80000 8.0000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000
## Max.    33.90000 8.0000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000

Wow, that was easy and the result is quite impressive, is it not!

Or if you want to perform a linear regression for all ten variables separately against mpg and want to get a table with all coefficients – there you go:

sapply(mtcars, function(x) round(coef(lm(mpg ~ x, data = mtcars)), 3))
##             mpg    cyl   disp     hp   drat     wt   qsec     vs     am  gear   carb
## (Intercept)   0 37.885 29.600 30.099 -7.525 37.285 -5.114 16.617 17.147 5.623 25.872
## x             1 -2.876 -0.041 -0.068  7.678 -5.344  1.412  7.940  7.245 3.923 -2.056

Here we used another higher-order function, sapply, together with an anonymous function. sapply goes through all the columns of a data frame (i.e. elements of a list) and tries to simplify the result (here your get back a nice matrix).

Often, you might not even have realised when you were using higher-order functions! I can tell you that it is quite a hassle in many programming languages to program a simple function plotter, i.e. a function which plots another function. In R it has already been done for you: you just use the higher-order function curve and give it the function you want to plot as an argument:

curve(sin(x) + cos(1/2 * x), -10, 10)

I want to give you one last example of another very helpful higher-order function (which not too many people know or use): by. It comes in very handy when you want to apply a function on different attributes split by a factor. So let’s say you want to get a summary of all the attributes of iris split by (!) species – here it comes:

by(iris[1:4], iris$Species, summary)
## iris$Species: setosa
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
##  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
##  Median :5.000   Median :3.400   Median :1.500   Median :0.200  
##  Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
##  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
##  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600  
## --------------------------------------------------------------
## iris$Species: versicolor
##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width   
##  Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000  
##  1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200  
##  Median :5.900   Median :2.800   Median :4.35   Median :1.300  
##  Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326  
##  3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500  
##  Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800  
## --------------------------------------------------------------
## iris$Species: virginica
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400  
##  1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800  
##  Median :6.500   Median :3.000   Median :5.550   Median :2.000  
##  Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026  
##  3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300  
##  Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500

This was just a very shy look at this huge topic. There are very powerful higher-order functions in R, like lapppy, aggregate, replicate (very handy for numerical simulations) and many more. A good overview can be found in the answers of this question: stackoverflow (my answer there is on the rather illusive switch function: switch).

For some reason people tend to confuse higher-order functions with recursive functions but that is the topic of another post, so stay tuned…