Learning Data Science: Modelling Basics


Data Science is all about building good models, so let us start by building a very simple model: we want to predict monthly income from age (in a later post we will see that age is indeed a good predictor for income).

For illustrative purposes we just make up some numbers for age and income, load both into a data frame, plot it and calculate its correlation matrix:

age <- c(21, 46, 55, 35, 28)
income <- c(1850, 2500, 2560, 2230, 1800)
data <- data.frame(age, income)
plot(data, pch = 16)

cor(data) # correlation
##              age    income
## age    1.0000000 0.9464183
## income 0.9464183 1.0000000

Just by looking at the plot we can see that the points seem to be linearly dependent. The correlation matrix corroborates this with a correlation of nearly 95\%. To build a linear regression model we use the lm function (for linear model). To specify the model we use the standard R formula interface: on the left hand side comes the dependent variable, i.e. the variable that we want to predict (income) and on the right hand side comes the independent variable, i.e. the variable that we want to use to predict income, viz. age – in between both variables we put a tilde (~). After that we plot the model as a line and look at some model diagnostics:

# building linear regression model
LinReg <- lm(income ~ age) # assign model to variable
abline(LinReg, col = "green", lwd = 2) # add regression line

LinReg # coefficients
## 
## Call:
## lm(formula = income ~ age)
## 
## Coefficients:
## (Intercept)          age  
##     1279.37        24.56

summary(LinReg) # model summary
## 
## Call:
## lm(formula = income ~ age)
## 
## Residuals:
##       1       2       3       4       5 
##   54.92   90.98  -70.04   91.12 -166.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1279.367    188.510   6.787  0.00654 **
## age           24.558      4.838   5.076  0.01477 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 132.1 on 3 degrees of freedom
## Multiple R-squared:  0.8957, Adjusted R-squared:  0.8609 
## F-statistic: 25.77 on 1 and 3 DF,  p-value: 0.01477

The general line equation is y=mx+b, where m is the slope (in this case 24.58) and b the intercept (here 1279.367). The p-values in the last column show us that both parameters are significant, so with a high probability not due to randomness. If we now want to predict the income of a 20-year old we just have to calculate 1279.37+20\cdot24.59=1770.52.

To do the same calculation in R we can conveniently use the predict function which does this calculation for us automatically. We can also use vectors as input and get several predictions at once. Important is that we use the independent variable with its name as a data frame, otherwise R will throw an error.

# predicting income with linear model
predict(LinReg, data.frame(age = 20))
##       1 
## 1770.52
pred_LinReg <- predict(LinReg, data.frame(age = seq(from = 0, to = 80, by = 5)))
names(pred_LinReg) <- seq(0, 80, 5)
pred_LinReg
##        0        5       10       15       20       25       30       35 
## 1279.367 1402.155 1524.944 1647.732 1770.520 1893.308 2016.097 2138.885 
##       40       45       50       55       60       65       70       75 
## 2261.673 2384.461 2507.249 2630.038 2752.826 2875.614 2998.402 3121.190 
##       80 
## 3243.979

So far so good, but wouldn’t it be nice to actually have a model that is even more accurate, a model where the line actually goes through all the data points so that all the available information is being used. Let us do that with a so called polynomial regression (in red):

# polynomial regression
PolyReg <- lm(income ~ poly(age, 4, raw = TRUE))
lines(c(20:55), predict(PolyReg, data.frame(age = c(20:55))), col = "red")

pred_PolyReg <- predict(PolyReg, data.frame(age = seq(0, 80, 5)))
names(pred_PolyReg) <- seq(0, 80, 5)
pred_PolyReg
##         0         5        10        15        20        25        30 
## 19527.182 11103.509  5954.530  3164.488  1959.098  1705.541  1912.468 
##        35        40        45        50        55        60        65 
##  2230.000  2449.726  2504.705  2469.464  2560.000  3133.778  4689.732 
##        70        75        80 
##  7868.267 13451.255 22362.037

What do you think? Is this a better model? Well, there is no good reason why income should go up the younger or older you are, and there is no good reason why there should be a bump between 50 and 55. In other words, this model takes the noise as legitimate input which actually makes it worse! To have a good model you always need some rigidity to generalize the data.

This illustration is a good example of one of the big problems of machine learning: overfitting the data! The other extreme would be underfitting, which is included in the following plot as a blue line:

# mean income as a model
abline(h = mean(income), col = "blue")

title(main = "All models are wrong, some are useful")
legend("bottomright", c("Mean: Underfitting", "Linear fit", "Polynomial Reg.: Overfitting"), col = c("blue", "green", "red"), lwd = 3)

mean(income)
## [1] 2188

We just took the mean value of income and claimed that no matter the age everybody would earn about 2200. This also obviously doesn’t make the cut. In practice and in all data science projects you always have to look for a model that hits the sweet spot between over- and underfitting. This is as much an art as a science, although there are certain best practices like dividing the data into a training and a test set, cross validation etc.

Another important point it that all models have their area of applicability. In physics for example quantum physics works well for the very small but not very good for the very big, it is the other way round for relativity theory. In this case we didn’t have data for people of very young or very old age, so it is in general very dangerous to make prediction for those people, i.e. for values outside of the original observation range (also called extrapolation – in contrast to interpolation = predictions within the original range). In case of the linear model we got an income of about 1280 for newborns (the intercept)! And the income gets higher the older people get, also something that doesn’t make sense. This behaviour was a lot worse for the non-linear model.

So, here you can see that it is one thing to just let R run and spit something out and quite another one to find a good model and interpret it correctly – even for such a simple problem as this one. In a later post we will use real-world data to predict income brackets with OneR, decision trees and random forests, so stay tuned!

Addendum
The post on now online: Learning Data Science: Predicting Income Brackets!

Hash Me If You Can

We are living in the era of Big Data but the problem of course is that the bigger our data sets become the slower even simple search operations get. I will now show you a trick that is the next best thing to magic: building a search function that practically doesn’t slow down even for large data sets… in base R!

On first thought this is totally counterintuitive: the bigger the data set is, the longer it should take to search it, right? Wrong!

The data structure we will be talking about is called a hash or a dictionary (sometimes also called associated array). The big idea is to use a mathematical function (called hash function) which maps each data item (e.g. a name) to an address (called hash) where the corresponding value (e.g. a telephone number) is stored. So to find the telephone number for a certain name you don’t have to search through all the names but you just put it into the hash function and you get back the address for the telephone number instantaneously:

Source: wikimedia

This image is from the very good wikipedia article on hash function algorithms: Hash function. It also gives a trivial example of a hash function to get the idea:

If the data to be hashed is small enough, one can use the data itself (reinterpreted as an integer) as the hashed value. The cost of computing this “trivial” (identity) hash function is effectively zero. This hash function is perfect, as it maps each input to a distinct hash value.

To do this with R we use environments with the option hash = TRUE. The following example is from an answer I gave on stackoverflow (https://stackoverflow.com/a/42350672/468305):

# vectorize assign, get and exists for convenience
assign_hash <- Vectorize(assign, vectorize.args = c("x", "value"))
get_hash <- Vectorize(get, vectorize.args = "x")
exists_hash <- Vectorize(exists, vectorize.args = "x")

# keys and values
key <- c("tic", "tac", "toe")
value <- c(1, 22, 333)

# initialize hash
hash <- new.env(hash = TRUE, parent = emptyenv(), size = 100L)

# assign values to keys
assign_hash(key, value, hash)
## tic tac toe 
##   1  22 333

# get values for keys
get_hash(c("toe", "tic"), hash)
## toe tic 
## 333   1

# alternatively:
mget(c("toe", "tic"), hash)
## $toe
## [1] 333
## 
## $tic
## [1] 1

# show all keys
ls(hash)
## [1] "tac" "tic" "toe"

# show all keys with values
get_hash(ls(hash), hash)
## tac tic toe 
##  22   1 333

# remove key-value pairs
rm(list = c("toe", "tic"), envir = hash)
get_hash(ls(hash), hash)
## tac 
##  22

# check if keys are in hash
exists_hash(c("tac", "nothere"), hash)
##     tac nothere 
##    TRUE   FALSE

# for single keys this is also possible:
# show value for single key
hash[["tac"]]
## [1] 22

# create new key-value pair
hash[["test"]] <- 1234
get_hash(ls(hash), hash)
##  tac test 
##   22 1234

# update single value
hash[["test"]] <- 54321
get_hash(ls(hash), hash)
##   tac  test 
##    22 54321

So you see that using the inbuilt hash functionality is quite simple. To get an idea of the performance boost there is a very thorough article here: http://jeffreyhorner.tumblr.com/post/114524915928/hash-table-performance-in-r-part-i.

Have a look at the following plot from the article:

Horner writes:

Bam! See that blue line? That’s near constant time for searching the entire 2^{15} size hash table!

If I could whet your appetite I want to close with a pointer to a much more professional implementation of hash tables using environments: https://CRAN.R-project.org/package=hash

I haven’t tried the package myself so far but the author, Christopher Brown, promises:

The hash package is the only full featured hash implementation for the R language. It provides more features and finer control of the hash behavior than the native feature set and has similar and sometimes better performance.

If you use this package (or any other hash implementation) I would love to read about your experience in the comments!

If wealth had anything to do with intelligence…

…the richest man on earth would have a fortune of no more than $43,000! If you don’t believe me read this post!

Have you ever thought about the distribution of wealth as a function of some quality? Especially rich people pride themselves on extraordinary abilities, so that they somehow “deserve” their wealth. Now “abilities” is somewhat hard to measure so let us take “intelligence” as a proxy. Intelligence is distributed normally with mean = 100 and standard deviation = 15. An interesting question is what is the expected maximum intelligence of all persons alive today?

IQ distribution Source: wikimedia

The first thought could be to simulate this but as it soon turns out this approach doesn’t work because you had to create and analyze a vector with nearly 8 billion elements – R will balk at this.

So we will have to look for an analytical solution – fortunately, asking my colleague Prof. Dr. Google brings up the following mathunderflow question and answers: Expected value for maximum of n normal random variable. According to this the expected maximum of n independent N(\mu, \sigma) is:

    \[\mu + \sigma n \int_{-\infty}^{\infty} x \phi(x) \Phi(x)^{n-1}dx,\]

where \phi (lowercase phi) and \Phi (uppercase phi) stand for the pdf and cdf of the normal distribution respectively. I spare you the mathematical details of the derivation.

Unfortunately there doesn’t seem to be a closed form for the integral so we will solve it numerically by making use of the integral function – first we reproduce the values of the accepted answer:

f <- function(x, n) x * dnorm(x) * pnorm(x)^(n-1)
for (i in 2:10) print(i * integrate(f, n = i, -Inf, Inf, abs.tol = 1e-20)$value) # test
## [1] 0.5641896
## [1] 0.8462844
## [1] 1.029375
## [1] 1.162964
## [1] 1.267206
## [1] 1.352178
## [1] 1.4236
## [1] 1.485013
## [1] 1.538753

This seems to be good enough. Now, let us calculate the maximum IQ that is to be expected for the current world population:

100 + 15 * world_population * integrate(f, n = world_population, -Inf, Inf, abs.tol = 1e-20)$value # expected max IQ of all people alive
## [1] 196.1035

To compare this with the actual record we find that Marilyn vos Savant seems to be (one of) the most intelligent persons on the planet. When you read the Wikipedia article you will find that there is some debate about what the actual number really is but it seems to be around 200… again, the above approximation seems to be good enough.

And now for the final leg: if the wealth distribution followed the same rules as the IQ distribution we could use the same formula to calculate the expected maximum wealth:

global_wealth <- 1.68e+14 # source: Allianz Global Wealth Report 2018
world_population <- 7.7e+09
per_capita_wealth <- global_wealth / world_population
per_capita_wealth_sd <- per_capita_wealth * 0.15 # equivalent to IQ

per_capita_wealth
## [1] 21818.18

per_capita_wealth + per_capita_wealth_sd * world_population * integrate(f, n = world_population, -Inf, Inf, abs.tol = 1e-20)$value
## [1] 42786.21

There you go: not more than $43.000! But the richest man at the moment, Jeff Bezos, has a net worth of about 140 billion US$!!! Conversely that would translate to an IQ of

1.4e+11 / per_capita_wealth * 100
## [1] 641666667

So, compared to an IQ of over 641 million Einstein would be the mental equivalent of an amoeba! If IQ were additive, i.e. if two persons with an IQ of 100 each had an combined IQ of 200, you had to take nearly the whole population of Europe (about 738 million) to match that; even the whole North American continent wouldn’t suffice with its 579 million inhabitants!

But if it is not ability/intelligence that determines the distribution of wealth what else could account for the extreme inequality we perceive in the world?

Stay tuned for more to come…

Understanding the Magic of Neural Networks


Everything “neural” is (again) the latest craze in machine learning and artificial intelligence. Now what is the magic here?

Let us dive directly into a (supposedly little silly) example: we have three protagonists in the fairy tail little red riding hood, the wolf, the grandmother and the woodcutter. They all have certain qualities and little red riding hood reacts in certain ways towards them. For example the grandmother has big eyes, is kindly and wrinkled – little red riding hood will approach her, kiss her on the cheek and offer her food (the behavior “flirt with” towards the woodcutter is a little sexist but we kept it to reproduce the original example from Jones, W. & Hoskins, J.: Back-Propagation, Byte, 1987). We will build and train a neural network which gets the qualities as inputs and little red riding wood’s behaviour as output, i.e. we train it to learn the adequate behaviour for each quality.

Have a look at the following code and its output including the resulting plot:

library(neuralnet)
library(NeuralNetTools)

# code qualities and actions
qualities <- matrix (c(1, 1, 1, 0, 0, 0,
                       0, 1, 0, 1, 1, 0,
                       1, 0, 0, 1, 0, 1), byrow = TRUE, nrow = 3)
colnames(qualities) <- c("big_ears", "big_eyes", "big_teeth", "kindly", "wrinkled", "handsome")
rownames(qualities) <- c("wolf", "grannie", "woodcutter")
qualities
##            big_ears big_eyes big_teeth kindly wrinkled handsome
## wolf              1        1         1      0        0        0
## grannie           0        1         0      1        1        0
## woodcutter        1        0         0      1        0        1

actions <- matrix (c(1, 1, 1, 0, 0, 0, 0,
                     0, 0, 0, 1, 1, 1, 0,
                     0, 0, 0, 1, 0, 1, 1), byrow = TRUE, nrow = 3)
colnames(actions) <- c("run_away", "scream", "look_for_woodcutter", "kiss_on_cheek", "approach", "offer_food", "flirt_with")
rownames(actions) <- rownames(qualities)
actions
##            run_away scream look_for_woodcutter kiss_on_cheek approach offer_food flirt_with
## wolf              1      1                   1             0        0          0          0
## grannie           0      0                   0             1        1          1          0
## woodcutter        0      0                   0             1        0          1          1

data <- cbind(qualities, actions)

# train the neural network (NN)
set.seed(123) # for reproducibility
neuralnetwork <- neuralnet(run_away + scream+look_for_woodcutter + kiss_on_cheek + approach + 
                           offer_food + flirt_with ~ 
                           big_ears + big_eyes + big_teeth + kindly + wrinkled + handsome,
                           data = data, hidden = 3, exclude = c(1, 8, 15, 22, 26, 30, 34, 38, 42, 46), 
                           lifesign = "minimal", linear.output = FALSE)
## hidden: 3    thresh: 0.01    rep: 1/1    steps:      48  error: 0.01319  time: 0.01 secs

# plot the NN
par_bkp <- par(mar = c(0, 0, 0, 0)) # set different margin to minimize cutoff text
plotnet(neuralnetwork, bias = FALSE)

par(par_bkp)

# predict actions
round(compute(neuralnetwork, qualities)$net.result)
##            [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## wolf          1    1    1    0    0    0    0
## grannie       0    0    0    1    1    1    0
## woodcutter    0    0    0    1    0    1    1

First the qualities and the actions are coded as binary variables in a data frame. After that the neural network is being trained with the qualities as input and the resulting behaviour as output (using the standard formula syntax). In the neuralnet function a few additional technical arguments are set which details won’t concern us here, they just simplify the process in this context). Then we plot the learned net and test it by providing it with the respective qualities: in all three cases it predicts the right actions. How did it learn those?

Let us look at the plot of the net. We see that there are two basic building blocks: neurons and weighted connections between them. We have one neuron for each quality and one neuron for each action. Between both layers we have a so called hidden layer with three neurons in this case. The learned strength between the neurons is shown by the thickness of the lines (whereby ‘black’ means positive and ‘grey’ negative weights). Please have a thorough look at those weights.

You might have noticed that although the net didn’t know anything about the three protagonists in our little story it nevertheless correctly built a representation of them: ‘H1’ (for Hidden 1) represents the wolf because its differentiating quality is ‘big teeth’ which leads to ‘run away’, ‘scream’ and ‘look for woodcutter’, by the same logic ‘H2’ is the woodcutter and ‘H3’ is the grandmother. So the net literally learned to connect the qualities with respective actions of little red riding hood by creating a representation of the three protagonists!

So an artificial neural network is obviously a network of neurons… so let us have a look at those neurons! Basically they are mathematical abstractions of real neurons in your brain. They consist of inputs and an output. The biologically inspired idea is that when the activation of the inputs surpasses a certain threshold the neuron fires. To be able to learn the neuron must, before summing up the inputs, adjust the inputs so that the output is not just arbitrary but matches some sensible result. What is ‘sensible’ you might ask. In a biological environment the answer is not always so clear cut but in our simple example here the neuron has just to match the output we provide it with (= supervised learning).

The following abstraction has all we need, inputs, weights, the sum function, a threshold after that and finally the output of the neuron:

Simple artificial neuron

Let us talk a little bit about what is going on here intuitively. First every input is taken, multiplied by its weight and all of this is summed up. Some of you might recognize this mathematical operation as a scalar product (also called dot product). Another mathematical definition of a scalar product is the following:

    \[\mathbf{a}\cdot\mathbf{b}=\|\mathbf{a}\|\ \|\mathbf{b}\|\cos(\theta)\]

That is we multiply the length of two vectors by the cosine of the angle of those two vectors. What has cosine to do with it? The cosine of an angle becomes one when both vectors point into the same direction, it becomes zero when they are orthogonal and minus one when both point into opposite directions. Does this make sense? Well, I give you a litte (albeit crude) parable. When growing up there are basically three stages: first you are totally dependent on your parents, then comes puberty and you are against whatever they say or think and after some years you are truly independent (some never reach that stage…). What does “independent” mean here? It means that you agree with some of the things your parents say and think and you disagree with some other things. During puberty you are as dependent on your parents as during being a toddler – you just don’t realize that but in reality you, so to speak, only multiply everything your parents say or think times minus one!

What is the connection with cosine? Well, as a toddler both you and your parents tend to be aligned which gives one, during puberty both of you are aligned but in opposing directions which gives minus one and only as a grown up you are both independent which mathematically means that your vector in a way points in both directions at the same time which is only possible when it is orthogonal on the vector of your parents (you entered a new dimension, literally) – and that gives zero for the cosine.

So cosine is nothing but a measure of dependence – as is correlation by the way. So this setup ensures that the neuron learns the dependence (or correlation) structure between the inputs and the output! The step function is just a way to help it to decide on which side of the fence it wants to sit, to make the decision clearer whether to fire or not. To sum it up, an artificial neuron is a non-linear function (in this case a step function) on a scalar product of the inputs (fixed) and the weights (adaptable to be able to learn). By adapting the weights the neuron learns the dependence structure between inputs and output.

In R you code this idea of an artificial neuron as follows:

neuron <- function(input) ifelse(weights %*% input > 0, 1, 0)

Now let us use this idea in R by training an artificial neuron to classify points in a plane. Have a look at the following table:

Input 1 Input 2 Output
1 0 0
0 0 1
1 1 0
0 1 1

If you plot those points with the colour coded pattern you get the following picture:

The task for the neuron is to find a separating line and thereby classify the two groups. Have a look at the following code:

# inspired by Kubat: An Introduction to Machine Learning, p. 72
plot_line <- function(w, col = "blue", add = TRUE)
  curve(-w[1] / w[2] * x - w[3] / w[2], xlim = c(-0.5, 1.5), ylim = c(-0.5, 1.5), col = col, lwd = 3, xlab = "Input 1", ylab = "Input 2", add = add)
neuron <- function(input) as.vector(ifelse(input %*% weights > 0, 1, 0)) # step function on scalar product of weights and input
eta <- 0.5 # learning rate

# examples
input <- matrix(c(1, 0,
                  0, 0,
                  1, 1,
                  0, 1), ncol = 2, byrow = TRUE)
input <- cbind(input, 1) # bias for intercept of line
output <- c(0, 1, 0, 1)
weights <- c(0.25, 0.2, 0.35) # random initial weights

plot_line(weights, add = FALSE); grid()
points(input[ , 1:2], pch = 16, col = (output + 2))

# training of weights of neuron
for (example in 1:length(output)) {
  weights <- weights + eta * (output[example] - neuron(input[example, ])) * input[example, ]
  plot_line(weights)
}
plot_line(weights, col = "black")

# test: applying neuron on input
apply(input, 1, neuron)
## [1] 0 1 0 1

As you can see the result matches the desired output, graphically the black line is the end result and as you can see it separates the green from the red points: the neuron has learned this simple classification task. The blue lines are where the neuron starts from and where it is during training – they are not able to classify the points correctly.

The training, i.e. adapting the weights, takes places in this line:

weights <- weights + eta * (output[example] - neuron(input[example, ])) * input[example, ]

The idea is to compare the current output of the neuron with the wanted output, scale that by some learning factor \eta (eta) and modify the weights accordingly. So if the output is too big make the weights smaller and vice versa. Do this for all examples (sometimes you need another loop to train the neuron with the examples several times) and that’s it. That is the core idea behind the ongoing revolution of neural networks!

Ok, so far we had a closer look at one part of neural networks, namely the neurons, let us now turn to the network structure (also called network topology). First, why do we need a whole network anyway when the neurons are already able to solve classification tasks? The answer is that they can do that only for very simple problems. For example the neuron above can only distinguish between linearly separable points, i.e. it can only draw lines. It fails in case of the simple problem of four points that are coloured green, red, red, green from top left to bottom right (try it yourself). We would need a non-linear function to separate the points. We have to combine several neurons to solve more complicated problems.

The biggest problem you have to overcome when you combine several neurons is how to adapt all the weights. You need a system how to attribute the error at the output layer to all the weights in the net. This had been a profound obstacle until an algorithm called backpropagation (also abbreviated backprop) was invented (or found). We won’t get into the details here but the general idea is to work backwards from the output layers through all of the hidden layers till one reaches the input layer and modify the weights according to their respective contribution to the resulting error. This is done several (sometimes millions of times) for all training examples until one achieves an acceptable error rate for the training data.

The result is that you get several layers of abstraction, so when you e.g. want to train a neural network to recognize certain faces you start with the raw input data in the form of pixels, these are automatically combined into abstract geometrical structures, after that the net detects certain elements of faces, like eyes and noses, and finally abstractions of certain faces are being rebuilt by the net. See the following picture (from nivdul.wordpress.com) for an illustration:

So far we have only coded very small examples of a neural networks. Real-world examples often have dozens of layers with thousands of neurons so that much more complicated patterns can be learned. The more layers there are the ‘deeper’ a net becomes… which is the reason why the current revolution in this field is called “deep learning” because there are so many hidden layers involved. Let us now look at a more realistic example: predicting whether a breast cell is malignant or benign.

Have a look at the following code:

library(OneR)
data(breastcancer)
data <- breastcancer
colnames(data) <- make.names(colnames(data))
data$Class <- as.integer(as.numeric(data$Class) - 1) # for compatibility with neuralnet
data <- na.omit(data)

# Divide training (80%) and test set (20%)
set.seed(12) # for reproducibility
random <- sample(1:nrow(data), 0.8 * nrow(data))
data_train <- data[random, ]
data_test <- data[-random, ]

# Train NN on training set
model_train <- neuralnet(Class ~., data = data_train, hidden = c(9, 9), lifesign = "minimal")
## hidden: 9, 9    thresh: 0.01    rep: 1/1    steps:    3784   error: 0.00524  time: 3.13 secs

# Plot net
plot(model_train, rep = "best")

# Use trained model to predict test set
prediction <- round(predict(model_train, data_test))
eval_model(prediction, data_test)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction   0   1 Sum
##        0    93   2  95
##        1     4  38  42
##        Sum  97  40 137
## 
## Confusion matrix (relative):
##           Actual
## Prediction    0    1  Sum
##        0   0.68 0.01 0.69
##        1   0.03 0.28 0.31
##        Sum 0.71 0.29 1.00
## 
## Accuracy:
## 0.9562 (131/137)
## 
## Error rate:
## 0.0438 (6/137)
## 
## Error rate reduction (vs. base rate):
## 0.85 (p-value = 1.298e-13)

So you see that a relatively simple net achieves an accuracy of about 95% out of sample. The code itself should be mostly self-explanatory. For the actual training the neuralnet function from the package with the same name is being used, the input method is the standard R formula interface, where you define Class as the variable to be predicted by using all the other variables (coded as .~).

When you look at the net one thing might strike you as odd: there are three neurons at the top with a fixed value of 1. These are so called bias neurons and they serve a similar purpose as the intercept in a linear regression: they kind of shift the model as a whole in n-dimensional feature space just as a regression line is being shifted by the intercept. In case you were attentive we also smuggled in a bias neuron in the above example of a single neuron: it is the last column of the input matrix which contains only ones.

Another thing: as can even be seen in this simple example it is very hard to find out what a neural network has actually learned – the following well-known anecdote (urban legend?) shall serve as a warning: some time ago the military built a system which had the aim to distinguish military vehicles from civilian ones. They chose a neural network approach and trained the system with pictures of tanks, humvees and missile launchers on the one hand and normal cars, pickups and lorries on the other. After having reached a satisfactory accuracy they brought the system into the field (quite literally). It failed completely, performing no better than a coin toss. What had happened? No one knew, so they re-engineered the black box (no small feat in itself) and found that most of the military pics where taken at dusk or dawn and most civilian pics under brighter weather conditions. The neural net had learned the difference between light and dark!

Just for comparison the same example with the OneR package:

data(breastcancer)
data <- breastcancer

# Divide training (80%) and test set (20%)
set.seed(12) # for reproducibility
random <- sample(1:nrow(data), 0.8 * nrow(data))
data_train <- optbin(data[random, ], method = "infogain")
## Warning in optbin.data.frame(data[random, ], method = "infogain"): 12
## instance(s) removed due to missing values
data_test <- data[-random, ]

# Train OneR model on training set
model_train <- OneR(data_train, verbose = TRUE)
## 
##     Attribute                   Accuracy
## 1 * Uniformity of Cell Size     92.32%  
## 2   Uniformity of Cell Shape    91.59%  
## 3   Bare Nuclei                 90.68%  
## 4   Bland Chromatin             90.31%  
## 5   Normal Nucleoli             90.13%  
## 6   Single Epithelial Cell Size 89.4%   
## 7   Marginal Adhesion           85.92%  
## 8   Clump Thickness             84.28%  
## 9   Mitoses                     78.24%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

# Show model and diagnostics
summary(model_train)
## 
## Call:
## OneR.data.frame(x = data_train, verbose = TRUE)
## 
## Rules:
## If Uniformity of Cell Size = (0.991,2] then Class = benign
## If Uniformity of Cell Size = (2,10]    then Class = malignant
## 
## Accuracy:
## 505 of 547 instances classified correctly (92.32%)
## 
## Contingency table:
##            Uniformity of Cell Size
## Class       (0.991,2] (2,10] Sum
##   benign        * 318     30 348
##   malignant        12  * 187 199
##   Sum             330    217 547
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 381.78243, df = 1, p-value < 0.00000000000000022204

# Plot model diagnostics
plot(model_train)

# Use trained model to predict test set
prediction <- predict(model_train, data_test)

# Evaluate model performance on test set
eval_model(prediction, data_test)
## 
## Confusion matrix (absolute):
##            Actual
## Prediction  benign malignant Sum
##   benign        92         0  92
##   malignant      8        40  48
##   Sum          100        40 140
## 
## Confusion matrix (relative):
##            Actual
## Prediction  benign malignant  Sum
##   benign      0.66      0.00 0.66
##   malignant   0.06      0.29 0.34
##   Sum         0.71      0.29 1.00
## 
## Accuracy:
## 0.9429 (132/140)
## 
## Error rate:
## 0.0571 (8/140)
## 
## Error rate reduction (vs. base rate):
## 0.8 (p-value = 0.000000000007992571)

As you can see the accuracy is only slightly worse but you have full interpretability of the model… and you would only need to measure one value (“Uniformity of Cell Size”) instead of 9 to get a prediction!

Captured images of layers of glass with smears of breast mass (the parts stained correspond to cell nuclei) – Source

On the other hand making neural networks interpretable is one of the big research challenges at the moment.

To end this rather long post: there is a real revolution going on at the moment with all kinds of powerful neural networks. Especially promising is a combination of reinforcement learning (the topic of an upcoming post) and neural networks, where the reinforcement learning algorithm uses a neural network as its memory. For example the revolutionary AlphaGo Zero is built this way: it just received the rules of Go, one of the most demanding strategy games humanity has ever invented, and grew superhuman strength after just three days! The highest human rank in Go has an ELO value of 2940 – AlphaGo Zero achieves 5185! Even the best players don’t stand a chance against this monster of a machine. The neural network technology that is used for AlphaGo Zero and many other deep neural networks is called Tensorflow, which can also easily be integrated into the R environment. To find out more go here: https://tensorflow.rstudio.com/

In this whole area there are many mind-blowing projects underway, so stay tuned!

Understanding the Maths of Computed Tomography (CT) scans


Noseman is having a headache and as an old-school hypochondriac he goes to see his doctor. His doctor is quite worried and makes an appointment with a radiologist for Noseman to get a CT scan.

Modern CT scanner from Siemens

Because Noseman always wants to know how things work he asks the radiologist about the inner workings of a CT scanner.

The basic idea is that X-rays are fired from one side of the scanner to the other. Because different sorts of tissue (like bones, brain cells, cartilage etc.) block different amounts of the X-rays the intensity measured on the other side varies accordingly.

The problem is of course that a single picture cannot give the full details of what is inside the body because it is a combination of different sorts of tissue in the way of the respective X-rays. The solution is to rotate the scanner and combine the different slices.

How, you ask? Good old linear algebra to the rescue!

We start with the initial position and fire X-rays with an intensity of 30 (just a number for illustrative purposes) through the body:

Initial position

As can be seen in the picture the upper ray goes through areas 1, 2 and 3 and let’s say that the intensity value of 12 is measured on the other side of the scanner:

    \[30-x_1-x_2-x_3=12\]

or

    \[x_1+x_2+x_3=18\]

The rest of the formula is found accordingly:

    \[\underbrace{ \begin{pmatrix}   1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\   0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 \\   0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \end{pmatrix} }_{\bold{A}_1} \underbrace{ \begin{pmatrix}   x_1 \\   x_2 \\   x_3 \\   x_4 \\   x_5 \\   x_6 \\   x_7 \\   x_8 \\   x_9 \end{pmatrix} }_{\bold{x}} = \underbrace{ \begin{pmatrix}   18 \\   21 \\   18 \end{pmatrix} }_{\bold{b}_1}\]

We then rotate the scanner for the first time…

Position after first rotation

…which gives the following formula:

    \[\underbrace{ \begin{pmatrix}   0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 \\   0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 \\   0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \end{pmatrix} }_{\bold{A}_2} \underbrace{ \begin{pmatrix}   x_1 \\   x_2 \\   x_3 \\   x_4 \\   x_5 \\   x_6 \\   x_7 \\   x_8 \\   x_9 \end{pmatrix} }_{\bold{x}} = \underbrace{ \begin{pmatrix}   18 \\   21 \\    9 \end{pmatrix} }_{\bold{b}_2}\]

And a second rotation…

Position after second rotation

…yields the following formula:

    \[\underbrace{ \begin{pmatrix}   0 & 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\   0 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 0 \\   0 & 0 & 0 & 0 & 0 & 1 & 0 & 1 & 0 \end{pmatrix} }_{\bold{A}_3} \underbrace{ \begin{pmatrix}   x_1 \\   x_2 \\   x_3 \\   x_4 \\   x_5 \\   x_6 \\   x_7 \\   x_8 \\   x_9 \end{pmatrix} }_{\bold{x}} = \underbrace{ \begin{pmatrix}   18 \\   14 \\   16 \end{pmatrix} }_{\bold{b}_3}\]

Now we are combining all three systems of equations:

    \[\begin{pmatrix}   \bold{A}_1 \\   \bold{A}_2 \\   \bold{A}_3 \end{pmatrix} \bold{x} = \begin{pmatrix}   \bold{b}_1 \\   \bold{b}_2 \\   \bold{b}_3 \end{pmatrix}\]

or written out in full:

    \[\underbrace{ \begin{pmatrix}   1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\   0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 \\   0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \\   0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 \\   0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 \\   0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\   0 & 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\   0 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 0 \\   0 & 0 & 0 & 0 & 0 & 1 & 0 & 1 & 0 \end{pmatrix} }_{\bold{A}} \underbrace{ \begin{pmatrix}   x_1 \\   x_2 \\   x_3 \\   x_4 \\   x_5 \\   x_6 \\   x_7 \\   x_8 \\   x_9 \end{pmatrix} }_{\bold{x}} = \underbrace{ \begin{pmatrix}   18 \\   21 \\   18 \\   18 \\   21 \\    9 \\   18 \\   14 \\   16 \end{pmatrix} }_{\bold{b}}\]

Here is the data of the matrix \bold{A} for you to download: ct-scan.txt).

We now have 9 equations with 9 unknown variables… which should easily be solvable by R, which can also depict the solution as a gray-scaled image… the actual CT-scan!

A <- read.csv("data/ct-scan.txt")
b <- c(18, 21, 18, 18, 21, 9, 18, 14, 16)
v <- solve(A, b)
matrix(v, ncol = 3, byrow = TRUE)
##      [,1] [,2] [,3]
## [1,]    9    9    0
## [2,]    9    5    7
## [3,]    9    9    0
image(matrix(v, ncol = 3), col = gray(4:0 / 4))
CT of Noseman

The radiologist looks at the picture… and has good news for Noseman: everything is how it should be! Noseman is relieved and his headache is much better now…

Real CT scans make use of the same basic principles (of course with a lot of additional engineering and maths magic 😉 )

Here are real images of CT scans of a human brain…

Source: Wikimedia

… which can be combined into a 3D-animation:

Source: Wikimedia

Isn’t it fascinating how a little bit of maths can save lives!

Check Machin-like formulae with arbitrary-precision arithmetic

Happy New Year to all of you! Let us start the year with something for your inner maths nerd 🙂

Rosetta Stone
Source: Wikimedia

For those of you who don’t yet know Rosetta Code: it is a real cool site where you can find lots of interesting code examples in all kinds of different languages for many different tasks. Of course R is also present big time (at the time of writing 426 code examples!): Rosetta Code for R.

The name of the site is inspired by the famous Rosetta Stone of Ancient Egypt which is inscribed with three different versions of the same text: in Ancient Egyptian hieroglyphs, Demotic script, and Ancient Greek script which proved invaluable in deciphering Egyptian hieroglyphs and thereby opening the window into ancient Egyptian history.

Now, a few days a ago I again added an example (for the other tasks I solved I will write more posts in the future, so stay tuned!). The task is to verify the correctness of Machin-like formulae using exact arithmetic.

A little bit of mathematical background is in order, so Wikipedia to the rescue:

Machin-like formulae are a popular technique for computing \pi to a large number of digits. They are generalizations of John Machin]s formula from 1706:

    \[\frac{\pi}{4} = 4 \arctan \frac{1}{5} - \arctan \frac{1}{239}\]

which he used to compute \pi to 100 decimal places.

Machin-like formulae have the form

    \[c_0 \frac{\pi}{4} = \sum_{n=1}^N c_n \arctan \frac{a_n}{b_n}\]

where a_n and b_n are positive integers such that a_n < b_n, c_n is a signed non-zero integer, and c_0 is a positive integer.

The exact task is to verify that the following Machin-like formulae are correct by calculating the value of tan (right hand side) for each equation using exact arithmetic and showing they equal one:

{\pi\over4} = \arctan{1\over2} + \arctan{1\over3}
{\pi\over4} = 2 \arctan{1\over3} + \arctan{1\over7}
{\pi\over4} = 4 \arctan{1\over5} - \arctan{1\over239}
{\pi\over4} = 5 \arctan{1\over7} + 2 \arctan{3\over79}
{\pi\over4} = 5 \arctan{29\over278} + 7 \arctan{3\over79}
{\pi\over4} = \arctan{1\over2} + \arctan{1\over5} + \arctan{1\over8}
{\pi\over4} = 4 \arctan{1\over5} - \arctan{1\over70} + \arctan{1\over99}
{\pi\over4} = 5 \arctan{1\over7} + 4 \arctan{1\over53} + 2 \arctan{1\over4443}
{\pi\over4} = 6 \arctan{1\over8} + 2 \arctan{1\over57} + \arctan{1\over239}
{\pi\over4} = 8 \arctan{1\over10} - \arctan{1\over239} - 4 \arctan{1\over515}
{\pi\over4} = 12 \arctan{1\over18} + 8 \arctan{1\over57} - 5 \arctan{1\over239}
{\pi\over4} = 16 \arctan{1\over21} + 3 \arctan{1\over239} + 4 \arctan{3\over1042}
{\pi\over4} = 22 \arctan{1\over28} + 2 \arctan{1\over443} - 5 \arctan{1\over1393} - 10 \arctan{1\over11018}
{\pi\over4} = 22 \arctan{1\over38} + 17 \arctan{7\over601} + 10 \arctan{7\over8149}
{\pi\over4} = 44 \arctan{1\over57} + 7 \arctan{1\over239} - 12 \arctan{1\over682} + 24 \arctan{1\over12943}

The same should be done for the last and most complicated case…

{\pi\over4} = 88 \arctan{1\over172} + 51 \arctan{1\over239} + 32 \arctan{1\over682} + 44 \arctan{1\over5357} + 68 \arctan{1\over12943}

… but it should be confirmed that the following, slightly changed, formula is incorrect by showing tan (right hand side) is not one:

{\pi\over4} = 88 \arctan{1\over172} + 51 \arctan{1\over239} + 32 \arctan{1\over682} + 44 \arctan{1\over5357} + 68 \arctan{1\over12944}

This is what I contributed to Rosetta Code:

library(Rmpfr)
prec <- 1000 # precision in bits
`%:%` <- function(e1, e2) '/'(mpfr(e1, prec), mpfr(e2, prec)) # operator %:% for high precision division
# function for checking identity of tan of expression and 1, making use of high precision division operator %:%
tanident_1 <- function(x) identical(round(tan(eval(parse(text = gsub("/", "%:%", deparse(substitute(x)))))), (prec/10)), mpfr(1, prec))
 
tanident_1( 1*atan(1/2)    +  1*atan(1/3) )
## [1] TRUE
tanident_1( 2*atan(1/3)    +  1*atan(1/7))
## [1] TRUE
tanident_1( 4*atan(1/5)    + -1*atan(1/239))
## [1] TRUE
tanident_1( 5*atan(1/7)    +  2*atan(3/79))
## [1] TRUE
tanident_1( 5*atan(29/278) +  7*atan(3/79))
## [1] TRUE
tanident_1( 1*atan(1/2)    +  1*atan(1/5)   +   1*atan(1/8) )
## [1] TRUE
tanident_1( 4*atan(1/5)    + -1*atan(1/70)  +   1*atan(1/99) )
## [1] TRUE
tanident_1( 5*atan(1/7)    +  4*atan(1/53)  +   2*atan(1/4443))
## [1] TRUE
tanident_1( 6*atan(1/8)    +  2*atan(1/57)  +   1*atan(1/239))
## [1] TRUE
tanident_1( 8*atan(1/10)   + -1*atan(1/239) +  -4*atan(1/515))
## [1] TRUE
tanident_1(12*atan(1/18)   +  8*atan(1/57)  +  -5*atan(1/239))
## [1] TRUE
tanident_1(16*atan(1/21)   +  3*atan(1/239) +   4*atan(3/1042))
## [1] TRUE
tanident_1(22*atan(1/28)   +  2*atan(1/443) +  -5*atan(1/1393) + -10*atan(1/11018))
## [1] TRUE
tanident_1(22*atan(1/38)   + 17*atan(7/601) +  10*atan(7/8149))
## [1] TRUE
tanident_1(44*atan(1/57)   +  7*atan(1/239) + -12*atan(1/682)  +  24*atan(1/12943))
## [1] TRUE

tanident_1(88*atan(1/172)  + 51*atan(1/239) +  32*atan(1/682)  +  44*atan(1/5357) + 68*atan(1/12943))
## [1] TRUE
tanident_1(88*atan(1/172)  + 51*atan(1/239) +  32*atan(1/682)  +  44*atan(1/5357) + 68*atan(1/12944))
## [1] FALSE

As you can see all statements are TRUE except for the last one!

In the code I make use of the Rmpfr package (from Martin Maechler of ETH Zürich, Switzerland) which is based on the excellent GMP (GNU Multiple Precision) library. I define a new infix operator %:% for high-precision division and after that convert all standard divisions in the formulae to high-precision divisions and calculate the tan. Before I check if the result is identical to one I round it to 100 decimal places which is more than enough given the precision of log_{10}(2^{1000})=301.03, so about 300 decimal places, in the example.

Please let me know in the comments what you think of this approach and whether you see room for improvement for the code – Thank you!

Clustering the Bible

During this time of year there is obviously a lot of talk about the Bible. As most people know the New Testament comprises four different Gospels written by anonymous authors 40 to 70 years after Jesus’ supposed crucifiction. Unfortunately we have lost all of the originals but only retained copies of copies of copies (and so on) which date back hundreds of years after they were written in all kinds of different versions (renowned Biblical scholar Professor Bart Ehrmann states that there are more versions of the New Testament than there are words in the New Testament). Just as a fun fact: there are many more Gospels but only those four were included in the official Bible.

So in general it is interesting to learn a little bit more about this stuff, especially because, for better or worse, it is at the core of Western civilization. One interesting question is how do the four Gospels relate to each other. To find that out you could either study Greek (i.e. the language of the New Testament) and Christian theology for many years and decades – or you could use the stylo package and do a cluster analysis within seconds. Obviously we go for the latter…

For that matter we download one version of the four Gospels from Project Gutenberg in Greek and put it into a separate directory called ‘corpus’ (which is the name of the default directory of the package). The famous beginning of the Gospel of John (“In the beginning was the Word, and the Word was with God, and the Word was God…”) reads as follows in Greek:

Στην αρχή ‘ταν ο λόγος κι’ ο λόγος είτανε με το
Θεό και Θεός είταν ο λόγος. Είταν εκείνος στην αρχή
με το Θεό. Όλα τα πάντα μέσο του έγιναν, και χωρίς
του τίποτα δεν έγινε που γίνηκε. Μέσα του είτανε ζωή
κι’ η ζωή ‘τανε το φως των ανθρώπων, και το φως
μέσα στο σκοτάδι φέγγει και το σκοτάδι δεν το κυρίεψε.

For your convenience I provide you with the curated texts here: John, Luke, Mark, Matthew.

After that we can already call the main function of the package:

library(stylo)

## ### stylo version: 0.6.8 ###
## 
## If you plan to cite this software (please do!), use the following reference:
##     Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R:
##     a package for computational text analysis. R Journal 8(1): 107-121.
##     <https://journal.r-project.org/archive/2016/RJ-2016-007/index.html>
## 
## To get full BibTeX entry, type: citation("stylo")

stylo(corpus.lang = "Other", encoding = "UTF-8", mfw.min = 200, mfw.max = 200, custom.graph.title = "Ancient Greek Gospels", gui = FALSE)

## using current directory...
## Performing no sampling (using entire text as sample)
## loading John.txt ...
## loading Luke.txt ...
## loading Mark.txt ...
## loading Matthew.txt  ...
## slicing input text into tokens...
## 
## turning words into features, e.g. char n-grams (if applicable)...
## 
## Total nr. of samples in the corpus: 4
## ....
## The corpus consists of 70943 tokens
## 
## processing  4  text samples
##        
## combining frequencies into a table...
## 
## 
## culling @ 0  available features (words) 5000
## Calculating z-scores... 
## 
## Calculating classic Delta distances...
## MFW used: 
## 200

## 
## Function call:
## stylo(gui = FALSE, corpus.lang = "Other", encoding = "UTF-8",     mfw.min = 200, mfw.max = 200, custom.graph.title = "Ancient Greek Gospels")
## 
## Depending on your chosen options, some results should have been written
## into a few files; you should be able to find them in your current
## (working) directory. Usually, these include a list of words/features
## used to build a table of frequencies, the table itself, a file containing
## recent configuration, etc.
## 
## Advanced users: you can pipe the results to a variable, e.g.:
##   I.love.this.stuff = stylo()
## this will create a class "I.love.this.stuff" containing some presumably
## interesting stuff. The class created, you can type, e.g.:
##   summary(I.love.this.stuff)
## to see which variables are stored there and how to use them.
## 
## 
## for suggestions how to cite this software, type: citation("stylo")

As you can see the Gospels of Matthew and Luke are more similar than Mark (the oldest Gospel), John (the last Gospel that made it into the Bible) is farthest away. This is indeed what Biblical scholars have found out by diligently studying the New Testament: Mark, Luke and Matthew are called “Synoptic Gospels” because they are so similar, John kind of stands on its own. The shared pieces of the Synoptic Gospels can be seen here (picture from Wikipedia, Synoptic Gospels):

As you can see, Biblical scholars have found out that Luke and Matthew share about 60 to 70 percent of the text (through the so called double tradition), whereas Mark shares about 40 percent of the text (through the triple tradition). For comparison, John only shares about 10 percent with the synoptic Gospels. Now, this is exactly what our cluster analysis form above shows – what an impressive feat!

By the way, there is an even simpler way to interact with the stylo package: through the GUI. Just call stylo() and the following window will open where you can play around with different arguments:

It has never been easier to do sophisticated textual analyses, even in ancient languages one cannot understand! You might ask: how did R do that?

Well, behind the scenes R basically creates frequency tables for the words used in the different documents. It then tries to cluster the documents based on a similarity (or distance) measure. Intuitively that means that the more often authors use the same words the closer their texts are. As you can see R doesn’t really have to understand the words or the language used. It just maps statistical measures which seems to be quite effective (not only) in this case. The statistical measures often give some kind of characteristic signature of an author or a source. In a way this is the quantified version of fuzzy concepts like ‘writing style’ yet R can do it much more quickly even for long texts than any human ever could!

Happy New Year… stay tuned for much more to come in
2019

Your AI journey… and Happy Holidays!

I want to draw your attention to a very valuable (and short!) whitepaper from my colleague, Professor Andrew Ng, where he shares important insights on how to lead companies into the AI era.

The five steps outlined in the paper are:

  1. Execute pilot projects to gain momentum
  2. Build an in-house AI team
  3. Provide broad AI training
  4. Develop an AI strategy
  5. Develop internal and external communications

Many points raised in the paper coincide with my own consulting experience in different industries: AI Transformation Playbook

I also recently gave an interview on the same topic, you can find it here:
Artificial intelligence: What is the best way for companies to innovate?

Happy reading!

If you need qualified support/consulting for your own AI journey, especially for crafting an AI strategy, setting up and management of AI projects, building an AI organization and planning/conducting AI coachings please contact me here: Prof. Dr. Holger K. von Jouanne-Diedrich

…and now for something completely different: like every year Christmas arrives totally unexpected!

So, I wish you a Merry Christmas/Happy Holidays (whatever applies 😉 ) with the following little ctree function:

# ctree: prints a Christmas tree with numbers of branch levels N
ctree <- function(N = 7) {
  for (i in 1:N) cat(rep("", N-i+1), rep("*", i), "\n")
  cat(rep("", N), "*\n")
}

ctree(5)
##      *
##     * *
##    * * *
##   * * * *
##  * * * * *
##      *
ctree()
##        *
##       * *
##      * * *
##     * * * *
##    * * * * *
##   * * * * * *
##  * * * * * * *
##        *
ctree(9)
##          *
##         * *
##        * * *
##       * * * *
##      * * * * *
##     * * * * * *
##    * * * * * * *
##   * * * * * * * *
##  * * * * * * * * *
##          *

Learning R: A gentle introduction to higher-order functions

Have you ever thought about why the definition of a function in R is different from many other programming languages? The part that causes the biggest difficulties (especially for beginners of R) is that you state the name of the function at the beginning and use the assignment operator – as if functions were like any other data type, like vectors, matrices or data frames…

Congratulations! You just encountered one of the big ideas of functional programming: functions are indeed like any other data type, they are not special – or in programming lingo, functions are first-class members. Now, you might ask: So what? Well, there are many ramifications, for example that you could use functions on other functions by using one function as an argument for another function. Sounds complicated?

In mathematics most of you will be familiar with taking the derivative of a function. When you think about it you could say that you put one function into the derivative function (or operator) and get out another function!

In R there are many applications as well, let us go through a simple example step by step.

Let’s say I want to apply the mean function on the first four columns of the iris dataset. I could do the following:

mean(iris[ , 1])
## [1] 5.843333
mean(iris[ , 2])
## [1] 3.057333
mean(iris[ , 3])
## [1] 3.758
mean(iris[ , 4])
## [1] 1.199333

Quite tedious and not very elegant. Of course, we can use a for loop for that:

for (x in iris[1:4]) {
  print(mean(x))
}
## [1] 5.843333
## [1] 3.057333
## [1] 3.758
## [1] 1.199333

This works fine but there is an even more intuitive approach. Just look at the original task: “apply the mean function on the first four columns of the iris dataset” – so let us do just that:

apply(iris[1:4], 2, mean)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

Wow, this is very concise and works perfectly (the 2 just stands for “go through the data column wise”, 1 would be for “row wise”). apply is called a “higher-order function” and we could use it with all kinds of other functions:

apply(iris[1:4], 2, sd)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.8280661    0.4358663    1.7652982    0.7622377
apply(iris[1:4], 2, min)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          4.3          2.0          1.0          0.1
apply(iris[1:4], 2, max)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          7.9          4.4          6.9          2.5

You can also use user-defined functions:

midrange <- function(x) (min(x) + max(x)) / 2
apply(iris[1:4], 2, midrange)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##         6.10         3.20         3.95         1.30

We can even use new functions that are defined “on the fly” (or in functional programming lingo “anonymous functions”):

apply(iris[1:4], 2, function(x) (min(x) + max(x)) / 2)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##         6.10         3.20         3.95         1.30

Let us now switch to another inbuilt data set, the mtcars dataset with 11 different variables of 32 cars (if you want to find out more, please consult the documentation):

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

To see the power of higher-order functions let us create a (numeric) matrix with minimum, first quartile, median, mean, third quartile and maximum for all 11 columns of the mtcars dataset with just one command!

apply(mtcars, 2, summary)
##              mpg    cyl     disp       hp     drat      wt     qsec     vs      am   gear   carb
## Min.    10.40000 4.0000  71.1000  52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000
## 1st Qu. 15.42500 4.0000 120.8250  96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000
## Median  19.20000 6.0000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000
## Mean    20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
## 3rd Qu. 22.80000 8.0000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000
## Max.    33.90000 8.0000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000

Wow, that was easy and the result is quite impressive, is it not!

Or if you want to perform a linear regression for all ten variables separately against mpg and want to get a table with all coefficients – there you go:

sapply(mtcars, function(x) round(coef(lm(mpg ~ x, data = mtcars)), 3))
##             mpg    cyl   disp     hp   drat     wt   qsec     vs     am  gear   carb
## (Intercept)   0 37.885 29.600 30.099 -7.525 37.285 -5.114 16.617 17.147 5.623 25.872
## x             1 -2.876 -0.041 -0.068  7.678 -5.344  1.412  7.940  7.245 3.923 -2.056

Here we used another higher-order function, sapply, together with an anonymous function. sapply goes through all the columns of a data frame (i.e. elements of a list) and tries to simplify the result (here your get back a nice matrix).

Often, you might not even have realised when you were using higher-order functions! I can tell you that it is quite a hassle in many programming languages to program a simple function plotter, i.e. a function which plots another function. In R it has already been done for you: you just use the higher-order function curve and give it the function you want to plot as an argument:

curve(sin(x) + cos(1/2 * x), -10, 10)

I want to give you one last example of another very helpful higher-order function (which not too many people know or use): by. It comes in very handy when you want to apply a function on different attributes split by a factor. So let’s say you want to get a summary of all the attributes of iris split by (!) species – here it comes:

by(iris[1:4], iris$Species, summary)
## iris$Species: setosa
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
##  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
##  Median :5.000   Median :3.400   Median :1.500   Median :0.200  
##  Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
##  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
##  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600  
## --------------------------------------------------------------
## iris$Species: versicolor
##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width   
##  Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000  
##  1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200  
##  Median :5.900   Median :2.800   Median :4.35   Median :1.300  
##  Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326  
##  3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500  
##  Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800  
## --------------------------------------------------------------
## iris$Species: virginica
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400  
##  1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800  
##  Median :6.500   Median :3.000   Median :5.550   Median :2.000  
##  Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026  
##  3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300  
##  Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500

This was just a very shy look at this huge topic. There are very powerful higher-order functions in R, like lapppy, aggregate, replicate (very handy for numerical simulations) and many more. A good overview can be found in the answers of this question: stackoverflow (my answer there is on the rather illusive switch function: switch).

For some reason people tend to confuse higher-order functions with recursive functions but that is the topic of another post, so stay tuned…

Intuition for principal component analysis (PCA)

Principal component analysis (PCA) is a dimension-reduction method that can be used to reduce a large set of (often correlated) variables into a smaller set of (uncorrelated) variables, called principal components, which still contain most of the information.

PCA is a concept that is traditionally hard to grasp so instead of giving you the n’th mathematical derivation I will provide you with some intuition.

Basically PCA is nothing else but a projection of some higher dimensional object into a lower dimension. What sounds complicated is really something we encounter every day: when we watch TV we see a 2D-projection of 3D-objects!

Please watch this instructive video:

So what we want is not only a projection but a projection which retains as much information as possible. A good strategy is to find the longest axis of the object and after that turn the object around this axis so that we again find the second longest axis.

Mathematically this can be done by calculating the so called eigenvectors of the covariance matrix, after that we use just use the n eigenvectors with the biggest eigenvalues to project the object into n-dimensional space (in our example n = 2) – but as I said I won’t go into the mathematical details here (many good explanations can be found here on stackexchange).

To make matters concrete we will now recreate the example from the video above! The teapot data is from https://www.khanacademy.org/computer-programming/3d-teapot/971436783 – there you can also turn it around interactively.

Have a look at the following code:

library(scatterplot3d)

teapot <- read.csv("data/teapot.csv", header = FALSE)
dat <- matrix(unlist(teapot), ncol = 3, byrow = TRUE)
head(dat)
##      [,1] [,2] [,3]
## [1,] 40.6 28.3 -1.1
## [2,] 40.1 30.4 -1.1
## [3,] 40.7 31.1 -1.1
## [4,] 42.0 30.4 -1.1
## [5,] 43.5 28.3 -1.1
## [6,] 37.5 28.3 14.5

# teapot in 3D
scatterplot3d(dat[ , 1], dat[ , 2], dat[ , 3], highlight.3d = TRUE, angle = 75, pch = 19, lwd = 15, xlab = "", ylab = "", zlab = "", main = "teapot in 3D")

And now we project that teapot from 3D-space onto a 2D-plane while retaining as much information as possible by taking the two eigenvectors with biggest eigenvalues:

# PCA
(eigenval <- eigen(cov(dat))$values)
## [1] 2050.6935  747.5764  553.4070

eigenvec <- eigen(cov(dat))$vectors
# take only the two eigenvectors with biggest eigenvalues
PCA_2 <- dat %*% eigenvec[ , 1:2] 
plot(PCA_2, ylim = c(50, -55), pch = 19, lwd = 35, col= "blue", xlab = "", ylab = "", main = "teapot in 2D")

round(cumsum(eigenval)/sum(eigenval) * 100, 2) # cumulative percentage of retrieved information
## [1]  61.18  83.49 100.00

That was fun, wasn’t it!

In data science you will often use PCA to condense large datasets into smaller datasets while retaining most of the information (which means retaining as much variance as possible in this case). PCA is like clustering an unsupervised learning method.