## Reinforcement Learning: Life is a Maze

It can be argued that most important decisions in life are some variant of an exploitation-exploration problem. Shall I stick with my current job or look for a new one? Shall I stay with my partner or seek a new love? Shall I continue reading the book or watch the movie instead? In all of those cases the question is always whether I should “exploit” the thing I have or whether I should “explore” new things. If you want to learn how to tackle this most basic trade off read on…

At the core this can be stated as the problem a gambler has who wants to play a one-armed bandit: if there are several machines with different winning probabilities (a so-called multi-armed bandit problem) the question the gambler faces is: which machine to play? He could “exploit” one machine or “explore” different machines. So what is the best strategy given a limited amount of time… and money?

There are two extreme cases: no exploration, i.e. playing only one randomly chosen bandit, or no exploitation, i.e. playing all bandits randomly – so obviously we need some middle ground between those two extremes. We have to start with one randomly chosen bandit, try different ones after that and compare the results. So in the simplest case the first variable is the probability rate with which to switch to a random bandit – or to stick with the best bandit found so far.

Let us create an example with bandits, which return units on average, except the second one which returns units. So the best strategy would obviously be to choose the second bandit right away and stick with it, but of course we don’t know the average returns of each bandit so this won’t work. Instead we need another vector which tallies the results of each bandit so far. This vector has to be updated after each game, so we need an update function which gets as arguments the current bandit and the return of the game.

The intelligence of the strategy lies in this update function, so how should we go about? The big idea behind this strategy is called Bellman equation and in its simplest form it works as follows: the adjustment of the former result vector is the difference between the former result and the current result weighted by some discount factor, in this case the inverse of the number of games played on the respective machine. This learning strategy is called Q-learning and is a so called reinforcement learning technique.

Have a look at the following example implementation:

set.seed(3141) # for reproducibility

# Q-learning update function
update <- function(i, r) {
Q[i] <<- Q[i] + 1/(k[i]+1) * (r-Q[i]) # Q-learning function
k[i] <<- k[i] + 1 # one more game played on i'th bandit
}

# simulate game on one-armed bandit i
ret <- function(i) {
round(rnorm(1, mean = rets[i]))
}

# chose which bandit to play
which.bandit <- function() {
p <- runif(1)
ifelse(p >= epsilon, which.max(Q), sample(1:n, 1))
}

epsilon <- 0.1 # switch in epsilon percent of cases
rets <- c(4, 5, 4, 4, 4) # average returns of bandits
n <- length(rets)
Q <- rep(0, n) # initialize return vector
k <- rep(0, n) # initialize vector for games played on each bandit
N <- 1000 # no. of runs
R <- 0 # sum of returns

for (j in 1:N) {
i <- which.bandit() # chose bandit
r <- ret(i) # simulate bandit
R <- R + r # add return of bandit to overall sum of returns
update(i, r) # calling Q-learning update function
}

which.max(Q) # which bandit has the highest return?
## [1] 2

Q
## [1] 4.000000 5.040481 4.090909 4.214286 3.611111

k
## [1]  32 914  22  14  18

N * max(rets) # theoretical max. return
## [1] 5000

R
## [1] 4949

R / (N * max(rets)) # percent reached of theoretical max
## [1] 0.9898


So, obviously the algorithm found a nearly perfect strategy all on its own!

Now, this is the simplest possible application of reinforcement learning. Let us now implement a more sophisticated example: a robot navigating a maze. Whereas the difficulty in the first example was that the feedback was blurred (because the return of each one-armed bandit is only an average return) here we only get definitive feedback after several steps (when the robot reaches its goal). Because this situation is more complicated we need more memory to store the intermediate results. In our multi-armed bandit example the memory was a vector, here we will need a matrix.

The robot will try to reach the goal in the following maze (i.e. to get out of the maze to reach F which can be done via B or E) and find the best strategy for each room it is placed in:

Have a look at the code (it is based on the Matlab code from the same tutorial the picture is from, which is why the names of variables and functions are called the same way to ensure consistency):

# find all possible actions
AllActions <- function(state, R) {
which(R[state, ] >= 0)
}

# chose one action out of all possible actions by chance
PossibleAction <- function(state, R) {
sample(AllActions(state, R), 1)
}

# Q-learning function
UpdateQ <- function(state, Q, R, gamma, goalstate) {
action <- PossibleAction(state, R)
Q[state, action] <- R[state, action] + gamma * max(Q[action, AllActions(action, R)]) # Bellman equation (learning rate implicitly = 1)
if(action != goalstate) Q <- UpdateQ(action, Q, R, gamma, goalstate)
Q
}

# recursive function to get the action with the the maximum Q value
MaxQ <- function(state, Q, goalstate) {
action <- which.max(Q[state[length(state)], ])
if (action != goalstate) action <- c(action, MaxQ(action, Q, goalstate))
action
}

# representation of the maze
R <- matrix(c(-Inf, -Inf, -Inf, -Inf,    0, -Inf,
-Inf, -Inf, -Inf,    0, -Inf,  100,
-Inf, -Inf, -Inf,    0, -Inf, -Inf,
-Inf,    0,    0, -Inf,    0, -Inf,
0, -Inf, -Inf,    0, -Inf,  100,
-Inf,    0, -Inf, -Inf,    0,  100), ncol = 6, byrow = TRUE)
colnames(R) <- rownames(R) <- LETTERS[1:6]
R
##      A    B    C    D    E    F
## A -Inf -Inf -Inf -Inf    0 -Inf
## B -Inf -Inf -Inf    0 -Inf  100
## C -Inf -Inf -Inf    0 -Inf -Inf
## D -Inf    0    0 -Inf    0 -Inf
## E    0 -Inf -Inf    0 -Inf  100
## F -Inf    0 -Inf -Inf    0  100

Q <- matrix(0, nrow = nrow(R), ncol = ncol(R))
colnames(Q) <- rownames(Q) <- LETTERS[1:6]
Q
##   A B C D E F
## A 0 0 0 0 0 0
## B 0 0 0 0 0 0
## C 0 0 0 0 0 0
## D 0 0 0 0 0 0
## E 0 0 0 0 0 0
## F 0 0 0 0 0 0

gamma <- 0.8 # learning rate
goalstate <- 6
N <- 50000 # no. of episodes

for (episode in 1:N) {
state <- sample(1:goalstate, 1)
Q <- UpdateQ(state, Q, R, gamma, goalstate)
}

Q
##      A    B    C    D   E   F
## A -Inf -Inf -Inf -Inf 400   0
## B    0    0    0  320   0 500
## C -Inf -Inf -Inf  320   0   0
## D    0  400  256    0 400   0
## E  320    0    0  320   0 500
## F    0  400    0    0 400 500

Q / max(Q) * 100
##      A    B    C    D  E   F
## A -Inf -Inf -Inf -Inf 80   0
## B    0    0  0.0   64  0 100
## C -Inf -Inf -Inf   64  0   0
## D    0   80 51.2    0 80   0
## E   64    0  0.0   64  0 100
## F    0   80  0.0    0 80 100

# print all learned routes for all rooms
for (i in 1:goalstate) {
cat(LETTERS[i], LETTERS[MaxQ(i, Q, goalstate)], sep = " -> ")
cat("\n")
}
## A -> E -> F
## B -> F
## C -> D -> B -> F
## D -> B -> F
## E -> F
## F -> F


So again, the algorithm has found the best route for each room!

Recently the combination of Neural Networks (see also Understanding the Magic of Neural Networks) and Reinforcement Learning has become quite popular. For example AlphaGo, the machine from Google that defeated a Go world champion for the first time in history is based on this powerful combination!

## Teach R to read handwritten Digits with just 4 Lines of Code

What is the best way for me to find out whether you are rich or poor, when the only thing I know is your address? Looking at your neighbourhood! That is the big idea behind the k-nearest neighbour (or KNN) algorithm, where k stands for the number of neighbours to look at. The idea couldn’t be any simpler yet the results are often very impressive indeed – so read on…

Let us take a task that is very hard to code, like identifying handwritten numbers. We will be using the Semeion Handwritten Digit Data Set from the UCI Machine Learning Repository and are separating training and test set for the upcoming task in the first step:

# helper function for plotting images of digits in a nice way + returning the respective number
plot_digit <- function(digit) {
M <- matrix(as.numeric(digit[1:256]), nrow = 16, ncol = 16, byrow = TRUE)
image(t(M[nrow(M):1, ]), col = c(0,1), xaxt = "n", yaxt = "n", useRaster = TRUE)
digit[257]
}

# load data and chose some digits as examples
semeion <- read.table("data/semeion.data", quote = "\"", comment.char = "") # put in right path here!
digit_data <- semeion[ , 1:256]
which_digit <- apply(semeion[ , 257:266], 1, function(x) which.max(x) - 1)
semeion_new <- cbind(digit_data, which_digit)

# chose training and test set by chance
set.seed(123) # for reproducibility
data <- semeion_new
random <- sample(1:nrow(data), 0.8 * nrow(data)) # 80%: training data, 20%: test data
train <- data[random, ]
test <- data[-random, ]

# plot example digits
old_par <- par(mfrow = c(4, 6), oma = c(5, 4, 0, 0) + 0.1, mar = c(0, 0, 1, 1) + 0.1)
matrix(apply(train[1:24, ], 1, plot_digit), 4, 6, byrow = TRUE)


##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    3    1    2    5    7    3
## [2,]    1    5    1    6    7    6
## [3,]    6    2    8    5    9    3
## [4,]    5    7    5    7    5    9

par(old_par)


As you can see teaching a computer to read those digits is a task which would take considerable effort and easily hundreds of lines of code. You would have to intelligently identify different regions in the images and find some boundaries to try to identify which number is being shown. You could expect to do a lot of tweaking before you would get acceptable results.

The real magic behind machine learning and artificial intelligence is that when something is too complicated to code let the machine program itself by just showing it lots of examples (see also my post So, what is AI really?). We will do just that with the nearest neighbour algorithm.

When talking about neighbours it is implied already that we need some kind of distance metric to define what constitutes a neighbour. As in real life the simplest one is the so called Euclidean distance which is just how far different points are apart from each other as the crow flies. The simple formula that is used for this is just the good old Pythagorean theorem (in this case in a vectorized way) – you can see what maths at school was good for after all:

dist_eucl <- function(x1, x2) {
sqrt(sum((x1 - x2) ^ 2)) # Pythagorean theorem!
}


The k-nearest neighbours algorithm is pretty straight forward: it just compares the digit which is to be identified with all other digits and choses the k nearest ones. In case that the k nearest ones don’t come up with the same answer the majority vote (or mathematically the mode) is taken:

mode <- function(NNs) {
names(sort(-table(NNs[ncol(NNs)])))[1] # mode = majority vote
}
knn <- function(train, test, k = 5) {
dist_sort <- order(apply(train[-ncol(train)], 1, function(x) dist_eucl(as.numeric(x), x2 = as.numeric(test[-ncol(test)]))))
mode(train[dist_sort[1:k], ])
}


So, the algorithm itself comprises barely 4 lines of code! Now, let us see how it performs on this complicated task with k = 9 out of sample (first a few examples are shown and after that we have a look at the overall performance):

# show a few examples
set.seed(123) # for reproducibility
no_examples <- 24
examples <- sample(dim(test)[1], no_examples)
old_par <- par(mfrow = c(4, 6), oma = c(5, 4, 0, 0) + 0.1, mar = c(0, 0, 1, 1) + 0.1)
matrix(apply(test[examples, ], 1, plot_digit), 4, 6, byrow = TRUE)


##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    4    1    1    5    7    3
## [2,]    0    5    1    4    7    3
## [3,]    6    2    7    4    0    2
## [4,]    5    5    3    6    3    7

par(old_par)
prediction <- integer(no_examples)
for (i in 1:no_examples) {
prediction[i] <- knn(train, test[examples[i], ], k = 9)
}
print(matrix(prediction, 4, 6, byrow = TRUE), quote = FALSE, right = TRUE)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    4    1    1    5    7    3
## [2,]    0    5    1    4    7    3
## [3,]    6    2    7    4    0    2
## [4,]    5    5    3    6    3    7

# now for the overall accuracy
library(OneR) # just for eval_model function to evaluate the model's accuracy

prediction <- integer(nrow(test))

ptm <- proc.time()
for (i in 1:nrow(test)) {
prediction[i] <- knn(train, test[i, ], k = 9)
}
proc.time() - ptm
##    user  system elapsed
##   26.74    0.82   27.59

eval_model(prediction, test[ncol(test)], zero.print = ".")
##
## Confusion matrix (absolute):
##           Actual
## Prediction   0   1   2   3   4   5   6   7   8   9 Sum
##        0    34   .   .   .   .   .   1   .   .   .  35
##        1     .  36   1   .   2   .   .   1   .   1  41
##        2     .   .  36   .   .   .   .   .   1   1  38
##        3     .   1   .  32   .   .   .   .   .   2  35
##        4     .   .   .   .  29   .   .   .   .   .  29
##        5     .   .   .   .   .  35   2   .   1   .  38
##        6     .   .   .   .   .   1  23   .   .   .  24
##        7     .   .   .   .   .   .   .  22   .   1  23
##        8     .   .   .   .   .   .   .   .  31   .  31
##        9     .   .   .   .   .   .   .   .   2  23  25
##        Sum  34  37  37  32  31  36  26  23  35  28 319
##
## Confusion matrix (relative):
##           Actual
## Prediction    0    1    2    3    4    5    6    7    8    9  Sum
##        0   0.11    .    .    .    .    .    .    .    .    . 0.11
##        1      . 0.11    .    . 0.01    .    .    .    .    . 0.13
##        2      .    . 0.11    .    .    .    .    .    .    . 0.12
##        3      .    .    . 0.10    .    .    .    .    . 0.01 0.11
##        4      .    .    .    . 0.09    .    .    .    .    . 0.09
##        5      .    .    .    .    . 0.11 0.01    .    .    . 0.12
##        6      .    .    .    .    .    . 0.07    .    .    . 0.08
##        7      .    .    .    .    .    .    . 0.07    .    . 0.07
##        8      .    .    .    .    .    .    .    . 0.10    . 0.10
##        9      .    .    .    .    .    .    .    . 0.01 0.07 0.08
##        Sum 0.11 0.12 0.12 0.10 0.10 0.11 0.08 0.07 0.11 0.09 1.00
##
## Accuracy:
## 0.9436 (301/319)
##
## Error rate:
## 0.0564 (18/319)
##
## Error rate reduction (vs. base rate):
## 0.9362 (p-value < 2.2e-16)


Wow, it achieves an accuracy of nearly 95% out of the box while some of the digits are really hard to read even for humans! And we haven’t even given it the information that those images are two-dimensional because we coded all the images simply as (one-dimensional) binary numbers.

To get the idea where it failed have a look at the digits that were misclassified:

# show misclassified digits
err <- which(as.integer(prediction) != unlist(test[ncol(test)]))
old_par <- par(mfrow = c(3, 6), oma = c(5, 4, 0, 0) + 0.1, mar = c(0, 0, 1, 1) + 0.1)
matrix(apply(test[err, ], 1, plot_digit), 3, 6, byrow = TRUE)


##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    2    6    9    8    9    5
## [2,]    6    6    7    4    8    8
## [3,]    9    9    1    4    8    9

par(old_par)

# show what was predicted
print(matrix(prediction[err], 3, 6, byrow = TRUE), quote = FALSE, right = TRUE)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    5    1    9    3    6
## [2,]    5    0    1    1    9    5
## [3,]    3    7    3    1    2    2


Most of us would have difficulties reading at least some of those digits too, e.g. the third digit in the first row is supposed to be a 9, yet it could also be a distorted 1 – same with the first digit in the last row: some people would read a 3 (like our little program) or nothing at all really, but it is supposed to be a 9. So even the mistakes the system makes are understandable.

Sometimes the simplest methods are – perhaps not the best but – very effective indeed, you should keep that in mind!

## Creating a Movie with Data from Outer Space in R

The Rosetta mission of the European Space Agency (ESA) is one of the greatest (yet underappreciated) triumphs of humankind: it was launched in 2004 and landed the spacecraft Philae ten years later on a small comet, named 67P/Churyumov–Gerasimenko (for the whole timeline of the mission see here: Timeline of Rosetta spacecraft).

ESA provided the world with datasets of the comet which we will use to create an animated gif in R… so read on!

The only prerequisites that we need is the rgl package (on CRAN) and an installed version of ImageMagick: http://www.imagemagick.org/script/download.php.

The script itself is very short due to the powerful functions of the rgl package, the dataset is loaded directly from the internet site of ESA:

library(rgl)

open3d()
## wgl
##   1

rgl.bg(color = c("black"))
movie3d(spin3d(axis = c(0, 0, 1)), duration = 12, dir = getwd())
## Writing 'movie000.png'
## Writing 'movie001.png'
## Writing 'movie002.png'
## Writing 'movie003.png'
[...]
## Writing 'movie118.png'
## Writing 'movie119.png'
## Writing 'movie120.png'

system("cmd.exe", input = paste("cd", getwd(), "&& convert -loop 0 movie.gif philae.gif")) # change GIF animation cycles to inf
Microsoft Windows [Version 10.0.17134.706]