## Customers who bought…

One of the classic examples in data science (called data mining at the time) is the beer and diapers example: when a big supermarket chain started analyzing their sales data they encountered not only trivial patterns, like toothbrushes and toothpaste being bought together, but also quite strange combinations like beer and diapers. Now, the trivial ones are reassuring that the method works but what about the more extravagant ones? Does it mean that young parents are alcoholics? Or that instead of breastfeeding they give their babies beer? Obviously, they had to get to the bottom of this.

As it turned out in many cases they following happened: stressed out mummy sends young daddy to the supermarket because they had run out of diapers. Young daddy seizes the opportunity to not only buy the much needed diapers but also to stock up on some beer! So what the supermarket did was to place the beer directly on the way from the aisle with the diapers – the result was a significant boost in beer sales (for all the young daddies who might have forgotten what they really wanted when buying diapers…).

So, to reproduce this example in a simplified way have a look at the following code:

# some example data for items bought together (market baskets)
Diapers <- c(1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0)
Baby_Oil <- c(1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0)
Ham <- c(rep(0, 6), rep(1, 2), rep(0, 7))
Beer <- c(rep(0, 3), 1, rep(0, 11))
(basket <- cbind(Diapers, Baby_Oil, Ham, Beer))
##       Diapers Baby_Oil Ham Beer
##  [1,]       1        1   0    0
##  [2,]       0        1   0    0
##  [3,]       1        1   0    0
##  [4,]       1        0   0    1
##  [5,]       0        0   0    0
##  [6,]       1        1   0    0
##  [7,]       1        0   1    0
##  [8,]       0        0   1    0
##  [9,]       1        1   0    0
## [10,]       1        1   0    0
## [11,]       0        1   0    0
## [12,]       0        1   0    0
## [13,]       1        1   0    0
## [14,]       0        0   0    0
## [15,]       0        0   0    0

# analysis of items bought together
round(cor_basket <- cor(basket), 2) # cor is the core of the method! (no pun intended)
##          Diapers Baby_Oil   Ham  Beer
## Diapers     1.00     0.33 -0.03  0.25
## Baby_Oil    0.33     1.00 -0.48 -0.33
## Ham        -0.03    -0.48  1.00 -0.10
## Beer        0.25    -0.33 -0.10  1.00

diag(cor_basket) <- 0 # we don't want to recommend the same products to the customers who already bought them
##          Diapers Baby_Oil   Ham  Beer
## Diapers     0.00     0.33 -0.03  0.25
## Baby_Oil    0.33     0.00 -0.48 -0.33
## Ham        -0.03    -0.48  0.00 -0.10
## Beer        0.25    -0.33 -0.10  0.00

# printing items bought together
col <- cor_basket[ , i, drop = FALSE]
col <- col[order(col, decreasing = TRUE), , drop = FALSE]
cat("Customers who bought", colnames(col), "also bought", rownames(col)[col > 0], "\n")
}
## Customers who bought Diapers also bought Baby_Oil Beer
## Customers who bought Baby_Oil also bought Diapers
## Customers who bought Ham also bought
## Customers who bought Beer also bought Diapers


What we are looking for is some kind of dependance pattern within the shopping baskets, in this case we use the good old correlation function. Traditionally other (dependance) measures are used, namely support, confidence and lift. We will come to that later on in this post.

Wikipedia offers the following fitting description of association rule learning:

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify rules discovered in databases using some measures of interestingness.

For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat.

Such information can be used as the basis for decisions about marketing activities such as, e.g. promotional pricing or product placements.

In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, continuous production, and bioinformatics.

So, this is also the basis of popular functions on ecommerce sites (“customers who bought this item also bought…”) or movie streaming platforms (“customers who watched this film also watched…”).

A very good package for real-world datasets is the arules package (on CRAN). Have a look at the following code:

library(arules)
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
##     abbreviate, write

data("Groceries")
rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.5))
## Apriori
##
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##      10  rules FALSE
##
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [5668 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules_conf <- arules::sort(rules, by = "confidence", decreasing = TRUE)
##      lhs                     rhs                    support confidence     lift count
## [1]  {rice,
##       sugar}              => {whole milk}       0.001220132          1 3.913649    12
## [2]  {canned fish,
##       hygiene articles}   => {whole milk}       0.001118454          1 3.913649    11
## [3]  {root vegetables,
##       butter,
##       rice}               => {whole milk}       0.001016777          1 3.913649    10
## [4]  {root vegetables,
##       whipped/sour cream,
##       flour}              => {whole milk}       0.001728521          1 3.913649    17
## [5]  {butter,
##       soft cheese,
##       domestic eggs}      => {whole milk}       0.001016777          1 3.913649    10
## [6]  {citrus fruit,
##       root vegetables,
##       soft cheese}        => {other vegetables} 0.001016777          1 5.168156    10
## [7]  {pip fruit,
##       butter,
##       hygiene articles}   => {whole milk}       0.001016777          1 3.913649    10
## [8]  {root vegetables,
##       whipped/sour cream,
##       hygiene articles}   => {whole milk}       0.001016777          1 3.913649    10
## [9]  {pip fruit,
##       root vegetables,
##       hygiene articles}   => {whole milk}       0.001016777          1 3.913649    10
## [10] {cream cheese ,
##       domestic eggs,
##       sugar}              => {whole milk}       0.001118454          1 3.913649    1


The algorithm used here is the so called Apriori algorithm. It ameliorates the problem with real-world datasets that when you want to test all combinations of all possible items you very soon run into performance problems – even with very fast computers – because there are just too many possibilities to be tested.

The Apriori algorithm aggressively prunes the possibilities to be tested by making use of the fact that if you are only interested in rules that are supported by a certain number of instances you can start with testing the support of individual items – which is easy to do – and work your way up to more complicated rules.

The trick is that you don’t test more complicated rules with items which don’t have enough support on the individual level: this is because if you don’t have enough instances on the individual level you don’t even have to look at more complicated combinations with those items included (which would be even more scarce). What sounds like an obvious point brings about huge time savings for real-world datasets which couldn’t be analyzed without this trick.

As mentioned above important concepts to assess the quality (or interestingness) of association rules are support, confidence and lift:

• Support of : percentage of X for all cases
• Confidence of : percentage of Y for all X
• Lift of : ratio of the observed support of X and Y to what would be expected if X and Y were independent

To understand these concepts better we are going to rebuild the examples given in the Wikipedia article in R. Have a look at the parts “Definition” and “Useful Concepts” of the article and after that at the following code, which should be self-explanatory:

M <- matrix(c(1, 1, 0, 0, 0,
0, 0, 1, 0, 0,
0, 0, 0, 1, 1,
1, 1, 1, 0, 0,
0, 1, 0, 0, 0), ncol = 5, byrow = TRUE)
colnames(M) <- c("milk", "bread", "butter", "beer", "diapers")
M
##      milk bread butter beer diapers
## [1,]    1     1      0    0       0
## [2,]    0     0      1    0       0
## [3,]    0     0      0    1       1
## [4,]    1     1      1    0       0
## [5,]    0     1      0    0       0

supp <- function(X) {
sum(rowSums(M[ , X, drop = FALSE]) == length(X)) / nrow(M) # "rowSums == length" mimics logical AND for the selected columns
}
conf <- function(X, Y) {
supp(c(X, Y)) / supp(X) # conf(X => Y)
}
lift <- function(X, Y) {
supp(c(X, Y)) / (supp(X) * supp(Y)) # lift(X => Y)
}

supp(c("beer", "diapers"))         # percentage of X for all cases
## [1] 0.2

conf(c("butter", "bread"), "milk") # percentage of Y for all X
## [1] 1

lift(c("milk", "bread"), "butter") # ratio of the observed support of X and Y to what would be expected if X and Y were independent
## [1] 1.25


You should conduct your own experiments by playing around with different item combinations so that you really understand the mechanics of those important concepts.

If all of those analyses are being done for perfecting your customer experience or just outright manipulation to lure you into buying stuff you don’t really need is obviously a matter of perspective…

## So, what is AI really?

One of the topics that is totally hyped at the moment is obviously Artificial Intelligence or AI for short. There are many self-proclaimed experts running around trying to sell you the stuff they have been doing all along under this new label. When you ask them what AI means you will normally get some convoluted explanations (which is a good sign that they don’t get it themselves) and some “success stories”. The truth is that many of those talking heads don’t really know what they are talking about, yet happen to have a friend who knows somebody who picked up a book at the local station bookshop… ok, that was nasty but unfortunately often not too far away from the truth.

So, what is AI really? This post tries to give some guidance.

The traditional way coding a computer program worked was through carefully analyzing the problem, trying to separate its different parts into simpler sub-problems, put the solution into an algorithmic form (kind of a recipe) and finally code it. Let’s have a look at an example!

Let’s say you want to program an app for mushroom pickers with warnings for certain qualities. The idea is to find an easy to follow guide in the form of “if your mushroom has this quality then DO NOT eat it!” (in computer lingo called “conditional statements” or just “conditionals”). As any mushroom picker can attest this is not an easy task. For the matter have a look at the following dataset (you can find it here: mushrooms, originally it is from https://archive.ics.uci.edu/ml/datasets/mushroom):

mushrooms <- read.csv("data/mushrooms.csv")
str(mushrooms)
## 'data.frame':    8124 obs. of  23 variables:
##  $cap_shape : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ... ##$ cap_surface             : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
##  $cap_color : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ... ##$ bruises                 : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $odor : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ... ##$ gill_attachment         : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
##  $gill_spacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ... ##$ gill_size               : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
##  $gill_color : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ... ##$ stalk_shape             : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
##  $stalk_root : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ... ##$ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ... ##$ stalk_color_above_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $stalk_color_below_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ... ##$ veil_type               : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
##  $veil_color : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ... ##$ ring_number             : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
##  $ring_type : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ... ##$ spore_print_color       : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ...
##  $population : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ... ##$ habitat                 : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...
##  $type : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ... head(mushrooms) ## cap_shape cap_surface cap_color bruises odor gill_attachment ## 1 convex smooth brown yes pungent free ## 2 convex smooth yellow yes almond free ## 3 bell smooth white yes anise free ## 4 convex scaly white yes pungent free ## 5 convex smooth gray no none free ## 6 convex scaly yellow yes almond free ## gill_spacing gill_size gill_color stalk_shape stalk_root ## 1 close narrow black enlarging equal ## 2 close broad black enlarging club ## 3 close broad brown enlarging club ## 4 close narrow brown enlarging equal ## 5 crowded broad black tapering equal ## 6 close broad brown enlarging club ## stalk_surface_above_ring stalk_surface_below_ring stalk_color_above_ring ## 1 smooth smooth white ## 2 smooth smooth white ## 3 smooth smooth white ## 4 smooth smooth white ## 5 smooth smooth white ## 6 smooth smooth white ## stalk_color_below_ring veil_type veil_color ring_number ring_type ## 1 white partial white one pendant ## 2 white partial white one pendant ## 3 white partial white one pendant ## 4 white partial white one pendant ## 5 white partial white one evanescent ## 6 white partial white one pendant ## spore_print_color population habitat type ## 1 black scattered urban poisonous ## 2 brown numerous grasses edible ## 3 brown numerous meadows edible ## 4 black scattered urban poisonous ## 5 brown abundant grasses edible ## 6 black numerous grasses edible  The dataset consists of 8124 examples of mushrooms with 22 qualities each (plus the attribute whether the respective mushroom is edible or poisonous). Well, obviously this is not going to be easy… A naive approach would be to formulate rules for every instance: if the cap shape is convex and the cap surface smooth and the cap colour brown… and so on for all 22 attributes, then DO NOT eat it! This would obviously not be very helpful. Another approach would be to go through every attribute and see whether it is helpful in determining the type, so for example: table(mushrooms$cap_shape, mushrooms$type) ## ## edible poisonous ## bell 404 48 ## conical 0 4 ## convex 1948 1708 ## flat 1596 1556 ## knobbed 228 600 ## sunken 32 0  Obviously this attribute isn’t very helpful, in many cases it just gives a “maybe, maybe not”-answer. Perhaps the approach itself is not so bad after all but you would have to try it for all 22 attributes, interpret the results, pick the best one, formulate if-then-rules and code them… tiresome and error-prone. Wouldn’t it be nice to do it the other way around: just show the computer all of the examples and it magically programs itself by finding the appropriate if-then-rules automatically? This is what AI is all about: Artificial Intelligence (AI): Showing a computer examples of a problem so that it programs itself to solve it. So, let us throw AI at our problem in the form of the OneR package (on CRAN): library(OneR) OneR(mushrooms, verbose = TRUE) ## ## Attribute Accuracy ## 1 * odor 98.52% ## 2 spore_print_color 86.8% ## 3 gill_color 80.5% ## 4 ring_type 77.55% ## 5 stalk_surface_above_ring 77.45% ## 6 stalk_surface_below_ring 76.61% ## 7 gill_size 75.63% ## 8 bruises 74.4% ## 9 population 72.18% ## 10 stalk_color_above_ring 71.64% ## 11 stalk_color_below_ring 71.44% ## 12 habitat 69.03% ## 13 stalk_root 64.6% ## 14 gill_spacing 61.6% ## 15 cap_color 59.53% ## 16 cap_surface 58.05% ## 17 cap_shape 56.43% ## 18 stalk_shape 55.29% ## 19 ring_number 53.82% ## 20 veil_color 51.9% ## 21 gill_attachment 51.8% ## 21 veil_type 51.8% ## --- ## Chosen attribute due to accuracy ## and ties method (if applicable): '*' ## ## Call: ## OneR.data.frame(x = mushrooms, verbose = TRUE) ## ## Rules: ## If odor = almond then type = edible ## If odor = anise then type = edible ## If odor = creosote then type = poisonous ## If odor = fishy then type = poisonous ## If odor = foul then type = poisonous ## If odor = musty then type = poisonous ## If odor = none then type = edible ## If odor = pungent then type = poisonous ## If odor = spicy then type = poisonous ## ## Accuracy: ## 8004 of 8124 instances classified correctly (98.52%)  Wow! Within the blink of an eye and with just one command (OneR) we got all the rules we need for our app! It is the odour: if it smells poisonous it probably is. The accuracy of this is nearly 99%, not too bad for such a simple rule… we wouldn’t even need an app for that. For more examples, some deeper explanation (and even a video) on the OneR package go here: OneR – Establishing a New Baseline for Machine Learning Classification Models. By the way, in the words “programs itself” – impressive as it may be – is still the term “programming”, so we are not talking about machines developing intentions, feelings or even consciousness anytime soon. This is the domain of Hollywood and not AI! As with every hype there are many terms flying around, like Machine Learning (ML), Data Science and (Predictive) Analytics and somewhat older terms, like Data Mining, Knowledge Discovery and Business Intelligence… and many, many more. Of course you can define them in all sorts of ways but to be honest with you in essence they all mean the same (see definition above). I would argue that Data Science is a somewhat broader term which comprises also e.g. the handling of data, interpretation by domain experts and presentation to the business but that is just one definition. There is an old joke that what is Machine Learning when put on powerpoint becomes Artificial Intelligence – it just sounds so much better than any technical term. We can now answer another question: why now? Why is there a hype now? I will share a secret with you: most of the AI methods used today are quite old! For example the core principles of (artificial) neural networks (a.k.a. deep learning) are from the 60’s of the last century! Now, more than half a century later we’ve got the hype. The reason is: It’s the data, stupid! Because AI is all about learning from examples, we need lots and lots of data and because of the internet and mobile revolution we are now drowning in those data. Combine this with more and more powerful hardware (also in the form of cheap cloud services) and you’ve got a revolution at your hands. And there is yet another lesson to be learned: when I said “we” are drowning in data it was not entirely correct: tech companies like Google, Amazon, Facebook, Apple (GAFA) and their Chinese counterparts, like Baidu, Alibaba, and Tencent (BAT) are drowning in our (!) data. This is the reason why they are also leading the field of AI and not because they necessarily have the brightest AI geniuses. This point is best illustrated by the example of DeepL: a small German startup in the area of machine translation. Many tests conducted by professional translators came to the conclusion that the translations by DeepL are far better than any other machine translations (like Google Translate). Why? Not because there necessarily is a secret sauce in their algorithms but because the company behind it (Linguee) had been amassing hand-curated bilingual language pairs for many, many years. The quality of their data is just so much better than of all the other translation services out there. This is their main differentiator. That is not to say that there haven’t been any major developments in the area of AI algorithms or computer hardware or that you don’t need bright people to set up and finetune those systems (quite to the contrary!) but many of the algorithms are open-source and computing power is relatively cheap via cloud services and, yes, there is a shortage of good people at the moment but there are still many highly qualified data scientists around: the real difference is in (the quality and amount of) the data! One last thing: many people think that AI is more objective than humans, after all its maths and cool logic, right? Wrong! When you only learn from examples and the examples given are racist the learned rules will be racist too! A very good book on the subject is Weapons of Math Destruction by Cathy O’Neal. Hope that this gave you some perspective on the ongoing hype… perhaps hype is not such a good word after all because when you look at the underlying reasons you can see that this mega trend is here to stay! ## Learning Data Science: Modelling Basics Data Science is all about building good models, so let us start by building a very simple model: we want to predict monthly income from age (in a later post we will see that age is indeed a good predictor for income). For illustrative purposes we just make up some numbers for age and income, load both into a data frame, plot it and calculate its correlation matrix: age <- c(21, 46, 55, 35, 28) income <- c(1850, 2500, 2560, 2230, 1800) data <- data.frame(age, income) plot(data, pch = 16)  cor(data) # correlation ## age income ## age 1.0000000 0.9464183 ## income 0.9464183 1.0000000  Just by looking at the plot we can see that the points seem to be linearly dependent. The correlation matrix corroborates this with a correlation of nearly . To build a linear regression model we use the lm function (for linear model). To specify the model we use the standard R formula interface: on the left hand side comes the dependent variable, i.e. the variable that we want to predict (income) and on the right hand side comes the independent variable, i.e. the variable that we want to use to predict income, viz. age – in between both variables we put a tilde (~). After that we plot the model as a line and look at some model diagnostics: # building linear regression model LinReg <- lm(income ~ age) # assign model to variable abline(LinReg, col = "green", lwd = 2) # add regression line  LinReg # coefficients ## ## Call: ## lm(formula = income ~ age) ## ## Coefficients: ## (Intercept) age ## 1279.37 24.56 summary(LinReg) # model summary ## ## Call: ## lm(formula = income ~ age) ## ## Residuals: ## 1 2 3 4 5 ## 54.92 90.98 -70.04 91.12 -166.98 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1279.367 188.510 6.787 0.00654 ** ## age 24.558 4.838 5.076 0.01477 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 132.1 on 3 degrees of freedom ## Multiple R-squared: 0.8957, Adjusted R-squared: 0.8609 ## F-statistic: 25.77 on 1 and 3 DF, p-value: 0.01477  The general line equation is , where is the slope (in this case ) and the intercept (here ). The p-values in the last column show us that both parameters are significant, so with a high probability not due to randomness (more on statistical significance can be found here: From Coin Tosses to p-Hacking: Make Statistics Significant Again!). If we now want to predict the income of a 20-year old we just have to calculate . To do the same calculation in R we can conveniently use the predict function which does this calculation for us automatically. We can also use vectors as input and get several predictions at once. Important is that we use the independent variable with its name as a data frame, otherwise R will throw an error. # predicting income with linear model predict(LinReg, data.frame(age = 20)) ## 1 ## 1770.52 pred_LinReg <- predict(LinReg, data.frame(age = seq(from = 0, to = 80, by = 5))) names(pred_LinReg) <- seq(0, 80, 5) pred_LinReg ## 0 5 10 15 20 25 30 35 ## 1279.367 1402.155 1524.944 1647.732 1770.520 1893.308 2016.097 2138.885 ## 40 45 50 55 60 65 70 75 ## 2261.673 2384.461 2507.249 2630.038 2752.826 2875.614 2998.402 3121.190 ## 80 ## 3243.979  So far so good, but wouldn’t it be nice to actually have a model that is even more accurate, a model where the line actually goes through all the data points so that all the available information is being used. Let us do that with a so called polynomial regression (in red): # polynomial regression PolyReg <- lm(income ~ poly(age, 4, raw = TRUE)) lines(c(20:55), predict(PolyReg, data.frame(age = c(20:55))), col = "red")  pred_PolyReg <- predict(PolyReg, data.frame(age = seq(0, 80, 5))) names(pred_PolyReg) <- seq(0, 80, 5) pred_PolyReg ## 0 5 10 15 20 25 30 ## 19527.182 11103.509 5954.530 3164.488 1959.098 1705.541 1912.468 ## 35 40 45 50 55 60 65 ## 2230.000 2449.726 2504.705 2469.464 2560.000 3133.778 4689.732 ## 70 75 80 ## 7868.267 13451.255 22362.037  What do you think? Is this a better model? Well, there is no good reason why income should go up the younger or older you are, and there is no good reason why there should be a bump between and . In other words, this model takes the noise as legitimate input which actually makes it worse! To have a good model you always need some rigidity to generalize the data. This illustration is a good example of one of the big problems of machine learning: overfitting the data! The other extreme would be underfitting, which is included in the following plot as a blue line: # mean income as a model abline(h = mean(income), col = "blue") title(main = "All models are wrong, some are useful") legend("bottomright", c("Mean: Underfitting", "Linear fit", "Polynomial Reg.: Overfitting"), col = c("blue", "green", "red"), lwd = 3) mean(income) ## [1] 2188  We just took the mean value of income and claimed that no matter the age everybody would earn about . This also obviously doesn’t make the cut. In practice and in all data science projects you always have to look for a model that hits the sweet spot between over- and underfitting. This is as much an art as a science, although there are certain best practices like dividing the data into a training and a test set, cross validation etc. Another important point it that all models have their area of applicability. In physics for example quantum physics works well for the very small but not very good for the very big, it is the other way round for relativity theory. In this case we didn’t have data for people of very young or very old age, so it is in general very dangerous to make prediction for those people, i.e. for values outside of the original observation range (also called extrapolation – in contrast to interpolation = predictions within the original range). In case of the linear model we got an income of about for newborns (the intercept)! And the income gets higher the older people get, also something that doesn’t make sense. This behaviour was a lot worse for the non-linear model. So, here you can see that it is one thing to just let R run and spit something out and quite another one to find a good model and interpret it correctly – even for such a simple problem as this one. In a later post we will use real-world data to predict income brackets with OneR, decision trees and random forests, so stay tuned! Addendum The post on now online: Learning Data Science: Predicting Income Brackets! ## Understanding the Magic of Neural Networks Everything “neural” is (again) the latest craze in machine learning and artificial intelligence. Now what is the magic here? Let us dive directly into a (supposedly little silly) example: we have three protagonists in the fairy tail little red riding hood, the wolf, the grandmother and the woodcutter. They all have certain qualities and little red riding hood reacts in certain ways towards them. For example the grandmother has big eyes, is kindly and wrinkled – little red riding hood will approach her, kiss her on the cheek and offer her food (the behavior “flirt with” towards the woodcutter is a little sexist but we kept it to reproduce the original example from Jones, W. & Hoskins, J.: Back-Propagation, Byte, 1987). We will build and train a neural network which gets the qualities as inputs and little red riding wood’s behaviour as output, i.e. we train it to learn the adequate behaviour for each quality. Have a look at the following code and its output including the resulting plot: library(neuralnet) library(NeuralNetTools) # code qualities and actions qualities <- matrix (c(1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1), byrow = TRUE, nrow = 3) colnames(qualities) <- c("big_ears", "big_eyes", "big_teeth", "kindly", "wrinkled", "handsome") rownames(qualities) <- c("wolf", "grannie", "woodcutter") qualities ## big_ears big_eyes big_teeth kindly wrinkled handsome ## wolf 1 1 1 0 0 0 ## grannie 0 1 0 1 1 0 ## woodcutter 1 0 0 1 0 1 actions <- matrix (c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1), byrow = TRUE, nrow = 3) colnames(actions) <- c("run_away", "scream", "look_for_woodcutter", "kiss_on_cheek", "approach", "offer_food", "flirt_with") rownames(actions) <- rownames(qualities) actions ## run_away scream look_for_woodcutter kiss_on_cheek approach offer_food flirt_with ## wolf 1 1 1 0 0 0 0 ## grannie 0 0 0 1 1 1 0 ## woodcutter 0 0 0 1 0 1 1 data <- cbind(qualities, actions) # train the neural network (NN) set.seed(123) # for reproducibility neuralnetwork <- neuralnet(run_away + scream+look_for_woodcutter + kiss_on_cheek + approach + offer_food + flirt_with ~ big_ears + big_eyes + big_teeth + kindly + wrinkled + handsome, data = data, hidden = 3, exclude = c(1, 8, 15, 22, 26, 30, 34, 38, 42, 46), lifesign = "minimal", linear.output = FALSE) ## hidden: 3 thresh: 0.01 rep: 1/1 steps: 48 error: 0.01319 time: 0.01 secs # plot the NN par_bkp <- par(mar = c(0, 0, 0, 0)) # set different margin to minimize cutoff text plotnet(neuralnetwork, bias = FALSE)  par(par_bkp) # predict actions round(compute(neuralnetwork, qualities)$net.result)
##            [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## wolf          1    1    1    0    0    0    0
## grannie       0    0    0    1    1    1    0
## woodcutter    0    0    0    1    0    1    1


First the qualities and the actions are coded as binary variables in a data frame. After that the neural network is being trained with the qualities as input and the resulting behaviour as output (using the standard formula syntax). In the neuralnet function a few additional technical arguments are set which details won’t concern us here, they just simplify the process in this context). Then we plot the learned net and test it by providing it with the respective qualities: in all three cases it predicts the right actions. How did it learn those?

Let us look at the plot of the net. We see that there are two basic building blocks: neurons and weighted connections between them. We have one neuron for each quality and one neuron for each action. Between both layers we have a so called hidden layer with three neurons in this case. The learned strength between the neurons is shown by the thickness of the lines (whereby ‘black’ means positive and ‘grey’ negative weights). Please have a thorough look at those weights.

You might have noticed that although the net didn’t know anything about the three protagonists in our little story it nevertheless correctly built a representation of them: ‘H1’ (for Hidden 1) represents the wolf because its differentiating quality is ‘big teeth’ which leads to ‘run away’, ‘scream’ and ‘look for woodcutter’, by the same logic ‘H2’ is the woodcutter and ‘H3’ is the grandmother. So the net literally learned to connect the qualities with respective actions of little red riding hood by creating a representation of the three protagonists!

So an artificial neural network is obviously a network of neurons… so let us have a look at those neurons! Basically they are mathematical abstractions of real neurons in your brain. They consist of inputs and an output. The biologically inspired idea is that when the activation of the inputs surpasses a certain threshold the neuron fires. To be able to learn the neuron must, before summing up the inputs, adjust the inputs so that the output is not just arbitrary but matches some sensible result. What is ‘sensible’ you might ask. In a biological environment the answer is not always so clear cut but in our simple example here the neuron has just to match the output we provide it with (= supervised learning).

The following abstraction has all we need, inputs, weights, the sum function, a threshold after that and finally the output of the neuron:

Let us talk a little bit about what is going on here intuitively. First every input is taken, multiplied by its weight and all of this is summed up. Some of you might recognize this mathematical operation as a scalar product (also called dot product). Another mathematical definition of a scalar product is the following:

That is we multiply the length of two vectors by the cosine of the angle of those two vectors. What has cosine to do with it? The cosine of an angle becomes one when both vectors point into the same direction, it becomes zero when they are orthogonal and minus one when both point into opposite directions. Does this make sense? Well, I give you a litte (albeit crude) parable. When growing up there are basically three stages: first you are totally dependent on your parents, then comes puberty and you are against whatever they say or think and after some years you are truly independent (some never reach that stage…). What does “independent” mean here? It means that you agree with some of the things your parents say and think and you disagree with some other things. During puberty you are as dependent on your parents as during being a toddler – you just don’t realize that but in reality you, so to speak, only multiply everything your parents say or think times minus one!

What is the connection with cosine? Well, as a toddler both you and your parents tend to be aligned which gives one, during puberty both of you are aligned but in opposing directions which gives minus one and only as a grown up you are both independent which mathematically means that your vector in a way points in both directions at the same time which is only possible when it is orthogonal on the vector of your parents (you entered a new dimension, literally) – and that gives zero for the cosine.

So cosine is nothing but a measure of dependence – as is correlation by the way. So this setup ensures that the neuron learns the dependence (or correlation) structure between the inputs and the output! The step function is just a way to help it to decide on which side of the fence it wants to sit, to make the decision clearer whether to fire or not. To sum it up, an artificial neuron is a non-linear function (in this case a step function) on a scalar product of the inputs (fixed) and the weights (adaptable to be able to learn). By adapting the weights the neuron learns the dependence structure between inputs and output.

In R you code this idea of an artificial neuron as follows:

neuron <- function(input) ifelse(weights %*% input > 0, 1, 0)


Now let us use this idea in R by training an artificial neuron to classify points in a plane. Have a look at the following table:

Input 1 Input 2 Output
1 0 0
0 0 1
1 1 0
0 1 1

If you plot those points with the colour coded pattern you get the following picture:

The task for the neuron is to find a separating line and thereby classify the two groups. Have a look at the following code:

# inspired by Kubat: An Introduction to Machine Learning, p. 72
plot_line <- function(w, col = "blue", add = TRUE)
curve(-w[1] / w[2] * x - w[3] / w[2], xlim = c(-0.5, 1.5), ylim = c(-0.5, 1.5), col = col, lwd = 3, xlab = "Input 1", ylab = "Input 2", add = add)
neuron <- function(input) as.vector(ifelse(input %*% weights > 0, 1, 0)) # step function on scalar product of weights and input
eta <- 0.5 # learning rate

# examples
input <- matrix(c(1, 0,
0, 0,
1, 1,
0, 1), ncol = 2, byrow = TRUE)
input <- cbind(input, 1) # bias for intercept of line
output <- c(0, 1, 0, 1)
weights <- c(0.25, 0.2, 0.35) # random initial weights

points(input[ , 1:2], pch = 16, col = (output + 2))

# training of weights of neuron
for (example in 1:length(output)) {
weights <- weights + eta * (output[example] - neuron(input[example, ])) * input[example, ]
plot_line(weights)
}
plot_line(weights, col = "black")


# test: applying neuron on input
apply(input, 1, neuron)
## [1] 0 1 0 1


As you can see the result matches the desired output, graphically the black line is the end result and as you can see it separates the green from the red points: the neuron has learned this simple classification task. The blue lines are where the neuron starts from and where it is during training – they are not able to classify the points correctly.

The training, i.e. adapting the weights, takes places in this line:

weights <- weights + eta * (output[example] - neuron(input[example, ])) * input[example, ]


The idea is to compare the current output of the neuron with the wanted output, scale that by some learning factor (eta) and modify the weights accordingly. So if the output is too big make the weights smaller and vice versa. Do this for all examples (sometimes you need another loop to train the neuron with the examples several times) and that’s it. That is the core idea behind the ongoing revolution of neural networks!

Ok, so far we had a closer look at one part of neural networks, namely the neurons, let us now turn to the network structure (also called network topology). First, why do we need a whole network anyway when the neurons are already able to solve classification tasks? The answer is that they can do that only for very simple problems. For example the neuron above can only distinguish between linearly separable points, i.e. it can only draw lines. It fails in case of the simple problem of four points that are coloured green, red, red, green from top left to bottom right (try it yourself). We would need a non-linear function to separate the points. We have to combine several neurons to solve more complicated problems.

The biggest problem you have to overcome when you combine several neurons is how to adapt all the weights. You need a system how to attribute the error at the output layer to all the weights in the net. This had been a profound obstacle until an algorithm called backpropagation (also abbreviated backprop) was invented (or found). We won’t get into the details here but the general idea is to work backwards from the output layers through all of the hidden layers till one reaches the input layer and modify the weights according to their respective contribution to the resulting error. This is done several (sometimes millions of times) for all training examples until one achieves an acceptable error rate for the training data.

The result is that you get several layers of abstraction, so when you e.g. want to train a neural network to recognize certain faces you start with the raw input data in the form of pixels, these are automatically combined into abstract geometrical structures, after that the net detects certain elements of faces, like eyes and noses, and finally abstractions of certain faces are being rebuilt by the net. See the following picture (from nivdul.wordpress.com) for an illustration:

So far we have only coded very small examples of a neural networks. Real-world examples often have dozens of layers with thousands of neurons so that much more complicated patterns can be learned. The more layers there are the ‘deeper’ a net becomes… which is the reason why the current revolution in this field is called “deep learning” because there are so many hidden layers involved. Let us now look at a more realistic example: predicting whether a breast cell is malignant or benign.

Have a look at the following code:

library(OneR)
data(breastcancer)
data <- breastcancer
colnames(data) <- make.names(colnames(data))
data$Class <- as.integer(as.numeric(data$Class) - 1) # for compatibility with neuralnet
data <- na.omit(data)

# Divide training (80%) and test set (20%)
set.seed(12) # for reproducibility
random <- sample(1:nrow(data), 0.8 * nrow(data))
data_train <- data[random, ]
data_test <- data[-random, ]

# Train NN on training set
model_train <- neuralnet(Class ~., data = data_train, hidden = c(9, 9), lifesign = "minimal")
## hidden: 9, 9    thresh: 0.01    rep: 1/1    steps:    3784   error: 0.00524  time: 3.13 secs

# Plot net
plot(model_train, rep = "best")


# Use trained model to predict test set
prediction <- round(predict(model_train, data_test))
eval_model(prediction, data_test)
##
## Confusion matrix (absolute):
##           Actual
## Prediction   0   1 Sum
##        0    93   2  95
##        1     4  38  42
##        Sum  97  40 137
##
## Confusion matrix (relative):
##           Actual
## Prediction    0    1  Sum
##        0   0.68 0.01 0.69
##        1   0.03 0.28 0.31
##        Sum 0.71 0.29 1.00
##
## Accuracy:
## 0.9562 (131/137)
##
## Error rate:
## 0.0438 (6/137)
##
## Error rate reduction (vs. base rate):
## 0.85 (p-value = 1.298e-13)


So you see that a relatively simple net achieves an accuracy of about 95% out of sample. The code itself should be mostly self-explanatory. For the actual training the neuralnet function from the package with the same name is being used, the input method is the standard R formula interface, where you define Class as the variable to be predicted by using all the other variables (coded as .~).

When you look at the net one thing might strike you as odd: there are three neurons at the top with a fixed value of 1. These are so called bias neurons and they serve a similar purpose as the intercept in a linear regression: they kind of shift the model as a whole in n-dimensional feature space just as a regression line is being shifted by the intercept. In case you were attentive we also smuggled in a bias neuron in the above example of a single neuron: it is the last column of the input matrix which contains only ones.

Another thing: as can even be seen in this simple example it is very hard to find out what a neural network has actually learned – the following well-known anecdote (urban legend?) shall serve as a warning: some time ago the military built a system which had the aim to distinguish military vehicles from civilian ones. They chose a neural network approach and trained the system with pictures of tanks, humvees and missile launchers on the one hand and normal cars, pickups and lorries on the other. After having reached a satisfactory accuracy they brought the system into the field (quite literally). It failed completely, performing no better than a coin toss. What had happened? No one knew, so they re-engineered the black box (no small feat in itself) and found that most of the military pics where taken at dusk or dawn and most civilian pics under brighter weather conditions. The neural net had learned the difference between light and dark!

Just for comparison the same example with the OneR package:

data(breastcancer)
data <- breastcancer

# Divide training (80%) and test set (20%)
set.seed(12) # for reproducibility
random <- sample(1:nrow(data), 0.8 * nrow(data))
data_train <- optbin(data[random, ], method = "infogain")
## Warning in optbin.data.frame(data[random, ], method = "infogain"): 12
## instance(s) removed due to missing values
data_test <- data[-random, ]

# Train OneR model on training set
model_train <- OneR(data_train, verbose = TRUE)
##
##     Attribute                   Accuracy
## 1 * Uniformity of Cell Size     92.32%
## 2   Uniformity of Cell Shape    91.59%
## 3   Bare Nuclei                 90.68%
## 4   Bland Chromatin             90.31%
## 5   Normal Nucleoli             90.13%
## 6   Single Epithelial Cell Size 89.4%
## 8   Clump Thickness             84.28%
## 9   Mitoses                     78.24%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

# Show model and diagnostics
summary(model_train)
##
## Call:
## OneR.data.frame(x = data_train, verbose = TRUE)
##
## Rules:
## If Uniformity of Cell Size = (0.991,2] then Class = benign
## If Uniformity of Cell Size = (2,10]    then Class = malignant
##
## Accuracy:
## 505 of 547 instances classified correctly (92.32%)
##
## Contingency table:
##            Uniformity of Cell Size
## Class       (0.991,2] (2,10] Sum
##   benign        * 318     30 348
##   malignant        12  * 187 199
##   Sum             330    217 547
## ---
## Maximum in each column: '*'
##
## Pearson's Chi-squared test:
## X-squared = 381.78243, df = 1, p-value < 0.00000000000000022204

# Plot model diagnostics
plot(model_train)


# Use trained model to predict test set
prediction <- predict(model_train, data_test)

# Evaluate model performance on test set
eval_model(prediction, data_test)
##
## Confusion matrix (absolute):
##            Actual
## Prediction  benign malignant Sum
##   benign        92         0  92
##   malignant      8        40  48
##   Sum          100        40 140
##
## Confusion matrix (relative):
##            Actual
## Prediction  benign malignant  Sum
##   benign      0.66      0.00 0.66
##   malignant   0.06      0.29 0.34
##   Sum         0.71      0.29 1.00
##
## Accuracy:
## 0.9429 (132/140)
##
## Error rate:
## 0.0571 (8/140)
##
## Error rate reduction (vs. base rate):
## 0.8 (p-value = 0.000000000007992571)


As you can see the accuracy is only slightly worse but you have full interpretability of the model… and you would only need to measure one value (“Uniformity of Cell Size”) instead of 9 to get a prediction!

On the other hand making neural networks interpretable is one of the big research challenges at the moment.

To end this rather long post: there is a real revolution going on at the moment with all kinds of powerful neural networks. Especially promising is a combination of reinforcement learning (the topic of an upcoming post) and neural networks, where the reinforcement learning algorithm uses a neural network as its memory. For example the revolutionary AlphaGo Zero is built this way: it just received the rules of Go, one of the most demanding strategy games humanity has ever invented, and grew superhuman strength after just three days! The highest human rank in Go has an ELO value of 2940 – AlphaGo Zero achieves 5185! Even the best players don’t stand a chance against this monster of a machine. The neural network technology that is used for AlphaGo Zero and many other deep neural networks is called Tensorflow, which can also easily be integrated into the R environment. To find out more go here: https://tensorflow.rstudio.com/

In this whole area there are many mind-blowing projects underway, so stay tuned!

## Clustering the Bible

During this time of year there is obviously a lot of talk about the Bible. As most people know the New Testament comprises four different Gospels written by anonymous authors 40 to 70 years after Jesus’ supposed crucifiction. Unfortunately we have lost all of the originals but only retained copies of copies of copies (and so on) which date back hundreds of years after they were written in all kinds of different versions (renowned Biblical scholar Professor Bart Ehrmann states that there are more versions of the New Testament than there are words in the New Testament). Just as a fun fact: there are many more Gospels but only those four were included in the official Bible.

So in general it is interesting to learn a little bit more about this stuff, especially because, for better or worse, it is at the core of Western civilization. One interesting question is how do the four Gospels relate to each other. To find that out you could either study Greek (i.e. the language of the New Testament) and Christian theology for many years and decades – or you could use the stylo package and do a cluster analysis within seconds. Obviously we go for the latter…

For that matter we download one version of the four Gospels from Project Gutenberg in Greek and put it into a separate directory called ‘corpus’ (which is the name of the default directory of the package). The famous beginning of the Gospel of John (“In the beginning was the Word, and the Word was with God, and the Word was God…”) reads as follows in Greek:

Στην αρχή ‘ταν ο λόγος κι’ ο λόγος είτανε με το
Θεό και Θεός είταν ο λόγος. Είταν εκείνος στην αρχή
με το Θεό. Όλα τα πάντα μέσο του έγιναν, και χωρίς
του τίποτα δεν έγινε που γίνηκε. Μέσα του είτανε ζωή
κι’ η ζωή ‘τανε το φως των ανθρώπων, και το φως
μέσα στο σκοτάδι φέγγει και το σκοτάδι δεν το κυρίεψε.

For your convenience I provide you with the curated texts here: John, Luke, Mark, Matthew.

After that we can already call the main function of the package:

library(stylo)

## ### stylo version: 0.6.8 ###
##
## If you plan to cite this software (please do!), use the following reference:
##     Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R:
##     a package for computational text analysis. R Journal 8(1): 107-121.
##     <https://journal.r-project.org/archive/2016/RJ-2016-007/index.html>
##
## To get full BibTeX entry, type: citation("stylo")

stylo(corpus.lang = "Other", encoding = "UTF-8", mfw.min = 200, mfw.max = 200, custom.graph.title = "Ancient Greek Gospels", gui = FALSE)

## using current directory...
## Performing no sampling (using entire text as sample)
## slicing input text into tokens...
##
## turning words into features, e.g. char n-grams (if applicable)...
##
## Total nr. of samples in the corpus: 4
## ....
## The corpus consists of 70943 tokens
##
## processing  4  text samples
##
## combining frequencies into a table...
##
##
## culling @ 0  available features (words) 5000
## Calculating z-scores...
##
## Calculating classic Delta distances...
## MFW used:
## 200


##
## Function call:
## stylo(gui = FALSE, corpus.lang = "Other", encoding = "UTF-8",     mfw.min = 200, mfw.max = 200, custom.graph.title = "Ancient Greek Gospels")
##
## Depending on your chosen options, some results should have been written
## into a few files; you should be able to find them in your current
## (working) directory. Usually, these include a list of words/features
## used to build a table of frequencies, the table itself, a file containing
## recent configuration, etc.
##
## Advanced users: you can pipe the results to a variable, e.g.:
##   I.love.this.stuff = stylo()
## this will create a class "I.love.this.stuff" containing some presumably
## interesting stuff. The class created, you can type, e.g.:
##   summary(I.love.this.stuff)
## to see which variables are stored there and how to use them.
##
##
## for suggestions how to cite this software, type: citation("stylo")


As you can see the Gospels of Matthew and Luke are more similar than Mark (the oldest Gospel), John (the last Gospel that made it into the Bible) is farthest away. This is indeed what Biblical scholars have found out by diligently studying the New Testament: Mark, Luke and Matthew are called “Synoptic Gospels” because they are so similar, John kind of stands on its own. The shared pieces of the Synoptic Gospels can be seen here (picture from Wikipedia, Synoptic Gospels):

As you can see, Biblical scholars have found out that Luke and Matthew share about 60 to 70 percent of the text (through the so called double tradition), whereas Mark shares about 40 percent of the text (through the triple tradition). For comparison, John only shares about 10 percent with the synoptic Gospels. Now, this is exactly what our cluster analysis form above shows – what an impressive feat!

By the way, there is an even simpler way to interact with the stylo package: through the GUI. Just call stylo() and the following window will open where you can play around with different arguments:

It has never been easier to do sophisticated textual analyses, even in ancient languages one cannot understand! You might ask: how did R do that?

Well, behind the scenes R basically creates frequency tables for the words used in the different documents. It then tries to cluster the documents based on a similarity (or distance) measure. Intuitively that means that the more often authors use the same words the closer their texts are. As you can see R doesn’t really have to understand the words or the language used. It just maps statistical measures which seems to be quite effective (not only) in this case. The statistical measures often give some kind of characteristic signature of an author or a source. In a way this is the quantified version of fuzzy concepts like ‘writing style’ yet R can do it much more quickly even for long texts than any human ever could!

Happy New Year… stay tuned for much more to come in

## Intuition for principal component analysis (PCA)

Principal component analysis (PCA) is a dimension-reduction method that can be used to reduce a large set of (often correlated) variables into a smaller set of (uncorrelated) variables, called principal components, which still contain most of the information.

PCA is a concept that is traditionally hard to grasp so instead of giving you the n’th mathematical derivation I will provide you with some intuition.

Basically PCA is nothing else but a projection of some higher dimensional object into a lower dimension. What sounds complicated is really something we encounter every day: when we watch TV we see a 2D-projection of 3D-objects!

So what we want is not only a projection but a projection which retains as much information as possible. A good strategy is to find the longest axis of the object and after that turn the object around this axis so that we again find the second longest axis.

Mathematically this can be done by calculating the so called eigenvectors of the covariance matrix, after that we use just use the n eigenvectors with the biggest eigenvalues to project the object into n-dimensional space (in our example n = 2) – but as I said I won’t go into the mathematical details here (many good explanations can be found here on stackexchange).

To make matters concrete we will now recreate the example from the video above! The teapot data is from https://www.khanacademy.org/computer-programming/3d-teapot/971436783 – there you can also turn it around interactively.

Have a look at the following code:

library(scatterplot3d)

dat <- matrix(unlist(teapot), ncol = 3, byrow = TRUE)
##      [,1] [,2] [,3]
## [1,] 40.6 28.3 -1.1
## [2,] 40.1 30.4 -1.1
## [3,] 40.7 31.1 -1.1
## [4,] 42.0 30.4 -1.1
## [5,] 43.5 28.3 -1.1
## [6,] 37.5 28.3 14.5

# teapot in 3D
scatterplot3d(dat[ , 1], dat[ , 2], dat[ , 3], highlight.3d = TRUE, angle = 75, pch = 19, lwd = 15, xlab = "", ylab = "", zlab = "", main = "teapot in 3D")


And now we project that teapot from 3D-space onto a 2D-plane while retaining as much information as possible by taking the two eigenvectors with biggest eigenvalues:

# PCA
(eigenval <- eigen(cov(dat))$values) ## [1] 2050.6935 747.5764 553.4070 eigenvec <- eigen(cov(dat))$vectors
# take only the two eigenvectors with biggest eigenvalues
PCA_2 <- dat %*% eigenvec[ , 1:2]
plot(PCA_2, ylim = c(50, -55), pch = 19, lwd = 35, col= "blue", xlab = "", ylab = "", main = "teapot in 2D")


round(cumsum(eigenval)/sum(eigenval) * 100, 2) # cumulative percentage of retrieved information
## [1]  61.18  83.49 100.00


That was fun, wasn’t it!

In data science you will often use PCA to condense large datasets into smaller datasets while retaining most of the information (which means retaining as much variance as possible in this case). PCA is like clustering an unsupervised learning method.

## OneR – Fascinating Insights through Simple Rules

We already saw the power of the OneR package in the preceding post, One Rule (OneR) Machine Learning Classification in under One Minute. Here we want to give some more examples to gain some fascinating, often counter-intuitive, insights.

Shirin Glander of Muenster University tested the OneR package with data from the World Happiness Report to find out what makes people happy: https://shiring.github.io/machine_learning/2017/04/23/one_r.

The whole analysis is very sophisticated so I present a simplified approach here. Have a look at the following code:

library(OneR)
data$Happiness.Score <- bin(data$Happiness.Score, nbins = 3, labels = c("low", "middle", "high"), method = "content")
data <- maxlevels(data) # removes all columns that have more than 20 levels (Country)
data <- optbin(formula = Happiness.Score ~., data = data, method = "infogain")
data <- data[ , -(1:4)] # remove columns with redundant data
model <- OneR(formula = Happiness.Score ~., data = data, verbose = TRUE)
##
##     Attribute                     Accuracy
## 1 * Economy..GDP.per.Capita.      71.97%
## 2   Health..Life.Expectancy.      70.06%
## 3   Family                        65.61%
## 4   Dystopia.Residual             59.87%
## 5   Freedom                       57.32%
## 6   Trust..Government.Corruption. 55.41%
## 7   Generosity                    47.13%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

summary(model)
##
## Call:
## OneR.formula(formula = Happiness.Score ~ ., data = data, verbose = TRUE)
##
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.675] then Happiness.Score = low
## If Economy..GDP.per.Capita. = (0.675,1.32]     then Happiness.Score = middle
## If Economy..GDP.per.Capita. = (1.32,1.83]      then Happiness.Score = high
##
## Accuracy:
## 113 of 157 instances classified correctly (71.97%)
##
## Contingency table:
##                Economy..GDP.per.Capita.
## Happiness.Score (-0.00182,0.675] (0.675,1.32] (1.32,1.83] Sum
##          low                * 36           16           0  52
##          middle                4         * 46           2  52
##          high                  0           22        * 31  53
##          Sum                  40           84          33 157
## ---
## Maximum in each column: '*'
##
## Pearson's Chi-squared test:
## X-squared = 130.99, df = 4, p-value < 2.2e-16

plot(model)


prediction <- predict(model, data)
eval_model(prediction, data)
##
## Confusion matrix (absolute):
##           Actual
## Prediction high low middle Sum
##     high     31   0      2  33
##     low       0  36      4  40
##     middle   22  16     46  84
##     Sum      53  52     52 157
##
## Confusion matrix (relative):
##           Actual
## Prediction high  low middle  Sum
##     high   0.20 0.00   0.01 0.21
##     low    0.00 0.23   0.03 0.25
##     middle 0.14 0.10   0.29 0.54
##     Sum    0.34 0.33   0.33 1.00
##
## Accuracy:
## 0.7197 (113/157)
##
## Error rate:
## 0.2803 (44/157)
##
## Error rate reduction (vs. base rate):
## 0.5769 (p-value < 2.2e-16)


As you can see we get more than 70% accuracy with three rules based on GDP per capita. So it seems that money CAN buy happiness after all!

Dr. Glander comes to the following conclusion:

All in all, based on this example, I would confirm that OneR models do indeed produce sufficiently accurate models for setting a good baseline. OneR was definitely faster than random forest, gradient boosting and neural nets. However, the latter were more complex models and included cross-validation.
If you prefer an easy to understand model that is very simple, OneR is a very good way to go. You could also use it as a starting point for developing more complex models with improved accuracy.
When looking at feature importance across models, the feature OneR chose – Economy/GDP per capita – was confirmed by random forest, gradient boosting trees and neural networks as being the most important feature. This is in itself an interesting conclusion! Of course, this correlation does not tell us that there is a direct causal relationship between money and happiness, but we can say that a country’s economy is the best individual predictor for how happy people tend to be.

A well known reference data set is the titanic data set which gives all kind of additional information on the passengers of the titanic and whether they survived the tragedy. We can conveniently use the titanic package from CRAN to get access to the data set. Have a look at the following code:

library(titanic)
data <- bin(maxlevels(titanic_train), na.omit = FALSE)
model <- OneR(Survived ~., data = data, verbose = TRUE)
## Warning in OneR.data.frame(x = data, ties.method = ties.method, verbose =
## verbose, : data contains unused factor levels
##
##     Attribute   Accuracy
## 1 * Sex         78.68%
## 2   Pclass      67.9%
## 3   Fare        64.42%
## 4   Embarked    63.86%
## 5   Age         62.74%
## 6   Parch       61.73%
## 7   PassengerId 61.62%
## 7   SibSp       61.62%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

summary(model)
##
## Call:
## OneR.formula(formula = Survived ~ ., data = data, verbose = TRUE)
##
## Rules:
## If Sex = female then Survived = 1
## If Sex = male   then Survived = 0
##
## Accuracy:
## 701 of 891 instances classified correctly (78.68%)
##
## Contingency table:
##         Sex
## Survived female  male Sum
##      0       81 * 468 549
##      1    * 233   109 342
##      Sum    314   577 891
## ---
## Maximum in each column: '*'
##
## Pearson's Chi-squared test:
## X-squared = 260.72, df = 1, p-value < 2.2e-16

plot(model)


prediction <- predict(model, data)
eval_model(prediction, data\$Survived)
##
## Confusion matrix (absolute):
##           Actual
## Prediction   0   1 Sum
##        0   468 109 577
##        1    81 233 314
##        Sum 549 342 891
##
## Confusion matrix (relative):
##           Actual
## Prediction    0    1  Sum
##        0   0.53 0.12 0.65
##        1   0.09 0.26 0.35
##        Sum 0.62 0.38 1.00
##
## Accuracy:
## 0.7868 (701/891)
##
## Error rate:
## 0.2132 (190/891)
##
## Error rate reduction (vs. base rate):
## 0.4444 (p-value < 2.2e-16)


So, somewhat contrary to popular belief, it were not necessarily the rich that survived but the women!

A more challenging data set is one about the quality of red wine (https://archive.ics.uci.edu/ml/datasets/Wine+Quality), have a look at the following code:

data <- read.csv("data/winequality-red.csv", header = TRUE, sep = ";")
data <- optbin(data, method = "infogain")
## Warning in optbin.data.frame(data, method = "infogain"): target is numeric
model <- OneR(data, verbose = TRUE)
##
##     Attribute            Accuracy
## 1 * alcohol              56.1%
## 2   sulphates            51.22%
## 3   volatile.acidity     50.59%
## 4   total.sulfur.dioxide 48.78%
## 5   density              47.22%
## 6   citric.acid          46.72%
## 7   chlorides            46.47%
## 8   free.sulfur.dioxide  44.47%
## 9   fixed.acidity        43.96%
## 10  residual.sugar       43.46%
## 11  pH                   43.21%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

summary(model)
##
## Call:
## OneR.data.frame(x = data, verbose = TRUE)
##
## Rules:
## If alcohol = (8.39,8.45] then quality = 3
## If alcohol = (8.45,10]   then quality = 5
## If alcohol = (10,11]     then quality = 6
## If alcohol = (11,12.5]   then quality = 6
## If alcohol = (12.5,14.9] then quality = 6
##
## Accuracy:
## 897 of 1599 instances classified correctly (56.1%)
##
## Contingency table:
##        alcohol
## quality (8.39,8.45] (8.45,10] (10,11] (11,12.5] (12.5,14.9]  Sum
##     3           * 1         5       4         0           0   10
##     4             0        28      13        11           1   53
##     5             0     * 475     156        41           9  681
##     6             1       216   * 216     * 176        * 29  638
##     7             0        19      53       104          23  199
##     8             0         2       2         6           8   18
##     Sum           2       745     444       338          70 1599
## ---
## Maximum in each column: '*'
##
## Pearson's Chi-squared test:
## X-squared = 546.64, df = 20, p-value < 2.2e-16

prediction <- predict(model, data)
eval_model(prediction, data, zero.print = ".")
##
## Confusion matrix (absolute):
##           Actual
## Prediction    3    4    5    6    7    8  Sum
##        3      1    .    .    1    .    .    2
##        4      .    .    .    .    .    .    .
##        5      5   28  475  216   19    2  745
##        6      4   25  206  421  180   16  852
##        7      .    .    .    .    .    .    .
##        8      .    .    .    .    .    .    .
##        Sum   10   53  681  638  199   18 1599
##
## Confusion matrix (relative):
##           Actual
## Prediction    3    4    5    6    7    8  Sum
##        3      .    .    .    .    .    .    .
##        4      .    .    .    .    .    .    .
##        5      . 0.02 0.30 0.14 0.01    . 0.47
##        6      . 0.02 0.13 0.26 0.11 0.01 0.53
##        7      .    .    .    .    .    .    .
##        8      .    .    .    .    .    .    .
##        Sum 0.01 0.03 0.43 0.40 0.12 0.01 1.00
##
## Accuracy:
## 0.561 (897/1599)
##
## Error rate:
## 0.439 (702/1599)
##
## Error rate reduction (vs. base rate):
## 0.2353 (p-value < 2.2e-16)


That is an interesting result, isn’t it: the more alcohol the higher the perceived quality!

We end this post with an unconventional use case for OneR: finding the best move in a strategic game, i.e. tic-tac-toe. We use a dataset with all possible board configurations at the end of such games, have a look at the following code:

data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/tic-tac-toe/tic-tac-toe.data", header = FALSE)
names(data) <- c("top-left-square", "top-middle-square", "top-right-square", "middle-left-square", "middle-middle-square", "middle-right-square", "bottom-left-square", "bottom-middle-square", "bottom-right-square", "Class")
model <- OneR(data, verbose = TRUE)
##
##     Attribute            Accuracy
## 1 * middle-middle-square 69.94%
## 2   top-left-square      65.34%
## 2   top-middle-square    65.34%
## 2   top-right-square     65.34%
## 2   middle-left-square   65.34%
## 2   middle-right-square  65.34%
## 2   bottom-left-square   65.34%
## 2   bottom-middle-square 65.34%
## 2   bottom-right-square  65.34%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

summary(model)
##
## Call:
## OneR.data.frame(x = data, verbose = TRUE)
##
## Rules:
## If middle-middle-square = b then Class = positive
## If middle-middle-square = o then Class = negative
## If middle-middle-square = x then Class = positive
##
## Accuracy:
## 670 of 958 instances classified correctly (69.94%)
##
## Contingency table:
##           middle-middle-square
## Class          b     o     x Sum
##   negative    48 * 192    92 332
##   positive * 112   148 * 366 626
##   Sum        160   340   458 958
## ---
## Maximum in each column: '*'
##
## Pearson's Chi-squared test:
## X-squared = 115.91, df = 2, p-value < 2.2e-16

prediction <- predict(model, data)
eval_model(prediction, data)
##
## Confusion matrix (absolute):
##           Actual
## Prediction negative positive Sum
##   negative      192      148 340
##   positive      140      478 618
##   Sum           332      626 958
##
## Confusion matrix (relative):
##           Actual
## Prediction negative positive  Sum
##   negative     0.20     0.15 0.35
##   positive     0.15     0.50 0.65
##   Sum          0.35     0.65 1.00
##
## Accuracy:
## 0.6994 (670/958)
##
## Error rate:
## 0.3006 (288/958)
##
## Error rate reduction (vs. base rate):
## 0.1325 (p-value = 0.001427)


Perhaps it doesn’t come as a surprise that the middle-middle-square is strategically the most important one – but still it is encouraging to see that OneR comes to the same conclusion.

## One Rule (OneR) Machine Learning Classification in under One Minute

You can also watch this video which goes through the following example step-by-step:

After installing the OneR package from CRAN load it

library(OneR)


Use the famous Iris dataset and determine optimal bins for numeric data

data <- optbin(iris)


Build model with best predictor

model <- OneR(data, verbose = TRUE)
##
##     Attribute    Accuracy
## 1 * Petal.Width  96%
## 2   Petal.Length 95.33%
## 3   Sepal.Length 74.67%
## 4   Sepal.Width  55.33%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'


Show learned rules and model diagnostics

summary(model)
##
## Call:
## OneR(data = data, verbose = TRUE)
##
## Rules:
## If Petal.Width = (0.0976,0.791] then Species = setosa
## If Petal.Width = (0.791,1.63]   then Species = versicolor
## If Petal.Width = (1.63,2.5]     then Species = virginica
##
## Accuracy:
## 144 of 150 instances classified correctly (96%)
##
## Contingency table:
##             Petal.Width
## Species      (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum
##   setosa               * 50            0          0  50
##   versicolor              0         * 48          2  50
##   virginica               0            4       * 46  50
##   Sum                    50           52         48 150
## ---
## Maximum in each column: '*'
##
## Pearson's Chi-squared test:
## X-squared = 266.35, df = 4, p-value < 2.2e-16


Plot model diagnostics

plot(model)


Use model to predict data

prediction <- predict(model, data)


Evaluate prediction statistics

eval_model(prediction, data)
##
## Confusion matrix (absolute):
##             Actual
## Prediction   setosa versicolor virginica Sum
##   setosa         50          0         0  50
##   versicolor      0         48         4  52
##   virginica       0          2        46  48
##   Sum            50         50        50 150
##
## Confusion matrix (relative):
##             Actual
## Prediction   setosa versicolor virginica  Sum
##   setosa       0.33       0.00      0.00 0.33
##   versicolor   0.00       0.32      0.03 0.35
##   virginica    0.00       0.01      0.31 0.32
##   Sum          0.33       0.33      0.33 1.00
##
## Accuracy:
## 0.96 (144/150)
##
## Error rate:
## 0.04 (6/150)
##
## Error rate reduction (vs. base rate):
## 0.94 (p-value < 2.2e-16)


Please note that the very good accuracy of 96% is reached effortlessly.

“Petal.Width” is identified as the attribute with the highest predictive value. The cut points of the intervals are found automatically (via the included optbin function). The results are three very simple, yet accurate, rules to predict the respective species.

The nearly perfect separation of the areas in the diagnostic plot give a good indication of the model’s ability to separate the different species.

The whole code of this post:

library(OneR)
data <- optbin(iris)
model <- OneR(data, verbose = TRUE)
summary(model)
plot(model)
prediction <- predict(model, data)
eval_model(prediction, data)


More sophisticated examples will follow in upcoming posts… so stay tuned!

Update: For more examples see e.g. this post: OneR – Fascinating Insights through Simple Rules

## Help

From within R:

help(package = OneR)


…or as a pdf here: OneR.pdf

The package vignette: OneR – Establishing a New Baseline for Machine Learning Classification Models

Issues can be posted here: https://github.com/vonjd/OneR/issues

## Feedback

I would love to hear about your experiences with the OneR package. Please drop a line or two in the comments – Thank you!