Customers who bought…


One of the classic examples in data science (called data mining at the time) is the beer and diapers example: when a big supermarket chain started analyzing their sales data they encountered not only trivial patterns, like toothbrushes and toothpaste being bought together, but also quite strange combinations like beer and diapers. Now, the trivial ones are reassuring that the method works but what about the more extravagant ones? Does it mean that young parents are alcoholics? Or that instead of breastfeeding they give their babies beer? Obviously, they had to get to the bottom of this.

As it turned out in many cases they following happened: stressed out mummy sends young daddy to the supermarket because they had run out of diapers. Young daddy seizes the opportunity to not only buy the much needed diapers but also to stock up on some beer! So what the supermarket did was to place the beer directly on the way from the aisle with the diapers – the result was a significant boost in beer sales (for all the young daddies who might have forgotten what they really wanted when buying diapers…).

So, to reproduce this example in a simplified way have a look at the following code:

# some example data for items bought together (market baskets)
Diapers <- c(1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0)
Baby_Oil <- c(1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0)
Ham <- c(rep(0, 6), rep(1, 2), rep(0, 7))
Beer <- c(rep(0, 3), 1, rep(0, 11))
(basket <- cbind(Diapers, Baby_Oil, Ham, Beer))
##       Diapers Baby_Oil Ham Beer
##  [1,]       1        1   0    0
##  [2,]       0        1   0    0
##  [3,]       1        1   0    0
##  [4,]       1        0   0    1
##  [5,]       0        0   0    0
##  [6,]       1        1   0    0
##  [7,]       1        0   1    0
##  [8,]       0        0   1    0
##  [9,]       1        1   0    0
## [10,]       1        1   0    0
## [11,]       0        1   0    0
## [12,]       0        1   0    0
## [13,]       1        1   0    0
## [14,]       0        0   0    0
## [15,]       0        0   0    0

# analysis of items bought together
round(cor_basket <- cor(basket), 2) # cor is the core of the method! (no pun intended)
##          Diapers Baby_Oil   Ham  Beer
## Diapers     1.00     0.33 -0.03  0.25
## Baby_Oil    0.33     1.00 -0.48 -0.33
## Ham        -0.03    -0.48  1.00 -0.10
## Beer        0.25    -0.33 -0.10  1.00

diag(cor_basket) <- 0 # we don't want to recommend the same products to the customers who already bought them
round(cor_basket, 2)
##          Diapers Baby_Oil   Ham  Beer
## Diapers     0.00     0.33 -0.03  0.25
## Baby_Oil    0.33     0.00 -0.48 -0.33
## Ham        -0.03    -0.48  0.00 -0.10
## Beer        0.25    -0.33 -0.10  0.00

# printing items bought together
for (i in 1:ncol(cor_basket)) {
  col <- cor_basket[ , i, drop = FALSE]
  col <- col[order(col, decreasing = TRUE), , drop = FALSE]
  cat("Customers who bought", colnames(col), "also bought", rownames(col)[col > 0], "\n")
}
## Customers who bought Diapers also bought Baby_Oil Beer 
## Customers who bought Baby_Oil also bought Diapers 
## Customers who bought Ham also bought  
## Customers who bought Beer also bought Diapers

What we are looking for is some kind of dependance pattern within the shopping baskets, in this case we use the good old correlation function. Traditionally other (dependance) measures are used, namely support, confidence and lift. We will come to that later on in this post.

Wikipedia offers the following fitting description of association rule learning:

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify rules discovered in databases using some measures of interestingness.

For example, the rule \{onions,potatoes\}\Rightarrow \{burger\} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat.

Such information can be used as the basis for decisions about marketing activities such as, e.g. promotional pricing or product placements.

In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, continuous production, and bioinformatics.

So, this is also the basis of popular functions on ecommerce sites (“customers who bought this item also bought…”) or movie streaming platforms (“customers who watched this film also watched…”).

A very good package for real-world datasets is the arules package (on CRAN). Have a look at the following code:

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write

data("Groceries")
rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [5668 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules_conf <- arules::sort(rules, by = "confidence", decreasing = TRUE)
inspect(head(rules_conf, 10))
##      lhs                     rhs                    support confidence     lift count
## [1]  {rice,                                                                          
##       sugar}              => {whole milk}       0.001220132          1 3.913649    12
## [2]  {canned fish,                                                                   
##       hygiene articles}   => {whole milk}       0.001118454          1 3.913649    11
## [3]  {root vegetables,                                                               
##       butter,                                                                        
##       rice}               => {whole milk}       0.001016777          1 3.913649    10
## [4]  {root vegetables,                                                               
##       whipped/sour cream,                                                            
##       flour}              => {whole milk}       0.001728521          1 3.913649    17
## [5]  {butter,                                                                        
##       soft cheese,                                                                   
##       domestic eggs}      => {whole milk}       0.001016777          1 3.913649    10
## [6]  {citrus fruit,                                                                  
##       root vegetables,                                                               
##       soft cheese}        => {other vegetables} 0.001016777          1 5.168156    10
## [7]  {pip fruit,                                                                     
##       butter,                                                                        
##       hygiene articles}   => {whole milk}       0.001016777          1 3.913649    10
## [8]  {root vegetables,                                                               
##       whipped/sour cream,                                                            
##       hygiene articles}   => {whole milk}       0.001016777          1 3.913649    10
## [9]  {pip fruit,                                                                     
##       root vegetables,                                                               
##       hygiene articles}   => {whole milk}       0.001016777          1 3.913649    10
## [10] {cream cheese ,                                                                 
##       domestic eggs,                                                                 
##       sugar}              => {whole milk}       0.001118454          1 3.913649    1

The algorithm used here is the so called Apriori algorithm. It ameliorates the problem with real-world datasets that when you want to test all combinations of all possible items you very soon run into performance problems – even with very fast computers – because there are just too many possibilities to be tested.

The Apriori algorithm aggressively prunes the possibilities to be tested by making use of the fact that if you are only interested in rules that are supported by a certain number of instances you can start with testing the support of individual items – which is easy to do – and work your way up to more complicated rules.

The trick is that you don’t test more complicated rules with items which don’t have enough support on the individual level: this is because if you don’t have enough instances on the individual level you don’t even have to look at more complicated combinations with those items included (which would be even more scarce). What sounds like an obvious point brings about huge time savings for real-world datasets which couldn’t be analyzed without this trick.

As mentioned above important concepts to assess the quality (or interestingness) of association rules are support, confidence and lift:

  • Support of \{items X\}: percentage of X for all cases
  • Confidence of \{items X\}\Rightarrow \{items Y\}: percentage of Y for all X
  • Lift of \{items X\}\Rightarrow \{items Y\}: ratio of the observed support of X and Y to what would be expected if X and Y were independent

To understand these concepts better we are going to rebuild the examples given in the Wikipedia article in R. Have a look at the parts “Definition” and “Useful Concepts” of the article and after that at the following code, which should be self-explanatory:

M <- matrix(c(1, 1, 0, 0, 0,
              0, 0, 1, 0, 0,
              0, 0, 0, 1, 1,
              1, 1, 1, 0, 0,
              0, 1, 0, 0, 0), ncol = 5, byrow = TRUE)
colnames(M) <- c("milk", "bread", "butter", "beer", "diapers")
M
##      milk bread butter beer diapers
## [1,]    1     1      0    0       0
## [2,]    0     0      1    0       0
## [3,]    0     0      0    1       1
## [4,]    1     1      1    0       0
## [5,]    0     1      0    0       0

supp <- function(X) {
  sum(rowSums(M[ , X, drop = FALSE]) == length(X)) / nrow(M) # "rowSums == length" mimics logical AND for the selected columns
}
conf <- function(X, Y) {
  supp(c(X, Y)) / supp(X) # conf(X => Y)
}
lift <- function(X, Y) {
  supp(c(X, Y)) / (supp(X) * supp(Y)) # lift(X => Y)
}

supp(c("beer", "diapers"))         # percentage of X for all cases
## [1] 0.2

conf(c("butter", "bread"), "milk") # percentage of Y for all X
## [1] 1

lift(c("milk", "bread"), "butter") # ratio of the observed support of X and Y to what would be expected if X and Y were independent
## [1] 1.25

You should conduct your own experiments by playing around with different item combinations so that you really understand the mechanics of those important concepts.

If all of those analyses are being done for perfecting your customer experience or just outright manipulation to lure you into buying stuff you don’t really need is obviously a matter of perspective…

To understand recursion you have to understand recursion…


Sorting values is one of the bread and butter tasks in computer science: this post uses it as a use case to learn what recursion is all about. It starts with some nerd humour… and ends with some more, so read on!

Have a look at the following code and try to understand it:

bogosort <- function(x) {
  while(is.unsorted(x)) x <- sample(x)
  x
}

set.seed(123)
(n <- sample(1:100, 10))
##  [1] 29 79 41 86 91  5 50 83 51 42

bogosort(n)
##  [1]  5 29 41 42 50 51 79 83 86 91

Obviously bogosort did its job… but in a ludicrously inefficient way! It just shuffles all values over and over until they are sorted by chance! In the worst case it just shuffles forever (i.e. until the sun swallows the earth or your partner switches the computer off… whatever comes first).

To create a more efficient sorting algorithm one could use the following strategy:

  1. Draw the smallest number, put it into a vector and
  2. Sort the remaining numbers by going back to 1.

The algorithm (which is called selection sort, see below) effectively sorts by calling itself! This is the main idea behind recursion.

To dive deeper into how recursion works watch the following video:

So, basically

Recursion means defining a function in terms of itself!

Additionally you need some stopping criterion or base case, otherwise you would create an infinite loop.

The recursive version of the factorial function is:

    \[1!=1, n!=n(n-1)!\]

In the video this formula is written in C, here is the R version:

fact <- function(n) {
  if (n == 1) 1
  else n * fact(n-1)
}
fact(4)
## [1] 24

factorial(4) #inbuilt factorial function
## [1] 24

After we have seen how recursion actually works, let us implement the above selection sort algorithm in R. First watch this, admittedly unconventional, video:
We now implement the main idea recursively:

selsort <- function(x) {
  if(length(x) > 1) {
    mini <- which.min(x)
    c(x[mini], selsort(x[-mini]))
  } else x
}

set.seed(123)
(v <- sample(1:100, 20))
##  [1] 29 79 41 86 91  5 50 83 51 42 87 98 60 94  9 77 21  4 27 78

selsort(v)
##  [1]  4  5  9 21 27 29 41 42 50 51 60 77 78 79 83 86 87 91 94 98

Another, even more efficient sorting algorithm, quicksort, also works recursively. Watch the following video to get the idea:

So, the main idea behind the quicksort algorithm is to find a pivot for each unsorted subset of numbers which splits it into two similarly sized subsets. Then the sorting function calls itself for each subset.

Have a look at the following code:

qsort <- function(v) {
  if (length(v) > 1) {
    pivot <- (min(v) + max(v)) / 2
    c(qsort(v[v < pivot]), v[v == pivot], qsort(v[v > pivot])) 
  } else v
}

qsort(v)
##  [1]  4  5  9 21 27 29 41 42 50 51 60 77 78 79 83 86 87 91 94 98

Now for some speed comparisons between selection sort, quicksort and R’s sort function (on my lame computer…):

options(expressions = 7500)
N <- 7000

set.seed(123)
vs <- runif(N)

system.time(u <- selsort(vs))
##    user  system elapsed 
##    0.59    0.14    0.73

system.time(u <- qsort(vs))
##    user  system elapsed 
##    0.01    0.00    0.01

system.time(u <- sort(vs))
##    user  system elapsed 
##       0       0       0

Wow, although quicksort was implemented in R (and not in C as the sort function) it is impressively fast. Why? Because each subset of unsorted numbers is again split into two subsets only about half as big. This pushes down the sizes of subsets that still have to be sorted down pretty fast. In the case of selection sort the subset only gets smaller by one number (the smallest one) in each step.

Let us end this post with a little easter egg from google – do you get it?

So, what is AI really?


One of the topics that is totally hyped at the moment is obviously Artificial Intelligence or AI for short. There are many self-proclaimed experts running around trying to sell you the stuff they have been doing all along under this new label. When you ask them what AI means you will normally get some convoluted explanations (which is a good sign that they don’t get it themselves) and some “success stories”. The truth is that many of those talking heads don’t really know what they are talking about, yet happen to have a friend who knows somebody who picked up a book at the local station bookshop… ok, that was nasty but unfortunately often not too far away from the truth.

So, what is AI really? This post tries to give some guidance.

The traditional way coding a computer program worked was through carefully analyzing the problem, trying to separate its different parts into simpler sub-problems, put the solution into an algorithmic form (kind of a recipe) and finally code it. Let’s have a look at an example!

Let’s say you want to program an app for mushroom pickers with warnings for certain qualities. The idea is to find an easy to follow guide in the form of “if your mushroom has this quality then DO NOT eat it!” (in computer lingo called “conditional statements” or just “conditionals”). As any mushroom picker can attest this is not an easy task. For the matter have a look at the following dataset (you can find it here: mushrooms, originally it is from https://archive.ics.uci.edu/ml/datasets/mushroom):

mushrooms <- read.csv("data/mushrooms.csv")
str(mushrooms)
## 'data.frame':    8124 obs. of  23 variables:
##  $ cap_shape               : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ...
##  $ cap_surface             : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
##  $ cap_color               : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ...
##  $ gill_attachment         : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill_spacing            : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill_size               : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill_color              : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ...
##  $ stalk_shape             : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk_root              : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
##  $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ stalk_color_above_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk_color_below_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil_type               : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
##  $ veil_color              : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring_number             : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring_type               : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore_print_color       : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ...
##  $ population              : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...
##  $ type                    : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...

head(mushrooms)
##   cap_shape cap_surface cap_color bruises    odor gill_attachment
## 1    convex      smooth     brown     yes pungent            free
## 2    convex      smooth    yellow     yes  almond            free
## 3      bell      smooth     white     yes   anise            free
## 4    convex       scaly     white     yes pungent            free
## 5    convex      smooth      gray      no    none            free
## 6    convex       scaly    yellow     yes  almond            free
##   gill_spacing gill_size gill_color stalk_shape stalk_root
## 1        close    narrow      black   enlarging      equal
## 2        close     broad      black   enlarging       club
## 3        close     broad      brown   enlarging       club
## 4        close    narrow      brown   enlarging      equal
## 5      crowded     broad      black    tapering      equal
## 6        close     broad      brown   enlarging       club
##   stalk_surface_above_ring stalk_surface_below_ring stalk_color_above_ring
## 1                   smooth                   smooth                  white
## 2                   smooth                   smooth                  white
## 3                   smooth                   smooth                  white
## 4                   smooth                   smooth                  white
## 5                   smooth                   smooth                  white
## 6                   smooth                   smooth                  white
##   stalk_color_below_ring veil_type veil_color ring_number  ring_type
## 1                  white   partial      white         one    pendant
## 2                  white   partial      white         one    pendant
## 3                  white   partial      white         one    pendant
## 4                  white   partial      white         one    pendant
## 5                  white   partial      white         one evanescent
## 6                  white   partial      white         one    pendant
##   spore_print_color population habitat      type
## 1             black  scattered   urban poisonous
## 2             brown   numerous grasses    edible
## 3             brown   numerous meadows    edible
## 4             black  scattered   urban poisonous
## 5             brown   abundant grasses    edible
## 6             black   numerous grasses    edible

The dataset consists of 8124 examples of mushrooms with 22 qualities each (plus the attribute whether the respective mushroom is edible or poisonous). Well, obviously this is not going to be easy…

A naive approach would be to formulate rules for every instance: if the cap shape is convex and the cap surface smooth and the cap colour brown… and so on for all 22 attributes, then DO NOT eat it! This would obviously not be very helpful. Another approach would be to go through every attribute and see whether it is helpful in determining the type, so for example:

table(mushrooms$cap_shape, mushrooms$type)
##          
##           edible poisonous
##   bell       404        48
##   conical      0         4
##   convex    1948      1708
##   flat      1596      1556
##   knobbed    228       600
##   sunken      32         0

Obviously this attribute isn’t very helpful, in many cases it just gives a “maybe, maybe not”-answer. Perhaps the approach itself is not so bad after all but you would have to try it for all 22 attributes, interpret the results, pick the best one, formulate if-then-rules and code them… tiresome and error-prone.

Wouldn’t it be nice to do it the other way around: just show the computer all of the examples and it magically programs itself by finding the appropriate if-then-rules automatically? This is what AI is all about:

Artificial Intelligence (AI): Showing a computer examples of a problem so that it programs itself to solve it.

So, let us throw AI at our problem in the form of the OneR package (on CRAN):

library(OneR)
OneR(mushrooms, verbose = TRUE)
## 
##     Attribute                Accuracy
## 1 * odor                     98.52%  
## 2   spore_print_color        86.8%   
## 3   gill_color               80.5%   
## 4   ring_type                77.55%  
## 5   stalk_surface_above_ring 77.45%  
## 6   stalk_surface_below_ring 76.61%  
## 7   gill_size                75.63%  
## 8   bruises                  74.4%   
## 9   population               72.18%  
## 10  stalk_color_above_ring   71.64%  
## 11  stalk_color_below_ring   71.44%  
## 12  habitat                  69.03%  
## 13  stalk_root               64.6%   
## 14  gill_spacing             61.6%   
## 15  cap_color                59.53%  
## 16  cap_surface              58.05%  
## 17  cap_shape                56.43%  
## 18  stalk_shape              55.29%  
## 19  ring_number              53.82%  
## 20  veil_color               51.9%   
## 21  gill_attachment          51.8%   
## 21  veil_type                51.8%   
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## 
## Call:
## OneR.data.frame(x = mushrooms, verbose = TRUE)
## 
## Rules:
## If odor = almond   then type = edible
## If odor = anise    then type = edible
## If odor = creosote then type = poisonous
## If odor = fishy    then type = poisonous
## If odor = foul     then type = poisonous
## If odor = musty    then type = poisonous
## If odor = none     then type = edible
## If odor = pungent  then type = poisonous
## If odor = spicy    then type = poisonous
## 
## Accuracy:
## 8004 of 8124 instances classified correctly (98.52%)

Wow! Within the blink of an eye and with just one command (OneR) we got all the rules we need for our app! It is the odour: if it smells poisonous it probably is. The accuracy of this is nearly 99%, not too bad for such a simple rule… we wouldn’t even need an app for that.

For more examples, some deeper explanation (and even a video) on the OneR package go here: OneR – Establishing a New Baseline for Machine Learning Classification Models.

By the way, in the words “programs itself” – impressive as it may be – is still the term “programming”, so we are not talking about machines developing intentions, feelings or even consciousness anytime soon. This is the domain of Hollywood and not AI!

As with every hype there are many terms flying around, like Machine Learning (ML), Data Science and (Predictive) Analytics and somewhat older terms, like Data Mining, Knowledge Discovery and Business Intelligence… and many, many more. Of course you can define them in all sorts of ways but to be honest with you in essence they all mean the same (see definition above). I would argue that Data Science is a somewhat broader term which comprises also e.g. the handling of data, interpretation by domain experts and presentation to the business but that is just one definition. There is an old joke that what is Machine Learning when put on powerpoint becomes Artificial Intelligence – it just sounds so much better than any technical term.

We can now answer another question: why now? Why is there a hype now? I will share a secret with you: most of the AI methods used today are quite old! For example the core principles of (artificial) neural networks (a.k.a. deep learning) are from the 60’s of the last century! Now, more than half a century later we’ve got the hype. The reason is:

It’s the data, stupid!

Because AI is all about learning from examples, we need lots and lots of data and because of the internet and mobile revolution we are now drowning in those data. Combine this with more and more powerful hardware (also in the form of cheap cloud services) and you’ve got a revolution at your hands.

And there is yet another lesson to be learned: when I said “we” are drowning in data it was not entirely correct: tech companies like Google, Amazon, Facebook, Apple (GAFA) and their Chinese counterparts, like Baidu, Alibaba, and Tencent (BAT) are drowning in our (!) data. This is the reason why they are also leading the field of AI and not because they necessarily have the brightest AI geniuses.

This point is best illustrated by the example of DeepL: a small German startup in the area of machine translation. Many tests conducted by professional translators came to the conclusion that the translations by DeepL are far better than any other machine translations (like Google Translate). Why? Not because there necessarily is a secret sauce in their algorithms but because the company behind it (Linguee) had been amassing hand-curated bilingual language pairs for many, many years. The quality of their data is just so much better than of all the other translation services out there. This is their main differentiator.

That is not to say that there haven’t been any major developments in the area of AI algorithms or computer hardware or that you don’t need bright people to set up and finetune those systems (quite to the contrary!) but many of the algorithms are open-source and computing power is relatively cheap via cloud services and, yes, there is a shortage of good people at the moment but there are still many highly qualified data scientists around: the real difference is in (the quality and amount of) the data!

One last thing: many people think that AI is more objective than humans, after all its maths and cool logic, right? Wrong! When you only learn from examples and the examples given are racist the learned rules will be racist too! A very good book on the subject is Weapons of Math Destruction by Cathy O’Neal.

Hope that this gave you some perspective on the ongoing hype… perhaps hype is not such a good word after all because when you look at the underlying reasons you can see that this mega trend is here to stay!

Learning Data Science: Modelling Basics


Data Science is all about building good models, so let us start by building a very simple model: we want to predict monthly income from age (in a later post we will see that age is indeed a good predictor for income).

For illustrative purposes we just make up some numbers for age and income, load both into a data frame, plot it and calculate its correlation matrix:

age <- c(21, 46, 55, 35, 28)
income <- c(1850, 2500, 2560, 2230, 1800)
data <- data.frame(age, income)
plot(data, pch = 16)

cor(data) # correlation
##              age    income
## age    1.0000000 0.9464183
## income 0.9464183 1.0000000

Just by looking at the plot we can see that the points seem to be linearly dependent. The correlation matrix corroborates this with a correlation of nearly 95\%. To build a linear regression model we use the lm function (for linear model). To specify the model we use the standard R formula interface: on the left hand side comes the dependent variable, i.e. the variable that we want to predict (income) and on the right hand side comes the independent variable, i.e. the variable that we want to use to predict income, viz. age – in between both variables we put a tilde (~). After that we plot the model as a line and look at some model diagnostics:

# building linear regression model
LinReg <- lm(income ~ age) # assign model to variable
abline(LinReg, col = "green", lwd = 2) # add regression line

LinReg # coefficients
## 
## Call:
## lm(formula = income ~ age)
## 
## Coefficients:
## (Intercept)          age  
##     1279.37        24.56

summary(LinReg) # model summary
## 
## Call:
## lm(formula = income ~ age)
## 
## Residuals:
##       1       2       3       4       5 
##   54.92   90.98  -70.04   91.12 -166.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1279.367    188.510   6.787  0.00654 **
## age           24.558      4.838   5.076  0.01477 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 132.1 on 3 degrees of freedom
## Multiple R-squared:  0.8957, Adjusted R-squared:  0.8609 
## F-statistic: 25.77 on 1 and 3 DF,  p-value: 0.01477

The general line equation is y=mx+b, where m is the slope (in this case 24.58) and b the intercept (here 1279.367). The p-values in the last column show us that both parameters are significant, so with a high probability not due to randomness. If we now want to predict the income of a 20-year old we just have to calculate 1279.37+20\cdot24.59=1770.52.

To do the same calculation in R we can conveniently use the predict function which does this calculation for us automatically. We can also use vectors as input and get several predictions at once. Important is that we use the independent variable with its name as a data frame, otherwise R will throw an error.

# predicting income with linear model
predict(LinReg, data.frame(age = 20))
##       1 
## 1770.52
pred_LinReg <- predict(LinReg, data.frame(age = seq(from = 0, to = 80, by = 5)))
names(pred_LinReg) <- seq(0, 80, 5)
pred_LinReg
##        0        5       10       15       20       25       30       35 
## 1279.367 1402.155 1524.944 1647.732 1770.520 1893.308 2016.097 2138.885 
##       40       45       50       55       60       65       70       75 
## 2261.673 2384.461 2507.249 2630.038 2752.826 2875.614 2998.402 3121.190 
##       80 
## 3243.979

So far so good, but wouldn’t it be nice to actually have a model that is even more accurate, a model where the line actually goes through all the data points so that all the available information is being used. Let us do that with a so called polynomial regression (in red):

# polynomial regression
PolyReg <- lm(income ~ poly(age, 4, raw = TRUE))
lines(c(20:55), predict(PolyReg, data.frame(age = c(20:55))), col = "red")

pred_PolyReg <- predict(PolyReg, data.frame(age = seq(0, 80, 5)))
names(pred_PolyReg) <- seq(0, 80, 5)
pred_PolyReg
##         0         5        10        15        20        25        30 
## 19527.182 11103.509  5954.530  3164.488  1959.098  1705.541  1912.468 
##        35        40        45        50        55        60        65 
##  2230.000  2449.726  2504.705  2469.464  2560.000  3133.778  4689.732 
##        70        75        80 
##  7868.267 13451.255 22362.037

What do you think? Is this a better model? Well, there is no good reason why income should go up the younger or older you are, and there is no good reason why there should be a bump between 50 and 55. In other words, this model takes the noise as legitimate input which actually makes it worse! To have a good model you always need some rigidity to generalize the data.

This illustration is a good example of one of the big problems of machine learning: overfitting the data! The other extreme would be underfitting, which is included in the following plot as a blue line:

# mean income as a model
abline(h = mean(income), col = "blue")

title(main = "All models are wrong, some are useful")
legend("bottomright", c("Mean: Underfitting", "Linear fit", "Polynomial Reg.: Overfitting"), col = c("blue", "green", "red"), lwd = 3)

mean(income)
## [1] 2188

We just took the mean value of income and claimed that no matter the age everybody would earn about 2200. This also obviously doesn’t make the cut. In practice and in all data science projects you always have to look for a model that hits the sweet spot between over- and underfitting. This is as much an art as a science, although there are certain best practices like dividing the data into a training and a test set, cross validation etc.

Another important point it that all models have their area of applicability. In physics for example quantum physics works well for the very small but not very good for the very big, it is the other way round for relativity theory. In this case we didn’t have data for people of very young or very old age, so it is in general very dangerous to make prediction for those people, i.e. for values outside of the original observation range (also called extrapolation – in contrast to interpolation = predictions within the original range). In case of the linear model we got an income of about 1280 for newborns (the intercept)! And the income gets higher the older people get, also something that doesn’t make sense. This behaviour was a lot worse for the non-linear model.

So, here you can see that it is one thing to just let R run and spit something out and quite another one to find a good model and interpret it correctly – even for such a simple problem as this one. In a later post we will use real-world data to predict income brackets with OneR, decision trees and random forests, so stay tuned!

Addendum
The post on now online: Learning Data Science: Predicting Income Brackets!