Intuition for principal component analysis (PCA)

Principal component analysis (PCA) is a dimension-reduction method that can be used to reduce a large set of (often correlated) variables into a smaller set of (uncorrelated) variables, called principal components, which still contain most of the information.

PCA is a concept that is traditionally hard to grasp so instead of giving you the n’th mathematical derivation I will provide you with some intuition.

Basically PCA is nothing else but a projection of some higher dimensional object into a lower dimension. What sounds complicated is really something we encounter every day: when we watch TV we see a 2D-projection of 3D-objects!

Please watch this instructive video:

So what we want is not only a projection but a projection which retains as much information as possible. A good strategy is to find the longest axis of the object and after that turn the object around this axis so that we again find the second longest axis.

Mathematically this can be done by calculating the so called eigenvectors of the covariance matrix, after that we use just use the n eigenvectors with the biggest eigenvalues to project the object into n-dimensional space (in our example n = 2) – but as I said I won’t go into the mathematical details here (many good explanations can be found here on stackexchange).

To make matters concrete we will now recreate the example from the video above! The teapot data is from https://www.khanacademy.org/computer-programming/3d-teapot/971436783 – there you can also turn it around interactively.

Have a look at the following code:

library(scatterplot3d)

teapot <- read.csv("data/teapot.csv", header = FALSE)
dat <- matrix(unlist(teapot), ncol = 3, byrow = TRUE)
head(dat)
##      [,1] [,2] [,3]
## [1,] 40.6 28.3 -1.1
## [2,] 40.1 30.4 -1.1
## [3,] 40.7 31.1 -1.1
## [4,] 42.0 30.4 -1.1
## [5,] 43.5 28.3 -1.1
## [6,] 37.5 28.3 14.5

# teapot in 3D
scatterplot3d(dat[ , 1], dat[ , 2], dat[ , 3], highlight.3d = TRUE, angle = 75, pch = 19, lwd = 15, xlab = "", ylab = "", zlab = "", main = "teapot in 3D")

And now we project that teapot from 3D-space onto a 2D-plane while retaining as much information as possible by taking the two eigenvectors with biggest eigenvalues:

# PCA
(eigenval <- eigen(cov(dat))$values)
## [1] 2050.6935  747.5764  553.4070

eigenvec <- eigen(cov(dat))$vectors
# take only the two eigenvectors with biggest eigenvalues
PCA_2 <- dat %*% eigenvec[ , 1:2] 
plot(PCA_2, ylim = c(50, -55), pch = 19, lwd = 35, col= "blue", xlab = "", ylab = "", main = "teapot in 2D")

round(cumsum(eigenval)/sum(eigenval) * 100, 2) # cumulative percentage of retrieved information
## [1]  61.18  83.49 100.00

That was fun, wasn’t it!

In data science you will often use PCA to condense large datasets into smaller datasets while retaining most of the information (which means retaining as much variance as possible in this case). PCA is like clustering an unsupervised learning method.

OneR – fascinating insights through simple rules


We already saw the power of the OneR package in the preceding post. Here we want to give some more examples to gain some fascinating, often counter-intuitive, insights.

Shirin Glander of Muenster University tested the OneR package with data from the World Happiness Report to find out what makes people happy: https://shiring.github.io/machine_learning/2017/04/23/one_r.

The whole analysis is very sophisticated so I present a simplified approach here. Have a look at the following code:

library(OneR)
data <- read.csv("data/2016.csv", sep = ",", header = TRUE)
data$Happiness.Score <- bin(data$Happiness.Score, nbins = 3, labels = c("low", "middle", "high"), method = "content")
data <- maxlevels(data) # removes all columns that have more than 20 levels (Country)
data <- optbin(formula = Happiness.Score ~., data = data, method = "infogain")
data <- data[ , -(1:4)] # remove columns with redundant data
model <- OneR(formula = Happiness.Score ~., data = data, verbose = TRUE)
## 
##     Attribute                     Accuracy
## 1 * Economy..GDP.per.Capita.      71.97%  
## 2   Health..Life.Expectancy.      70.06%  
## 3   Family                        65.61%  
## 4   Dystopia.Residual             59.87%  
## 5   Freedom                       57.32%  
## 6   Trust..Government.Corruption. 55.41%  
## 7   Generosity                    47.13%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
summary(model)
## 
## Call:
## OneR.formula(formula = Happiness.Score ~ ., data = data, verbose = TRUE)
## 
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.675] then Happiness.Score = low
## If Economy..GDP.per.Capita. = (0.675,1.32]     then Happiness.Score = middle
## If Economy..GDP.per.Capita. = (1.32,1.83]      then Happiness.Score = high
## 
## Accuracy:
## 113 of 157 instances classified correctly (71.97%)
## 
## Contingency table:
##                Economy..GDP.per.Capita.
## Happiness.Score (-0.00182,0.675] (0.675,1.32] (1.32,1.83] Sum
##          low                * 36           16           0  52
##          middle                4         * 46           2  52
##          high                  0           22        * 31  53
##          Sum                  40           84          33 157
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 130.99, df = 4, p-value < 2.2e-16
plot(model)

prediction <- predict(model, data)
eval_model(prediction, data)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction high low middle Sum
##     high     31   0      2  33
##     low       0  36      4  40
##     middle   22  16     46  84
##     Sum      53  52     52 157
## 
## Confusion matrix (relative):
##           Actual
## Prediction high  low middle  Sum
##     high   0.20 0.00   0.01 0.21
##     low    0.00 0.23   0.03 0.25
##     middle 0.14 0.10   0.29 0.54
##     Sum    0.34 0.33   0.33 1.00
## 
## Accuracy:
## 0.7197 (113/157)
## 
## Error rate:
## 0.2803 (44/157)
## 
## Error rate reduction (vs. base rate):
## 0.5769 (p-value < 2.2e-16)

As you can see we get more than 70% accuracy with three rules based on GDP per capita. So it seems that money CAN buy happiness after all!

Dr. Glander comes to the following conclusion:

All in all, based on this example, I would confirm that OneR models do indeed produce sufficiently accurate models for setting a good baseline. OneR was definitely faster than random forest, gradient boosting and neural nets. However, the latter were more complex models and included cross-validation.
If you prefer an easy to understand model that is very simple, OneR is a very good way to go. You could also use it as a starting point for developing more complex models with improved accuracy.
When looking at feature importance across models, the feature OneR chose – Economy/GDP per capita – was confirmed by random forest, gradient boosting trees and neural networks as being the most important feature. This is in itself an interesting conclusion! Of course, this correlation does not tell us that there is a direct causal relationship between money and happiness, but we can say that a country’s economy is the best individual predictor for how happy people tend to be.

A well known reference data set is the titanic data set which gives all kind of additional information on the passengers of the titanic and whether they survived the tragedy. We can conveniently use the titanic package from CRAN to get access to the data set. Have a look at the following code:

library(titanic)
data <- bin(maxlevels(titanic_train), na.omit = FALSE)
model <- OneR(Survived ~., data = data, verbose = TRUE)
## Warning in OneR.data.frame(x = data, ties.method = ties.method, verbose =
## verbose, : data contains unused factor levels
## 
##     Attribute   Accuracy
## 1 * Sex         78.68%  
## 2   Pclass      67.9%   
## 3   Fare        64.42%  
## 4   Embarked    63.86%  
## 5   Age         62.74%  
## 6   Parch       61.73%  
## 7   PassengerId 61.62%  
## 7   SibSp       61.62%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
summary(model)
## 
## Call:
## OneR.formula(formula = Survived ~ ., data = data, verbose = TRUE)
## 
## Rules:
## If Sex = female then Survived = 1
## If Sex = male   then Survived = 0
## 
## Accuracy:
## 701 of 891 instances classified correctly (78.68%)
## 
## Contingency table:
##         Sex
## Survived female  male Sum
##      0       81 * 468 549
##      1    * 233   109 342
##      Sum    314   577 891
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 260.72, df = 1, p-value < 2.2e-16
plot(model)

prediction <- predict(model, data)
eval_model(prediction, data$Survived)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction   0   1 Sum
##        0   468 109 577
##        1    81 233 314
##        Sum 549 342 891
## 
## Confusion matrix (relative):
##           Actual
## Prediction    0    1  Sum
##        0   0.53 0.12 0.65
##        1   0.09 0.26 0.35
##        Sum 0.62 0.38 1.00
## 
## Accuracy:
## 0.7868 (701/891)
## 
## Error rate:
## 0.2132 (190/891)
## 
## Error rate reduction (vs. base rate):
## 0.4444 (p-value < 2.2e-16)

So, somewhat contrary to popular belief, it were not necessarily the rich that survived but the women!

A more challenging data set is one about the quality of red wine (https://archive.ics.uci.edu/ml/datasets/Wine+Quality), have a look at the following code:

data <- read.csv("data/winequality-red.csv", header = TRUE, sep = ";")
data <- optbin(data, method = "infogain")
## Warning in optbin.data.frame(data, method = "infogain"): target is numeric
model <- OneR(data, verbose = TRUE)
## 
##     Attribute            Accuracy
## 1 * alcohol              56.1%   
## 2   sulphates            51.22%  
## 3   volatile.acidity     50.59%  
## 4   total.sulfur.dioxide 48.78%  
## 5   density              47.22%  
## 6   citric.acid          46.72%  
## 7   chlorides            46.47%  
## 8   free.sulfur.dioxide  44.47%  
## 9   fixed.acidity        43.96%  
## 10  residual.sugar       43.46%  
## 11  pH                   43.21%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
summary(model)
## 
## Call:
## OneR.data.frame(x = data, verbose = TRUE)
## 
## Rules:
## If alcohol = (8.39,8.45] then quality = 3
## If alcohol = (8.45,10]   then quality = 5
## If alcohol = (10,11]     then quality = 6
## If alcohol = (11,12.5]   then quality = 6
## If alcohol = (12.5,14.9] then quality = 6
## 
## Accuracy:
## 897 of 1599 instances classified correctly (56.1%)
## 
## Contingency table:
##        alcohol
## quality (8.39,8.45] (8.45,10] (10,11] (11,12.5] (12.5,14.9]  Sum
##     3           * 1         5       4         0           0   10
##     4             0        28      13        11           1   53
##     5             0     * 475     156        41           9  681
##     6             1       216   * 216     * 176        * 29  638
##     7             0        19      53       104          23  199
##     8             0         2       2         6           8   18
##     Sum           2       745     444       338          70 1599
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 546.64, df = 20, p-value < 2.2e-16
prediction <- predict(model, data)
eval_model(prediction, data, zero.print = ".")
## 
## Confusion matrix (absolute):
##           Actual
## Prediction    3    4    5    6    7    8  Sum
##        3      1    .    .    1    .    .    2
##        4      .    .    .    .    .    .    .
##        5      5   28  475  216   19    2  745
##        6      4   25  206  421  180   16  852
##        7      .    .    .    .    .    .    .
##        8      .    .    .    .    .    .    .
##        Sum   10   53  681  638  199   18 1599
## 
## Confusion matrix (relative):
##           Actual
## Prediction    3    4    5    6    7    8  Sum
##        3      .    .    .    .    .    .    .
##        4      .    .    .    .    .    .    .
##        5      . 0.02 0.30 0.14 0.01    . 0.47
##        6      . 0.02 0.13 0.26 0.11 0.01 0.53
##        7      .    .    .    .    .    .    .
##        8      .    .    .    .    .    .    .
##        Sum 0.01 0.03 0.43 0.40 0.12 0.01 1.00
## 
## Accuracy:
## 0.561 (897/1599)
## 
## Error rate:
## 0.439 (702/1599)
## 
## Error rate reduction (vs. base rate):
## 0.2353 (p-value < 2.2e-16)

That is an interesting result, isn’t it: the more alcohol the higher the perceived quality!

We end this post with an unconventional use case for OneR: finding the best move in a strategic game, i.e. tic-tac-toe. We use a dataset with all possible board configurations at the end of such games, have a look at the following code:

data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/tic-tac-toe/tic-tac-toe.data", header = FALSE)
names(data) <- c("top-left-square", "top-middle-square", "top-right-square", "middle-left-square", "middle-middle-square", "middle-right-square", "bottom-left-square", "bottom-middle-square", "bottom-right-square", "Class")
model <- OneR(data, verbose = TRUE)
## 
##     Attribute            Accuracy
## 1 * middle-middle-square 69.94%  
## 2   top-left-square      65.34%  
## 2   top-middle-square    65.34%  
## 2   top-right-square     65.34%  
## 2   middle-left-square   65.34%  
## 2   middle-right-square  65.34%  
## 2   bottom-left-square   65.34%  
## 2   bottom-middle-square 65.34%  
## 2   bottom-right-square  65.34%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
summary(model)
## 
## Call:
## OneR.data.frame(x = data, verbose = TRUE)
## 
## Rules:
## If middle-middle-square = b then Class = positive
## If middle-middle-square = o then Class = negative
## If middle-middle-square = x then Class = positive
## 
## Accuracy:
## 670 of 958 instances classified correctly (69.94%)
## 
## Contingency table:
##           middle-middle-square
## Class          b     o     x Sum
##   negative    48 * 192    92 332
##   positive * 112   148 * 366 626
##   Sum        160   340   458 958
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 115.91, df = 2, p-value < 2.2e-16
prediction <- predict(model, data)
eval_model(prediction, data)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction negative positive Sum
##   negative      192      148 340
##   positive      140      478 618
##   Sum           332      626 958
## 
## Confusion matrix (relative):
##           Actual
## Prediction negative positive  Sum
##   negative     0.20     0.15 0.35
##   positive     0.15     0.50 0.65
##   Sum          0.35     0.65 1.00
## 
## Accuracy:
## 0.6994 (670/958)
## 
## Error rate:
## 0.3006 (288/958)
## 
## Error rate reduction (vs. base rate):
## 0.1325 (p-value = 0.001427)

Perhaps it doesn’t come as a surprise that the middle-middle-square is strategically the most important one – but still it is encouraging to see that OneR comes to the same conclusion.

One Rule (OneR) Machine Learning Classification in under One Minute

You can also watch this video which goes through the following example step-by-step:

After installing the OneR package from CRAN load it

library(OneR)

Use the famous Iris dataset and determine optimal bins for numeric data

data <- optbin(iris)

Build model with best predictor

model <- OneR(data, verbose = TRUE)
## 
##     Attribute    Accuracy
## 1 * Petal.Width  96%     
## 2   Petal.Length 95.33%  
## 3   Sepal.Length 74.67%  
## 4   Sepal.Width  55.33%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'

Show learned rules and model diagnostics

summary(model)
## 
## Call:
## OneR(data = data, verbose = TRUE)
## 
## Rules:
## If Petal.Width = (0.0976,0.791] then Species = setosa
## If Petal.Width = (0.791,1.63]   then Species = versicolor
## If Petal.Width = (1.63,2.5]     then Species = virginica
## 
## Accuracy:
## 144 of 150 instances classified correctly (96%)
## 
## Contingency table:
##             Petal.Width
## Species      (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum
##   setosa               * 50            0          0  50
##   versicolor              0         * 48          2  50
##   virginica               0            4       * 46  50
##   Sum                    50           52         48 150
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 266.35, df = 4, p-value < 2.2e-16

Plot model diagnostics

plot(model)

Use model to predict data

prediction <- predict(model, data)

Evaluate prediction statistics

eval_model(prediction, data)
## 
## Confusion matrix (absolute):
##             Actual
## Prediction   setosa versicolor virginica Sum
##   setosa         50          0         0  50
##   versicolor      0         48         4  52
##   virginica       0          2        46  48
##   Sum            50         50        50 150
## 
## Confusion matrix (relative):
##             Actual
## Prediction   setosa versicolor virginica  Sum
##   setosa       0.33       0.00      0.00 0.33
##   versicolor   0.00       0.32      0.03 0.35
##   virginica    0.00       0.01      0.31 0.32
##   Sum          0.33       0.33      0.33 1.00
## 
## Accuracy:
## 0.96 (144/150)
## 
## Error rate:
## 0.04 (6/150)
## 
## Error rate reduction (vs. base rate):
## 0.94 (p-value < 2.2e-16)

Please note that the very good accuracy of 96% is reached effortlessly.

“Petal.Width” is identified as the attribute with the highest predictive value. The cut points of the intervals are found automatically (via the included optbin function). The results are three very simple, yet accurate, rules to predict the respective species.

The nearly perfect separation of the areas in the diagnostic plot give a good indication of the model’s ability to separate the different species.

The whole code of this post:


library(OneR)
data <- optbin(iris)
model <- OneR(data, verbose = TRUE)
summary(model)
plot(model)
prediction <- predict(model, data)
eval_model(prediction, data)

More sophisticated examples will follow in upcoming posts… so stay tuned!

Help

From within R:

help(package = OneR)

…or as a pdf here: OneR.pdf

The package vignette: OneR – Establishing a New Baseline for Machine Learning Classification Models

Issues can be posted here: https://github.com/vonjd/OneR/issues

Feedback

I would love to hear about your experiences with the OneR package. Please drop a line or two in the comments – Thank you!