Intuition for principal component analysis (PCA)

Principal component analysis (PCA) is a dimension-reduction method that can be used to reduce a large set of (often correlated) variables into a smaller set of (uncorrelated) variables, called principal components, which still contain most of the information.

PCA is a concept that is traditionally hard to grasp so instead of giving you the n’th mathematical derivation I will provide you with some intuition.

Basically PCA is nothing else but a projection of some higher dimensional object into a lower dimension. What sounds complicated is really something we encounter every day: when we watch TV we see a 2D-projection of 3D-objects!

Please watch this instructive youtube video.

So what we want is not only a projection but a projection which retains as much information as possible. A good strategy is to find the longest axis of the object and after that turn the object around this axis so that we again find the second longest axis.

Mathematically this can be done by calculating the so called eigenvectors of the covariance matrix, after that we use just use the n eigenvectors with the biggest eigenvalues to project the object into n-dimensional space (in our example n = 2) – but as I said I won’t go into the mathematical details here (many good explanations can be found here on stackexchange).

To make matters concrete we will now recreate the example from the video above! The teapot data is from https://www.khanacademy.org/computer-programming/3d-teapot/971436783 – there you can also turn it around interactively.

Have a look at the following code:

library(scatterplot3d)

teapot <- read.csv("data/teapot.csv", header = FALSE)
dat <- matrix(unlist(teapot), ncol = 3, byrow = TRUE)
head(dat)
##      [,1] [,2] [,3]
## [1,] 40.6 28.3 -1.1
## [2,] 40.1 30.4 -1.1
## [3,] 40.7 31.1 -1.1
## [4,] 42.0 30.4 -1.1
## [5,] 43.5 28.3 -1.1
## [6,] 37.5 28.3 14.5

# teapot in 3D
scatterplot3d(dat[ , 1], dat[ , 2], dat[ , 3], highlight.3d = TRUE, angle = 75, pch = 19, lwd = 15, xlab = "", ylab = "", zlab = "", main = "teapot in 3D")

And now we project that teapot from 3D-space onto a 2D-plane while retaining as much information as possible by taking the two eigenvectors with biggest eigenvalues:

# PCA
(eigenval <- eigen(cov(dat))$values)
## [1] 2050.6935  747.5764  553.4070

eigenvec <- eigen(cov(dat))$vectors
# take only the two eigenvectors with biggest eigenvalues
PCA_2 <- dat %*% eigenvec[ , 1:2] 
plot(PCA_2, ylim = c(50, -55), pch = 19, lwd = 35, col= "blue", xlab = "", ylab = "", main = "teapot in 2D")

round(cumsum(eigenval)/sum(eigenval) * 100, 2) # cumulative percentage of retrieved information
## [1]  61.18  83.49 100.00

That was fun, wasn’t it!

In data science you will often use PCA to condense large datasets into smaller datasets while retaining most of the information. PCA is like clustering an unsupervised learning method.

Why R for data science – and not Python?

There are literally hundreds of programming languages out there, e.g. the whole alphabet of one letter programming languages is taken. In the area of data science there are two big contenders: R and Python. Now why is this blog about R and not Python?

I have to make a confession: I really wanted to like Python. I dug deep into the language and some of its extensions. Yet, it never really worked for me. I think one of the problems is that Python tries to be everybody’s darling. It can do everything… and its opposite. No, really, it is a great language to learn programming but I think it has some really serious flaws. I list some of them here:

  • It starts with which version to use! The current version has release number 3 but there is still a lot of code based on the former version number 2. The problem is that there is no backward compatibility. Even the syntax of the print command got changed!
  • The next thing is which distribution to chose! What seems like a joke to R users is a sad reality for Python users: there are all kinds of different distributions out there. The most well known for data science is Anaconda: https://www.anaconda.com/. One of the reasons for this is that the whole package system in Python is a mess. To just give you a taste, have a look at the official documentation: https://packaging.python.org/tutorials/installing-packages/ – seven (!) pages for what is basically one command in R: install.packages() (I know, this is not entirely fair, but you get the idea).
  • There are several GUIs out there and admittedly it is also a matter of taste which one to use but in my opinion when it comes to data scientific tasks – where you need a combination of on-line work and scripts – there is no better GUI than RStudio (there is now Rodeo, free download here: https://www.yhat.com/products/rodeo, but I don’t know how mature it is).
  • There is no general rule when to use a function and when to use a method on an object. The reason for this problem is what I stated above: Python wants to be everybody’s darling and tries to achieve everything at the same time. That it is not only me can be seen in this illuminating discussion where people scramble to find criteria when to use which: https://stackoverflow.com/questions/8108688/in-python-when-should-i-use-a-function-instead-of-a-method. A concrete example can be found here, where it is explained why the function any(df2 == 1) gives the wrong result and you have to use e.g. the method (df2 == 1).any(). Very confusing and error prone.
  • More sophisticated data science data structures are not part of the core language. For example you need the NumPy package for vectors and the pandas package for data.frames. That in itself is not the problem but the inconsistencies that this brings. To give you just one example: whereas vectorized code is supported by NumPy and pandas it is not supported in base Python and you have to use good old loops instead.
  • Both Python and R are not the fastest of languages but the integration with one of the fastest, namely C++, is so much better in R (via Rcpp by Dirk Eddelbuettel) than in Python that it can by now be considered a standard approach. All R data structures are supported by corresponding C++ classes and there is a generic way to write ultra fast C++ functions that can be called like regular R functions:
  • library(Rcpp)
    
    bmi_R <- function(weight, height) {
      weight / (height * height)
    }
    bmi_R(80, 1.85) # body mass index of person with 80 kg and 185 cm
    ## [1] 23.37473
    
    cppFunction("
      float bmi_cpp(float weight, float height) {
        return weight / (height * height);
      }
    ")
    bmi_cpp(80, 1.85) # same with cpp function
    ## [1] 23.37473
    

    One of the main reasons for using Python in the data science arena is shrinking by the day: Neural Networks. The main frameworks like Tensorflow and APIs like Keras used to be controlled by Python but there are now excellent wrappers available for R too (https://tensorflow.rstudio.com/ and https://keras.rstudio.com/).

    All in all I think that R is really the best choice for most data science applications. The learning curve may be a little bit steeper at the beginning but when you get to more sophisticated concepts it becomes easier to use than Python.

    OneR – fascinating insights through simple rules

    We already saw the power of the OneR package in the preceding post. Here we want to give some more examples to gain some fascinating, often counter-intuitive, insights.

    Shirin Glander of Muenster University tested the OneR package with data from the World Happiness Report to find out what makes people happy: https://shiring.github.io/machine_learning/2017/04/23/one_r.

    The whole analysis is very sophisticated so I present a simplified approach here. Have a look at the following code:

    library(OneR)
    data <- read.csv("data/2016.csv", sep = ",", header = TRUE)
    data$Happiness.Score <- bin(data$Happiness.Score, nbins = 3, labels = c("low", "middle", "high"), method = "content")
    data <- maxlevels(data) # removes all columns that have more than 20 levels (Country)
    data <- optbin(formula = Happiness.Score ~., data = data, method = "infogain")
    data <- data[ , -(1:4)] # remove columns with redundant data
    model <- OneR(formula = Happiness.Score ~., data = data, verbose = TRUE)
    ## 
    ##     Attribute                     Accuracy
    ## 1 * Economy..GDP.per.Capita.      71.97%  
    ## 2   Health..Life.Expectancy.      70.06%  
    ## 3   Family                        65.61%  
    ## 4   Dystopia.Residual             59.87%  
    ## 5   Freedom                       57.32%  
    ## 6   Trust..Government.Corruption. 55.41%  
    ## 7   Generosity                    47.13%  
    ## ---
    ## Chosen attribute due to accuracy
    ## and ties method (if applicable): '*'
    
    summary(model)
    ## 
    ## Call:
    ## OneR.formula(formula = Happiness.Score ~ ., data = data, verbose = TRUE)
    ## 
    ## Rules:
    ## If Economy..GDP.per.Capita. = (-0.00182,0.675] then Happiness.Score = low
    ## If Economy..GDP.per.Capita. = (0.675,1.32]     then Happiness.Score = middle
    ## If Economy..GDP.per.Capita. = (1.32,1.83]      then Happiness.Score = high
    ## 
    ## Accuracy:
    ## 113 of 157 instances classified correctly (71.97%)
    ## 
    ## Contingency table:
    ##                Economy..GDP.per.Capita.
    ## Happiness.Score (-0.00182,0.675] (0.675,1.32] (1.32,1.83] Sum
    ##          low                * 36           16           0  52
    ##          middle                4         * 46           2  52
    ##          high                  0           22        * 31  53
    ##          Sum                  40           84          33 157
    ## ---
    ## Maximum in each column: '*'
    ## 
    ## Pearson's Chi-squared test:
    ## X-squared = 130.99, df = 4, p-value < 2.2e-16
    
    plot(model)
    

    prediction <- predict(model, data)
    eval_model(prediction, data)
    ## 
    ## Confusion matrix (absolute):
    ##           Actual
    ## Prediction high low middle Sum
    ##     high     31   0      2  33
    ##     low       0  36      4  40
    ##     middle   22  16     46  84
    ##     Sum      53  52     52 157
    ## 
    ## Confusion matrix (relative):
    ##           Actual
    ## Prediction high  low middle  Sum
    ##     high   0.20 0.00   0.01 0.21
    ##     low    0.00 0.23   0.03 0.25
    ##     middle 0.14 0.10   0.29 0.54
    ##     Sum    0.34 0.33   0.33 1.00
    ## 
    ## Accuracy:
    ## 0.7197 (113/157)
    ## 
    ## Error rate:
    ## 0.2803 (44/157)
    ## 
    ## Error rate reduction (vs. base rate):
    ## 0.5769 (p-value < 2.2e-16)
    

    As you can see we get more than 70% accuracy with three rules based on GDP per capita. So it seems that money CAN buy happiness after all!

    Dr. Glander comes to the following conclusion:

    All in all, based on this example, I would confirm that OneR models do indeed produce sufficiently accurate models for setting a good baseline. OneR was definitely faster than random forest, gradient boosting and neural nets. However, the latter were more complex models and included cross-validation.
    If you prefer an easy to understand model that is very simple, OneR is a very good way to go. You could also use it as a starting point for developing more complex models with improved accuracy.
    When looking at feature importance across models, the feature OneR chose – Economy/GDP per capita – was confirmed by random forest, gradient boosting trees and neural networks as being the most important feature. This is in itself an interesting conclusion! Of course, this correlation does not tell us that there is a direct causal relationship between money and happiness, but we can say that a country’s economy is the best individual predictor for how happy people tend to be.

    A well known reference data set is the titanic data set which gives all kind of additional information on the passengers of the titanic and whether they survived the tragedy. We can conveniently use the titanic package from CRAN to get access to the data set. Have a look at the following code:

    library(titanic)
    data <- bin(maxlevels(titanic_train), na.omit = FALSE)
    model <- OneR(Survived ~., data = data, verbose = TRUE)
    ## Warning in OneR.data.frame(x = data, ties.method = ties.method, verbose =
    ## verbose, : data contains unused factor levels
    ## 
    ##     Attribute   Accuracy
    ## 1 * Sex         78.68%  
    ## 2   Pclass      67.9%   
    ## 3   Fare        64.42%  
    ## 4   Embarked    63.86%  
    ## 5   Age         62.74%  
    ## 6   Parch       61.73%  
    ## 7   PassengerId 61.62%  
    ## 7   SibSp       61.62%  
    ## ---
    ## Chosen attribute due to accuracy
    ## and ties method (if applicable): '*'
    
    summary(model)
    ## 
    ## Call:
    ## OneR.formula(formula = Survived ~ ., data = data, verbose = TRUE)
    ## 
    ## Rules:
    ## If Sex = female then Survived = 1
    ## If Sex = male   then Survived = 0
    ## 
    ## Accuracy:
    ## 701 of 891 instances classified correctly (78.68%)
    ## 
    ## Contingency table:
    ##         Sex
    ## Survived female  male Sum
    ##      0       81 * 468 549
    ##      1    * 233   109 342
    ##      Sum    314   577 891
    ## ---
    ## Maximum in each column: '*'
    ## 
    ## Pearson's Chi-squared test:
    ## X-squared = 260.72, df = 1, p-value < 2.2e-16
    
    plot(model)
    

    prediction <- predict(model, data)
    eval_model(prediction, data$Survived)
    ## 
    ## Confusion matrix (absolute):
    ##           Actual
    ## Prediction   0   1 Sum
    ##        0   468 109 577
    ##        1    81 233 314
    ##        Sum 549 342 891
    ## 
    ## Confusion matrix (relative):
    ##           Actual
    ## Prediction    0    1  Sum
    ##        0   0.53 0.12 0.65
    ##        1   0.09 0.26 0.35
    ##        Sum 0.62 0.38 1.00
    ## 
    ## Accuracy:
    ## 0.7868 (701/891)
    ## 
    ## Error rate:
    ## 0.2132 (190/891)
    ## 
    ## Error rate reduction (vs. base rate):
    ## 0.4444 (p-value < 2.2e-16)
    

    So, somewhat contrary to popular belief, it were not necessarily the rich that survived but the women!

    A more challenging data set is one about the quality of red wine (https://archive.ics.uci.edu/ml/datasets/Wine+Quality), have a look at the following code:

    data <- read.csv("data/winequality-red.csv", header = TRUE, sep = ";")
    data <- optbin(data, method = "infogain")
    ## Warning in optbin.data.frame(data, method = "infogain"): target is numeric
    model <- OneR(data, verbose = TRUE)
    ## 
    ##     Attribute            Accuracy
    ## 1 * alcohol              56.1%   
    ## 2   sulphates            51.22%  
    ## 3   volatile.acidity     50.59%  
    ## 4   total.sulfur.dioxide 48.78%  
    ## 5   density              47.22%  
    ## 6   citric.acid          46.72%  
    ## 7   chlorides            46.47%  
    ## 8   free.sulfur.dioxide  44.47%  
    ## 9   fixed.acidity        43.96%  
    ## 10  residual.sugar       43.46%  
    ## 11  pH                   43.21%  
    ## ---
    ## Chosen attribute due to accuracy
    ## and ties method (if applicable): '*'
    
    summary(model)
    ## 
    ## Call:
    ## OneR.data.frame(x = data, verbose = TRUE)
    ## 
    ## Rules:
    ## If alcohol = (8.39,8.45] then quality = 3
    ## If alcohol = (8.45,10]   then quality = 5
    ## If alcohol = (10,11]     then quality = 6
    ## If alcohol = (11,12.5]   then quality = 6
    ## If alcohol = (12.5,14.9] then quality = 6
    ## 
    ## Accuracy:
    ## 897 of 1599 instances classified correctly (56.1%)
    ## 
    ## Contingency table:
    ##        alcohol
    ## quality (8.39,8.45] (8.45,10] (10,11] (11,12.5] (12.5,14.9]  Sum
    ##     3           * 1         5       4         0           0   10
    ##     4             0        28      13        11           1   53
    ##     5             0     * 475     156        41           9  681
    ##     6             1       216   * 216     * 176        * 29  638
    ##     7             0        19      53       104          23  199
    ##     8             0         2       2         6           8   18
    ##     Sum           2       745     444       338          70 1599
    ## ---
    ## Maximum in each column: '*'
    ## 
    ## Pearson's Chi-squared test:
    ## X-squared = 546.64, df = 20, p-value < 2.2e-16
    
    prediction <- predict(model, data)
    eval_model(prediction, data, zero.print = ".")
    ## 
    ## Confusion matrix (absolute):
    ##           Actual
    ## Prediction    3    4    5    6    7    8  Sum
    ##        3      1    .    .    1    .    .    2
    ##        4      .    .    .    .    .    .    .
    ##        5      5   28  475  216   19    2  745
    ##        6      4   25  206  421  180   16  852
    ##        7      .    .    .    .    .    .    .
    ##        8      .    .    .    .    .    .    .
    ##        Sum   10   53  681  638  199   18 1599
    ## 
    ## Confusion matrix (relative):
    ##           Actual
    ## Prediction    3    4    5    6    7    8  Sum
    ##        3      .    .    .    .    .    .    .
    ##        4      .    .    .    .    .    .    .
    ##        5      . 0.02 0.30 0.14 0.01    . 0.47
    ##        6      . 0.02 0.13 0.26 0.11 0.01 0.53
    ##        7      .    .    .    .    .    .    .
    ##        8      .    .    .    .    .    .    .
    ##        Sum 0.01 0.03 0.43 0.40 0.12 0.01 1.00
    ## 
    ## Accuracy:
    ## 0.561 (897/1599)
    ## 
    ## Error rate:
    ## 0.439 (702/1599)
    ## 
    ## Error rate reduction (vs. base rate):
    ## 0.2353 (p-value < 2.2e-16)
    

    That is an interesting result, isn’t it: the more alcohol the higher the perceived quality!

    We end this post with an unconventional use case for OneR: finding the best move in a strategic game, i.e. tic-tac-toe. We use a dataset with all possible board configurations at the end of such games, have a look at the following code:

    data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/tic-tac-toe/tic-tac-toe.data", header = FALSE)
    names(data) <- c("top-left-square", "top-middle-square", "top-right-square", "middle-left-square", "middle-middle-square", "middle-right-square", "bottom-left-square", "bottom-middle-square", "bottom-right-square", "Class")
    model <- OneR(data, verbose = TRUE)
    ## 
    ##     Attribute            Accuracy
    ## 1 * middle-middle-square 69.94%  
    ## 2   top-left-square      65.34%  
    ## 2   top-middle-square    65.34%  
    ## 2   top-right-square     65.34%  
    ## 2   middle-left-square   65.34%  
    ## 2   middle-right-square  65.34%  
    ## 2   bottom-left-square   65.34%  
    ## 2   bottom-middle-square 65.34%  
    ## 2   bottom-right-square  65.34%  
    ## ---
    ## Chosen attribute due to accuracy
    ## and ties method (if applicable): '*'
    
    summary(model)
    ## 
    ## Call:
    ## OneR.data.frame(x = data, verbose = TRUE)
    ## 
    ## Rules:
    ## If middle-middle-square = b then Class = positive
    ## If middle-middle-square = o then Class = negative
    ## If middle-middle-square = x then Class = positive
    ## 
    ## Accuracy:
    ## 670 of 958 instances classified correctly (69.94%)
    ## 
    ## Contingency table:
    ##           middle-middle-square
    ## Class          b     o     x Sum
    ##   negative    48 * 192    92 332
    ##   positive * 112   148 * 366 626
    ##   Sum        160   340   458 958
    ## ---
    ## Maximum in each column: '*'
    ## 
    ## Pearson's Chi-squared test:
    ## X-squared = 115.91, df = 2, p-value < 2.2e-16
    
    prediction <- predict(model, data)
    eval_model(prediction, data)
    ## 
    ## Confusion matrix (absolute):
    ##           Actual
    ## Prediction negative positive Sum
    ##   negative      192      148 340
    ##   positive      140      478 618
    ##   Sum           332      626 958
    ## 
    ## Confusion matrix (relative):
    ##           Actual
    ## Prediction negative positive  Sum
    ##   negative     0.20     0.15 0.35
    ##   positive     0.15     0.50 0.65
    ##   Sum          0.35     0.65 1.00
    ## 
    ## Accuracy:
    ## 0.6994 (670/958)
    ## 
    ## Error rate:
    ## 0.3006 (288/958)
    ## 
    ## Error rate reduction (vs. base rate):
    ## 0.1325 (p-value = 0.001427)
    

    Perhaps it doesn’t come as a surprise that the middle-middle-square is strategically the most important one – but still it is encouraging to see that OneR comes to the same conclusion.

    One Rule (OneR) Machine Learning Classification in under One Minute

    You can also watch this video which goes through the following example step-by-step: Quick Start Guide for the OneR package (Video)

    After installing the OneR package from CRAN load it

    library(OneR)
    

    Use the famous Iris dataset and determine optimal bins for numeric data

    data <- optbin(iris)
    

    Build model with best predictor

    model <- OneR(data, verbose = TRUE)
    ## 
    ##     Attribute    Accuracy
    ## 1 * Petal.Width  96%     
    ## 2   Petal.Length 95.33%  
    ## 3   Sepal.Length 74.67%  
    ## 4   Sepal.Width  55.33%  
    ## ---
    ## Chosen attribute due to accuracy
    ## and ties method (if applicable): '*'
    

    Show learned rules and model diagnostics

    summary(model)
    ## 
    ## Call:
    ## OneR(data = data, verbose = TRUE)
    ## 
    ## Rules:
    ## If Petal.Width = (0.0976,0.791] then Species = setosa
    ## If Petal.Width = (0.791,1.63]   then Species = versicolor
    ## If Petal.Width = (1.63,2.5]     then Species = virginica
    ## 
    ## Accuracy:
    ## 144 of 150 instances classified correctly (96%)
    ## 
    ## Contingency table:
    ##             Petal.Width
    ## Species      (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum
    ##   setosa               * 50            0          0  50
    ##   versicolor              0         * 48          2  50
    ##   virginica               0            4       * 46  50
    ##   Sum                    50           52         48 150
    ## ---
    ## Maximum in each column: '*'
    ## 
    ## Pearson's Chi-squared test:
    ## X-squared = 266.35, df = 4, p-value < 2.2e-16
    

    Plot model diagnostics

    plot(model)
    

    Use model to predict data

    prediction <- predict(model, data)
    

    Evaluate prediction statistics

    eval_model(prediction, data)
    ## 
    ## Confusion matrix (absolute):
    ##             Actual
    ## Prediction   setosa versicolor virginica Sum
    ##   setosa         50          0         0  50
    ##   versicolor      0         48         4  52
    ##   virginica       0          2        46  48
    ##   Sum            50         50        50 150
    ## 
    ## Confusion matrix (relative):
    ##             Actual
    ## Prediction   setosa versicolor virginica  Sum
    ##   setosa       0.33       0.00      0.00 0.33
    ##   versicolor   0.00       0.32      0.03 0.35
    ##   virginica    0.00       0.01      0.31 0.32
    ##   Sum          0.33       0.33      0.33 1.00
    ## 
    ## Accuracy:
    ## 0.96 (144/150)
    ## 
    ## Error rate:
    ## 0.04 (6/150)
    ## 
    ## Error rate reduction (vs. base rate):
    ## 0.94 (p-value < 2.2e-16)
    

    Please note that the very good accuracy of 96% is reached effortlessly.

    “Petal.Width” is identified as the attribute with the highest predictive value. The cut points of the intervals are found automatically (via the included optbin function). The results are three very simple, yet accurate, rules to predict the respective species.

    The nearly perfect separation of the areas in the diagnostic plot give a good indication of the model’s ability to separate the different species.

    The whole code of this post:

    
    library(OneR)
    data <- optbin(iris)
    model <- OneR(data, verbose = TRUE)
    summary(model)
    plot(model)
    prediction <- predict(model, data)
    eval_model(prediction, data)
    
    

    More sophisticated examples will follow in upcoming posts… so stay tuned!

    Help

    From within R:

    help(package = OneR)
    

    …or as a pdf here: OneR.pdf

    The package vignette: OneR – Establishing a New Baseline for Machine Learning Classification Models

    Issues can be posted here: https://github.com/vonjd/OneR/issues

    Feedback

    I would love to hear about your experiences with the OneR package. Please drop a line or two in the comments – Thank you!