Check Machin-like formulae with arbitrary-precision arithmetic

Happy New Year to all of you! Let us start the year with something for your inner maths nerd 🙂

Rosetta Stone
Source: Wikimedia

For those of you who don’t yet know Rosetta Code: it is a real cool site where you can find lots of interesting code examples in all kinds of different languages for many different tasks. Of course R is also present big time (at the time of writing 426 code examples!): Rosetta Code for R.

The name of the site is inspired by the famous Rosetta Stone of Ancient Egypt which is inscribed with three different versions of the same text: in Ancient Egyptian hieroglyphs, Demotic script, and Ancient Greek script which proved invaluable in deciphering Egyptian hieroglyphs and thereby opening the window into ancient Egyptian history.

Now, a few days a ago I again added an example (for the other tasks I solved I will write more posts in the future, so stay tuned!). The task is to verify the correctness of Machin-like formulae using exact arithmetic.

A little bit of mathematical background is in order, so Wikipedia to the rescue:

Machin-like formulae are a popular technique for computing \pi to a large number of digits. They are generalizations of John Machin]s formula from 1706:

    \[\frac{\pi}{4} = 4 \arctan \frac{1}{5} - \arctan \frac{1}{239}\]

which he used to compute \pi to 100 decimal places.

Machin-like formulae have the form

    \[c_0 \frac{\pi}{4} = \sum_{n=1}^N c_n \arctan \frac{a_n}{b_n}\]

where a_n and b_n are positive integers such that a_n < b_n, c_n is a signed non-zero integer, and c_0 is a positive integer.

The exact task is to verify that the following Machin-like formulae are correct by calculating the value of tan (right hand side) for each equation using exact arithmetic and showing they equal one:

{\pi\over4} = \arctan{1\over2} + \arctan{1\over3}
{\pi\over4} = 2 \arctan{1\over3} + \arctan{1\over7}
{\pi\over4} = 4 \arctan{1\over5} - \arctan{1\over239}
{\pi\over4} = 5 \arctan{1\over7} + 2 \arctan{3\over79}
{\pi\over4} = 5 \arctan{29\over278} + 7 \arctan{3\over79}
{\pi\over4} = \arctan{1\over2} + \arctan{1\over5} + \arctan{1\over8}
{\pi\over4} = 4 \arctan{1\over5} - \arctan{1\over70} + \arctan{1\over99}
{\pi\over4} = 5 \arctan{1\over7} + 4 \arctan{1\over53} + 2 \arctan{1\over4443}
{\pi\over4} = 6 \arctan{1\over8} + 2 \arctan{1\over57} + \arctan{1\over239}
{\pi\over4} = 8 \arctan{1\over10} - \arctan{1\over239} - 4 \arctan{1\over515}
{\pi\over4} = 12 \arctan{1\over18} + 8 \arctan{1\over57} - 5 \arctan{1\over239}
{\pi\over4} = 16 \arctan{1\over21} + 3 \arctan{1\over239} + 4 \arctan{3\over1042}
{\pi\over4} = 22 \arctan{1\over28} + 2 \arctan{1\over443} - 5 \arctan{1\over1393} - 10 \arctan{1\over11018}
{\pi\over4} = 22 \arctan{1\over38} + 17 \arctan{7\over601} + 10 \arctan{7\over8149}
{\pi\over4} = 44 \arctan{1\over57} + 7 \arctan{1\over239} - 12 \arctan{1\over682} + 24 \arctan{1\over12943}

The same should be done for the last and most complicated case…

{\pi\over4} = 88 \arctan{1\over172} + 51 \arctan{1\over239} + 32 \arctan{1\over682} + 44 \arctan{1\over5357} + 68 \arctan{1\over12943}

… but it should be confirmed that the following, slightly changed, formula is incorrect by showing tan (right hand side) is not one:

{\pi\over4} = 88 \arctan{1\over172} + 51 \arctan{1\over239} + 32 \arctan{1\over682} + 44 \arctan{1\over5357} + 68 \arctan{1\over12944}

This is what I contributed to Rosetta Code:

library(Rmpfr)
prec <- 1000 # precision in bits
`%:%` <- function(e1, e2) '/'(mpfr(e1, prec), mpfr(e2, prec)) # operator %:% for high precision division
# function for checking identity of tan of expression and 1, making use of high precision division operator %:%
tanident_1 <- function(x) identical(round(tan(eval(parse(text = gsub("/", "%:%", deparse(substitute(x)))))), (prec/10)), mpfr(1, prec))
 
tanident_1( 1*atan(1/2)    +  1*atan(1/3) )
## [1] TRUE
tanident_1( 2*atan(1/3)    +  1*atan(1/7))
## [1] TRUE
tanident_1( 4*atan(1/5)    + -1*atan(1/239))
## [1] TRUE
tanident_1( 5*atan(1/7)    +  2*atan(3/79))
## [1] TRUE
tanident_1( 5*atan(29/278) +  7*atan(3/79))
## [1] TRUE
tanident_1( 1*atan(1/2)    +  1*atan(1/5)   +   1*atan(1/8) )
## [1] TRUE
tanident_1( 4*atan(1/5)    + -1*atan(1/70)  +   1*atan(1/99) )
## [1] TRUE
tanident_1( 5*atan(1/7)    +  4*atan(1/53)  +   2*atan(1/4443))
## [1] TRUE
tanident_1( 6*atan(1/8)    +  2*atan(1/57)  +   1*atan(1/239))
## [1] TRUE
tanident_1( 8*atan(1/10)   + -1*atan(1/239) +  -4*atan(1/515))
## [1] TRUE
tanident_1(12*atan(1/18)   +  8*atan(1/57)  +  -5*atan(1/239))
## [1] TRUE
tanident_1(16*atan(1/21)   +  3*atan(1/239) +   4*atan(3/1042))
## [1] TRUE
tanident_1(22*atan(1/28)   +  2*atan(1/443) +  -5*atan(1/1393) + -10*atan(1/11018))
## [1] TRUE
tanident_1(22*atan(1/38)   + 17*atan(7/601) +  10*atan(7/8149))
## [1] TRUE
tanident_1(44*atan(1/57)   +  7*atan(1/239) + -12*atan(1/682)  +  24*atan(1/12943))
## [1] TRUE

tanident_1(88*atan(1/172)  + 51*atan(1/239) +  32*atan(1/682)  +  44*atan(1/5357) + 68*atan(1/12943))
## [1] TRUE
tanident_1(88*atan(1/172)  + 51*atan(1/239) +  32*atan(1/682)  +  44*atan(1/5357) + 68*atan(1/12944))
## [1] FALSE

As you can see all statements are TRUE except for the last one!

In the code I make use of the Rmpfr package (from Martin Maechler of ETH Zürich, Switzerland) which is based on the excellent GMP (GNU Multiple Precision) library. I define a new infix operator %:% for high-precision division and after that convert all standard divisions in the formulae to high-precision divisions and calculate the tan. Before I check if the result is identical to one I round it to 100 decimal places which is more than enough given the precision of log_{10}(2^{1000})=301.03, so about 300 decimal places, in the example.

Please let me know in the comments what you think of this approach and whether you see room for improvement for the code – Thank you!

Clustering the Bible


During this time of year there is obviously a lot of talk about the Bible. As most people know the New Testament comprises four different Gospels written by anonymous authors 40 to 70 years after Jesus’ supposed crucifiction. Unfortunately we have lost all of the originals but only retained copies of copies of copies (and so on) which date back hundreds of years after they were written in all kinds of different versions (renowned Biblical scholar Professor Bart Ehrmann states that there are more versions of the New Testament than there are words in the New Testament). Just as a fun fact: there are many more Gospels but only those four were included in the official Bible.

So in general it is interesting to learn a little bit more about this stuff, especially because, for better or worse, it is at the core of Western civilization. One interesting question is how do the four Gospels relate to each other. To find that out you could either study Greek (i.e. the language of the New Testament) and Christian theology for many years and decades – or you could use the stylo package and do a cluster analysis within seconds. Obviously we go for the latter…

For that matter we download one version of the four Gospels from Project Gutenberg in Greek and put it into a separate directory called ‘corpus’ (which is the name of the default directory of the package). The famous beginning of the Gospel of John (“In the beginning was the Word, and the Word was with God, and the Word was God…”) reads as follows in Greek:

Στην αρχή ‘ταν ο λόγος κι’ ο λόγος είτανε με το
Θεό και Θεός είταν ο λόγος. Είταν εκείνος στην αρχή
με το Θεό. Όλα τα πάντα μέσο του έγιναν, και χωρίς
του τίποτα δεν έγινε που γίνηκε. Μέσα του είτανε ζωή
κι’ η ζωή ‘τανε το φως των ανθρώπων, και το φως
μέσα στο σκοτάδι φέγγει και το σκοτάδι δεν το κυρίεψε.

For your convenience I provide you with the curated texts here: John, Luke, Mark, Matthew.

After that we can already call the main function of the package:

library(stylo)

## ### stylo version: 0.6.8 ###
## 
## If you plan to cite this software (please do!), use the following reference:
##     Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R:
##     a package for computational text analysis. R Journal 8(1): 107-121.
##     <https://journal.r-project.org/archive/2016/RJ-2016-007/index.html>
## 
## To get full BibTeX entry, type: citation("stylo")

stylo(corpus.lang = "Other", encoding = "UTF-8", mfw.min = 200, mfw.max = 200, custom.graph.title = "Ancient Greek Gospels", gui = FALSE)

## using current directory...
## Performing no sampling (using entire text as sample)
## loading John.txt ...
## loading Luke.txt ...
## loading Mark.txt ...
## loading Matthew.txt  ...
## slicing input text into tokens...
## 
## turning words into features, e.g. char n-grams (if applicable)...
## 
## Total nr. of samples in the corpus: 4
## ....
## The corpus consists of 70943 tokens
## 
## processing  4  text samples
##        
## combining frequencies into a table...
## 
## 
## culling @ 0  available features (words) 5000
## Calculating z-scores... 
## 
## Calculating classic Delta distances...
## MFW used: 
## 200

## 
## Function call:
## stylo(gui = FALSE, corpus.lang = "Other", encoding = "UTF-8",     mfw.min = 200, mfw.max = 200, custom.graph.title = "Ancient Greek Gospels")
## 
## Depending on your chosen options, some results should have been written
## into a few files; you should be able to find them in your current
## (working) directory. Usually, these include a list of words/features
## used to build a table of frequencies, the table itself, a file containing
## recent configuration, etc.
## 
## Advanced users: you can pipe the results to a variable, e.g.:
##   I.love.this.stuff = stylo()
## this will create a class "I.love.this.stuff" containing some presumably
## interesting stuff. The class created, you can type, e.g.:
##   summary(I.love.this.stuff)
## to see which variables are stored there and how to use them.
## 
## 
## for suggestions how to cite this software, type: citation("stylo")

As you can see the Gospels of Matthew and Luke are more similar than Mark (the oldest Gospel), John (the last Gospel that made it into the Bible) is farthest away. This is indeed what Biblical scholars have found out by diligently studying the New Testament: Mark, Luke and Matthew are called “Synoptic Gospels” because they are so similar, John kind of stands on its own. The shared pieces of the Synoptic Gospels can be seen here (picture from Wikipedia, Synoptic Gospels):

As you can see, Biblical scholars have found out that Luke and Matthew share about 60 to 70 percent of the text (through the so called double tradition), whereas Mark shares about 40 percent of the text (through the triple tradition). For comparison, John only shares about 10 percent with the synoptic Gospels. Now, this is exactly what our cluster analysis form above shows – what an impressive feat!

By the way, there is an even simpler way to interact with the stylo package: through the GUI. Just call stylo() and the following window will open where you can play around with different arguments:

It has never been easier to do sophisticated textual analyses, even in ancient languages one cannot understand! You might ask: how did R do that?

Well, behind the scenes R basically creates frequency tables for the words used in the different documents. It then tries to cluster the documents based on a similarity (or distance) measure. Intuitively that means that the more often authors use the same words the closer their texts are. As you can see R doesn’t really have to understand the words or the language used. It just maps statistical measures which seems to be quite effective (not only) in this case. The statistical measures often give some kind of characteristic signature of an author or a source. In a way this is the quantified version of fuzzy concepts like ‘writing style’ yet R can do it much more quickly even for long texts than any human ever could!

Happy New Year… stay tuned for much more to come in
2019

Your AI journey… and Happy Holidays!

I want to draw your attention to a very valuable (and short!) whitepaper from my colleague, Professor Andrew Ng, where he shares important insights on how to lead companies into the AI era.

The five steps outlined in the paper are:

  1. Execute pilot projects to gain momentum
  2. Build an in-house AI team
  3. Provide broad AI training
  4. Develop an AI strategy
  5. Develop internal and external communications

Many points raised in the paper coincide with my own consulting experience in different industries: AI Transformation Playbook

I also recently gave an interview on the same topic, you can find it here:
Artificial intelligence: What is the best way for companies to innovate?

Happy reading!

If you need qualified support/consulting for your own AI journey, especially for crafting an AI strategy, setting up and management of AI projects, building an AI organization and planning/conducting AI coachings please contact me here: Prof. Dr. Holger K. von Jouanne-Diedrich

…and now for something completely different: like every year Christmas arrives totally unexpected!

So, I wish you a Merry Christmas/Happy Holidays (whatever applies 😉 ) with the following little ctree function:

# ctree: prints a Christmas tree with numbers of branch levels N
ctree <- function(N = 7) {
  for (i in 1:N) cat(rep("", N-i+1), rep("*", i), "\n")
  cat(rep("", N), "*\n")
}

ctree(5)
##      *
##     * *
##    * * *
##   * * * *
##  * * * * *
##      *
ctree()
##        *
##       * *
##      * * *
##     * * * *
##    * * * * *
##   * * * * * *
##  * * * * * * *
##        *
ctree(9)
##          *
##         * *
##        * * *
##       * * * *
##      * * * * *
##     * * * * * *
##    * * * * * * *
##   * * * * * * * *
##  * * * * * * * * *
##          *

Learning R: A gentle introduction to higher-order functions

Have you ever thought about why the definition of a function in R is different from many other programming languages? The part that causes the biggest difficulties (especially for beginners of R) is that you state the name of the function at the beginning and use the assignment operator – as if functions were like any other data type, like vectors, matrices or data frames…

Congratulations! You just encountered one of the big ideas of functional programming: functions are indeed like any other data type, they are not special – or in programming lingo, functions are first-class members. Now, you might ask: So what? Well, there are many ramifications, for example that you could use functions on other functions by using one function as an argument for another function. Sounds complicated?

In mathematics most of you will be familiar with taking the derivative of a function. When you think about it you could say that you put one function into the derivative function (or operator) and get out another function!

In R there are many applications as well, let us go through a simple example step by step.

Let’s say I want to apply the mean function on the first four columns of the iris dataset. I could do the following:

mean(iris[ , 1])
## [1] 5.843333
mean(iris[ , 2])
## [1] 3.057333
mean(iris[ , 3])
## [1] 3.758
mean(iris[ , 4])
## [1] 1.199333

Quite tedious and not very elegant. Of course, we can use a for loop for that:

for (x in iris[1:4]) {
  print(mean(x))
}
## [1] 5.843333
## [1] 3.057333
## [1] 3.758
## [1] 1.199333

This works fine but there is an even more intuitive approach. Just look at the original task: “apply the mean function on the first four columns of the iris dataset” – so let us do just that:

apply(iris[1:4], 2, mean)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

Wow, this is very concise and works perfectly (the 2 just stands for “go through the data column wise”, 1 would be for “row wise”). apply is called a “higher-order function” and we could use it with all kinds of other functions:

apply(iris[1:4], 2, sd)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.8280661    0.4358663    1.7652982    0.7622377
apply(iris[1:4], 2, min)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          4.3          2.0          1.0          0.1
apply(iris[1:4], 2, max)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          7.9          4.4          6.9          2.5

You can also use user-defined functions:

midrange <- function(x) (min(x) + max(x)) / 2
apply(iris[1:4], 2, midrange)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##         6.10         3.20         3.95         1.30

We can even use new functions that are defined “on the fly” (or in functional programming lingo “anonymous functions”):

apply(iris[1:4], 2, function(x) (min(x) + max(x)) / 2)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##         6.10         3.20         3.95         1.30

Let us now switch to another inbuilt data set, the mtcars dataset with 11 different variables of 32 cars (if you want to find out more, please consult the documentation):

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

To see the power of higher-order functions let us create a (numeric) matrix with minimum, first quartile, median, mean, third quartile and maximum for all 11 columns of the mtcars dataset with just one command!

apply(mtcars, 2, summary)
##              mpg    cyl     disp       hp     drat      wt     qsec     vs      am   gear   carb
## Min.    10.40000 4.0000  71.1000  52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000
## 1st Qu. 15.42500 4.0000 120.8250  96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000
## Median  19.20000 6.0000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000
## Mean    20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
## 3rd Qu. 22.80000 8.0000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000
## Max.    33.90000 8.0000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000

Wow, that was easy and the result is quite impressive, is it not!

Or if you want to perform a linear regression for all ten variables separately against mpg and want to get a table with all coefficients – there you go:

sapply(mtcars, function(x) round(coef(lm(mpg ~ x, data = mtcars)), 3))
##             mpg    cyl   disp     hp   drat     wt   qsec     vs     am  gear   carb
## (Intercept)   0 37.885 29.600 30.099 -7.525 37.285 -5.114 16.617 17.147 5.623 25.872
## x             1 -2.876 -0.041 -0.068  7.678 -5.344  1.412  7.940  7.245 3.923 -2.056

Here we used another higher-order function, sapply, together with an anonymous function. sapply goes through all the columns of a data frame (i.e. elements of a list) and tries to simplify the result (here your get back a nice matrix).

Often, you might not even have realised when you were using higher-order functions! I can tell you that it is quite a hassle in many programming languages to program a simple function plotter, i.e. a function which plots another function. In R it has already been done for you: you just use the higher-order function curve and give it the function you want to plot as an argument:

curve(sin(x) + cos(1/2 * x), -10, 10)

I want to give you one last example of another very helpful higher-order function (which not too many people know or use): by. It comes in very handy when you want to apply a function on different attributes split by a factor. So let’s say you want to get a summary of all the attributes of iris split by (!) species – here it comes:

by(iris[1:4], iris$Species, summary)
## iris$Species: setosa
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
##  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
##  Median :5.000   Median :3.400   Median :1.500   Median :0.200  
##  Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
##  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
##  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600  
## --------------------------------------------------------------
## iris$Species: versicolor
##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width   
##  Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000  
##  1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200  
##  Median :5.900   Median :2.800   Median :4.35   Median :1.300  
##  Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326  
##  3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500  
##  Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800  
## --------------------------------------------------------------
## iris$Species: virginica
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400  
##  1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800  
##  Median :6.500   Median :3.000   Median :5.550   Median :2.000  
##  Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026  
##  3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300  
##  Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500

This was just a very shy look at this huge topic. There are very powerful higher-order functions in R, like lapppy, aggregate, replicate (very handy for numerical simulations) and many more. A good overview can be found in the answers of this question: stackoverflow (my answer there is on the rather illusive switch function: switch).

For some reason people tend to confuse higher-order functions with recursive functions but that is the topic of another post, so stay tuned…

Intuition for principal component analysis (PCA)

Principal component analysis (PCA) is a dimension-reduction method that can be used to reduce a large set of (often correlated) variables into a smaller set of (uncorrelated) variables, called principal components, which still contain most of the information.

PCA is a concept that is traditionally hard to grasp so instead of giving you the n’th mathematical derivation I will provide you with some intuition.

Basically PCA is nothing else but a projection of some higher dimensional object into a lower dimension. What sounds complicated is really something we encounter every day: when we watch TV we see a 2D-projection of 3D-objects!

Please watch this instructive video:

So what we want is not only a projection but a projection which retains as much information as possible. A good strategy is to find the longest axis of the object and after that turn the object around this axis so that we again find the second longest axis.

Mathematically this can be done by calculating the so called eigenvectors of the covariance matrix, after that we use just use the n eigenvectors with the biggest eigenvalues to project the object into n-dimensional space (in our example n = 2) – but as I said I won’t go into the mathematical details here (many good explanations can be found here on stackexchange).

To make matters concrete we will now recreate the example from the video above! The teapot data is from https://www.khanacademy.org/computer-programming/3d-teapot/971436783 – there you can also turn it around interactively.

Have a look at the following code:

library(scatterplot3d)

teapot <- read.csv("data/teapot.csv", header = FALSE)
dat <- matrix(unlist(teapot), ncol = 3, byrow = TRUE)
head(dat)
##      [,1] [,2] [,3]
## [1,] 40.6 28.3 -1.1
## [2,] 40.1 30.4 -1.1
## [3,] 40.7 31.1 -1.1
## [4,] 42.0 30.4 -1.1
## [5,] 43.5 28.3 -1.1
## [6,] 37.5 28.3 14.5

# teapot in 3D
scatterplot3d(dat[ , 1], dat[ , 2], dat[ , 3], highlight.3d = TRUE, angle = 75, pch = 19, lwd = 15, xlab = "", ylab = "", zlab = "", main = "teapot in 3D")

And now we project that teapot from 3D-space onto a 2D-plane while retaining as much information as possible by taking the two eigenvectors with biggest eigenvalues:

# PCA
(eigenval <- eigen(cov(dat))$values)
## [1] 2050.6935  747.5764  553.4070

eigenvec <- eigen(cov(dat))$vectors
# take only the two eigenvectors with biggest eigenvalues
PCA_2 <- dat %*% eigenvec[ , 1:2] 
plot(PCA_2, ylim = c(50, -55), pch = 19, lwd = 35, col= "blue", xlab = "", ylab = "", main = "teapot in 2D")

round(cumsum(eigenval)/sum(eigenval) * 100, 2) # cumulative percentage of retrieved information
## [1]  61.18  83.49 100.00

That was fun, wasn’t it!

In data science you will often use PCA to condense large datasets into smaller datasets while retaining most of the information (which means retaining as much variance as possible in this case). PCA is like clustering an unsupervised learning method.

Why R for Data Science – and not Python?


There are literally hundreds of programming languages out there, e.g. the whole alphabet of one letter programming languages is taken. In the area of data science there are two big contenders: R and Python. Now why is this blog about R and not Python?

I have to make a confession: I really wanted to like Python. I dug deep into the language and some of its extensions. Yet, it never really worked for me. I think one of the problems is that Python tries to be everybody’s darling. It can do everything… and its opposite. No, really, it is a great language to learn programming but I think it has some really serious flaws. I list some of them here:

  • It starts with which version to use! The current version has release number 3 but there is still a lot of code based on the former version number 2. The problem is that there is no backward compatibility. Even the syntax of the print command got changed!
  • The next thing is which distribution to chose! What seems like a joke to R users is a sad reality for Python users: there are all kinds of different distributions out there. The most well known for data science is Anaconda: https://www.anaconda.com/. One of the reasons for this is that the whole package system in Python is a mess. To just give you a taste, have a look at the official documentation: https://packaging.python.org/tutorials/installing-packages/ – seven (!) pages for what is basically one command in R: install.packages() (I know, this is not entirely fair, but you get the idea).
  • There are several GUIs out there and admittedly it is also a matter of taste which one to use but in my opinion when it comes to data scientific tasks – where you need a combination of on-line work and scripts – there is no better GUI than RStudio (there is now Rodeo, free download here: https://www.yhat.com/products/rodeo, but I don’t know how mature it is).
  • There is no general rule when to use a function and when to use a method on an object. The reason for this problem is what I stated above: Python wants to be everybody’s darling and tries to achieve everything at the same time. That it is not only me can be seen in this illuminating discussion where people scramble to find criteria when to use which: https://stackoverflow.com/questions/8108688/in-python-when-should-i-use-a-function-instead-of-a-method. A concrete example can be found here, where it is explained why the function any(df2 == 1) gives the wrong result and you have to use e.g. the method (df2 == 1).any(). Very confusing and error prone.
  • More sophisticated data science data structures are not part of the core language. For example you need the NumPy package for vectors and the pandas package for data.frames. That in itself is not the problem but the inconsistencies that this brings. To give you just one example: whereas vectorized code is supported by NumPy and pandas it is not supported in base Python and you have to use good old loops instead.
  • Both Python and R are not the fastest of languages but the integration with one of the fastest, namely C++, is so much better in R (via Rcpp by Dirk Eddelbuettel) than in Python that it can by now be considered a standard approach. All R data structures are supported by corresponding C++ classes and there is a generic way to write ultra fast C++ functions that can be called like regular R functions:
  • library(Rcpp)
    
    bmi_R <- function(weight, height) {
      weight / (height * height)
    }
    bmi_R(80, 1.85) # body mass index of person with 80 kg and 185 cm
    ## [1] 23.37473
    
    cppFunction("
      float bmi_cpp(float weight, float height) {
        return weight / (height * height);
      }
    ")
    bmi_cpp(80, 1.85) # same with cpp function
    ## [1] 23.37473
    

    One of the main reasons for using Python in the data science arena is shrinking by the day: Neural Networks. The main frameworks like Tensorflow and APIs like Keras used to be controlled by Python but there are now excellent wrappers available for R too (https://tensorflow.rstudio.com/ and https://keras.rstudio.com/).


    All in all I think that R is really the best choice for most data science applications. The learning curve may be a little bit steeper at the beginning but when you get to more sophisticated concepts it becomes easier to use than Python.

    If you now want to learn R see this post of mine: Learning R: The Ultimate Introduction (incl. Machine Learning!)

    OneR – Fascinating Insights through Simple Rules


    We already saw the power of the OneR package in the preceding post, One Rule (OneR) Machine Learning Classification in under One Minute. Here we want to give some more examples to gain some fascinating, often counter-intuitive, insights.

    Shirin Glander of Muenster University tested the OneR package with data from the World Happiness Report to find out what makes people happy: https://shiring.github.io/machine_learning/2017/04/23/one_r.

    The whole analysis is very sophisticated so I present a simplified approach here. Have a look at the following code:

    library(OneR)
    data <- read.csv("data/2016.csv", sep = ",", header = TRUE)
    data$Happiness.Score <- bin(data$Happiness.Score, nbins = 3, labels = c("low", "middle", "high"), method = "content")
    data <- maxlevels(data) # removes all columns that have more than 20 levels (Country)
    data <- optbin(formula = Happiness.Score ~., data = data, method = "infogain")
    data <- data[ , -(1:4)] # remove columns with redundant data
    model <- OneR(formula = Happiness.Score ~., data = data, verbose = TRUE)
    ## 
    ##     Attribute                     Accuracy
    ## 1 * Economy..GDP.per.Capita.      71.97%  
    ## 2   Health..Life.Expectancy.      70.06%  
    ## 3   Family                        65.61%  
    ## 4   Dystopia.Residual             59.87%  
    ## 5   Freedom                       57.32%  
    ## 6   Trust..Government.Corruption. 55.41%  
    ## 7   Generosity                    47.13%  
    ## ---
    ## Chosen attribute due to accuracy
    ## and ties method (if applicable): '*'
    
    summary(model)
    ## 
    ## Call:
    ## OneR.formula(formula = Happiness.Score ~ ., data = data, verbose = TRUE)
    ## 
    ## Rules:
    ## If Economy..GDP.per.Capita. = (-0.00182,0.675] then Happiness.Score = low
    ## If Economy..GDP.per.Capita. = (0.675,1.32]     then Happiness.Score = middle
    ## If Economy..GDP.per.Capita. = (1.32,1.83]      then Happiness.Score = high
    ## 
    ## Accuracy:
    ## 113 of 157 instances classified correctly (71.97%)
    ## 
    ## Contingency table:
    ##                Economy..GDP.per.Capita.
    ## Happiness.Score (-0.00182,0.675] (0.675,1.32] (1.32,1.83] Sum
    ##          low                * 36           16           0  52
    ##          middle                4         * 46           2  52
    ##          high                  0           22        * 31  53
    ##          Sum                  40           84          33 157
    ## ---
    ## Maximum in each column: '*'
    ## 
    ## Pearson's Chi-squared test:
    ## X-squared = 130.99, df = 4, p-value < 2.2e-16
    
    plot(model)
    

    prediction <- predict(model, data)
    eval_model(prediction, data)
    ## 
    ## Confusion matrix (absolute):
    ##           Actual
    ## Prediction high low middle Sum
    ##     high     31   0      2  33
    ##     low       0  36      4  40
    ##     middle   22  16     46  84
    ##     Sum      53  52     52 157
    ## 
    ## Confusion matrix (relative):
    ##           Actual
    ## Prediction high  low middle  Sum
    ##     high   0.20 0.00   0.01 0.21
    ##     low    0.00 0.23   0.03 0.25
    ##     middle 0.14 0.10   0.29 0.54
    ##     Sum    0.34 0.33   0.33 1.00
    ## 
    ## Accuracy:
    ## 0.7197 (113/157)
    ## 
    ## Error rate:
    ## 0.2803 (44/157)
    ## 
    ## Error rate reduction (vs. base rate):
    ## 0.5769 (p-value < 2.2e-16)
    

    As you can see we get more than 70% accuracy with three rules based on GDP per capita. So it seems that money CAN buy happiness after all!

    Dr. Glander comes to the following conclusion:

    All in all, based on this example, I would confirm that OneR models do indeed produce sufficiently accurate models for setting a good baseline. OneR was definitely faster than random forest, gradient boosting and neural nets. However, the latter were more complex models and included cross-validation.
    If you prefer an easy to understand model that is very simple, OneR is a very good way to go. You could also use it as a starting point for developing more complex models with improved accuracy.
    When looking at feature importance across models, the feature OneR chose – Economy/GDP per capita – was confirmed by random forest, gradient boosting trees and neural networks as being the most important feature. This is in itself an interesting conclusion! Of course, this correlation does not tell us that there is a direct causal relationship between money and happiness, but we can say that a country’s economy is the best individual predictor for how happy people tend to be.

    A well known reference data set is the titanic data set which gives all kind of additional information on the passengers of the titanic and whether they survived the tragedy. We can conveniently use the titanic package from CRAN to get access to the data set. Have a look at the following code:

    library(titanic)
    data <- bin(maxlevels(titanic_train), na.omit = FALSE)
    model <- OneR(Survived ~., data = data, verbose = TRUE)
    ## Warning in OneR.data.frame(x = data, ties.method = ties.method, verbose =
    ## verbose, : data contains unused factor levels
    ## 
    ##     Attribute   Accuracy
    ## 1 * Sex         78.68%  
    ## 2   Pclass      67.9%   
    ## 3   Fare        64.42%  
    ## 4   Embarked    63.86%  
    ## 5   Age         62.74%  
    ## 6   Parch       61.73%  
    ## 7   PassengerId 61.62%  
    ## 7   SibSp       61.62%  
    ## ---
    ## Chosen attribute due to accuracy
    ## and ties method (if applicable): '*'
    
    summary(model)
    ## 
    ## Call:
    ## OneR.formula(formula = Survived ~ ., data = data, verbose = TRUE)
    ## 
    ## Rules:
    ## If Sex = female then Survived = 1
    ## If Sex = male   then Survived = 0
    ## 
    ## Accuracy:
    ## 701 of 891 instances classified correctly (78.68%)
    ## 
    ## Contingency table:
    ##         Sex
    ## Survived female  male Sum
    ##      0       81 * 468 549
    ##      1    * 233   109 342
    ##      Sum    314   577 891
    ## ---
    ## Maximum in each column: '*'
    ## 
    ## Pearson's Chi-squared test:
    ## X-squared = 260.72, df = 1, p-value < 2.2e-16
    
    plot(model)
    

    prediction <- predict(model, data)
    eval_model(prediction, data$Survived)
    ## 
    ## Confusion matrix (absolute):
    ##           Actual
    ## Prediction   0   1 Sum
    ##        0   468 109 577
    ##        1    81 233 314
    ##        Sum 549 342 891
    ## 
    ## Confusion matrix (relative):
    ##           Actual
    ## Prediction    0    1  Sum
    ##        0   0.53 0.12 0.65
    ##        1   0.09 0.26 0.35
    ##        Sum 0.62 0.38 1.00
    ## 
    ## Accuracy:
    ## 0.7868 (701/891)
    ## 
    ## Error rate:
    ## 0.2132 (190/891)
    ## 
    ## Error rate reduction (vs. base rate):
    ## 0.4444 (p-value < 2.2e-16)
    

    So, somewhat contrary to popular belief, it were not necessarily the rich that survived but the women!

    A more challenging data set is one about the quality of red wine (https://archive.ics.uci.edu/ml/datasets/Wine+Quality), have a look at the following code:

    data <- read.csv("data/winequality-red.csv", header = TRUE, sep = ";")
    data <- optbin(data, method = "infogain")
    ## Warning in optbin.data.frame(data, method = "infogain"): target is numeric
    model <- OneR(data, verbose = TRUE)
    ## 
    ##     Attribute            Accuracy
    ## 1 * alcohol              56.1%   
    ## 2   sulphates            51.22%  
    ## 3   volatile.acidity     50.59%  
    ## 4   total.sulfur.dioxide 48.78%  
    ## 5   density              47.22%  
    ## 6   citric.acid          46.72%  
    ## 7   chlorides            46.47%  
    ## 8   free.sulfur.dioxide  44.47%  
    ## 9   fixed.acidity        43.96%  
    ## 10  residual.sugar       43.46%  
    ## 11  pH                   43.21%  
    ## ---
    ## Chosen attribute due to accuracy
    ## and ties method (if applicable): '*'
    
    summary(model)
    ## 
    ## Call:
    ## OneR.data.frame(x = data, verbose = TRUE)
    ## 
    ## Rules:
    ## If alcohol = (8.39,8.45] then quality = 3
    ## If alcohol = (8.45,10]   then quality = 5
    ## If alcohol = (10,11]     then quality = 6
    ## If alcohol = (11,12.5]   then quality = 6
    ## If alcohol = (12.5,14.9] then quality = 6
    ## 
    ## Accuracy:
    ## 897 of 1599 instances classified correctly (56.1%)
    ## 
    ## Contingency table:
    ##        alcohol
    ## quality (8.39,8.45] (8.45,10] (10,11] (11,12.5] (12.5,14.9]  Sum
    ##     3           * 1         5       4         0           0   10
    ##     4             0        28      13        11           1   53
    ##     5             0     * 475     156        41           9  681
    ##     6             1       216   * 216     * 176        * 29  638
    ##     7             0        19      53       104          23  199
    ##     8             0         2       2         6           8   18
    ##     Sum           2       745     444       338          70 1599
    ## ---
    ## Maximum in each column: '*'
    ## 
    ## Pearson's Chi-squared test:
    ## X-squared = 546.64, df = 20, p-value < 2.2e-16
    
    prediction <- predict(model, data)
    eval_model(prediction, data, zero.print = ".")
    ## 
    ## Confusion matrix (absolute):
    ##           Actual
    ## Prediction    3    4    5    6    7    8  Sum
    ##        3      1    .    .    1    .    .    2
    ##        4      .    .    .    .    .    .    .
    ##        5      5   28  475  216   19    2  745
    ##        6      4   25  206  421  180   16  852
    ##        7      .    .    .    .    .    .    .
    ##        8      .    .    .    .    .    .    .
    ##        Sum   10   53  681  638  199   18 1599
    ## 
    ## Confusion matrix (relative):
    ##           Actual
    ## Prediction    3    4    5    6    7    8  Sum
    ##        3      .    .    .    .    .    .    .
    ##        4      .    .    .    .    .    .    .
    ##        5      . 0.02 0.30 0.14 0.01    . 0.47
    ##        6      . 0.02 0.13 0.26 0.11 0.01 0.53
    ##        7      .    .    .    .    .    .    .
    ##        8      .    .    .    .    .    .    .
    ##        Sum 0.01 0.03 0.43 0.40 0.12 0.01 1.00
    ## 
    ## Accuracy:
    ## 0.561 (897/1599)
    ## 
    ## Error rate:
    ## 0.439 (702/1599)
    ## 
    ## Error rate reduction (vs. base rate):
    ## 0.2353 (p-value < 2.2e-16)
    

    That is an interesting result, isn’t it: the more alcohol the higher the perceived quality!

    We end this post with an unconventional use case for OneR: finding the best move in a strategic game, i.e. tic-tac-toe. We use a dataset with all possible board configurations at the end of such games, have a look at the following code:

    data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/tic-tac-toe/tic-tac-toe.data", header = FALSE)
    names(data) <- c("top-left-square", "top-middle-square", "top-right-square", "middle-left-square", "middle-middle-square", "middle-right-square", "bottom-left-square", "bottom-middle-square", "bottom-right-square", "Class")
    model <- OneR(data, verbose = TRUE)
    ## 
    ##     Attribute            Accuracy
    ## 1 * middle-middle-square 69.94%  
    ## 2   top-left-square      65.34%  
    ## 2   top-middle-square    65.34%  
    ## 2   top-right-square     65.34%  
    ## 2   middle-left-square   65.34%  
    ## 2   middle-right-square  65.34%  
    ## 2   bottom-left-square   65.34%  
    ## 2   bottom-middle-square 65.34%  
    ## 2   bottom-right-square  65.34%  
    ## ---
    ## Chosen attribute due to accuracy
    ## and ties method (if applicable): '*'
    
    summary(model)
    ## 
    ## Call:
    ## OneR.data.frame(x = data, verbose = TRUE)
    ## 
    ## Rules:
    ## If middle-middle-square = b then Class = positive
    ## If middle-middle-square = o then Class = negative
    ## If middle-middle-square = x then Class = positive
    ## 
    ## Accuracy:
    ## 670 of 958 instances classified correctly (69.94%)
    ## 
    ## Contingency table:
    ##           middle-middle-square
    ## Class          b     o     x Sum
    ##   negative    48 * 192    92 332
    ##   positive * 112   148 * 366 626
    ##   Sum        160   340   458 958
    ## ---
    ## Maximum in each column: '*'
    ## 
    ## Pearson's Chi-squared test:
    ## X-squared = 115.91, df = 2, p-value < 2.2e-16
    
    prediction <- predict(model, data)
    eval_model(prediction, data)
    ## 
    ## Confusion matrix (absolute):
    ##           Actual
    ## Prediction negative positive Sum
    ##   negative      192      148 340
    ##   positive      140      478 618
    ##   Sum           332      626 958
    ## 
    ## Confusion matrix (relative):
    ##           Actual
    ## Prediction negative positive  Sum
    ##   negative     0.20     0.15 0.35
    ##   positive     0.15     0.50 0.65
    ##   Sum          0.35     0.65 1.00
    ## 
    ## Accuracy:
    ## 0.6994 (670/958)
    ## 
    ## Error rate:
    ## 0.3006 (288/958)
    ## 
    ## Error rate reduction (vs. base rate):
    ## 0.1325 (p-value = 0.001427)
    

    Perhaps it doesn’t come as a surprise that the middle-middle-square is strategically the most important one – but still it is encouraging to see that OneR comes to the same conclusion.

    One Rule (OneR) Machine Learning Classification in under One Minute

    You can also watch this video which goes through the following example step-by-step:

    After installing the OneR package from CRAN load it

    library(OneR)
    

    Use the famous Iris dataset and determine optimal bins for numeric data

    data <- optbin(iris)
    

    Build model with best predictor

    model <- OneR(data, verbose = TRUE)
    ## 
    ##     Attribute    Accuracy
    ## 1 * Petal.Width  96%     
    ## 2   Petal.Length 95.33%  
    ## 3   Sepal.Length 74.67%  
    ## 4   Sepal.Width  55.33%  
    ## ---
    ## Chosen attribute due to accuracy
    ## and ties method (if applicable): '*'
    

    Show learned rules and model diagnostics

    summary(model)
    ## 
    ## Call:
    ## OneR(data = data, verbose = TRUE)
    ## 
    ## Rules:
    ## If Petal.Width = (0.0976,0.791] then Species = setosa
    ## If Petal.Width = (0.791,1.63]   then Species = versicolor
    ## If Petal.Width = (1.63,2.5]     then Species = virginica
    ## 
    ## Accuracy:
    ## 144 of 150 instances classified correctly (96%)
    ## 
    ## Contingency table:
    ##             Petal.Width
    ## Species      (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum
    ##   setosa               * 50            0          0  50
    ##   versicolor              0         * 48          2  50
    ##   virginica               0            4       * 46  50
    ##   Sum                    50           52         48 150
    ## ---
    ## Maximum in each column: '*'
    ## 
    ## Pearson's Chi-squared test:
    ## X-squared = 266.35, df = 4, p-value < 2.2e-16
    

    Plot model diagnostics

    plot(model)
    

    Use model to predict data

    prediction <- predict(model, data)
    

    Evaluate prediction statistics

    eval_model(prediction, data)
    ## 
    ## Confusion matrix (absolute):
    ##             Actual
    ## Prediction   setosa versicolor virginica Sum
    ##   setosa         50          0         0  50
    ##   versicolor      0         48         4  52
    ##   virginica       0          2        46  48
    ##   Sum            50         50        50 150
    ## 
    ## Confusion matrix (relative):
    ##             Actual
    ## Prediction   setosa versicolor virginica  Sum
    ##   setosa       0.33       0.00      0.00 0.33
    ##   versicolor   0.00       0.32      0.03 0.35
    ##   virginica    0.00       0.01      0.31 0.32
    ##   Sum          0.33       0.33      0.33 1.00
    ## 
    ## Accuracy:
    ## 0.96 (144/150)
    ## 
    ## Error rate:
    ## 0.04 (6/150)
    ## 
    ## Error rate reduction (vs. base rate):
    ## 0.94 (p-value < 2.2e-16)
    

    Please note that the very good accuracy of 96% is reached effortlessly.

    “Petal.Width” is identified as the attribute with the highest predictive value. The cut points of the intervals are found automatically (via the included optbin function). The results are three very simple, yet accurate, rules to predict the respective species.

    The nearly perfect separation of the areas in the diagnostic plot give a good indication of the model’s ability to separate the different species.

    The whole code of this post:

    library(OneR)
    data <- optbin(iris)
    model <- OneR(data, verbose = TRUE)
    summary(model)
    plot(model)
    prediction <- predict(model, data)
    eval_model(prediction, data)
    

    More sophisticated examples will follow in upcoming posts… so stay tuned!

    Update: For more examples see e.g. this post: OneR – Fascinating Insights through Simple Rules

    Help

    From within R:

    help(package = OneR)
    

    …or as a pdf here: OneR.pdf

    The package vignette: OneR – Establishing a New Baseline for Machine Learning Classification Models

    Issues can be posted here: https://github.com/vonjd/OneR/issues

    Feedback

    I would love to hear about your experiences with the OneR package. Please drop a line or two in the comments – Thank you!