Today the biggest book fair of the world starts again in Frankfurt, Germany. I thought this might be a good opportunity to do you some good!

*Springer* is one of the most renowned scientific publishing companies in the world. Normally, their books are quite expensive but also in the publishing business *Open Access* is a megatrend.

If you want to use R in a little fun project to find the latest additions of open access books to their program read on!

The idea is to create an R script which you can run from time to time to see whether there are new titles available. So, we need some place to store the retrieved data in a persistent manner: a *database*! For our purposes here most database systems would be total overkill but there is one great solution available: the amazing `RSQLite`

package (on CRAN).

This package brings its own lightweight database with it, no need to install any additional software! And it is fully *SQL* compatible (for *Structured Query Language*, the industry standard of *relational database management systems*) like any decent database software.

So, you only have to install the `RSQLite`

package and then load the `DBI`

package (for *database interface*). To render the output table in an appealing form we will use the `htmlTable`

package (on CRAN).

Have a look at the following fully documented code which should (hopefully) be quite clear:

library(DBI) library(htmlTable) # inital search for English books from 2019 springer_initial <- read.csv("https://link.springer.com/search/csv?facet-content-type=%22Book%22&previous-end-year=2019&date-facet-mode=in&facet-language=%22En%22&showAll=false&query=&facet-end-year=2019&previous-start-year=2019&facet-start-year=2019", encoding = "UTF-8") # current search for English books from 2020 - has to be updated in the following years! springer_search <- read.csv("https://link.springer.com/search/csv?previous-end-year=2020&facet-content-type=%22Book%22&date-facet-mode=in&previous-start-year=2020&facet-language=%22En%22&showAll=false&query=&facet-start-year=2020&facet-end-year=2020", encoding = "UTF-8") # open database connection springer_db <- dbConnect(RSQLite::SQLite(), "my-db.sqlite") # initialize database if (!dbExistsTable(springer_db, "search")) { dbWriteTable(springer_db, "search", springer_initial) } # read current search table, replace it with new search and compare both springer_search_old <- dbReadTable(springer_db, "search") dbRemoveTable(springer_db, "search") dbWriteTable(springer_db, "search", springer_search) new_books <- setdiff(springer_search_old, dbReadTable(springer_db, "search")) if (nrow(new_books) > 0) htmlTable(new_books[c("Item.Title", "Authors", "URL")])

[showing only a subset of the more than 200 (!) free titles in 2019]

Item.Title | Authors | URL | |
---|---|---|---|

47 | Disrupting Finance | Theo LynnProf. John G. MooneyDr. Pierangelo RosatiProf. Mark Cummins | http://link.springer.com/book/10.1007/978-3-030-02330-0 |

84 | Understanding Statistics and Experimental Design | Prof. Dr. Michael H. HerzogProf. Dr. Gregory FrancisPh.D. Aaron Clarke | http://link.springer.com/book/10.1007/978-3-030-03499-3 |

85 | Information<U+0097>Consciousness<U+0097>Reality | Dr. James B. Glattfelder | http://link.springer.com/book/10.1007/978-3-030-03633-1 |

133 | Modelling our Changing World | Dr. Jennifer L. CastleProf. Dr. David F. Hendry | http://link.springer.com/book/10.1007/978-3-030-21432-6 |

147 | Fundamentals of Clinical Data Science | Dr. Pieter KubbenMichel DumontierProf. Dr. Andre Dekker | http://link.springer.com/book/10.1007/978-3-319-99713-1 |

169 | Reality Lost | Vincent F. HendricksMads Vestergaard | http://link.springer.com/book/10.1007/978-3-030-00813-0 |

172 | The Brownian Motion | Prof. Dr. Andreas LfflerProf. Dr. Lutz Kruschwitz | http://link.springer.com/book/10.1007/978-3-030-20103-6 |

186 | Automated Machine Learning | Prof. Dr. Frank HutterLars KotthoffPh.D. Joaquin Vanschoren | http://link.springer.com/book/10.1007/978-3-030-05318-5 |

209 | Lithium-Ion Batteries | Beta Writer | http://link.springer.com/book/10.1007/978-3-030-16800-1 |

# close database connection dbDisconnect(springer_db)

And you thought Christmas was yet to come, right!

As an aside, the last entry is an especially interesting case: it is the first machine-generated research book! The “author” *Beta Writer* was developed in a joint effort and in collaboration between Springer and researchers from Goethe University, Frankfurt. The book is a cross-corpora *auto-summarization* of current texts from SpringerLink, organized by means of a *similarity-based clustering* routine in coherent chapters and sections. It automatically condenses a large set of papers into a reasonably short book. More technical details of this fascinating endeavor, with the potential to revolutionize scientific publishing, can be found in the preface of the book.

By clicking on the link you will directly be directed to the respective book page, where you can download the *pdf* and in most cases also an *epub* file (bonus tip: in most cases you can also download a free version of the book for your kindle on amazon.com). To get clickable links you need to render *an HTML markdown* document. Otherwise, if you run it in RStudio directly you will have to copy and paste the links into your browser.

You just have to run the script from time to time to see what is new!

If you want to customize the data retrieved from link.springer.com have a look at their search interface:

You can customize your search by changing the values in the blue boxes. To get the URL which you can paste in the `read.csv`

function above just right click on the button with the down arrow at the upper right corner (marked by the blue arrow) and choose “Copy link address” in the context menu.

In case you want to completely reset the database you can use the following function (with care):

# function for resetting the springer database reset_springer_db <- function() { springer_db <- dbConnect(RSQLite::SQLite(), "my-db.sqlite") dbRemoveTable(springer_db, "search") dbDisconnect(springer_db) }

One small thing: although I tried my very best there still seems to be an issue with the encoding… some special characters, like the German umlauts äöüÄÖÜ, are just not rendered. If you have a solution for me please leave it in the comments and I will add it to the post (or perhaps even write a post on the issues of encoding in R, RStudio and Windows).

]]>

The two most disruptive political events of the last few years are undoubtedly the Brexit referendum to leave the European Union and the election of Donald Trump. Both are commonly associated with the political consulting firm *Cambridge Analytica* and a technique known as *Microtargeting*.

If you want to understand the *data science* behind the *Cambridge Analytica/Facebook data scandal* and Microtargeting (i.e. *LASSO regression*) by building a toy example in R read on!

The following post is mainly based on the excellent case study “Cambridge Analytica: Rise and Fall” by my colleague Professor Oliver Gassmann (who was so kind as to email it to me) and Raphael Boemelburg, both from the University of St. Gallen, Switzerland (where I did my PhD), and “Weapons of Micro Destruction: How Our ‘Likes’ Hijacked Democracy” by Data Scientist Dave Smith (the data for the toy example is also from that article).

Also well worth a watch is the Netflix documentary “The Great Hack”. I encourage all of my colleagues from academia and all teachers to consider screening it to their students/pupils. Not many know that Netflix is kind enough to provide a license to do this legally: Neflix grant of permission for educational screenings. In this documentary, it becomes clear that microtargeting can be considered as some form of *psychological warfare* and *information warfare*, i.e. as a military-grade *weapon* with the potential to *destroy our democratic system*:

So, how does it actually work?

Basically, *Microtargeting* is the prediction of psychological profiles on the basis of social media activity and using that knowledge to address different personality types with customized ads. Microtargeting is not only used in the political arena but of course also in *Marketing* and *Customer Relationship Management (CRM)*.

A well-known psychological model is the so-called OCEAN model:

The five personality traits are:

*Openness to experience*(inventive/curious vs. consistent/cautious)*Conscientiousness*(efficient/organized vs. easy-going/careless)*Extraversion*(outgoing/energetic vs. solitary/reserved)*Agreeableness*(friendly/compassionate vs. challenging/detached)*Neuroticism*(sensitive/nervous vs. secure/confident)

You can find out about your own personality by taking this free, anonymous test: The Big Five Project Personality Test.

If you had the psychological profiles of many individuals you could use this together with modern advertising technology (which lets you show different ads to different people) to cater to their individual needs with the aim to manipulate them very efficiently:

So far, so standard… the difficult part is predicting psychological traits with high accuracy! And here comes data science into play, namely a technique called *LASSO regression* (for *Least Absolute Shrinkage and Selection Operator*), or simply *the LASSO*.

If you are not familiar with *Classical Linear Regression (CLR)* please read my post Learning Data Science: Modelling Basics first. The difference between CLR and the LASSO is that with the latter you simultaneously minimize the error of the regression and the sum of the coefficients so that some coefficients will become zero. Thereby you only retain the important variables, so LASSO regression provides automatic *variable selection*! The other effect is that by effectively reducing the complexity of the model (also called *shrinkage* in this context, some form of *regularization*) you prevent *overfitting*.

Mathematically you minimize the following expression:

The first summand is the error term of classical linear regression (the difference between the real values and the predicted values), the second is the regularization term. (lambda) is a *hyperparameter* which controls how big the shrinkage of the coefficients get (the bigger the smaller the coefficients).

Enough of the theory, let’s get to a toy example! We use the psychological factor “Openness” as the trait to predict. We have seven individuals with their openness score and their individual likes on certain Facebook posts (you can find the data for this example here: cambridge-analytica.csv). We use five for the *training set* and two for the *test set*:

data <- read.csv("data/cambridge-analytica.csv") data ## Openness Person The_Colbert_Report TED George_Takei Meditation ## 1 1.85 Adam 1 1 1 1 ## 2 1.60 Bob 1 1 1 1 ## 3 -0.26 Cathy 0 1 1 0 ## 4 -2.00 Donald 0 1 0 0 ## 5 -2.50 Erin 0 0 0 0 ## 6 1.77 Hilary 1 1 1 1 ## 7 -2.20 Mike 0 0 0 0 ## Bass_Pro_Shops NFL_Network The_Bachelor ## 1 0 0 0 ## 2 0 0 0 ## 3 0 0 1 ## 4 1 1 0 ## 5 1 1 1 ## 6 0 0 0 ## 7 1 1 0 ## Ok_If_we_get_caught_heres_the_story ## 1 0 ## 2 1 ## 3 1 ## 4 1 ## 5 1 ## 6 1 ## 7 1 x_train <- as.matrix(data[1:5, 3:10]) x_test <- as.matrix(data[6:7, 3:10]) y_train <- as.matrix(data[1:5, 1]) y_test <- as.matrix(data[6:7, 1])

Now we build the actual model with the excellent `glmnet`

package (on CRAN), which has been co-developed by one of the discoverers of the LASSO, my colleague Professor Robert Tibshirani from Standford University, and after that plot how the coefficients get smaller the bigger gets:

library(glmnet) ## Loading required package: Matrix ## Loading required package: foreach ## Loaded glmnet 2.0-18 LASSO <- glmnet(x_train, y_train, alpha = 1) # alpha = 1 for LASSO regression plot(LASSO, xvar = "lambda") legend("bottomright", lwd = 1, col = 1:6, legend = colnames(x_train), cex = .7)

In the plot, you can see how growing lets the coefficients shrink. To find a good value for we use a technique called *cross-validation*. What it basically does is building a lot of different training- and test-sets automatically and averaging the error over all of them for different values of . After that, we plot the resulting errors with upper and lower standard-deviations:

cv_LASSO <- cv.glmnet(x_train, y_train) ## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations ## per fold plot(cv_LASSO)

In this case, we see that the minimal gives the smallest error, so we choose it for the prediction of the openness score of our test set:

round(predict(LASSO, x_test, s = cv_LASSO$lambda.min), 2) ## 1 ## 6 1.58 ## 7 -2.41 y_test ## [,1] ## [1,] 1.77 ## [2,] -2.20

Not too bad! In reality, with only 70 likes the algorithm can assess personality better than a friend of the person would be able to, with 150 likes it is better than the parents, and with 300 likes, it is even better than the spouse! Creepy, isn’t it!

Another big advantage is the *interpretability* of LASSO regression. It is easily discernible that “The Colbert Report” and “George Takei” (a former Star Trek actor who became a gay rights and left-leaning political activist) are the biggest drivers here:

round(coef(LASSO, s = cv_LASSO$lambda.min), 2) ## 9 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) -2.23 ## The_Colbert_Report 1.84 ## TED 0.43 ## George_Takei 1.72 ## Meditation 0.00 ## Bass_Pro_Shops . ## NFL_Network . ## The_Bachelor . ## Ok_If_we_get_caught_heres_the_story -0.18

It is no overstatement to say that with those new possibilities of microtargeting we have entered a new era of potential (and real!) manipulation. I hope that you now understand the data science behind it better.

In my opinion that knowledge is important to be part of the necessary conversation about the consequences for our society. This conversation has only just begun… looking forward to your comments on this topic!

]]>

A few month ago I posted about *market basket analysis* (see Customers who bought…), in this post we will see another form of it, done with *Logistic Regression*, so read on…

A big supermarket chain wanted to *target* (wink, wink) certain customer groups better. In this special case we are talking about *pregnant* women. The story goes that they identified a young girl as being pregnant and kept sending her coupons for baby care products. Now, the father got angry because she was “too young”… and complained to the supermarket. The whole story took a turn when his daughter confessed that… well, you know what! We are now going to reproduce a similar model here!

In this example, we have a dataset with products bought by customers with the additional information whether the respective buyer was pregnant or not. This is coded in the last column as 1 for pregnant and 0 for not pregnant, 500 instances each. As always all kinds of analyses could be used but we stick with good old logistic regression because, first, it works quite well, and second, as we will see, the results are *interpretable* in this case.

Have a look at the following code (the data is from the book Data Smart by John Foreman and can be downloaded here: ch06.zip):

RetailMart <- read.csv("data/RetailMart.csv") # load data head(RetailMart) ## Male Female Home Apt Pregnancy.Test Birth.Control Feminine.Hygiene ## 1 1 0 0 1 1 0 0 ## 2 1 0 1 0 1 0 0 ## 3 1 0 1 0 1 0 0 ## 4 0 0 1 0 0 0 0 ## 5 0 1 0 1 0 0 0 ## 6 0 1 1 0 0 0 0 ## Folic.Acid Prenatal.Vitamins Prenatal.Yoga Body.Pillow Ginger.Ale ## 1 0 1 0 0 0 ## 2 0 1 0 0 0 ## 3 0 0 0 0 0 ## 4 0 0 0 0 1 ## 5 0 0 1 0 0 ## 6 0 1 0 0 0 ## Sea.Bands Stopped.buying.ciggies Cigarettes Smoking.Cessation ## 1 0 0 0 0 ## 2 0 0 0 0 ## 3 1 0 0 0 ## 4 0 0 0 0 ## 5 0 0 0 0 ## 6 0 1 0 0 ## Stopped.buying.wine Wine Maternity.Clothes PREGNANT ## 1 0 0 0 1 ## 2 0 0 0 1 ## 3 0 0 0 1 ## 4 0 0 0 1 ## 5 1 0 0 1 ## 6 0 0 0 1 tail(RetailMart) ## Male Female Home Apt Pregnancy.Test Birth.Control Feminine.Hygiene ## 995 1 0 1 0 0 0 1 ## 996 1 0 0 1 0 0 0 ## 997 0 1 0 1 0 0 0 ## 998 1 0 1 0 0 0 1 ## 999 0 0 1 0 0 0 0 ## 1000 1 0 0 1 0 0 0 ## Folic.Acid Prenatal.Vitamins Prenatal.Yoga Body.Pillow Ginger.Ale ## 995 0 0 0 0 0 ## 996 0 0 0 0 0 ## 997 0 0 0 0 0 ## 998 0 0 0 0 0 ## 999 0 0 0 0 0 ## 1000 0 0 0 0 1 ## Sea.Bands Stopped.buying.ciggies Cigarettes Smoking.Cessation ## 995 1 0 0 0 ## 996 0 0 0 0 ## 997 0 0 0 0 ## 998 0 0 0 0 ## 999 0 0 0 0 ## 1000 0 0 0 0 ## Stopped.buying.wine Wine Maternity.Clothes PREGNANT ## 995 0 0 0 0 ## 996 0 0 0 0 ## 997 0 0 0 0 ## 998 0 0 0 0 ## 999 0 0 0 0 ## 1000 0 0 1 0 table(RetailMart$PREGNANT) ## ## 0 1 ## 500 500 str(RetailMart) ## 'data.frame': 1000 obs. of 20 variables: ## $ Male : int 1 1 1 0 0 0 1 0 0 0 ... ## $ Female : int 0 0 0 0 1 1 0 1 1 1 ... ## $ Home : int 0 1 1 1 0 1 1 1 1 1 ... ## $ Apt : int 1 0 0 0 1 0 0 0 0 0 ... ## $ Pregnancy.Test : int 1 1 1 0 0 0 0 0 0 0 ... ## $ Birth.Control : int 0 0 0 0 0 0 1 0 0 0 ... ## $ Feminine.Hygiene : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Folic.Acid : int 0 0 0 0 0 0 1 0 0 0 ... ## $ Prenatal.Vitamins : int 1 1 0 0 0 1 1 0 0 1 ... ## $ Prenatal.Yoga : int 0 0 0 0 1 0 0 0 0 0 ... ## $ Body.Pillow : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Ginger.Ale : int 0 0 0 1 0 0 0 0 1 0 ... ## $ Sea.Bands : int 0 0 1 0 0 0 0 0 0 0 ... ## $ Stopped.buying.ciggies: int 0 0 0 0 0 1 0 0 0 0 ... ## $ Cigarettes : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Smoking.Cessation : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Stopped.buying.wine : int 0 0 0 0 1 0 0 0 0 0 ... ## $ Wine : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Maternity.Clothes : int 0 0 0 0 0 0 0 1 0 1 ... ## $ PREGNANT : int 1 1 1 1 1 1 1 1 1 1 ...

The *metadata* for each *feature* are the following:

- Account holder is Male/Female/Unknown by matching surname to census data.
- Account holder address is a home, apartment, or PO box.
- Recently purchased a pregnancy test
- Recently purchased birth control
- Recently purchased feminine hygiene products
- Recently purchased folic acid supplements
- Recently purchased prenatal vitamins
- Recently purchased prenatal yoga DVD
- Recently purchased body pillow
- Recently purchased ginger ale
- Recently purchased Sea-Bands
- Bought cigarettes regularly until recently, then stopped
- Recently purchased cigarettes
- Recently purchased smoking cessation products (gum, patch, etc.)
- Bought wine regularly until recently, then stopped
- Recently purchased wine
- Recently purchased maternity clothing

For building the actual model we use `glm`

(for *generalized linear model*):

logreg <- glm(PREGNANT ~ ., data = RetailMart, family = binomial) # logistic regression - glm stands for generalized linear model summary(logreg) ## ## Call: ## glm(formula = PREGNANT ~ ., family = binomial, data = RetailMart) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.2012 -0.5566 -0.0246 0.5127 2.8658 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.204470 0.422738 -0.484 0.628613 ## Male -0.595820 0.315546 -1.888 0.058997 . ## Female -0.141939 0.307588 -0.461 0.644469 ## Home -0.170115 0.334798 -0.508 0.611375 ## Apt 0.002813 0.336432 0.008 0.993329 ## Pregnancy.Test 2.370554 0.521781 4.543 5.54e-06 *** ## Birth.Control -2.300272 0.365270 -6.297 3.03e-10 *** ## Feminine.Hygiene -2.028558 0.342398 -5.925 3.13e-09 *** ## Folic.Acid 4.077666 0.761888 5.352 8.70e-08 *** ## Prenatal.Vitamins 2.479469 0.369063 6.718 1.84e-11 *** ## Prenatal.Yoga 2.922974 1.146990 2.548 0.010822 * ## Body.Pillow 1.261037 0.860617 1.465 0.142847 ## Ginger.Ale 1.938502 0.426733 4.543 5.55e-06 *** ## Sea.Bands 1.107530 0.673435 1.645 0.100053 ## Stopped.buying.ciggies 1.302222 0.342347 3.804 0.000142 *** ## Cigarettes -1.443022 0.370120 -3.899 9.67e-05 *** ## Smoking.Cessation 1.790779 0.512610 3.493 0.000477 *** ## Stopped.buying.wine 1.383888 0.305883 4.524 6.06e-06 *** ## Wine -1.565539 0.348910 -4.487 7.23e-06 *** ## Maternity.Clothes 2.078202 0.329432 6.308 2.82e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1386.29 on 999 degrees of freedom ## Residual deviance: 744.11 on 980 degrees of freedom ## AIC: 784.11 ## ## Number of Fisher Scoring iterations: 7

Concerning interpretability, have a look at the output of the `summary`

function above. First, you can see that some features have more stars than others. This has to do with their statistical significance (see also here: From Coin Tosses to p-Hacking: Make Statistics Significant Again!) and hints at whether the respective feature has some real influence on the outcome and is not just some random noise. We see that e.g. *Pregnancy.Test*, *Birth.Control* and *Folic.Acid* but also alcohol- and cigarette-related features get the maximum of three stars and are therefore considered highly significant for the model.

Another value is the estimate given for each feature which shows how strong each feature influences the final model (because all feature values are normalized to being either 0 or 1) and in which direction. We can e.g. see that buying pregnancy tests and to quit smoking are quite strong predictors for being pregnant (no surprises here). An interesting case is the sex of the customers: both are not statistically significant and both point in the same direction. The answer to this seeming paradox is of course that men also buy items for their pregnant girlfriends or wives.

The predictions coming out of the model are percentages of being pregnant. Now, because a woman is obviously either pregnant or not, and the supermarket has to decide whether to send a coupon or not, we employ a naive approach which draws the line at 50%:

pred <- ifelse(predict(logreg,RetailMart[ , -ncol(RetailMart)], "response") < 0.5, 0, 1) # naive approach to predict whether pregnant results <- data.frame(actual = RetailMart$PREGNANT, prediction = pred) results[460:520, ] ## actual prediction ## 460 1 1 ## 461 1 1 ## 462 1 1 ## 463 1 1 ## 464 1 1 ## 465 1 1 ## 466 1 1 ## 467 1 0 ## 468 1 1 ## 469 1 1 ## 470 1 1 ## 471 1 1 ## 472 1 1 ## 473 1 1 ## 474 1 1 ## 475 1 1 ## 476 1 1 ## 477 1 1 ## 478 1 0 ## 479 1 0 ## 480 1 1 ## 481 1 1 ## 482 1 0 ## 483 1 0 ## 484 1 0 ## 485 1 1 ## 486 1 1 ## 487 1 1 ## 488 1 0 ## 489 1 1 ## 490 1 0 ## 491 1 1 ## 492 1 0 ## 493 1 1 ## 494 1 1 ## 495 1 1 ## 496 1 1 ## 497 1 1 ## 498 1 0 ## 499 1 1 ## 500 1 0 ## 501 0 1 ## 502 0 1 ## 503 0 0 ## 504 0 0 ## 505 0 0 ## 506 0 1 ## 507 0 0 ## 508 0 0 ## 509 0 0 ## 510 0 0 ## 511 0 0 ## 512 0 0 ## 513 0 0 ## 514 0 0 ## 515 0 0 ## 516 0 0 ## 517 0 0 ## 518 0 0 ## 519 0 0 ## 520 0 0

As can be seen in the next code section, the *accuracy* (which is all *correct* predictions divided by *all* predictions) is well over 80 percent which is not too bad for a naive out-of-the-box model:

(conf <- table(pred, RetailMart$PREGNANT)) # create confusion matrix ## ## pred 0 1 ## 0 450 115 ## 1 50 385 sum(diag(conf)) / sum(conf) # calculate accuracy ## [1] 0.835

Now, how does a logistic regression work? One hint lies in the name of the function: generalized linear model. Whereas with standard *linear regression* (see e.g. here: Learning Data Science: Modelling Basics) in the 2D-case one tries to find the best-fitting *line* for all points, with logistic regression you try to find the best line which *separates* the two classes (in this case pregnant vs. not pregnant). In the n-D-case (i.e. with n features) the line becomes a *hyperplane*, e.g. in the 3D-case:

One learning from all of that is again that simple models are oftentimes quite good and better interpretable than more complicated models! Another learning is that even with simple models and enough data very revealing (and sometimes embarrassing) information can be inferred… you should keep that in mind too!

]]>

It was November last year when I seriously started blogging and it is time to share with you some experiences and highlights before the summer break… so read on!

The first thing that really surprised me (and still surprises me) is the popularity of my blog – and I say this without false modesty: when I started out I thought that a few dozen people would visit my blog per day (most of them my students looking for clues for the upcoming exams…). Instead nearly 20,000 people visit my blog every month (each of them viewing about 3 pages per session), sometimes more than two thousand per day! Some of the posts get nearly one thousand *facebook* likes and are very actively reshared, people commented hundreds of times, some post got dozens of comments alone. All of this still blows me away, what an honour! I cannot thank you enough for your commitment!

The most popular posts plus some I am especially proud of:

- Why R for Data Science – and not Python?
- Learning R: The Ultimate Introduction (incl. Machine Learning!)
- So, what is AI really?
- Understanding the Magic of Neural Networks
- OneR – fascinating insights through simple rules
- Teach R to read handwritten Digits with just 4 Lines of Code
- Causation doesn’t imply Correlation either
- Clustering the Bible

So, what are my motivations for blogging? Actually, manifold:

*R*really is my passion: I want to push its popularity even further and give back to the community which has given me so much!- We are living in a complex world and I want to explain things most of us use on a regular basis as simple as possible (e.g. see Google’s Eigenvector… or how a Random Surfer finds the most relevant Webpages, Customers who bought… or Understanding the Maths of Computed Tomography (CT) scans).
- I want to create an online script for my students on some of the topics I teach.
- I am always eager to learn and I am always interested in feedback (btw, also concerning my English because I am not a native speaker!) – so, please use the comment section frequently!
- And finally, I want to create some publicity for my consulting work: I know a thing or two about Data Science, as a manager and as a “nerd”, and can support you on your AI journey (which I successfully did already for many renowned companies). If you are interested or want to know more: Your AI journey… and Happy Holidays!

At the moment I publish one post per week which is a murderous pace if you want to create high-quality content (and still have a day job). I think that I will have to slow down a bit. Although you can support me:

- If you have any ideas or suggestions for posts, please let me know in the comments below!
- If you want to contribute to this blog via guest posts (to e.g. promote your books or your own blogs), also please let me know in the comments or contact me here: About.

Concerning the technical side, it was quite a learning curve: as we all know *WordPress *is pretty sophisticated and writing a blog post is not a problem… yet, as always, the devil is in the detail. It really took some time to get rid of some nasty problems, e.g. showing R code correctly. I found out that it works best for me to only use the text view of my editor to have full control over the *HTML*, *LaTeX* and R code. I don’t want to bore you with all the technical details: if you have some specific questions please leave them in the comments and I will try to answer them as best as I can.

Yet one unresolved issue remains, so perhaps one of you can help me:

SyntaxHighlighter Evolved: PHP Versions >7.0 don’t work properly.

If you have a solution you can either answer directly on *Stack Overflow* or in the comments – Thank you!

**The last thing I want to do in this post is thank you again all very, very much for your great support during the last few months! You are amazing! I am overwhelmed and humbled and I am looking forward to interacting with you guys after the summer break again!**

I’ll will be back on the first of October, so stay tuned!

Here you see stuff my kids made for me… **I love you guys!**

A few months ago I published a quite popular post on Clustering the Bible… one well known *clustering* algorithm is *k-means*. If you want to learn how *k*-means works and how to apply it in a real-world example, read on…

*k*-means (not to be confused with *k-nearest neighbours* or *KNN*: Teach R to read handwritten Digits with just 4 Lines of Code) is a simple, yet often very effective *unsupervised learning* algorithm to find similarities in large amounts of data and cluster them accordingly. The *hyperparameter k* stands for the number of clusters which has to be set beforehand.

The guiding principles are:

- The
*distance*between data points within clusters should be as small as possible. - The distance of the
*centroids*(= centres of the clusters) should be as big as possible.

Because there are too many possible combinations of all possible clusters comprising all possible data points *k*-means follows an *iterative* approach:

- Initialization: assign clusters randomly to all data points
- E-step (for expectation): assign each observation to the “nearest” (based on
*Euclidean distance*) cluster - M-step (for maximization): determine new centroids based on the mean of assigned objects
- Repeat steps 3 and 4 until no further changes occur

As can be seen above *k*-means is an example of a so-called *expectation-maximization algorithm*.

To implement *k*-means in R we first assign some variables and define a helper function for plotting the steps:

n <- 3 # no. of centroids set.seed(1415) # set seed for reproducibility M1 <- matrix(round(runif(100, 1, 5), 1), ncol = 2) M2 <- matrix(round(runif(100, 7, 12), 1), ncol = 2) M3 <- matrix(round(runif(100, 20, 25), 1), ncol = 2) M <- rbind(M1, M2, M3) C <- M[1:n, ] # define centroids as first n objects obs <- length(M) / 2 A <- sample(1:n, obs, replace = TRUE) # assign objects to centroids at random colors <- seq(10, 200, 25) clusterplot <- function(M, C, txt) { plot(M, main = txt, xlab = "", ylab = "") for(i in 1:n) { points(C[i, , drop = FALSE], pch = 23, lwd = 3, col = colors[i]) points(M[A == i, , drop = FALSE], col = colors[i]) } } clusterplot(M, C, "Initialization")

Here comes the *k*-means algorithm as described above (the circles are the data points, diamonds are the centroids and the three colours symbolize cluster assignments):

repeat { # calculate Euclidean distance between objects and centroids D <- matrix(data = NA, nrow = n, ncol = obs) for(i in 1:n) { for(j in 1:obs) { D[i, j] <- sqrt((M[j, 1] - C[i, 1])^2 + (M[j, 2] - C[i, 2])^2) } } O <- A ## E-step: parameters are fixed, distributions are optimized A <- max.col(t(-D)) # assign objects to centroids if(all(O == A)) break # if no change stop clusterplot(M, C, "E-step") ## M-step: distributions are fixed, parameters are optimized # determine new centroids based on mean of assigned objects for(i in 1:n) { C[i, ] <- apply(M[A == i, , drop = FALSE], 2, mean) } clusterplot(M, C, "M-step") }

As can seen the clusters wander slowly but surely until all three are stable. We now compare the result with the `k-means`

function in Base R:

cl <- kmeans(M, n) clusterplot(M, cl$centers, "Base R")

(custom <- C[order(C[ , 1]), ]) ## [,1] [,2] ## [1,] 3.008 2.740 ## [2,] 9.518 9.326 ## [3,] 22.754 22.396 (base <- cl$centers[order(cl$centers[ , 1]), ]) ## [,1] [,2] ## 2 3.008 2.740 ## 1 9.518 9.326 ## 3 22.754 22.396 round(base - custom, 13) ## [,1] [,2] ## 2 0 0 ## 1 0 0 ## 3 0 0

As you can see, the result is the same!

Now, for some real-world application: clustering wholesale customer data. The data set refers to clients of a wholesale distributor. It includes the annual spending on diverse product categories and is from the renowned *UCI Machine Learning Repository* (I guess the category “Delicassen” should rather be “Delicatessen”).

Have a look at the following code:

data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale customers data.csv", header = TRUE) head(data) ## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen ## 1 2 3 12669 9656 7561 214 2674 1338 ## 2 2 3 7057 9810 9568 1762 3293 1776 ## 3 2 3 6353 8808 7684 2405 3516 7844 ## 4 1 3 13265 1196 4221 6404 507 1788 ## 5 2 3 22615 5410 7198 3915 1777 5185 ## 6 2 3 9413 8259 5126 666 1795 1451 set.seed(123) k <- kmeans(data[ , -c(1, 2)], centers = 4) # remove columns 1 and 2, create 4 clusters (centers <- k$centers) # display cluster centers ## Fresh Milk Grocery Frozen Detergents_Paper Delicassen ## 1 8149.837 18715.857 27756.592 2034.714 12523.020 2282.143 ## 2 20598.389 3789.425 5027.274 3993.540 1120.142 1638.398 ## 3 48777.375 6607.375 6197.792 9462.792 932.125 4435.333 ## 4 5442.969 4120.071 5597.087 2258.157 1989.299 1053.272 round(prop.table(centers, 2) * 100) # percentage of sales per category ## Fresh Milk Grocery Frozen Detergents_Paper Delicassen ## 1 10 56 62 11 76 24 ## 2 25 11 11 22 7 17 ## 3 59 20 14 53 6 47 ## 4 7 12 13 13 12 11 table(k$cluster) # number of customers per cluster ## ## 1 2 3 4 ## 49 113 24 254

One interpretation could be the following for the four clusters:

- Big general shops
- Small food shops
- Big food shops
- Small general shops

As you can see, the interpretation of some clusters found by the algorithm can be quite a challenge. If you have a better idea of how to interpret the result please tell me in the comments below!

]]>

It can be argued that the most important decisions in life are some variant of an *exploitation-exploration* problem. Shall I stick with my current job or look for a new one? Shall I stay with my partner or seek a new love? Shall I continue reading the book or watch the movie instead? In all of those cases, the question is always whether I should “exploit” the thing I have or whether I should “explore” new things. If you want to learn how to tackle this most basic trade-off read on…

At the core this can be stated as the problem a gambler has who wants to play a *one-armed bandit*: if there are several machines with different winning probabilities (a so-called *multi-armed bandit* problem) the question the gambler faces is: which machine to play? He could “exploit” one machine or “explore” different machines. So what is the best strategy given a limited amount of time… and money?

There are two extreme cases: no exploration, i.e. playing only one randomly chosen bandit, or no exploitation, i.e. playing all bandits randomly – so obviously we need some middle ground between those two extremes. We have to start with one randomly chosen bandit, try different ones after that and compare the results. So in the simplest case, the first variable is the probability rate with which to switch to a random bandit – or to stick with the best bandit found so far.

Let us create an example with bandits, which return units on average, except the second one which returns units. So the best strategy would obviously be to choose the second bandit right away and stick with it, but of course, we don’t know the average returns of each bandit so this won’t work. Instead, we need another vector which tallies the results of each bandit so far. This vector has to be updated after each game, so we need an *update function* which gets as arguments the current bandit and the return of the game.

The intelligence of the strategy lies in this update function, so how should we go about it? The big idea behind this strategy is called *Bellman equation* and in its simplest form it works as follows: the adjustment of the former result vector is the difference between the former result and the current result weighted by some *discount factor*, in this case, the inverse of the number of games played on the respective machine. This learning strategy is called *Q-learning* and is a so-called *reinforcement learning* technique.

Have a look at the following example implementation:

set.seed(3141) # for reproducibility # Q-learning update function update <- function(i, r) { Q[i] <<- Q[i] + 1/(k[i]+1) * (r-Q[i]) # Q-learning function k[i] <<- k[i] + 1 # one more game played on i'th bandit } # simulate game on one-armed bandit i ret <- function(i) { round(rnorm(1, mean = rets[i])) } # chose which bandit to play which.bandit <- function() { p <- runif(1) ifelse(p >= epsilon, which.max(Q), sample(1:n, 1)) } epsilon <- 0.1 # switch in epsilon percent of cases rets <- c(4, 5, 4, 4, 4) # average returns of bandits n <- length(rets) Q <- rep(0, n) # initialize return vector k <- rep(0, n) # initialize vector for games played on each bandit N <- 1000 # no. of runs R <- 0 # sum of returns for (j in 1:N) { i <- which.bandit() # chose bandit r <- ret(i) # simulate bandit R <- R + r # add return of bandit to overall sum of returns update(i, r) # calling Q-learning update function } which.max(Q) # which bandit has the highest return? ## [1] 2 Q ## [1] 4.000000 5.040481 4.090909 4.214286 3.611111 k ## [1] 32 914 22 14 18 N * max(rets) # theoretical max. return ## [1] 5000 R ## [1] 4949 R / (N * max(rets)) # percent reached of theoretical max ## [1] 0.9898

So, obviously, the algorithm found a nearly perfect strategy all on its own!

Now, this is the simplest possible application of reinforcement learning. Let us now implement a more sophisticated example: a robot navigating a maze. Whereas the difficulty in the first example was that the feedback was blurred (because the return of each one-armed bandit is only an average return) here we only get definitive feedback after several steps (when the robot reaches its goal). Because this situation is more complicated we need more memory to store the intermediate results. In our multi-armed bandit example, the memory was a vector, here we will need a matrix.

The robot will try to reach the goal in the following maze (i.e. to get out of the maze to reach *F* which can be done via *B* or *E*) and find the best strategy for each room it is placed in:

Have a look at the code (it is based on the *Matlab* code from the same tutorial the picture is from, which is why the names of variables and functions are called the same way to ensure consistency):

# find all possible actions AllActions <- function(state, R) { which(R[state, ] >= 0) } # chose one action out of all possible actions by chance PossibleAction <- function(state, R) { sample(AllActions(state, R), 1) } # Q-learning function UpdateQ <- function(state, Q, R, gamma, goalstate) { action <- PossibleAction(state, R) Q[state, action] <- R[state, action] + gamma * max(Q[action, AllActions(action, R)]) # Bellman equation (learning rate implicitly = 1) if(action != goalstate) Q <- UpdateQ(action, Q, R, gamma, goalstate) Q } # recursive function to get the action with the maximum Q value MaxQ <- function(state, Q, goalstate) { action <- which.max(Q[state[length(state)], ]) if (action != goalstate) action <- c(action, MaxQ(action, Q, goalstate)) action } # representation of the maze R <- matrix(c(-Inf, -Inf, -Inf, -Inf, 0, -Inf, -Inf, -Inf, -Inf, 0, -Inf, 100, -Inf, -Inf, -Inf, 0, -Inf, -Inf, -Inf, 0, 0, -Inf, 0, -Inf, 0, -Inf, -Inf, 0, -Inf, 100, -Inf, 0, -Inf, -Inf, 0, 100), ncol = 6, byrow = TRUE) colnames(R) <- rownames(R) <- LETTERS[1:6] R ## A B C D E F ## A -Inf -Inf -Inf -Inf 0 -Inf ## B -Inf -Inf -Inf 0 -Inf 100 ## C -Inf -Inf -Inf 0 -Inf -Inf ## D -Inf 0 0 -Inf 0 -Inf ## E 0 -Inf -Inf 0 -Inf 100 ## F -Inf 0 -Inf -Inf 0 100 Q <- matrix(0, nrow = nrow(R), ncol = ncol(R)) colnames(Q) <- rownames(Q) <- LETTERS[1:6] Q ## A B C D E F ## A 0 0 0 0 0 0 ## B 0 0 0 0 0 0 ## C 0 0 0 0 0 0 ## D 0 0 0 0 0 0 ## E 0 0 0 0 0 0 ## F 0 0 0 0 0 0 gamma <- 0.8 # learning rate goalstate <- 6 N <- 50000 # no. of episodes for (episode in 1:N) { state <- sample(1:goalstate, 1) Q <- UpdateQ(state, Q, R, gamma, goalstate) } Q ## A B C D E F ## A -Inf -Inf -Inf -Inf 400 0 ## B 0 0 0 320 0 500 ## C -Inf -Inf -Inf 320 0 0 ## D 0 400 256 0 400 0 ## E 320 0 0 320 0 500 ## F 0 400 0 0 400 500 Q / max(Q) * 100 ## A B C D E F ## A -Inf -Inf -Inf -Inf 80 0 ## B 0 0 0.0 64 0 100 ## C -Inf -Inf -Inf 64 0 0 ## D 0 80 51.2 0 80 0 ## E 64 0 0.0 64 0 100 ## F 0 80 0.0 0 80 100 # print all learned routes for all rooms for (i in 1:goalstate) { cat(LETTERS[i], LETTERS[MaxQ(i, Q, goalstate)], sep = " -> ") cat("\n") } ## A -> E -> F ## B -> F ## C -> D -> B -> F ## D -> B -> F ## E -> F ## F -> F

So again, the algorithm has found the best route for each room!

Recently the combination of Neural Networks (see also Understanding the Magic of Neural Networks) and Reinforcement Learning has become quite popular. For example *AlphaGo*, the machine from Google that defeated a Go world champion for the first time in history is based on this powerful combination!

What is the best way for me to find out whether you are rich or poor, when the only thing I know is your address? Looking at your *neighbourhood*! That is the big idea behind the *k-nearest neighbours (or KNN)* algorithm, where *k* stands for the *number of neighbours* to look at. The idea couldn’t be any simpler yet the results are often very impressive indeed – so read on…

Let us take a task that is very hard to code, like identifying handwritten numbers. We will be using the Semeion Handwritten Digit Data Set from the UCI Machine Learning Repository and are separating *training* and *test set* for the upcoming task in the first step:

# helper function for plotting images of digits in a nice way + returning the respective number plot_digit <- function(digit) { M <- matrix(as.numeric(digit[1:256]), nrow = 16, ncol = 16, byrow = TRUE) image(t(M[nrow(M):1, ]), col = c(0,1), xaxt = "n", yaxt = "n", useRaster = TRUE) digit[257] } # load data and chose some digits as examples semeion <- read.table("data/semeion.data", quote = "\"", comment.char = "") # put in right path here! digit_data <- semeion[ , 1:256] which_digit <- apply(semeion[ , 257:266], 1, function(x) which.max(x) - 1) semeion_new <- cbind(digit_data, which_digit) # chose training and test set by chance set.seed(123) # for reproducibility data <- semeion_new random <- sample(1:nrow(data), 0.8 * nrow(data)) # 80%: training data, 20%: test data train <- data[random, ] test <- data[-random, ] # plot example digits old_par <- par(mfrow = c(4, 6), oma = c(5, 4, 0, 0) + 0.1, mar = c(0, 0, 1, 1) + 0.1) matrix(apply(train[1:24, ], 1, plot_digit), 4, 6, byrow = TRUE)

## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 3 1 2 5 7 3 ## [2,] 1 5 1 6 7 6 ## [3,] 6 2 8 5 9 3 ## [4,] 5 7 5 7 5 9 par(old_par)

As you can see teaching a computer to read those digits is a task that would take considerable effort and easily hundreds of lines of code. You would have to intelligently identify different regions in the images and find some boundaries to try to identify which number is being shown. You could expect to do a lot of tweaking before you would get acceptable results.

The real magic behind *machine learning* and *artificial intelligence* is that when something is too complicated to code let the machine program itself by just showing it lots of examples (see also my post So, what is AI really?). We will do just that with the nearest neighbour algorithm.

When talking about neighbours it is implied already that we need some kind of *distance metric* to define what constitutes a neighbour. As in real life, the simplest one is the so-called *Euclidean distance* which is just how far different points are apart from each other as the crow flies. The simple formula that is used for this is just the good old *Pythagorean theorem* (in this case in a vectorized way) – you can see what maths at school was good for after all:

dist_eucl <- function(x1, x2) { sqrt(sum((x1 - x2) ^ 2)) # Pythagorean theorem! }

The *k*-nearest neighbours algorithm is pretty straight forward: it just compares the digit which is to be identified with all other digits and choses the *k* nearest ones. In case that the *k* nearest ones don’t come up with the same answer the *majority vote* (or mathematically the *mode*) is taken:

mode <- function(NNs) { names(sort(-table(NNs[ncol(NNs)])))[1] # mode = majority vote } knn <- function(train, test, k = 5) { dist_sort <- order(apply(train[-ncol(train)], 1, function(x) dist_eucl(as.numeric(x), x2 = as.numeric(test[-ncol(test)])))) mode(train[dist_sort[1:k], ]) }

So, the algorithm itself comprises barely *4* lines of code! Now, let us see how it performs on this complicated task with *k = 9* out of sample (first a few examples are shown and after that we have a look at the overall performance):

# show a few examples set.seed(123) # for reproducibility no_examples <- 24 examples <- sample(dim(test)[1], no_examples) old_par <- par(mfrow = c(4, 6), oma = c(5, 4, 0, 0) + 0.1, mar = c(0, 0, 1, 1) + 0.1) matrix(apply(test[examples, ], 1, plot_digit), 4, 6, byrow = TRUE)

## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 4 1 1 5 7 3 ## [2,] 0 5 1 4 7 3 ## [3,] 6 2 7 4 0 2 ## [4,] 5 5 3 6 3 7 par(old_par) prediction <- integer(no_examples) for (i in 1:no_examples) { prediction[i] <- knn(train, test[examples[i], ], k = 9) } print(matrix(prediction, 4, 6, byrow = TRUE), quote = FALSE, right = TRUE) ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 4 1 1 5 7 3 ## [2,] 0 5 1 4 7 3 ## [3,] 6 2 7 4 0 2 ## [4,] 5 5 3 6 3 7 # now for the overall accuracy library(OneR) # just for eval_model function to evaluate the model's accuracy prediction <- integer(nrow(test)) ptm <- proc.time() for (i in 1:nrow(test)) { prediction[i] <- knn(train, test[i, ], k = 9) } proc.time() - ptm ## user system elapsed ## 26.74 0.82 27.59 eval_model(prediction, test[ncol(test)], zero.print = ".") ## ## Confusion matrix (absolute): ## Actual ## Prediction 0 1 2 3 4 5 6 7 8 9 Sum ## 0 34 . . . . . 1 . . . 35 ## 1 . 36 1 . 2 . . 1 . 1 41 ## 2 . . 36 . . . . . 1 1 38 ## 3 . 1 . 32 . . . . . 2 35 ## 4 . . . . 29 . . . . . 29 ## 5 . . . . . 35 2 . 1 . 38 ## 6 . . . . . 1 23 . . . 24 ## 7 . . . . . . . 22 . 1 23 ## 8 . . . . . . . . 31 . 31 ## 9 . . . . . . . . 2 23 25 ## Sum 34 37 37 32 31 36 26 23 35 28 319 ## ## Confusion matrix (relative): ## Actual ## Prediction 0 1 2 3 4 5 6 7 8 9 Sum ## 0 0.11 . . . . . . . . . 0.11 ## 1 . 0.11 . . 0.01 . . . . . 0.13 ## 2 . . 0.11 . . . . . . . 0.12 ## 3 . . . 0.10 . . . . . 0.01 0.11 ## 4 . . . . 0.09 . . . . . 0.09 ## 5 . . . . . 0.11 0.01 . . . 0.12 ## 6 . . . . . . 0.07 . . . 0.08 ## 7 . . . . . . . 0.07 . . 0.07 ## 8 . . . . . . . . 0.10 . 0.10 ## 9 . . . . . . . . 0.01 0.07 0.08 ## Sum 0.11 0.12 0.12 0.10 0.10 0.11 0.08 0.07 0.11 0.09 1.00 ## ## Accuracy: ## 0.9436 (301/319) ## ## Error rate: ## 0.0564 (18/319) ## ## Error rate reduction (vs. base rate): ## 0.9362 (p-value < 2.2e-16)

Wow, it achieves an accuracy of nearly *95%* out of the box while some of the digits are really hard to read even for humans! And we haven’t even given it the information that those images are two-dimensional because we coded all the images simply as (one-dimensional) binary numbers.

To get the idea where it failed have a look at the digits that were misclassified:

# show misclassified digits err <- which(as.integer(prediction) != unlist(test[ncol(test)])) old_par <- par(mfrow = c(3, 6), oma = c(5, 4, 0, 0) + 0.1, mar = c(0, 0, 1, 1) + 0.1) matrix(apply(test[err, ], 1, plot_digit), 3, 6, byrow = TRUE)

## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 2 6 9 8 9 5 ## [2,] 6 6 7 4 8 8 ## [3,] 9 9 1 4 8 9 par(old_par) # show what was predicted print(matrix(prediction[err], 3, 6, byrow = TRUE), quote = FALSE, right = TRUE) ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 1 5 1 9 3 6 ## [2,] 5 0 1 1 9 5 ## [3,] 3 7 3 1 2 2

Most of us would have difficulties reading at least some of those digits too, e.g. the third digit in the first row is supposed to be a *9*, yet it could also be a distorted *1* – same with the first digit in the last row: some people would read a *3* (like our little program) or nothing at all really, but it is supposed to be a *9*. So even the mistakes the system makes are understandable.

Sometimes the simplest methods are – perhaps not the best but – very effective indeed, you should keep that in mind!

]]>The

ESA provided the world with datasets of the comet which we will use to create an animated gif in R… so read on!

The only prerequisites that we need is the `rgl`

package (on CRAN) and an installed version of *ImageMagick*: http://www.imagemagick.org/script/download.php.

The script itself is very short due to the powerful functions of the `rgl`

package, the dataset is loaded directly from the internet site of ESA:

library(rgl) comet <- readOBJ(url("http://sci.esa.int/science-e/www/object/doc.cfm?fobjectid=54726")) # may take some time to load open3d() ## wgl ## 1 rgl.bg(color = c("black")) shade3d(comet, col = ("gray")) movie3d(spin3d(axis = c(0, 0, 1)), duration = 12, dir = getwd()) ## Writing 'movie000.png' ## Writing 'movie001.png' ## Writing 'movie002.png' ## Writing 'movie003.png' [...] ## Writing 'movie118.png' ## Writing 'movie119.png' ## Writing 'movie120.png' ## Loading required namespace: magick system("cmd.exe", input = paste("cd", getwd(), "&& convert -loop 0 movie.gif philae.gif")) # change GIF animation cycles to inf Microsoft Windows [Version 10.0.17134.706] (c) 2018 Microsoft Corporation. All rights reserved. C:\Users\Holger\[...]>cd C:/Users/Holger/[...] && convert -loop 0 movie.gif philae.gif C:\Users\Holger\[...]>

The result is quite impressive:

I also used the dataset to create a 3D printout which is now on my desk in the office at my university:

What a symbol of the power of science to have a 3D printout of a cosmic roamer on your desk and an animated image of it on your computer!

]]>

You may have misread the title as the old *correlation does not imply causation* mantra, but the opposite is also true! If you don’t believe me, read on…

First I want to provide you with some intuition on what *correlation* is really all about! For many people (and many of my students for sure) the implications of the following formula for the *correlation coefficient* of two variables and are not immediately clear:

In fact the most interesting part is this: . We see a product of two differences. The differences consist of the data points minus the respective *means* (average values): in effect, this leads to the origin being moved to the means of both variables (as if you moved the crosshair right into the centre of all data points).

There are now four possible *quadrants* for every data point: top or bottom, left or right. Top right means that both differences are positive, so the result will be positive too. The same is true for the bottom left quadrant because minus times minus equals plus (it often boils down to simple school maths)! The other two quadrants give negative results because minus times plus and plus times minus equals minus.

After that, we sum over all products and normalize them by dividing by the respective *standard deviations* (how much the data are spread out), so that we will only get values between and .

Let us see this in action with an example. First we define a helper function for visualizing this intuition:

cor.plot <- function(data) { x_mean <- mean(data[ , 1]) y_mean <- mean(data[ , 2]) plot(data, type = "n") # plot invisibly limits = par()$usr # limits of plot # plot correlation quadrants rect(x_mean, y_mean, limits[2], limits[4], col = "lightgreen") rect(x_mean, y_mean, limits[1], limits[4], col = "orangered") rect(x_mean, y_mean, limits[1], limits[3], col = "lightgreen") rect(x_mean, y_mean, limits[2], limits[3], col = "orangered") points(data, pch = 16) # plot scatterplot on top colnames(data) <- c("x", "y") # rename cols instead of dynamic variable names in lm abline(lm(y ~ x, data), lwd = 2) # add regression line title(paste("cor =", round(cor(data[1], data[2]), 2))) # add cor as title }

Now for the actual example (in fact the same example we had in this post: Learning Data Science: Modelling Basics):

age <- c(21, 46, 55, 35, 28) income <- c(1850, 2500, 2560, 2230, 1800) data <- data.frame(age, income) plot(data, pch = 16)

cor.plot(data)

The correlation is very high because most of the data points are in the positive (green) quadrants and the data is close to its *regression line* (linear regression and correlation are closely related mathematically).

Now, let us get to the actual topic of this post: *Causation doesn’t imply Correlation either*. What could be “more causal” than a parabolic shot? When you shoot a projectile without air resistance the trajectory will form a perfect *parabola*! This *is* in fact rocket science!

Let us simulate such a shot and calculate the correlation between time and altitude, two variables that are perfectly causally dependent:

t <- c(-30:30) x <- -t^2 data <- data.frame(t, x) plot(data, pch = 16)

cor.plot(data)

The correlation is exactly zero, zip, nada! And it is clear why: the data points in the positive and in the negative quadrants cancel each other out completely because of the perfect symmetry!

This leads us to the following very important insight:

Correlation is a measure of linear dependence (and linear dependance only!).

Even a strong causal relationship can be overlooked by correlation because of its *non-linear nature* (as in this case with the *quadratic *relationship). The following example conveys the same idea in a somewhat more humorous manner – it is the by now infamous *datasaurus*:

library(datasauRus) # on CRAN dino <- datasaurus_dozen[datasaurus_dozen$dataset == "dino", 2:3] plot(dino, pch = 16, cex = 2)

cor.plot(dino)

As with the above example, we can clearly see why the correlation is so low, although there is a whole dinosaur hiding in your data…

The learning is that you should never just blindly trust statistical measures on their own, always visualize your data when possible: there might be some real beauties hiding inside your data, waiting to be discovered…

]]>

Many of you might have heard of the concept “Wisdom of the Crowd”: when many people independently guess some quantity, e.g. the number of marbles in a jar glass, the average of their guesses is often pretty accurate – even though many of the guesses are totally off.

The same principle is at work in so-called *ensemble methods*, like *bagging* and *boosting*. If you want to know more about boosting and how to turn pseudocode of a scientific paper into valid R code read on…

We start from an original paper of one of the authors of the first practical boosting algorithm, i.e. *AdaBoost*: Robert E. Schapire: Explaining AdaBoost. The first sentence of the introduction gives the big idea:

Boosting is an approach to machine learning based on the idea of creating a highly accurate prediction rule by combining many relatively weak and inaccurate rules.

The second page gives the pseudocode of Adaboost…:

Given: where .

Initialize: for .

For :

- Train weak learner using distribution .
- Get weak hypothesis : .
- Aim: select with low weighted error:
- Choose .
- Update, for :
where is a normalization factor (chosen so that will be a distribution).

Output the final hypothesis:

… with some explanation:

[…] we are given labeled training examples where the are in some domain , and the labels . On each round , a distribution is computed as in the figure over the training examples, and a given weak learner or weak learning algorithm is applied to find a weak hypothesis : , where the aim of the weak learner is to find a weak hypothesis with low weighted error relative to . The final or combined hypothesis computes the sign of a weighted combination of weak hypotheses

This is equivalent to saying that is computed as a weighted majority vote of the weak hypotheses where each is assigned weight . ([…] we use the terms “hypothesis” and “classifier” interchangeably.)

So, AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous ones. But to really understand what is going on my approach has always been that you haven’t really understood something before you didn’t build it yourself…

Perhaps you might want to try to translate the pseudocode into R code before reading on… (to increase your motivation I frankly admit that I also had some errors in my first implementation… which provides a good example of how strong the R community is because I posted it on StackOverflow and got a perfect answer two hours later: What is wrong with my implementation of AdaBoost?

Anyway, here is my implementation (the data can be found here: http://freakonometrics.free.fr/myocarde.csv):

library(rpart) library(OneR) maxdepth <- 1 T <- 100 # number of rounds # Given: (x_1, y_1),...,(x_m, y_m) where x_i element of X, y_i element of {-1, +1} myocarde <- read.table("data/myocarde.csv", header = TRUE, sep = ";") y <- (myocarde[ , "PRONO"] == "SURVIE") * 2 - 1 x <- myocarde[ , 1:7] m <- nrow(x) data <- data.frame(x, y) # Initialize: D_1(i) = 1/m for i = 1,...,m D <- rep(1/m, m) H <- replicate(T, list()) a <- vector(mode = "numeric", T) set.seed(123) # For t = 1,...,T for(t in 1:T) { # Train weak learner using distribution D_t # Get weak hypothesis h_t: X -> {-1, +1} H[[t]] <- rpart(y ~., data = data, weights = D, maxdepth = maxdepth, method = "class") # Aim: select h_t with low weighted error: e_t = Pr_i~D_t[h_t(x_i) != y_i] h <- predict(H[[t]], x, type = "class") e <- sum((h!=y) * D) # Choose a_t = 0.5 * log((1-e) / e) a[t] <- 0.5 * log((1-e) / e) # Update for i = 1,...,m: D_t+1(i) = (D_t(i) * exp(-a_t * y_i * h_t(x_i))) / Z_t # where Z_t is a normalization factor (chosen so that Dt+1 will be a distribution) D <- D * exp(-a[t] * y * as.numeric(as.character(h))) D <- D / sum(D) } # Output the final hypothesis: H(x) = sign(sum of a_t * h_t(x) for t=1 to T) newdata <- x H_x <- sapply(H, function(x) as.numeric(as.character(predict(x, newdata = newdata, type = "class")))) H_x <- t(a * t(H_x)) pred <- sign(rowSums(H_x)) eval_model(pred, y) ## ## Confusion matrix (absolute): ## Actual ## Prediction -1 1 Sum ## -1 29 0 29 ## 1 0 42 42 ## Sum 29 42 71 ## ## Confusion matrix (relative): ## Actual ## Prediction -1 1 Sum ## -1 0.41 0.00 0.41 ## 1 0.00 0.59 0.59 ## Sum 0.41 0.59 1.00 ## ## Accuracy: ## 1 (71/71) ## ## Error rate: ## 0 (0/71) ## ## Error rate reduction (vs. base rate): ## 1 (p-value < 2.2e-16)

Let’s compare this with the result from the package `JOUSBoost`

(on CRAN):

library(JOUSBoost) ## JOUSBoost 2.1.0 boost <- adaboost(as.matrix(x), y, tree_depth = maxdepth, n_rounds = T) pred <- predict(boost, x) eval_model(pred, y) ## ## Confusion matrix (absolute): ## Actual ## Prediction -1 1 Sum ## -1 29 0 29 ## 1 0 42 42 ## Sum 29 42 71 ## ## Confusion matrix (relative): ## Actual ## Prediction -1 1 Sum ## -1 0.41 0.00 0.41 ## 1 0.00 0.59 0.59 ## Sum 0.41 0.59 1.00 ## ## Accuracy: ## 1 (71/71) ## ## Error rate: ## 0 (0/71) ## ## Error rate reduction (vs. base rate): ## 1 (p-value < 2.2e-16)

As you can see: zero errors as with my implementation. Two additional remarks are in order:

An accuracy of 100% hints at one of the problems of boosting: it is prone to *overfitting* (see also Learning Data Science: Modelling Basics).

The second problem is the lack of *interpretability*: whereas decision trees are normally well interpretable ensembles of them are not. This is also known under the name *Accuracy-Interpretability Trade-Off* (another often used ensemble method is random forests, see also here: Learning Data Science: Predicting Income Brackets).

I hope that this post was helpful for you to understand the widely used boosting methodology better and to see how you can get from pseudocode to valid R code. If you have any questions or feedback please let me know in the comments – Thank you and stay tuned!

]]>