Our intuition concerning

When Apple first introduced its shuffling function on the iPod customers were irritated and complained that it was not truly random. Oftentimes some titles appeared to be repeated too often while others seemed to have disappeared completely. What was going on?

To illustrate the point I sometimes show my students the following two pics and ask them which was generated by randomness and which by an deterministic rule (you find the used `randtoolbox`

package on CRAN):

library(randtoolbox) ## Loading required package: rngWELL ## This is randtoolbox. For an overview, type 'help("randtoolbox")'. n <- 200 set.seed(2345) x <- runif(n) y <- runif(n) oldpar = par(mar=c(2, 2, 2, 2) + 0.1) plot(x, y, ylim = c(0, 1), xlim = c(0, 1), xaxs = "i", yaxs = "i", axes = FALSE, frame.plot = TRUE, pch = 16, cex = 2.1)

s <- sobol(n, 2, scrambling = 3) plot(s, ylim = c(0, 1), xlim = c(0, 1), xaxs = "i", yaxs = "i", axes = FALSE, frame.plot = TRUE, pch = 16, cex = 2.1)

Many a student thinks that the first pic was created by some underlying pattern (because of its points clumping together in some areas while leaving others empty) and that the second one is “more” random. The truth is that technically both are not random (but only *pseudo-random*) but the first resembles “true” randomness more closely while the second is a *low-discrepancy sequence*.

While coming to the point of pseudo-randomness in a moment “true” randomness may appear to have a tendency to occur in clusters or clumps (technically called *Poisson clumping*). This is the effect seen (or shall I say heard) in the iPod shuffling function. Apple changed it to a more regular behaviour (in the spirit of the second picture)… which was then perceived to be *more* random (as with my students)!

Now imagine that the first pic represents some map showing, let’s say, leukaemia in children. Wouldn’t we want to know whether there is some underlying reason for those clusters?!? Now imagine that there is a nuclear power plant near one of the more prominent clusters… just by chance! Oh, dear! Of course, it *could* be the reason for the cancer cases but just by looking at the map no real conclusions can be drawn! The takeaway message is that randomness often seems to have more pronounced patterns than purely deterministic sequences.

Another area where people are easily fooled by randomness is the stock market! Have a look at the following chart:

set.seed(3141) run <- sample(c(-1, 1), 1e5, replace = TRUE) plot(cumsum(run), type = "l", xaxs = "i", yaxs = "i", axes = FALSE, frame.plot = TRUE)

par(oldpar)

So-called *technical analysts* will clearly see what they call a *Double Top pattern* (basically the letter M in the chart) which they interpret as a bearish (= sell) signal. Now before you sell all of your stocks when you encounter something like this remember that the above chart was created purely by chance (as can be seen in the code)! Yet it seems as if all kinds of bullish and bearish trends can be observed.

Every *quantitative analyst* (or just *quant*) knows that stock charts (in most cases) cannot be distinguished from ones created by the toss of a coin. Yet we are evolutionarily trained to see all kinds of patterns, even when there are none. We see faces in fronts of cars and animals (or other funny things) in clouds… and buy and sell signals in random sequences.

While I will not get into the thorny (and philosophical) issue of what constitutes “true” randomness (perhaps some other time…) one thing is clear: computers are notoriously bad at creating it. Why? Because under the hood computers are purely deterministic animals, working on one command at a time. So they are only able to create something that *looks like* randomness: *pseudo-randomness*. On the positive side, that means that this kind of randomness is reproducible: in R you use the `set.seed()`

function to get the same “random” sequence every time.

In the old days of computers (basically only a few decades ago) whole books with “good” random numbers were being published! The following can still be bought for over 50 bucks as a paperback and has over 600 pages! I guess it is the most unread book ever (even more than James Joyce’s Ulysses )

The following *xkcd* cartoon takes the idea of pseudo-randomness to its absurd extreme (as usual ):

About two years ago the renowned medical journal “The Lancet” came out with the rather sensational conclusion that there is no safe level of alcohol consumption, so every little hurts! For example, drinking a bottle of beer per day (half a litre) would increase your risk of developing a serious health problem within one year by a whopping 7%! When I read that I had to calm my nerves by having a drink!

Ok, kidding aside: in this post, you will learn how to lie with statistics by deviously mixing up *relative* and *absolute* changes in risks, so read on!

The meta-study “Risk thresholds for alcohol consumption” adheres to the highest scientific standards, that is not the problem. The problem is how they chose to communicate the associated changes in risks for consuming alcohol.

For example, they tell you that by drinking a bottle of beer a day (half a litre) your risk of developing a serious health problem (like cardiovascular disease, cancer, cirrhosis of the liver, inflammation of the pancreas or diabetes) within one year would increase by 7%, i.e. 63 people on top of 914 people who would get a serious health problem anyway:

63 / 914 * 100 # shock horror: nearly 7% more with health problems when drinking half a litre of beer per day! ## [1] 6.892779

So, what does that mean? That about one in fourteen beer drinkers are going to bite the dust (no pun intended) next year? Fortunately not!

The problem is that this is a *relative change in risk*! It does not really help to assess the real danger. Only *absolute changes in risk* can do that!

To illustrate we use the `personograph`

package (on CRAN) to show you what is really going on. Taking 2000 people about 18 would develop a serious health issue within one year anyway…

library(personograph) ## Loading required package: grImport ## Loading required package: grid ## Loading required package: XML n <- 2000 probl_wo_alc <- 18 / n data <- list(first = probl_wo_alc, second = 1-probl_wo_alc) personograph(data, colors = list(first = "black", second = "#efefef"), fig.title = "18 of 2000 people with health problems", draw.legend = FALSE, n.icons = n, dimensions = c(20, 100), plot.width = 0.97) ## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, ## x$y, : font family not found in Windows font database

…by consuming about 20 grams of alcohol per day (i.e. about 25 mL) a little more than one person would become seriously sick on top of that:

probl_w_alc <- 1 / n data_2 <- list(first = probl_wo_alc, second = probl_w_alc, third = 1-(probl_wo_alc+probl_w_alc)) personograph(data_2, colors = list(first = "black", second = "red", third = "#efefef"), fig.title = "About 1 additional case with half a litre of beer per day", draw.legend = FALSE, n.icons = n, dimensions = c(20, 100), plot.width = 0.97)

As you can see, this doesn’t look spectacular at all! Yet, this would have been a good way to communicate the results so that everybody could get a feeling for what they really mean (but as I said, this doesn’t look spectacular at all, go figure!).

Doing the numbers also gives an absolute change in risk by only 0.063%! It is about 50% more probable to die in a house fire (and how many people do you personally know who actually died in a house fire? I don’t know anybody…)!

63 / 100000 * 100 # only 0.063% in absolute numbers! ## [1] 0.063

Please note: I do not say that it is safe to drink alcohol! But you have to put the numbers in perspective and the risk doesn’t seem to be overly high (to put it mildly) when you drink responsibly!

You see that you can use statistics not only to “lie” but to clarify things and communicate facts transparently. So, the problem lies not so much in statistics but in dishonesty and manipulation per se, which is the idea of one of my favorite cartoons (found here: CrossValidated):

We all know the classical Sci-Fi trope of intelligent machines becoming

This post will be a little different from the others because it is highly speculative. You may even call me mental afterwards but let us not get ahead of ourselves. The big problem with consciousness is that really nobody has any clue whatsoever what it really is. This is why the philosopher David Chalmers has labeled it “the hard problem”. I start with that because there are some intellectuals who will try to tell you that it is not that great of a problem, that it is only an illusion (oh, really?) or that it will be solved in the near future (the big breakthrough: always near, never here…).

One of the big systematic problems to do research on it is that traditionally science differentiates between the subject and the object. When you investigate something you try to exclude the observer from it and try to be as objective as possible. In the case of consciousness, this has to fail since we can only observe our own, our subjective consciousness. We cannot even know whether anybody else has consciousness like we do… it only seems reasonable to assume that.

The optimists say that, although we don’t really know how consciousness works, it will just be a matter of time till an *artificial neural network (ANN)* will develop it (something along the lines of “more is different”). When one looks at how ANNs work (you can read all about it here: Understanding the Magic of Neural Networks) I would be very skeptical! Basically, at the core ANNs do nothing else but add and multiply stuff! As impressive as it may be when a sophisticated ANN is able to label things (see e.g. Teach R to see by borrowing a Brain) I see no reason whatsoever why it would develop consciousness that way. Now, you could argue that at the core our brains also do nothing else than ANNs do but in my opinion, we are still missing something crucial here because I see no road to consciousness by just adding and multiplying numbers…

In my opinion, consciousness is something ontologically fundamental in our universe. I don’t think that it is some emergent phenomenon but that is at least as much at the core of things as other fundamental entities of nature, like elementary particles or forces like gravity. This philosophical position is called *idealistic monism* and one more popular manifestation of this which postulates that consciousness is basically everywhere to varying degrees in the cosmos is called *panpsychism*. One prominent advocate of this position is my colleague Professor Philip Goff from Durham University. The following article from him gives a good overview of panpsychism: Panpsychism is crazy, but itâs also most probably true. He recently also published a popular science book on the matter (no pun intended ): Galileoâs Error. Foundations for a New Science of Consciousness.

One of the problems classical panpsychism faces is how all those little conscious entities combine to form e.g. our consciousness. This is called the *combination problem*. This is why some theories of panpsychism go even further by postulating that consciousness may be even more fundamental than e.g. matter and that the basis of all being, including our inner life, is one cosmic consciousness. Philip Goff has also written on this subject, called *cosmopsychism*: Is the Universe a conscious mind?, a more academic version was published by Springer: Did the universe design itself?.

It is parsimonious to suppose that the Universe has a consciousness-involving nature.

Philip Goff

Other researchers even try to make core elements of panpsychism mathematically rigorous: Markus P. MĂŒller, who is Research Group Leader of the Institute for Quantum Optics and Quantum Information, Vienna, Austria, and Visiting Fellow of the Perimeter Institute for Theoretical Physics, Canada, has built a whole theory based on *algorithmic information theory* and *Solomonoff induction* to describe how a “mind” can cause “matter” (and not the other way around!). A popular version of his research can be found here: Mind before matter: reversing the arrow of fundamentality, the technical version here: Law without law: from observer states to physics via algorithmic information theory.

I suggest that it is sometimes a step forward to reverse our intuition on âwhat is fundamentalâ, a move that is somewhat reminiscent of the idea of noncommutative geometry. I argue that some foundational conceptual problems in physics and related fields motivate us to attempt such a reversal of perspective, and to take seriously the idea that an information-theoretic notion of observer (âmindâ) could in some sense be more fundamental than our intuitive idea of a physical world (âmatterâ).

Markus P. MĂŒller

As I said at the beginning, all of this may seem mental to you… but in a way, every explanation for consciousness is crazy (Nabokov’s “Only one letter divides the comic from the cosmic” comes to mind) and I see at least no contradictory evidence why cosmopsychism should be wrong. It is, in my opinion, as respectable as it gets for a theory of consciousness. In a way the thought that consciousness might be more fundamental than anything else is not so foreign from an epistemological perspective, Descarte formulated this idea hundreds of years ago: the question is what can we really know for sure? If an *evil demon* were to deceive me really everything could be an illusion (or in modern lingo Fake News!), only one thing is really certain, i.e. absolute: the fact that I think (that is me thinking!) and that therefore I am… or in good ol’ Latin:

Cogito, ergo sum.

RenĂ© Descartes

Seeing it this way it seems almost strange to believe that the things we have no direct access to (i.e. matter and fields) should be the most fundamental entities… and then wonder how consciousness emerges out of that. Who seems mental now?

Now, where does this leave us concerning our question from the beginning? Well, if cosmospychism were true, simply adding and multiplying numbers – however many operations and however fast – would never lead to consciousness because it is the wrong answer to an ill-formed question. It is a little bit like those *cargo cults* where underdeveloped peoples try to rebuild crude copies of airports and runways and wonder why no planes show up to provide them with desired Western cargo. I think that believing that you could build some consciousness with ANNs is cargo cult-like! We are on a completely wrong track here, or runway if you like

What do you think about this whole matter? What do you think about panpsychism and cosmopsychism? And what do you think about machines becoming conscious?

Please leave your thoughts in the comments!

**UPDATE May 14, 2020**

An earlier version of this article contained some imprecise characterization of cargo cults. Hat tip to Tom for his helpful comment.

Star Wars is somewhat nerdy, R definitely is… what could be more worthwhile than combining both

This Sunday was Star Wars Day (May the 4th be with you!) and suitable for the occasion we will do a little fun project and implement the following *xkcd* flowchart, which can give us more than 2 million different Star Wars plots.

Even if you are new to R, the used code should be comprehensible, so read on!

First, I provide you with the surrounding phrases…

phrase_1 <- "In this Star Wars movie, our heroes return to take on the First Order and new villain" phrase_2 <- "with help from their new friend" phrase_3 <- "Rey builds a new lightsaber with a" phrase_4 <- "blade, and they head out to confront the First Order's new superweapon, the" phrase_5 <- "a space station capable of" phrase_6 <- "They unexpectedly join forces with their old enemy" phrase_7 <- "and destroy the superweapon in a battle featuring" phrase_8 <- "P.S. Rey's parents are" phrase_9 <- "and"

…and the different options:

villain_name <- c("Kyle Ren", "Malloc", "Darth Sebelius", "Theranos", "Lord Juul") friend_name <- c("Kim Spacemeasurer", "Teen Yoda", "Dab Tweetdeck", "Yaz Progestin", "TI-83") color <- c("beige", "ochre", "mauve", "aquamarine", "taupe") superweapon_name <- c("Sun Obliterator", "Moonsquisher", "World Eater", "Planet Zester", "Superconducting Supercollider") evil_plan <- c("blowing up a planet with a bunch of beams of energy that combine into one", "blowing up a bunch of planets with one beam of energy that splits into many", "cutting a planet in half and smashing the halves together like two cymbals", "increasing the CO2 levels in a planet's atmosphere, causing rapid heating", "triggering the end credits before the movie is done") character_1 <- c("Boba Fett", "Salacious Crumb", "The Space Slug", "the bottom half of Darth Maul", "Youtube commenters") strange_event <- c("a bow that shoots little lightsaber-headed arrows", "X-Wings and TIE fighters dodging the giant letters of the opening crawl", "a Sith educational display that uses Force Lightning to demonstrate the dielectric breakdown of air", "Kylo Ren putting on another helmet over his smaller one", "a Sith car wash where the bristles on the brushes are little lightsabers") character_2 <- c("Luke", "Leia", "Han", "Obi-Wan", "a random junk trader") character_3 <- c("Poe", "BB-8", "Amilyn Holdo", "Laura Dern", "a random junk trader", "that one droid from the Jawa Sandcrawler that says Gonk")

A quick calculation reveals that we have 2,343,750 possibilities of different Star Wars stories.

Now, perhaps with the data readily available, you might want to try to implement the SWSG (Star Wars Spoiler Generator) yourself…

Here is one way to do it: we create a function with `sample`

to choose from one of the given options, `paste`

to concatenate the different parts, and `cat`

to output the result:

SWSG <- function() { cat(paste(phrase_1, sample(villain_name, 1), phrase_2, sample(friend_name, 1)), "\n") cat(paste(phrase_3, sample(color, 1), phrase_4, sample(superweapon_name, 1), phrase_5, sample(evil_plan, 1)), "\n") cat(paste(phrase_6, sample(character_1, 1), phrase_7, sample(strange_event, 1)), "\n") cat(paste(phrase_8, sample(character_2, 1), phrase_9, sample(character_3_2, 1))) }

Every call of this function generates a new random storyline:

SWSG() ## In this Star Wars movie, our heroes return to take on the First Order and new villain Kyle Ren with help from their new friend Kim Spacemeasurer ## Rey builds a new lightsaber with a ochre blade, and they head out to confront the First Order's new superweapon, the Moonsquisher a space station capable of blowing up a bunch of planets with one beam of energy that splits into many ## They unexpectedly join forces with their old enemy Boba Fett and destroy the superweapon in a battle featuring a bow that shoots little lightsaber-headed arrows ## P.S. Rey's parents are Leia and Amilyn Holdo SWSG() ## In this Star Wars movie, our heroes return to take on the First Order and new villain Theranos with help from their new friend Teen Yoda ## Rey builds a new lightsaber with a aquamarine blade, and they head out to confront the First Order's new superweapon, the Moonsquisher a space station capable of blowing up a bunch of planets with one beam of energy that splits into many ## They unexpectedly join forces with their old enemy the bottom half of Darth Maul and destroy the superweapon in a battle featuring X-Wings and TIE fighters dodging the giant letters of the opening crawl ## P.S. Rey's parents are Han and a random junk trader SWSG() ## In this Star Wars movie, our heroes return to take on the First Order and new villain Malloc with help from their new friend Kim Spacemeasurer ## Rey builds a new lightsaber with a aquamarine blade, and they head out to confront the First Order's new superweapon, the Sun Obliterator a space station capable of cutting a planet in half and smashing the halves together like two cymbals ## They unexpectedly join forces with their old enemy Youtube commenters and destroy the superweapon in a battle featuring Kylo Ren putting on another helmet over his smaller one ## P.S. Rey's parents are a random junk trader and Poe

That was easy, right? May the FoRce be with you!

]]>In one of my most popular posts So, what is AI really? I showed that

In this post, I create the simplest possible *classifier*, called *ZeroR*, to show that even this classifier can achieve surprisingly high values for *accuracy* (i.e. the ratio of correctly predicted instances)… and why this is not necessarily a good thing, so read on!

In the above-mentioned post, I gave an example of a classifier that was able to give you some guidance on whether a certain mushroom is edible or not. The basis for this was rules, which separated the examples based on the given attributes:

## Rules: ## If odor = almond then type = edible ## If odor = anise then type = edible ## If odor = creosote then type = poisonous ## If odor = fishy then type = poisonous ## If odor = foul then type = poisonous ## If odor = musty then type = poisonous ## If odor = none then type = edible ## If odor = pungent then type = poisonous ## If odor = spicy then type = poisonous

Obviously, the more rules the more complex a classifier is. In the example above we used the so-called OneR classifier which bases its decision on one attribute alone. Here, I will give you an even simpler classifier! The ZeroR classifier bases its decision on no attribute whatsoever… zero, zilch, zip, nada! How can this be? Easy: it just takes the majority class of the target attribute! I will give you an example.

First, we build a function for the classifier by using the `OneR`

package (on CRAN) and some S3-class magic:

library(OneR) ZeroR <- function(x, ...) { output <- OneR(cbind(dummy = TRUE, x[ncol(x)]), ...) class(output) <- c("ZeroR", "OneR") output } predict.ZeroR <- function(object, newdata, ...) { class(object) <- "OneR" predict(object, cbind(dummy = TRUE, newdata[ncol(newdata)]), ...) }

As an example we take the well-known German Credit Dataset (originally from my old alma mater, the University of Hamburg) and divide it into a training and a test set:

data <- read.table("data/german.data", header = FALSE) data <- data.frame(data[ , 1:20], creditrisk = factor(data[ , 21])) table(data$creditrisk) ## ## 1 2 ## 700 300 set.seed(805) random <- sample(1:nrow(data), 0.6 * nrow(data)) data_train <- data[random, ] data_test <- data[-random, ]

We see that 700 customers have a good credit risk while 300 have a bad one. The ZeroR classifier now takes the majority class (good credit risk) and uses it as the prediction *every time*! You have read correctly, it just predicts that *every* customer is a good credit risk!

Seems a little crazy, right? Well, it illustrates an important point: many of my students, as well as some of my consulting clients, often ask me what a good classifier is and how long it does take to build one. Many people in the area of data science (even some “experts”) will give you something like the following answer (source: A. Burkov):

Machine learning accuracy rule:

0-80%: one day

80-90%: one week

90-95%: one month

95-97%: three months

97-99%: one year (or never)

Well, to be honest with you: this is not a very good answer. Why? Because it very much depends on… the share of the majority class! To understand that, let us have a look at how the ZeroR classifier performs on our dataset:

model <- ZeroR(data_train) summary(model) ## ## Call: ## OneR.data.frame(x = cbind(dummy = TRUE, x[ncol(x)])) ## ## Rules: ## If dummy = TRUE then creditrisk = 1 ## ## Accuracy: ## 481 of 700 instances classified correctly (68.71%) ## ## Contingency table: ## dummy ## creditrisk TRUE Sum ## 1 * 481 481 ## 2 219 219 ## Sum 700 700 ## --- ## Maximum in each column: '*' ## ## Pearson's Chi-squared test: ## X-squared = 98.063, df = 1, p-value < 2.2e-16 plot(model)

prediction <- predict(model, data_test) eval_model(prediction, data_test) ## ## Confusion matrix (absolute): ## Actual ## Prediction 1 2 Sum ## 1 219 81 300 ## 2 0 0 0 ## Sum 219 81 300 ## ## Confusion matrix (relative): ## Actual ## Prediction 1 2 Sum ## 1 0.73 0.27 1.00 ## 2 0.00 0.00 0.00 ## Sum 0.73 0.27 1.00 ## ## Accuracy: ## 0.73 (219/300) ## ## Error rate: ## 0.27 (81/300) ## ## Error rate reduction (vs. base rate): ## 0 (p-value = 0.5299)

So, because 70% of the customers are good risks we get an accuracy of about 70%! You can take this example to extremes: for example, if you have a dataset with credit card transactions where 0.1% of the transactions are fraudulent (which is about the actual number) you will get an accuracy of 99.9% just by using the ZeroR classifier! Concretely, just by saying that *no fraud exists* (!) you get an accuracy even beyond the “one year (or never)” bracket (according to the above scheme)!

Another example even concerns life and death: the probability of dying within one year lies at about 0.8% (averaged over all the people worldwide, according to “The World Factbook” by the CIA). So by declaring that we are all immortal, we are in more than 99% of all cases right! Many medical studies have a much higher error rate…

Now, let us try the OneR classifier on our credit dataset:

model <- OneR(optbin(data_train)) summary(model) ## ## Call: ## OneR.data.frame(x = optbin(data_train)) ## ## Rules: ## If V3 = A30 then creditrisk = 2 ## If V3 = A31 then creditrisk = 2 ## If V3 = A32 then creditrisk = 1 ## If V3 = A33 then creditrisk = 1 ## If V3 = A34 then creditrisk = 1 ## ## Accuracy: ## 492 of 700 instances classified correctly (70.29%) ## ## Contingency table: ## V3 ## creditrisk A30 A31 A32 A33 A34 Sum ## 1 10 14 * 247 * 37 * 173 481 ## 2 * 16 * 19 124 21 39 219 ## Sum 26 33 371 58 212 700 ## --- ## Maximum in each column: '*' ## ## Pearson's Chi-squared test: ## X-squared = 39.504, df = 4, p-value = 5.48e-08 plot(model)

# Attribute 3: (qualitative) # Credit history # A30 : no credits taken/ # all credits paid back duly # A31 : all credits at this bank paid back duly # A32 : existing credits paid back duly till now # A33 : delay in paying off in the past # A34 : critical account/ # other credits existing (not at this bank) prediction <- predict(model, data_test) eval_model(prediction, data_test) ## ## Confusion matrix (absolute): ## Actual ## Prediction 1 2 Sum ## 1 207 63 270 ## 2 12 18 30 ## Sum 219 81 300 ## ## Confusion matrix (relative): ## Actual ## Prediction 1 2 Sum ## 1 0.69 0.21 0.90 ## 2 0.04 0.06 0.10 ## Sum 0.73 0.27 1.00 ## ## Accuracy: ## 0.75 (225/300) ## ## Error rate: ## 0.25 (75/300) ## ## Error rate reduction (vs. base rate): ## 0.0741 (p-value = 0.2388)

Here, we see that we get an out-of-sample accuracy of 75%, which is more than 7 percentage points better than what we got with the ZeroR classifier, here called *base rate*. Yet, this is not statistically significant (for an introduction to statistical significance see From Coin Tosses to p-Hacking: Make Statistics Significant Again!).

Because the concept of “error rate reduction” compared to ZeroR (= base rate) and its statistical significance is so relevant it is displayed by default in the `eval_model()`

function of the `OneR`

package.

To end this post, we build a random forest model with the `randomForest`

package (on CRAN) on the dataset (for some more information on random forests see Learning Data Science: Predicting Income Brackets):

set.seed(78) library(randomForest) ## randomForest 4.6-14 ## Type rfNews() to see new features/changes/bug fixes. model <- randomForest(creditrisk ~., data = data_train, ntree = 2000) prediction <- predict(model, data_test) eval_model(prediction, data_test) ## ## Confusion matrix (absolute): ## Actual ## Prediction 1 2 Sum ## 1 209 43 252 ## 2 10 38 48 ## Sum 219 81 300 ## ## Confusion matrix (relative): ## Actual ## Prediction 1 2 Sum ## 1 0.70 0.14 0.84 ## 2 0.03 0.13 0.16 ## Sum 0.73 0.27 1.00 ## ## Accuracy: ## 0.8233 (247/300) ## ## Error rate: ## 0.1767 (53/300) ## ## Error rate reduction (vs. base rate): ## 0.3457 (p-value = 9.895e-05)

The out-of-sample accuracy is over 80% here and the error rate reduction (compared to ZeroR) of about one third is statistically significant. Yet 80% is still not that impressive when you keep in mind that 70% is the base rate!

You should now be able to spot why this is one of the worst scientific papers I have ever seen: Applications of rule based Classification Techniques for Thoracic Surgery (2015). This also shows one of the more general problems: although this is a medical topic not many medical professionals would be able to spot the elephant in the room here… this will be true for most other areas too, where machine learning will be used ever more frequently. (Just as an aside: this type of blunder wouldn’t have happened had the authors used the `OneR`

package: One Rule (OneR) Machine Learning Classification in under One Minute.)

As you can imagine, there are many strategies to deal with the above challenges of *imbalanced/unbalanced data*, e.g. other *model metrics* (like *recall* or *precision*) and other sampling strategies (like *undersampling* the majority class or *oversampling* the minority class)… but that are topics for another post, so stay tuned!

The global lockdown has slowed down mobility considerably. This can be seen in the data produced by our ubiquitous mobile phones.

Apple is kind enough to make those anonymized and aggregated data available to the public. If you want to learn how to get a handle on those data and analyze trends with R read on!

To download the current data set go to the following website, click on “All Data CSV”: Apple Maps: Mobility Trends Reports and move the file to your data folder.

Apple explains:

The CSV file and charts on this site show a relative volume of directions requests per country/region or city compared to a baseline volume on January 13th, 2020.

We define our day as midnight-to-midnight, Pacific time. Cities represent usage in greater metropolitan areas and are stably defined during this period. In many countries/regions and cities, the relative volume has increased since January 13th, consistent with normal, seasonal usage of Apple Maps. Day of week effects are important to normalize as you use this data.

Data that is sent from usersâ devices to the Maps service is associated with random, rotating identifiers so Apple doesnât have a profile of your movements and searches. Apple Maps has no demographic information about our users, so we canât make any statements about the representativeness of our usage against the overall population.

To get an overview we first load the data into R and print the available regions (data for countries and many cities are available) and transportation types (“driving”, “transit” and “walking”):

mobility <- read.csv("data/applemobilitytrends-2020-04-19.csv") # change path and file name accordingly levels(mobility$region) ## [1] "Albania" "Amsterdam" ## [3] "Argentina" "Athens" ## [5] "Atlanta" "Auckland" ## [7] "Australia" "Austria" ## [9] "Baltimore" "Bangkok" ## [11] "Barcelona" "Belgium" ## [13] "Berlin" "Birmingham - UK" ## [15] "Bochum - Dortmund" "Boston" ## [17] "Brazil" "Brisbane" ## [19] "Brussels" "Buenos Aires" ## [21] "Bulgaria" "Cairo" ## [23] "Calgary" "Cambodia" ## [25] "Canada" "Cape Town" ## [27] "Chicago" "Chile" ## [29] "Cologne" "Colombia" ## [31] "Copenhagen" "Croatia" ## [33] "Czech Republic" "Dallas" ## [35] "Delhi" "Denmark" ## [37] "Denver" "Detroit" ## [39] "Dubai" "Dublin" ## [41] "Dusseldorf" "Edmonton" ## [43] "Egypt" "Estonia" ## [45] "Finland" "France" ## [47] "Frankfurt" "Fukuoka" ## [49] "Germany" "Greece" ## [51] "Guadalajara" "Halifax" ## [53] "Hamburg" "Helsinki" ## [55] "Hong Kong" "Houston" ## [57] "Hsin-chu" "Hungary" ## [59] "Iceland" "India" ## [61] "Indonesia" "Ireland" ## [63] "Israel" "Istanbul" ## [65] "Italy" "Jakarta" ## [67] "Japan" "Johannesburg" ## [69] "Kuala Lumpur" "Latvia" ## [71] "Leeds" "Lille" ## [73] "Lithuania" "London" ## [75] "Los Angeles" "Luxembourg" ## [77] "Lyon" "Macao" ## [79] "Madrid" "Malaysia" ## [81] "Manchester" "Manila" ## [83] "Melbourne" "Mexico" ## [85] "Mexico City" "Miami" ## [87] "Milan" "Montreal" ## [89] "Morocco" "Moscow" ## [91] "Mumbai" "Munich" ## [93] "Nagoya" "Netherlands" ## [95] "New York City" "New Zealand" ## [97] "Norway" "Osaka" ## [99] "Oslo" "Ottawa" ## [101] "Paris" "Perth" ## [103] "Philadelphia" "Philippines" ## [105] "Poland" "Portugal" ## [107] "Republic of Korea" "Rio de Janeiro" ## [109] "Riyadh" "Romania" ## [111] "Rome" "Rotterdam" ## [113] "Russia" "Saint Petersburg" ## [115] "San Francisco - Bay Area" "Santiago" ## [117] "Sao Paulo" "Saudi Arabia" ## [119] "Seattle" "Seoul" ## [121] "Serbia" "Singapore" ## [123] "Slovakia" "Slovenia" ## [125] "South Africa" "Spain" ## [127] "Stockholm" "Stuttgart" ## [129] "Sweden" "Switzerland" ## [131] "Sydney" "Taichung" ## [133] "Taipei" "Taiwan" ## [135] "Tel Aviv" "Thailand" ## [137] "Tijuana" "Tokyo" ## [139] "Toronto" "Toulouse" ## [141] "Turkey" "UK" ## [143] "Ukraine" "United Arab Emirates" ## [145] "United States" "Uruguay" ## [147] "Utrecht" "Vancouver" ## [149] "Vienna" "Vietnam" ## [151] "Washington DC" "Zurich" levels(mobility$transportation_type) ## [1] "driving" "transit" "walking"

We now create a function `mobi_trends`

to return the data in a well-structured format. The default `plot = TRUE`

plots the data, `plot = FALSE`

returns a named vector with the raw data for further investigation:

mobi_trends <- function(reg = "United States", trans = "driving", plot = TRUE, addsmooth = TRUE) { data <- subset(mobility, region == reg & transportation_type == trans)[4:ncol(mobility)] dates <- as.Date(sapply(names(data), function(x) substr(x, start = 2, stop = 11)), "%Y.%m.%d") values <- as.numeric(data) series <- setNames(values, dates) if (plot) { plot(dates, values, main = paste("Mobility Trends", reg, trans), xlab = "", ylab = "", type = "l", col = "blue", lwd = 3) if (addsmooth) { lines(dates, values, col = "lightblue", lwd = 3) lines(supsmu(dates, values), col = "blue", lwd = 2) } abline(h = 100) abline(h = c(0, 20, 40, 60, 80, 120, 140, 160, 180, 200), lty = 3) invisible(series) } else series } mobi_trends()

The drop is quite dramatic… by 60%! Even more dramatic, of course, is the situation in Italy:

mobi_trends(reg = "Italy")

A drop by 80%! The same plot for Frankfurt:

mobi_trends(reg = "Frankfurt")

Obviously in Germany people are taking those measures less seriously lately, there seems to be a clear upward trend. This can also be seen in the German “walking” data:

mobi_trends(reg = "Germany", trans = "walking")

What is interesting is that before the lockdown “transit” mobility seems to have accelerated before plunging:

mobi_trends(reg = "Germany", trans = "transit")

You can also plot the raw numbers only, without an added smoother (option `addsmooth = FALSE`

):

mobi_trends(reg = "London", trans = "walking", addsmooth = FALSE)

And as I said, you can conduct your own analyses on the formatted vector of the time series (option `plot = FALSE`

)…

mobi_trends(reg = "London", trans = "walking", plot = FALSE) ## 2020-01-13 2020-01-14 2020-01-15 2020-01-16 2020-01-17 2020-01-18 ## 100.00 108.89 116.84 118.82 132.18 160.29 ## 2020-01-19 2020-01-20 2020-01-21 2020-01-22 2020-01-23 2020-01-24 ## 105.12 108.02 120.52 124.81 127.01 137.38 ## 2020-01-25 2020-01-26 2020-01-27 2020-01-28 2020-01-29 2020-01-30 ## 162.41 97.16 100.01 113.27 122.75 124.96 ## 2020-01-31 2020-02-01 2020-02-02 2020-02-03 2020-02-04 2020-02-05 ## 144.13 161.17 103.93 105.67 115.03 125.42 ## 2020-02-06 2020-02-07 2020-02-08 2020-02-09 2020-02-10 2020-02-11 ## 128.43 140.65 167.80 76.79 100.51 115.26 ## 2020-02-12 2020-02-13 2020-02-14 2020-02-15 2020-02-16 2020-02-17 ## 125.35 124.69 150.77 149.35 96.03 131.20 ## 2020-02-18 2020-02-19 2020-02-20 2020-02-21 2020-02-22 2020-02-23 ## 131.72 137.59 136.05 153.95 170.22 104.41 ## 2020-02-24 2020-02-25 2020-02-26 2020-02-27 2020-02-28 2020-02-29 ## 104.32 119.88 125.12 123.88 133.76 153.92 ## 2020-03-01 2020-03-02 2020-03-03 2020-03-04 2020-03-05 2020-03-06 ## 109.26 103.64 114.68 114.25 106.50 142.09 ## 2020-03-07 2020-03-08 2020-03-09 2020-03-10 2020-03-11 2020-03-12 ## 167.10 96.86 97.50 105.54 106.91 98.87 ## 2020-03-13 2020-03-14 2020-03-15 2020-03-16 2020-03-17 2020-03-18 ## 104.19 117.44 64.28 64.53 48.95 43.31 ## 2020-03-19 2020-03-20 2020-03-21 2020-03-22 2020-03-23 2020-03-24 ## 38.76 37.49 37.36 30.76 31.25 24.63 ## 2020-03-25 2020-03-26 2020-03-27 2020-03-28 2020-03-29 2020-03-30 ## 24.09 22.89 23.40 23.40 17.83 19.72 ## 2020-03-31 2020-04-01 2020-04-02 2020-04-03 2020-04-04 2020-04-05 ## 22.29 22.19 22.76 24.34 28.49 26.06 ## 2020-04-06 2020-04-07 2020-04-08 2020-04-09 2020-04-10 2020-04-11 ## 21.63 24.64 23.87 26.13 28.59 28.58 ## 2020-04-12 2020-04-13 2020-04-14 2020-04-15 2020-04-16 2020-04-17 ## 22.86 22.80 25.66 27.44 26.40 23.27 ## 2020-04-18 2020-04-19 ## 26.36 30.40

…as we have only scratched the surface of the many possibilities here, there are many interesting analyses, like including the data in epidemiological models or simply calculate correlations with new infections/deaths: please share your findings in the comments below!

]]>One widely used graphical plot to assess the quality of a machine learning classifier or the accuracy of a medical test is the

Many machine learning classifiers give you some kind of score or probability of the predicted class. One example is logistic regression (see also Learning Data Science: The Supermarket knows you are pregnant before your Dad does). If we had a perfect (binary) classifier the true (actual) classes (`C`

), ordered according to the values of the score of the classifier (`Score`

), would show two distinct blocks of the true positive classes (`P`

) and the true negative classes (`N`

):

(perfect <- data.frame(C = c(rep("P", 10), c(rep("N", 10))), Score = seq(0.95, 0, -1/20))) ## C Score ## 1 P 0.95 ## 2 P 0.90 ## 3 P 0.85 ## 4 P 0.80 ## 5 P 0.75 ## 6 P 0.70 ## 7 P 0.65 ## 8 P 0.60 ## 9 P 0.55 ## 10 P 0.50 ## 11 N 0.45 ## 12 N 0.40 ## 13 N 0.35 ## 14 N 0.30 ## 15 N 0.25 ## 16 N 0.20 ## 17 N 0.15 ## 18 N 0.10 ## 19 N 0.05 ## 20 N 0.00

On the other hand the ordered scores of a completely useless classifier would give some arbitrary distribution of the true classes:

(useless <- data.frame(C = c(rep(c("P", "N"), 10)), Score = seq(0.95, 0, -1/20))) ## C Score ## 1 P 0.95 ## 2 N 0.90 ## 3 P 0.85 ## 4 N 0.80 ## 5 P 0.75 ## 6 N 0.70 ## 7 P 0.65 ## 8 N 0.60 ## 9 P 0.55 ## 10 N 0.50 ## 11 P 0.45 ## 12 N 0.40 ## 13 P 0.35 ## 14 N 0.30 ## 15 P 0.25 ## 16 N 0.20 ## 17 P 0.15 ## 18 N 0.10 ## 19 P 0.05 ## 20 N 0.00

As a third example let us take the ordered scores of a more realistic classifier and see how the true classes are distributed:

(realistic <- data.frame(C = c("P", "P", "N", "P", "P", "P", "N", "N", "P", "N", "P", "N", "P", "N", "N", "N", "P", "N", "P", "N"), Score = c(0.9, 0.8, 0.7, 0.6, 0.55, 0.54, 0.53, 0.52, 0.51, 0.505, 0.4, 0.39, 0.38, 0.37, 0.36, 0.35, 0.34, 0.33, 0.3, 0.1))) ## C Score ## 1 P 0.900 ## 2 P 0.800 ## 3 N 0.700 ## 4 P 0.600 ## 5 P 0.550 ## 6 P 0.540 ## 7 N 0.530 ## 8 N 0.520 ## 9 P 0.510 ## 10 N 0.505 ## 11 P 0.400 ## 12 N 0.390 ## 13 P 0.380 ## 14 N 0.370 ## 15 N 0.360 ## 16 N 0.350 ## 17 P 0.340 ## 18 N 0.330 ## 19 P 0.300 ## 20 N 0.100

To get some handle on the quality of classifiers (and medical tests) two measures are widely used:

**Sensitivity**or**True Positive Rate (TPR)**: the proportion of actual positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition).**Specificity**or**True Negative Rate (TNR)**: the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).

ROC curves plot both of those measures against each other! More concretely, it goes along the ordered scores and plots a line up for a true positive example and a line to the right for a true negative example (for historical reasons not the True Negative Rate (TNR) but the False Positive Rate (FPR) is being plotted on the x-axis. Because FPR = 1 – TNR the plot would be the same if the x-axis ran from 1 to 0.).

Let us first create a small function for plotting the ROC curve and try it with our “perfect” classifier:

roc <- function(x) { C <- x$C; Score <- x$Score Labels <- C[order(Score, decreasing = TRUE)] levels(Labels) <- c(FALSE, TRUE); Labels <- as.logical(Labels) # convert to logical plot(data.frame(FPR = c(0, cumsum(!Labels) / sum(!Labels)), TPR = c(0, cumsum(Labels) / sum(Labels))), type = "o", pch = 18, xlim = c(0, 1), ylim = c(0, 1)) lines(c(0, 1), c(0, 1), lty = 2) } roc(perfect)

Because all of the first examples are true positive examples the line goes all the way up and when it reaches the middle (the threshold) it goes all the way to the right: a perfect classifier.

Let us now plot our completely useless classifier:

roc(useless)

As can be seen, the useless classifier runs along the diagonal line from (0, 0) to (1, 1). The better a classifier the more it looks like a turned around “L”, the worse the more it looks like this diagonal line.

Realistic cases are between those two extremes:

roc(realistic)

To summarize have a look at the following animation which builds the last plot step by step:

There would be much more to say about concepts like *Area Under the Curve (AUC)*, finding the *optimal threshold* (“cutoff” point) for the scores and more *performance measures* but this will have to wait for another post, so stay tuned!

One of the biggest problems of the COVID-19 pandemic is that there are no reliable numbers of infections. This fact renders many model projections next to useless.

If you want to get to know a simple method how to roughly estimate the real number of infections and expected deaths in the US, read on!

As we have seen many times on this blog: simple doesn’t always mean inferior, it only means more comprehensible! The following estimation is based on a simple idea from an article in DER SPIEGEL (H. Dambeck: Was uns die Zahl der Toten verrĂ€t).

The general idea goes like this:

- The number of people having died from COVID-19 is much more reliable than the number of infections.
- Our best estimate of the true fatality rate of COVID-19 still is 0.7% of the number of infected persons and
- we know that the time from reporting of an infection to death is about 10 days.

With this knowledge, we can infer the people that got *actually infected 10 days ago* and deduce the *percentage of confirmed vs. actually infected persons*:

# https://en.wikipedia.org/wiki/Template:2019%E2%80%9320_coronavirus_pandemic_data/United_States_medical_cases new_inf <- c(1, 1, 1, 2, 1, 1, 1, 3, 1, 0, 2, 0, 1, 4, 5, 18, 15, 28, 26, 64, 77, 101, 144, 148, 291, 269, 393, 565, 662, 676, 872, 1291, 2410, 3948, 5417, 6271, 8631, 10410, 9939, 12226, 17050, 19046, 20093, 19118, 20463, 25396, 26732, 28812, 32182, 34068, 25717, 29362) deaths <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 4, 3, 2, 0, 3, 5, 2, 5, 5, 6, 4, 8, 7, 6, 14, 21, 26, 52, 55, 68, 110, 111, 162, 225, 253, 433, 447, 392, 554, 821, 940, 1075, 1186, 1352, 1175, 1214) data <- data.frame(new_inf, deaths) n <- length(new_inf) shift <- function(x, n = 10){ c(x[-(seq(n))], rep(NA, n)) } data$real_inf <- shift(round(data$deaths / 0.007)) data$perc_real <- round(data$new_inf / data$real_inf, 4) data ## new_inf deaths real_inf perc_real ## 1 1 0 0 Inf ## 2 1 0 0 Inf ## 3 1 0 0 Inf ## 4 2 0 0 Inf ## 5 1 0 143 0.0070 ## 6 1 0 143 0.0070 ## 7 1 0 571 0.0018 ## 8 3 0 429 0.0070 ## 9 1 0 286 0.0035 ## 10 0 0 0 NaN ## 11 2 0 429 0.0047 ## 12 0 0 714 0.0000 ## 13 1 0 286 0.0035 ## 14 4 0 714 0.0056 ## 15 5 1 714 0.0070 ## 16 18 1 857 0.0210 ## 17 15 4 571 0.0263 ## 18 28 3 1143 0.0245 ## 19 26 2 1000 0.0260 ## 20 64 0 857 0.0747 ## 21 77 3 2000 0.0385 ## 22 101 5 3000 0.0337 ## 23 144 2 3714 0.0388 ## 24 148 5 7429 0.0199 ## 25 291 5 7857 0.0370 ## 26 269 6 9714 0.0277 ## 27 393 4 15714 0.0250 ## 28 565 8 15857 0.0356 ## 29 662 7 23143 0.0286 ## 30 676 6 32143 0.0210 ## 31 872 14 36143 0.0241 ## 32 1291 21 61857 0.0209 ## 33 2410 26 63857 0.0377 ## 34 3948 52 56000 0.0705 ## 35 5417 55 79143 0.0684 ## 36 6271 68 117286 0.0535 ## 37 8631 110 134286 0.0643 ## 38 10410 111 153571 0.0678 ## 39 9939 162 169429 0.0587 ## 40 12226 225 193143 0.0633 ## 41 17050 253 167857 0.1016 ## 42 19046 433 173429 0.1098 ## 43 20093 447 NA NA ## 44 19118 392 NA NA ## 45 20463 554 NA NA ## 46 25396 821 NA NA ## 47 26732 940 NA NA ## 48 28812 1075 NA NA ## 49 32182 1186 NA NA ## 50 34068 1352 NA NA ## 51 25717 1175 NA NA ## 52 29362 1214 NA NA

We see that only **up to 10% of actual infections are being officially registered** (although fortunately this ratio is growing). Based on this percentage, we can extrapolate the number of *actual infections* from the number of *confirmed infections* and multiply it by the death rate to arrive at the number of *projected deaths for the next 10 days*, i.e. over the Easter weekend:

# how many are actually newly infected? (real_inf <- round(tail(data$new_inf, 10) / mean(data$perc_real[(n-12):(n-10)]))) ## [1] 219436 208788 223477 277350 291940 314656 351460 372057 280855 320663 # how many will die in the coming 10 days? round(real_inf * 0.007) ## [1] 1536 1462 1564 1941 2044 2203 2460 2604 1966 2245

Unfortunately, the numbers do not bode well: this simple projection shows that, with **over 300,000 new infections per day**, there is a realistic possibility to **break the 2,000 deaths-per-day barrier at Easter**.

Remember: this is not based on some fancy model but only on the numbers of people that probably got infected already! This is why this method cannot project beyond the 10-day horizon, yet should be more accurate than many a model tossed around at the moment (which are mainly based on mostly unreliable data).

We will soon see how all of this pans out… please share your thoughts and your own calculations in the comments below.

I wish you, despite the grim circumstances, a Happy Easter!

…and heed what Jesus would do in times of social distancing!

**UPDATE April 14, 2020**

Unfortunately, my prediction of breaking the 2,000 deaths-per-day barrier became true.

The topic of this post will be the fitting with the R-package `optim`

. Food? That sounds like a rather unlikely match for writing a post on a blog about quantitative analysis, however, there is a surprising overlap between these disciplinary fields. For example, whether you model the transport of a flavour molecule or transport of a virus, the type of mathematical equations and the ways to treat the data are a lot similar.

This contribution will be split into two parts. In the first part, we pick up on the earlier fitting described in a previous blog-post here (see Epidemiology: How contagious is Novel Coronavirus (2019-nCoV)?). These fits are sometimes difficult to perform. How can we analyse that difficult behaviour and how can we make further improvements? In the second part, we will see that all these efforts to make a nice performing algorithm to perform the fitting is actually not much useful for the current case. Just because we use a mathematical model, which sounds rigorous, does not mean that our conclusions/predictions are trustworthy.

These two parts will be accompanied by the R-script covid.r.

With the outbreak of COVID-19 one thing that is certain is that never before a virus has gone so much viral on the internet. Especially, a lot of data about the spread of the virus is going around. A large amount of data is available in the form of fancy coronavirus-trackers that look like weather forecasts or overviews of sports results. Many people have started to try predicting the evolution of the epidemiological curve and along with that the reproduction number , but can this be done with this type of data?

In this blog-post, we describe the fitting of the data with the SIR model and explain the tricky parts of the fitting methodology and how we can mitigate some of the problems that we encounter.

The general problem is that the fitting-algorithm is not always finding itâs way to the best solution. Below is a graph that shows an out of the box fit of the data with the `optim`

package (itâs the one from the previous blog post Epidemiology: How contagious is Novel Coronavirus (2019-nCoV)? ). Next to it, we show a result that is more optimal. Why did we not find this result directly with the `optim`

package?

There are two main reasons why the model is not converging well.

The first reason is that the `optim`

algorithm (which is updating model parameters starting from an initial guess and moving towards the optimal solution) is stopping too early before it has found the right solution.

How does the `optim`

package find a solution? The gradient methods used by the `optim`

package find the optimum estimate by repeatedly improving the current estimate and finding a new solution with a lower residual sum of squares (RSS) each time. Gradient methods do this by computing for a small change of the parameters in which direction the RSS will change the fastest and then, in the case of the BFGS method used by the `optim`

package, it computes (via a line search method) where in that direction the lowest value for the RSS is. This is repeated until no further improvement can be made, or when the improvement is below some desired/sufficient minimal level.

In the two images below we see how the algorithm solves stepwise the fit, for a SIR model that uses the parameters and (these parameters had been explained in the previous blog post and are repeated in this post below). The images are contour plot (lines) and surface plot (colours) for the value of the RSS as a function of the model parameters. The minimum is around and where eventually the algorithm should end.

We see in these images effects that make it difficult for the algorithm to approach the optimum quickly in few steps, or it may even get blocked before that point (also it may end up in a local optimum, which is a bit different case, although we have it here as well and thereâs a local optimum with a value for ).

**Computation of the gradient** If the function that we use for the optimization does not provide an expression for the gradient of the function (which is needed to find the direction of movement) then the `optim`

package will compute this manually by taking the values at nearby points.

How much nearby do these points need to be? The `optim`

package uses the scale of the parameters for this. This scale does not always work out of the box and when it is too large then the algorithm is not making an accurate computation of the gradient.

In the image below we see this by the path taken by the algorithm is shown by the red and black arrows. The red arrows show the path when we do not fine-tune the optimization, the black path shows the path when we reduce the scale of the parameters manually. This is done with the control parameter. In the code of the file covid.R you see this in the function:

OptTemp <- optim(new, RSS2, method = "L-BFGS-B", lower = c(0,1.00001), hessian = TRUE, control = list(parscale = c(10^-4,10^-4), factr = 1))

By using `parscale = c(10^-4,10^-4)`

we let the algorithm compute the gradient at a smaller scale (we could actually also use the `ndeps`

parameter). In addition, we used `factr = 1`

, which is a factor that determines the point when the algorithm stops (in this case when the improvement is less than one times the machine precision).

So by changing the parameter `parscale`

we can often push the algorithm to get closer to the optimal solution.

**A zigzag path towards the optimum** may occur when the surface plot of the RSS has a sort of long stretched valley shape. Then the algorithm is computing a path that moves towards the optimum like a sort of snowboarder on a half-pipe, taking lots of movements along the axis in the direction of the curvature of the half-pipe, and much less movement along the axis downhill towards the bottom.

In the case above we had let the algorithm start at and and this was chosen on purpose for the illustration. But we do not always make such a good initial guess. In the image below we see what happens when we had chosen and as starting condition (note that image should be stretched out along the y-axis due to the different ranges of and in which case the change of the RSS is much faster/stronger in the direction left-right than the direction up-down).

The red curve, which shows the result of the algorithm without the fine-tuning, stops already after one step around where it hits the bottom of the curvature of the valley/half-pipe and is not accurately finding out that there is still a long path/gradient in the other direction. We can improve the situation by changing the `parscale`

parameter, in which case the algorithm will more precisely determine the slope and continue itâs path (see the black arrows). But in the direction of the y-axis, it does this only in small steps, so it will only slowly converge to the optimal solution.

We can often improve this situation by changing the relative scale of the parameters, however, in this particular case, it is not easy, because of the L-shape of the âvalleyâ (see the above image). We could change the relative scales of the parameters to improve convergence in the beginning, but then the convergence at the end becomes more difficult.

The second reason for the bad convergence behaviour of the algorithm is that the problem is ill-conditioned. That means that a small change of the data will have a large influence on the outcome of the parameter estimates.

In that case, the data is not very useful to differentiate between different parameters of the model. A large range of variation in the parameters can more or less explain the same data.

An example of this is in the image below, where we see that for different values of R0 we can still fit the data without much difference in the residual sum of squares (RSS). We get every time a value for around to (and the shape of the curve is not much dependent on the value of ).

This value for relates to the initial growth rate. Letâs look at the differential equations to see why variations in have so little effect on the begin of the curve. In terms of the parameters and the equations are now:

Here we see that, when is approximately equal to (which is the case in the beginning), then we get approximately and the beginning of the curve will be approximately exponential.

Thus, for a large range of values of , the beginning of the epidemiological curve will resemble an exponential growth that is independent of the value of . In the opposite direction: when we observe exponential growth (initially) then we can not use this observation to derive a value for .

With these ill-conditioned problems, it is often difficult to get the algorithm to converge to the minimum. This is because changes in some parameter (in our case ) will result in only a small improvement of the RSS and a large range of the parameters have more or less the same RSS.

So if small variations in the data occur, due to measurements errors, how much impact will this have on the estimates of the parameters? Here we show the results for two different ways to do determine this. In the file covid.R the execution of the methods will be explained in more detail.

**Using an estimate of the Fisher information.** We can determine an estimate for (lower bound of) the variance of the parameters by considering the CramĂ©r-Rao bound, which states that the variance of (unbiased) parameter estimates are equal to or larger than the inverse of the Fisher Information matrix. The Fisher information is a matrix with the second-order partial derivatives of the log-likelihood function evaluated at the true parameter values.

The log-likelihood function is this thing:

We do not know this loglikelihood function and itâs dependence on the parameters and because we do not have the true parameter values and also we do not know the variance of the random error of the data points (the term in the likelihood function). But we can estimate it based on the Hessian, a matrix with the second-order partial derivatives of our objective function evaluated at our final estimate.

##################### ## ## computing variance with Hessian ## ################### ### The output of optim will store values for RSS and the hessian mod <- optim(c(0.3, 1.04), RSS2, method = "L-BFGS-B", hessian = TRUE, control = list(parscale = c(10^-4,10^-4), factr = 1)) # unbiased estimate of standard deviation # we divide by n-p # where n is the number of data points # and p is the number of estimated parameters sigma_estimate <- sqrt(mod$value/(length(Infected)-2)) # compute the inverse of the hessian # The hessian = the second order partial derivative of the objective function # in our case this is the RSS # we multiply by 1/(2 * sigma^2) covpar <- solve(1/(2*sigma_estimate^2)*mod$hessian) covpar # [,1] [,2] #[1,] 1.236666e-05 -2.349611e-07 #[2,] -2.349611e-07 9.175736K and R0 e-09 ## the variance of R0 is then approximately ## covpar[2,2]^0.5 #[1] 9.579006e-05

**Using a Monte Carlo estimation.** A formula to compute exactly the propagation of errors/variance in the data to the errors/variance in the estimates of the parameters is often very complex. The Hessian will only give us a lower bound (I personally find it more useful to see any potential strong correlation between parameters), and it is not so easy to implement. There is however a very blunt but effective way to get an idea of the propagation of errors and that is by performing a random simulation.

The full details of this method are explained in the covid.R file. Here we will show just the results of the simulation:

In this simulation, we simulated times new data based on a true model with parameter values and and with the variance of data points corresponding to the observed RSS of our fit. We also show in the right graph how the parameters and are distributed for the same simulation. The parameters and are strongly correlated. This results in them having a large marginal/individual error, but the values and have much less relative variation (this is why we changed the fitting parameters from and to and ).

Now, we are almost at the end of this post, and we will make a new attempt to fit again the epidemiological curve, but now based on more new data.

What we do this time is make some small adaptations:

- The data is the number of total people that have gotten sick. This is different from the (infectious) and (recovered) output of the model. We make the comparison of the modelled with the data (the total that have gone sick).
- In this comparison, we will use a scaling factor because the reported number of infected/infectious people is an underestimation of the true value, and this latter value is what the model computes. We use two scaling factors one for before and one for after February 12 (because at that time the definition for reporting cases had been changed).
- We make the population size a fitting variable. This will correct for the two assumptions that we have homogeneous mixing among the entire population of China and that of the population is susceptible. In addition, we make the infected people at the start a fitting variable. In this model, we will fit . There is data for a separate but it is not such an accurate variable (because the recovery and the infectious phase is not easy to define/measure/determine).

Because the computation of all these parameters is too difficult in a single `optim`

function we solve the parameters separately in a nested way. In the most inner loop, we solve the scaling parameters (which can be done with a simple linear model), in the middle loop we solve the and with the `optim`

function, in the outer loop we do a brute force search for the optimal starting point of .

To obtain a starting condition we use a result from *Harko, Lobo and Mak 2014* (Exact analytical solutions of the Susceptible-Infected-Recovered (SIR) epidemic model and of the SIR model with equal death and birth rates) who derived expressions for , and in terms of a single differential equation. The equation below is based on their equations but expressed in slightly different terms:

We can solve this equation as a linear equation which gives us a good starting condition (small sidenote: using some form of differential equation is a general way of getting starting conditions, but the might be noisy, in that case, one could integrate the expression).

The further details of the computation can be found in the covid.R script. Below you see a result of the outer loop where we did a brute force search (which gives an optimum around for ) and next to it a fitted curve for the parameters , , and .

In this new fit, we get again a low reproduction number . One potential reason for this is that due to the measures that have been taken, the Chinese have been able to reduce the rate of the spread of the virus. The model is unaware of this and interprets this as a reduction that is due to immunity (decrease of susceptible people). However, only a very small fraction of the people have gained immunity (about of the population got sick if we consider ). For the virus to stop spreading at already such a low fraction of sick people it must mean that the is very low.

Thus, an estimation of the parameters, based on this type of data, is difficult. When we see a decrease in the growth rate then one or more of the following four effects play a role: (1) The number of susceptible people has decreased sufficiently to overcome the reproduction rate . This relative decrease in susceptible people happens faster when the total number of people is smaller. (2) Something has changed about the conditions, the reproduction rate is not constant in time. For instance, with respiratory infections, it is common that the transfer rates depend on weather and are higher during winter. (3) The measures that are being taken against the disease are taking effect. (4) The model is too simple with several assumptions that overestimate the effect of the initial growth rate. This growth rate is very high per day, and we observe a doubling every three days. This means that the time between generations is very short, something that is not believed to be true. It may be likely that the increase in numbers is partially due to variable time delay in the occurrence of the symptoms as well as sampling bias.

For statisticians, it is difficult to estimate what causes the changes in the epidemic curves. We should need more *detailed* information in order to fill in the gaps which do not seem to go away by having just more data (and this coronavirus creates a lot of data, possibly too much). But as human beings under threat of a nasty disease, we can at least consider ourselves lucky that we have a lot of options how the disease can fade away. And we can be lucky that we see a seemingly/effective reproduction rate that is very low, and also only a fraction of the population is susceptible.

So now we have done all this nice mathematics and we can draw accurately a modelled curve through all our data points. But is this useful when we model the wrong data with the wrong model? The difference between statistics and mathematics is that statisticians need to look beyond the computations.

- We need to consider what the data actually represents, how is it sampled, whether there are biases and how strongly they are gonna influence our analysis. We should actually do this ideally
*before*we start throwing computations at the data. Or such computations will at most be exploratory analysis, but they should not start to live their own life without the data. - And we need to consider how good a representation our models are. We can make expressions based on the variance in the data, but the error is also determined by the bias in our models.

At the present time, COVID-19 is making an enormous impact on our lives, with an unclear effect for the future (we even do not know when the measures are gonna stop, end of April, end of May, maybe even June?). Only time will tell what the economic aftermath of this coronavirus is gonna be, and how much itâs impact will be for our health and quality of life. But one thing that we can assure ourself about is that the ominous view of an unlimited exponential growth (currently going around on social media) is not data-driven.

In this post, I have explained some mathematics about fitting. However, I would like to warn for the blunt use of these mathematical formulaâs. Just because we use a mathematical model does not mean that our conclusions/predictions are trustworthy. We need to challenge the premises which are the underlying data and models. So in a next post, âContagiousness of COVID-19 Part 2: Why the Result of Part 1 is Uselessâ, I hope to explain what sort of considerations about the data and the models one should take into account and make some connections with other cases where statistics went in a wrong direction.

]]>Correlation and its associated challenges don’t lose their fascination: most people know that

If you want to learn about a paradoxical effect nearly nobody is aware of, where correlation between two uncorrelated random variables is introduced just by sampling, read on!

Let us just get into an example (inspired by When Correlation Is Not Causation, But Something Much More Screwy): for all intents and purposes let us assume that appearance and IQ are normally distributed and are uncorrelated:

set.seed(1147) hotness <- rnorm(1000, 100, 15) IQ <- rnorm(1000, 100, 15) pop <- data.frame(hotness, IQ) plot(hotness ~ IQ, main = "The general population")

Now, we can ask ourselves: why does somebody become famous? One plausible assumption (besides luck, see also: The Rich didnât earn their Wealth, they just got Lucky) would be that this person has some combination of attributes. To stick with our example, let us assume some combination of hotness and intelligence and let us sample some “celebrities” on the basis of this combination:

pop$comb <- pop$hotness + pop$IQ # some combination of hotness and IQ celebs <- pop[pop$comb > 235, ] # sample celebs on the basis of this combination plot(celebs$hotness ~ celebs$IQ, xlab = "IQ", ylab = "hotness", main = "Celebrities") abline(lm(celebs$hotness ~ celebs$IQ), col = "red")

Wow, a clear negative relationship between hotness and IQ! Even a highly significant one (to understand significance, see also: From Coin Tosses to p-Hacking: Make Statistics Significant Again!):

cor.test(celebs$hotness, celebs$IQ) # highly significant ## ## Pearson's product-moment correlation ## ## data: celebs$hotness and celebs$IQ ## t = -14.161, df = 46, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## -0.9440972 -0.8306163 ## sample estimates: ## cor ## -0.901897

How can this be? Well, the basis (the combination of hotness and IQ) on which we sample from our (uncorrelated) population is what is called a *collider* (variable) in statistics. Whereas a *confounder* (variable) influences (at least) two variables (A â C â B), a collider is the opposite: it is influenced by (at least) two variables (A â C â B).

In our simple case, it is the sum of our two independent variables. The result is a spurious correlation introduced by a special form of *selection bias*, namely *endogenous selection bias*. The same effect also goes under the name *Berkson’s paradox*, *Berksonâs fallacy*, *selection-distortion effect*, *conditioning on a collider (variable)*, *collider stratification bias*, or just *collider bias*.

To understand this effect intuitively we are going to combine the two plots from above:

plot(hotness ~ IQ, main = "The general population & Celebrities") points(celebs$hotness ~ celebs$IQ, col = "red") abline(a = 235, b = -1, col = "blue")

In reality, things are often not so simple. When you google the above search terms you will find all kinds of examples, e.g. the so-called *obesity paradox* (an apparent preventive effect of obesity on mortality in individuals with cardiovascular disease (CVD)), a supposed health-protective effect of neuroticism or biased deep learning predictions of lung cancer.

As a takeaway: if a statistical result implies a relationship that seems too strange to be true, it possibly is! To check whether collider bias might be present check if sampling was being conducted on the basis of a variable that is influenced by the variables that seem to be correlated! Otherwise, you might not only falsely conclude that beautiful people are generally stupid and intelligent people ugly…

]]>