The topic of this post will be the fitting with the R-package `optim`

. Food? That sounds like a rather unlikely match for writing a post on a blog about quantitative analysis, however, there is a surprising overlap between these disciplinary fields. For example, whether you model the transport of a flavour molecule or transport of a virus, the type of mathematical equations and the ways to treat the data are a lot similar.

This contribution will be split into two parts. In the first part, we pick up on the earlier fitting described in a previous blog-post here (see Epidemiology: How contagious is Novel Coronavirus (2019-nCoV)?). These fits are sometimes difficult to perform. How can we analyse that difficult behaviour and how can we make further improvements? In the second part, we will see that all these efforts to make a nice performing algorithm to perform the fitting is actually not much useful for the current case. Just because we use a mathematical model, which sounds rigorous, does not mean that our conclusions/predictions are trustworthy.

These two parts will be accompanied by the R-script covid.r.

With the outbreak of COVID-19 one thing that is certain is that never before a virus has gone so much viral on the internet. Especially, a lot of data about the spread of the virus is going around. A large amount of data is available in the form of fancy coronavirus-trackers that look like weather forecasts or overviews of sports results. Many people have started to try predicting the evolution of the epidemiological curve and along with that the reproduction number , but can this be done with this type of data?

In this blog-post, we describe the fitting of the data with the SIR model and explain the tricky parts of the fitting methodology and how we can mitigate some of the problems that we encounter.

The general problem is that the fitting-algorithm is not always finding it’s way to the best solution. Below is a graph that shows an out of the box fit of the data with the `optim`

package (it’s the one from the previous blog post Epidemiology: How contagious is Novel Coronavirus (2019-nCoV)? ). Next to it, we show a result that is more optimal. Why did we not find this result directly with the `optim`

package?

There are two main reasons why the model is not converging well.

The first reason is that the `optim`

algorithm (which is updating model parameters starting from an initial guess and moving towards the optimal solution) is stopping too early before it has found the right solution.

How does the `optim`

package find a solution? The gradient methods used by the `optim`

package find the optimum estimate by repeatedly improving the current estimate and finding a new solution with a lower residual sum of squares (RSS) each time. Gradient methods do this by computing for a small change of the parameters in which direction the RSS will change the fastest and then, in the case of the BFGS method used by the `optim`

package, it computes (via a line search method) where in that direction the lowest value for the RSS is. This is repeated until no further improvement can be made, or when the improvement is below some desired/sufficient minimal level.

In the two images below we see how the algorithm solves stepwise the fit, for a SIR model that uses the parameters and (these parameters had been explained in the previous blog post and are repeated in this post below). The images are contour plot (lines) and surface plot (colours) for the value of the RSS as a function of the model parameters. The minimum is around and where eventually the algorithm should end.

We see in these images effects that make it difficult for the algorithm to approach the optimum quickly in few steps, or it may even get blocked before that point (also it may end up in a local optimum, which is a bit different case, although we have it here as well and there’s a local optimum with a value for ).

**Computation of the gradient** If the function that we use for the optimization does not provide an expression for the gradient of the function (which is needed to find the direction of movement) then the `optim`

package will compute this manually by taking the values at nearby points.

How much nearby do these points need to be? The `optim`

package uses the scale of the parameters for this. This scale does not always work out of the box and when it is too large then the algorithm is not making an accurate computation of the gradient.

In the image below we see this by the path taken by the algorithm is shown by the red and black arrows. The red arrows show the path when we do not fine-tune the optimization, the black path shows the path when we reduce the scale of the parameters manually. This is done with the control parameter. In the code of the file covid.R you see this in the function:

OptTemp <- optim(new, RSS2, method = "L-BFGS-B", lower = c(0,1.00001), hessian = TRUE, control = list(parscale = c(10^-4,10^-4), factr = 1))

By using `parscale = c(10^-4,10^-4)`

we let the algorithm compute the gradient at a smaller scale (we could actually also use the `ndeps`

parameter). In addition, we used `factr = 1`

, which is a factor that determines the point when the algorithm stops (in this case when the improvement is less than one times the machine precision).

So by changing the parameter `parscale`

we can often push the algorithm to get closer to the optimal solution.

**A zigzag path towards the optimum** may occur when the surface plot of the RSS has a sort of long stretched valley shape. Then the algorithm is computing a path that moves towards the optimum like a sort of snowboarder on a half-pipe, taking lots of movements along the axis in the direction of the curvature of the half-pipe, and much less movement along the axis downhill towards the bottom.

In the case above we had let the algorithm start at and and this was chosen on purpose for the illustration. But we do not always make such a good initial guess. In the image below we see what happens when we had chosen and as starting condition (note that image should be stretched out along the y-axis due to the different ranges of and in which case the change of the RSS is much faster/stronger in the direction left-right than the direction up-down).

The red curve, which shows the result of the algorithm without the fine-tuning, stops already after one step around where it hits the bottom of the curvature of the valley/half-pipe and is not accurately finding out that there is still a long path/gradient in the other direction. We can improve the situation by changing the `parscale`

parameter, in which case the algorithm will more precisely determine the slope and continue it’s path (see the black arrows). But in the direction of the y-axis, it does this only in small steps, so it will only slowly converge to the optimal solution.

We can often improve this situation by changing the relative scale of the parameters, however, in this particular case, it is not easy, because of the L-shape of the ‘valley’ (see the above image). We could change the relative scales of the parameters to improve convergence in the beginning, but then the convergence at the end becomes more difficult.

The second reason for the bad convergence behaviour of the algorithm is that the problem is ill-conditioned. That means that a small change of the data will have a large influence on the outcome of the parameter estimates.

In that case, the data is not very useful to differentiate between different parameters of the model. A large range of variation in the parameters can more or less explain the same data.

An example of this is in the image below, where we see that for different values of R0 we can still fit the data without much difference in the residual sum of squares (RSS). We get every time a value for around to (and the shape of the curve is not much dependent on the value of ).

This value for relates to the initial growth rate. Let’s look at the differential equations to see why variations in have so little effect on the begin of the curve. In terms of the parameters and the equations are now:

Here we see that, when is approximately equal to (which is the case in the beginning), then we get approximately and the beginning of the curve will be approximately exponential.

Thus, for a large range of values of , the beginning of the epidemiological curve will resemble an exponential growth that is independent of the value of . In the opposite direction: when we observe exponential growth (initially) then we can not use this observation to derive a value for .

With these ill-conditioned problems, it is often difficult to get the algorithm to converge to the minimum. This is because changes in some parameter (in our case ) will result in only a small improvement of the RSS and a large range of the parameters have more or less the same RSS.

So if small variations in the data occur, due to measurements errors, how much impact will this have on the estimates of the parameters? Here we show the results for two different ways to do determine this. In the file covid.R the execution of the methods will be explained in more detail.

**Using an estimate of the Fisher information.** We can determine an estimate for (lower bound of) the variance of the parameters by considering the Cramér-Rao bound, which states that the variance of (unbiased) parameter estimates are equal to or larger than the inverse of the Fisher Information matrix. The Fisher information is a matrix with the second-order partial derivatives of the log-likelihood function evaluated at the true parameter values.

The log-likelihood function is this thing:

We do not know this loglikelihood function and it’s dependence on the parameters and because we do not have the true parameter values and also we do not know the variance of the random error of the data points (the term in the likelihood function). But we can estimate it based on the Hessian, a matrix with the second-order partial derivatives of our objective function evaluated at our final estimate.

##################### ## ## computing variance with Hessian ## ################### ### The output of optim will store values for RSS and the hessian mod <- optim(c(0.3, 1.04), RSS2, method = "L-BFGS-B", hessian = TRUE, control = list(parscale = c(10^-4,10^-4), factr = 1)) # unbiased estimate of standard deviation # we divide by n-p # where n is the number of data points # and p is the number of estimated parameters sigma_estimate <- sqrt(mod$value/(length(Infected)-2)) # compute the inverse of the hessian # The hessian = the second order partial derivative of the objective function # in our case this is the RSS # we multiply by 1/(2 * sigma^2) covpar <- solve(1/(2*sigma_estimate^2)*mod$hessian) covpar # [,1] [,2] #[1,] 1.236666e-05 -2.349611e-07 #[2,] -2.349611e-07 9.175736K and R0 e-09 ## the variance of R0 is then approximately ## covpar[2,2]^0.5 #[1] 9.579006e-05

**Using a Monte Carlo estimation.** A formula to compute exactly the propagation of errors/variance in the data to the errors/variance in the estimates of the parameters is often very complex. The Hessian will only give us a lower bound (I personally find it more useful to see any potential strong correlation between parameters), and it is not so easy to implement. There is however a very blunt but effective way to get an idea of the propagation of errors and that is by performing a random simulation.

The full details of this method are explained in the covid.R file. Here we will show just the results of the simulation:

In this simulation, we simulated times new data based on a true model with parameter values and and with the variance of data points corresponding to the observed RSS of our fit. We also show in the right graph how the parameters and are distributed for the same simulation. The parameters and are strongly correlated. This results in them having a large marginal/individual error, but the values and have much less relative variation (this is why we changed the fitting parameters from and to and ).

Now, we are almost at the end of this post, and we will make a new attempt to fit again the epidemiological curve, but now based on more new data.

What we do this time is make some small adaptations:

- The data is the number of total people that have gotten sick. This is different from the (infectious) and (recovered) output of the model. We make the comparison of the modelled with the data (the total that have gone sick).
- In this comparison, we will use a scaling factor because the reported number of infected/infectious people is an underestimation of the true value, and this latter value is what the model computes. We use two scaling factors one for before and one for after February 12 (because at that time the definition for reporting cases had been changed).
- We make the population size a fitting variable. This will correct for the two assumptions that we have homogeneous mixing among the entire population of China and that of the population is susceptible. In addition, we make the infected people at the start a fitting variable. In this model, we will fit . There is data for a separate but it is not such an accurate variable (because the recovery and the infectious phase is not easy to define/measure/determine).

Because the computation of all these parameters is too difficult in a single `optim`

function we solve the parameters separately in a nested way. In the most inner loop, we solve the scaling parameters (which can be done with a simple linear model), in the middle loop we solve the and with the `optim`

function, in the outer loop we do a brute force search for the optimal starting point of .

To obtain a starting condition we use a result from *Harko, Lobo and Mak 2014* (Exact analytical solutions of the Susceptible-Infected-Recovered (SIR) epidemic model and of the SIR model with equal death and birth rates) who derived expressions for , and in terms of a single differential equation. The equation below is based on their equations but expressed in slightly different terms:

We can solve this equation as a linear equation which gives us a good starting condition (small sidenote: using some form of differential equation is a general way of getting starting conditions, but the might be noisy, in that case, one could integrate the expression).

The further details of the computation can be found in the covid.R script. Below you see a result of the outer loop where we did a brute force search (which gives an optimum around for ) and next to it a fitted curve for the parameters , , and .

In this new fit, we get again a low reproduction number . One potential reason for this is that due to the measures that have been taken, the Chinese have been able to reduce the rate of the spread of the virus. The model is unaware of this and interprets this as a reduction that is due to immunity (decrease of susceptible people). However, only a very small fraction of the people have gained immunity (about of the population got sick if we consider ). For the virus to stop spreading at already such a low fraction of sick people it must mean that the is very low.

Thus, an estimation of the parameters, based on this type of data, is difficult. When we see a decrease in the growth rate then one or more of the following four effects play a role: (1) The number of susceptible people has decreased sufficiently to overcome the reproduction rate . This relative decrease in susceptible people happens faster when the total number of people is smaller. (2) Something has changed about the conditions, the reproduction rate is not constant in time. For instance, with respiratory infections, it is common that the transfer rates depend on weather and are higher during winter. (3) The measures that are being taken against the disease are taking effect. (4) The model is too simple with several assumptions that overestimate the effect of the initial growth rate. This growth rate is very high per day, and we observe a doubling every three days. This means that the time between generations is very short, something that is not believed to be true. It may be likely that the increase in numbers is partially due to variable time delay in the occurrence of the symptoms as well as sampling bias.

For statisticians, it is difficult to estimate what causes the changes in the epidemic curves. We should need more *detailed* information in order to fill in the gaps which do not seem to go away by having just more data (and this coronavirus creates a lot of data, possibly too much). But as human beings under threat of a nasty disease, we can at least consider ourselves lucky that we have a lot of options how the disease can fade away. And we can be lucky that we see a seemingly/effective reproduction rate that is very low, and also only a fraction of the population is susceptible.

So now we have done all this nice mathematics and we can draw accurately a modelled curve through all our data points. But is this useful when we model the wrong data with the wrong model? The difference between statistics and mathematics is that statisticians need to look beyond the computations.

- We need to consider what the data actually represents, how is it sampled, whether there are biases and how strongly they are gonna influence our analysis. We should actually do this ideally
*before*we start throwing computations at the data. Or such computations will at most be exploratory analysis, but they should not start to live their own life without the data. - And we need to consider how good a representation our models are. We can make expressions based on the variance in the data, but the error is also determined by the bias in our models.

At the present time, COVID-19 is making an enormous impact on our lives, with an unclear effect for the future (we even do not know when the measures are gonna stop, end of April, end of May, maybe even June?). Only time will tell what the economic aftermath of this coronavirus is gonna be, and how much it’s impact will be for our health and quality of life. But one thing that we can assure ourself about is that the ominous view of an unlimited exponential growth (currently going around on social media) is not data-driven.

In this post, I have explained some mathematics about fitting. However, I would like to warn for the blunt use of these mathematical formula’s. Just because we use a mathematical model does not mean that our conclusions/predictions are trustworthy. We need to challenge the premises which are the underlying data and models. So in a next post, “Contagiousness of COVID-19 Part 2: Why the Result of Part 1 is Useless”, I hope to explain what sort of considerations about the data and the models one should take into account and make some connections with other cases where statistics went in a wrong direction.

]]>Correlation and its associated challenges don’t lose their fascination: most people know that

If you want to learn about a paradoxical effect nearly nobody is aware of, where correlation between two uncorrelated random variables is introduced just by sampling, read on!

Let us just get into an example (inspired by When Correlation Is Not Causation, But Something Much More Screwy): for all intents and purposes let us assume that appearance and IQ are normally distributed and are uncorrelated:

set.seed(1147) hotness <- rnorm(1000, 100, 15) IQ <- rnorm(1000, 100, 15) pop <- data.frame(hotness, IQ) plot(hotness ~ IQ, main = "The general population")

Now, we can ask ourselves: why does somebody become famous? One plausible assumption (besides luck, see also: The Rich didn’t earn their Wealth, they just got Lucky) would be that this person has some combination of attributes. To stick with our example, let us assume some combination of hotness and intelligence and let us sample some “celebrities” on the basis of this combination:

pop$comb <- pop$hotness + pop$IQ # some combination of hotness and IQ celebs <- pop[pop$comb > 235, ] # sample celebs on the basis of this combination plot(celebs$hotness ~ celebs$IQ, xlab = "IQ", ylab = "hotness", main = "Celebrities") abline(lm(celebs$hotness ~ celebs$IQ), col = "red")

Wow, a clear negative relationship between hotness and IQ! Even a highly significant one (to understand significance, see also: From Coin Tosses to p-Hacking: Make Statistics Significant Again!):

cor.test(celebs$hotness, celebs$IQ) # highly significant ## ## Pearson's product-moment correlation ## ## data: celebs$hotness and celebs$IQ ## t = -14.161, df = 46, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## -0.9440972 -0.8306163 ## sample estimates: ## cor ## -0.901897

How can this be? Well, the basis (the combination of hotness and IQ) on which we sample from our (uncorrelated) population is what is called a *collider* (variable) in statistics. Whereas a *confounder* (variable) influences (at least) two variables (A ← C → B), a collider is the opposite: it is influenced by (at least) two variables (A → C ← B).

In our simple case, it is the sum of our two independent variables. The result is a spurious correlation introduced by a special form of *selection bias*, namely *endogenous selection bias*. The same effect also goes under the name *Berkson’s paradox*, *Berkson’s fallacy*, *selection-distortion effect*, *conditioning on a collider (variable)*, *collider stratification bias*, or just *collider bias*.

To understand this effect intuitively we are going to combine the two plots from above:

plot(hotness ~ IQ, main = "The general population & Celebrities") points(celebs$hotness ~ celebs$IQ, col = "red") abline(a = 235, b = -1, col = "blue")

In reality, things are often not so simple. When you google the above search terms you will find all kinds of examples, e.g. the so-called *obesity paradox* (an apparent preventive effect of obesity on mortality in individuals with cardiovascular disease (CVD)), a supposed health-protective effect of neuroticism or biased deep learning predictions of lung cancer.

As a takeaway: if a statistical result implies a relationship that seems too strange to be true, it possibly is! To check whether collider bias might be present check if sampling was being conducted on the basis of a variable that is influenced by the variables that seem to be correlated! Otherwise, you might not only falsely conclude that beautiful people are generally stupid and intelligent people ugly…

]]>So, why not a post from the new epicentre of the global COVID-19 pandemic, Central Europe, more exactly where I live: Germany?! Indeed, if you want to find out what the numbers tell us how things might develop here, read on!

We will use the same model we already used in this post: Epidemiology: How contagious is Novel Coronavirus (2019-nCoV)?. You can find all the details there and in the comments.

library(deSolve) # https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Germany#Statistics Infected <- c(16, 18, 21, 26, 53, 66, 117, 150, 188, 240, 349, 534, 684, 847, 1110, 1458, 1881, 2364, 3057, 3787, 4826, 5999) Day <- 1:(length(Infected)) N <- 83149300 # population of Germany acc. to Destatis old <- par(mfrow = c(1, 2)) plot(Day, Infected, type ="b") plot(Day, Infected, log = "y") abline(lm(log10(Infected) ~ Day)) title("Total infections COVID-19 Germany", outer = TRUE, line = -2)

This clearly shows that we have an exponential development here, unfortunately as expected.

SIR <- function(time, state, parameters) { par <- as.list(c(state, parameters)) with(par, { dS <- -beta/N * I * S dI <- beta/N * I * S - gamma * I dR <- gamma * I list(c(dS, dI, dR)) }) } init <- c(S = N-Infected[1], I = Infected[1], R = 0) RSS <- function(parameters) { names(parameters) <- c("beta", "gamma") out <- ode(y = init, times = Day, func = SIR, parms = parameters) fit <- out[ , 3] sum((Infected - fit)^2) } Opt <- optim(c(0.5, 0.5), RSS, method = "L-BFGS-B", lower = c(0, 0), upper = c(1, 1)) # optimize with some sensible conditions Opt$message ## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH" Opt_par <- setNames(Opt$par, c("beta", "gamma")) Opt_par ## beta gamma ## 0.6427585 0.3572415 t <- 1:80 # time in days fit <- data.frame(ode(y = init, times = t, func = SIR, parms = Opt_par)) col <- 1:3 # colour matplot(fit$time, fit[ , 2:4], type = "l", xlab = "Day", ylab = "Number of subjects", lwd = 2, lty = 1, col = col) matplot(fit$time, fit[ , 2:4], type = "l", xlab = "Day", ylab = "Number of subjects", lwd = 2, lty = 1, col = col, log = "y") ## Warning in xy.coords(x, y, xlabel, ylabel, log = log): 1 y value <= 0 ## omitted from logarithmic plot points(Day, Infected) legend("bottomright", c("Susceptibles", "Infecteds", "Recovereds"), lty = 1, lwd = 2, col = col, inset = 0.05) title("SIR model COVID-19 Germany", outer = TRUE, line = -2)

par(old) R0 <- setNames(Opt_par["beta"] / Opt_par["gamma"], "R0") R0 ## R0 ## 1.799227 fit[fit$I == max(fit$I), "I", drop = FALSE] # height of pandemic ## I ## 54 9765121 max_infected <- max(fit$I) max_infected / 5 # severe cases ## [1] 1953024 max_infected * 0.06 # cases with need for intensive care ## [1] 585907.3 # https://www.newscientist.com/article/mg24532733-700-why-is-it-so-hard-to-calculate-how-many-people-will-die-from-covid-19/ max_infected * 0.007 # deaths with supposed 0.7% fatality rate ## [1] 68355.85

So, according to this model, the height of the pandemic will be reached by the end of April, beginning of May. About 10 million people would be infected by then, which translates to about 2 million severe cases, about 600,000 cases in need of intensive care (there are about 28,000 ICU beds in Germany with an average utilization rate of around 80%) and up to 70,000 deaths.

Those are the numbers our model produces and nobody knows whether they are correct while everybody hopes they are not. One thing has to be kept in mind though: the numbers used in the model are from before the shutdown (for details see here: DER SPIEGEL: Germany Moves To Shut Down Most of Public Life). So hopefully those measures will prove effective and the actual numbers will turn out to be much, much lower.

I wish you all the best and stay healthy!

**UPDATE March 17, 2020**

The German Disease Control Center “Robert Koch Institute” (RKI) just announced that there could be up to 10 million infections in the next few months. This agrees with my model above.

Also, the current number of confirmed infections (about 8,200 cases) is within a small error margin (about 3%) what my model predicted for today.

Source: DER SPIEGEL: Coronavirus Could Infect Up to 10 Million in Germany

]]>Valentine’s Day is around the corner and love is in the air… but, shock horror, nearly every second marriage ends in a divorce! Unfortunately, I can tell you first hand that this is an experience you’d rather not have. In this post, we see how

`OneR`

package and an interesting new data set, might potentially help you to avoid that tragedy… so read on!In a scientific study last year in Turkey, nearly 200 participants (married as well as divorced) were being asked to rate how important they find the following statements (some of them seem to have got a little lost in translation from the original Turkish version):

1. When one of our apologies apologizes when our discussions go in a bad direction, the issue does not extend.

2. I know we can ignore our differences, even if things get hard sometimes.

3. When we need it, we can take our discussions with my wife from the beginning and correct it.

4. When I argue with my wife, it will eventually work for me to contact him.

5. The time I spent with my wife is special for us.

6. We don’t have time at home as partners.

7. We are like two strangers who share the same environment at home rather than family.

8. I enjoy our holidays with my wife.

9. I enjoy traveling with my wife.

10. My wife and most of our goals are common.

11. I think that one day in the future, when I look back, I see that my wife and I are in harmony with each other.

12. My wife and I have similar values in terms of personal freedom.

13. My husband and I have similar entertainment.

14. Most of our goals for people (children, friends, etc.) are the same.

15. Our dreams of living with my wife are similar and harmonious

16. We’re compatible with my wife about what love should be

17. We share the same views with my wife about being happy in your life

18. My wife and I have similar ideas about how marriage should be

19. My wife and I have similar ideas about how roles should be in marriage

20. My wife and I have similar values in trust

21. I know exactly what my wife likes.

22. I know how my wife wants to be taken care of when she’s sick.

23. I know my wife’s favorite food.

24. I can tell you what kind of stress my wife is facing in her life.

25. I have knowledge of my wife’s inner world.

26. I know my wife’s basic concerns.

27. I know what my wife’s current sources of stress are.

28. I know my wife’s hopes and wishes.

29. I know my wife very well.

30. I know my wife’s friends and their social relationships.

31. I feel aggressive when I argue with my wife.

32. When discussing with my wife, I usually use expressions such as “you always” or “you never”.

33. I can use negative statements about my wife’s personality during our discussions.

34. I can use offensive expressions during our discussions.

35. I can insult our discussions.

36. I can be humiliating when we argue.

37. My argument with my wife is not calm.

38. I hate my wife’s way of bringing it up.

39. Fights often occur suddenly.

40. We’re just starting a fight before I know what’s going on.

41. When I talk to my wife about something, my calm suddenly breaks.

42. When I argue with my wife, it only snaps in and I don’t say a word.

43. I’m mostly thirsty to calm the environment a little bit.

44. Sometimes I think it’s good for me to leave home for a while.

45. I’d rather stay silent than argue with my wife.

46. Even if I’m right in the argument, I’m thirsty not to upset the other side.

47. When I argue with my wife, I remain silent because I am afraid of not being able to control my anger.

48. I feel right in our discussions.

49. I have nothing to do with what I’ve been accused of.

50. I’m not actually the one who’s guilty about what I’m accused of.

51. I’m not the one who’s wrong about problems at home.

52. I wouldn’t hesitate to tell her about my wife’s inadequacy.

53. When I discuss it, I remind her of my wife’s inadequate issues.

54. I’m not afraid to tell her about my wife’s incompetence.

Now, the question is whether one can decide – on the basis of their ratings alone – whether a person will actually get divorced. Let us see if data science can help us in this love related matter!

The data and a link to the corresponding article can be found here: Divorce Predictors data set, I unpacked the data for your convenience, you can download it here: divorce.csv. Let us now use the `OneR`

package (on CRAN) to analyse it:

library(OneR) divorce <- read.csv("data/divorce.csv", sep = ";") divorce$Class <- factor(ifelse(divorce$Class == 0, "married", "divorced")) data <- optbin(divorce) model <- OneR(data, verbose = TRUE) # 18. My wife and I have similar ideas about how marriage should be ## ## Attribute Accuracy ## 1 * Atr18 98.24% ## 2 Atr11 97.65% ## 2 Atr17 97.65% ## 2 Atr19 97.65% ## 5 Atr9 97.06% ## 5 Atr16 97.06% ## 5 Atr20 97.06% ## 5 Atr40 97.06% ## 9 Atr26 96.47% ## 10 Atr12 95.88% ## 10 Atr14 95.88% ## 10 Atr15 95.88% ## 10 Atr25 95.88% ## 10 Atr30 95.88% ## 15 Atr29 95.29% ## 15 Atr36 95.29% ## 15 Atr39 95.29% ## 18 Atr4 94.71% ## 18 Atr8 94.71% ## 18 Atr21 94.71% ## 18 Atr27 94.71% ## 22 Atr5 94.12% ## 22 Atr37 94.12% ## 22 Atr38 94.12% ## 25 Atr41 93.53% ## 25 Atr44 93.53% ## 27 Atr1 92.94% ## 27 Atr2 92.94% ## 27 Atr10 92.94% ## 27 Atr24 92.94% ## 31 Atr22 92.35% ## 31 Atr28 92.35% ## 31 Atr31 92.35% ## 31 Atr33 92.35% ## 35 Atr13 91.76% ## 35 Atr32 91.76% ## 35 Atr35 91.76% ## 38 Atr23 91.18% ## 38 Atr34 91.18% ## 40 Atr54 90.59% ## 41 Atr50 89.41% ## 42 Atr3 88.82% ## 43 Atr42 87.65% ## 44 Atr51 87.06% ## 45 Atr49 84.71% ## 45 Atr53 84.71% ## 47 Atr7 82.35% ## 48 Atr47 81.76% ## 49 Atr48 80.59% ## 50 Atr52 80% ## 51 Atr43 78.82% ## 52 Atr45 77.06% ## 53 Atr6 74.12% ## 54 Atr46 68.24% ## --- ## Chosen attribute due to accuracy ## and ties method (if applicable): '*' summary(model) ## ## Call: ## OneR.data.frame(x = data, verbose = TRUE) ## ## Rules: ## If Atr18 = (-0.004,1.19] then Class = married ## If Atr18 = (1.19,4] then Class = divorced ## ## Accuracy: ## 167 of 170 instances classified correctly (98.24%) ## ## Contingency table: ## Atr18 ## Class (-0.004,1.19] (1.19,4] Sum ## divorced 3 * 81 84 ## married * 86 0 86 ## Sum 89 81 170 ## --- ## Maximum in each column: '*' ## ## Pearson's Chi-squared test: ## X-squared = 154.56, df = 1, p-value < 2.2e-16 plot(model)

prediction <- predict(model, data) eval_model(prediction, data) ## ## Confusion matrix (absolute): ## Actual ## Prediction divorced married Sum ## divorced 81 0 81 ## married 3 86 89 ## Sum 84 86 170 ## ## Confusion matrix (relative): ## Actual ## Prediction divorced married Sum ## divorced 0.48 0.00 0.48 ## married 0.02 0.51 0.52 ## Sum 0.49 0.51 1.00 ## ## Accuracy: ## 0.9824 (167/170) ## ## Error rate: ## 0.0176 (3/170) ## ## Error rate reduction (vs. base rate): ## 0.9643 (p-value < 2.2e-16)

So, the best predictor is the rating on statement 18. The question you should ask your partner before marrying him or her is, therefore, the following:

What is a good marriage for you?

A simple question but one that might reveal some major differences between your conceptions of what a good marriage is. In that case, the outlook is not good. The accuracy of the prediction is a whopping 98.24%! By the way, this is even slightly better than the 98.23% given in the paper (which is achieved by an artificial neural network).

Had I only known this 20 years ago…

Happy Valentine’s Day and stay tuned as we will take a little break and hopefully see you back on March 17’th!

]]>A new invisible enemy, only 30kb in size, has emerged and is on a killing spree around the world:

It has already killed more people than the SARS pandemic and its outbreak has been declared a Public Health Emergency of International Concern (PHEIC) by the World Health Organization (WHO).

If you want to learn how epidemiologists estimate how contagious a new virus is and how to do it in R read on!

There are many epidemiological models around, we will use one of the simplest here, the so-called *SIR model*. We will use this model with the latest data from the current outbreak of 2019-nCoV (from here: Wikipedia: Case statistics). You can use the following R code as a starting point for your own experiments and estimations.

Before we start to calculate a forecast let us begin with what is confirmed so far:

Infected <- c(45, 62, 121, 198, 291, 440, 571, 830, 1287, 1975, 2744, 4515, 5974, 7711, 9692, 11791, 14380, 17205, 20440) Day <- 1:(length(Infected)) N <- 1400000000 # population of mainland china old <- par(mfrow = c(1, 2)) plot(Day, Infected, type ="b") plot(Day, Infected, log = "y") abline(lm(log10(Infected) ~ Day)) title("Confirmed Cases 2019-nCoV China", outer = TRUE, line = -2)

On the left, we see the confirmed cases in mainland China and on the right the same but with a log scale on the y-axis (a so-called *semi-log plot* or more precisely *log-linear plot* here), which indicates that the epidemic is in an exponential phase, although at a slightly smaller rate than at the beginning. By the way: many people were not alarmed at all at the beginning. Why? Because an exponential function looks linear in the beginning. It was the same with HIV/AIDS when it first started.

Now we come to the prediction part with the SIR model, which basic idea is quite simple. There are three groups of people: those that are healthy but susceptible to the disease (), the infected () and the people who have recovered ():

To model the dynamics of the outbreak we need three *differential equations*, one for the change in each group, where is the parameter that controls the transition between and and which controls the transition between and :

This can easily be put into R code:

SIR <- function(time, state, parameters) { par <- as.list(c(state, parameters)) with(par, { dS <- -beta/N * I * S dI <- beta/N * I * S - gamma * I dR <- gamma * I list(c(dS, dI, dR)) }) }

To fit the model to the data we need two things: a *solver* for differential equations and an *optimizer*. To solve differential equations the function `ode`

from the `deSolve`

package (on CRAN) is an excellent choice, to optimize we will use the `optim`

function from base R. Concretely, we will minimize the sum of the squared differences between the number of infected at time and the corresponding number of predicted cases by our model :

Putting it all together:

library(deSolve) init <- c(S = N-Infected[1], I = Infected[1], R = 0) RSS <- function(parameters) { names(parameters) <- c("beta", "gamma") out <- ode(y = init, times = Day, func = SIR, parms = parameters) fit <- out[ , 3] sum((Infected - fit)^2) } Opt <- optim(c(0.5, 0.5), RSS, method = "L-BFGS-B", lower = c(0, 0), upper = c(1, 1)) # optimize with some sensible conditions Opt$message ## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH" Opt_par <- setNames(Opt$par, c("beta", "gamma")) Opt_par ## beta gamma ## 0.6746089 0.3253912 t <- 1:70 # time in days fit <- data.frame(ode(y = init, times = t, func = SIR, parms = Opt_par)) col <- 1:3 # colour matplot(fit$time, fit[ , 2:4], type = "l", xlab = "Day", ylab = "Number of subjects", lwd = 2, lty = 1, col = col) matplot(fit$time, fit[ , 2:4], type = "l", xlab = "Day", ylab = "Number of subjects", lwd = 2, lty = 1, col = col, log = "y") ## Warning in xy.coords(x, y, xlabel, ylabel, log = log): 1 y value <= 0 ## omitted from logarithmic plot points(Day, Infected) legend("bottomright", c("Susceptibles", "Infecteds", "Recovereds"), lty = 1, lwd = 2, col = col, inset = 0.05) title("SIR model 2019-nCoV China", outer = TRUE, line = -2)

We see in the right log-linear plot that the model seems to fit the values quite well. We can now extract some interesting statistics. One important number is the so-called *basic reproduction number* (also basic reproduction ratio) (pronounced “R naught”) which basically shows how many healthy people get infected by a sick person on average:

par(old) R0 <- setNames(Opt_par["beta"] / Opt_par["gamma"], "R0") R0 ## R0 ## 2.073224 fit[fit$I == max(fit$I), "I", drop = FALSE] # height of pandemic ## I ## 50 232001865 max(fit$I) * 0.02 # max deaths with supposed 2% fatality rate ## [1] 4640037

So, is slightly above 2, which is the number many researchers and the WHO give and which is around the same range of SARS, Influenza or Ebola (while transmission of Ebola is via bodily fluids and not airborne droplets). Additionally, according to this model, the height of a possible pandemic would be reached by the beginning of March (50 days after it started) with over 200 million Chinese infected and over 4 million dead!

**Do not panic!** All of this is preliminary and hopefully (probably!) false. When you play along with the above model you will see that the fitted parameters are far from stable. On the one hand, the purpose of this post was just to give an illustration of how such analyses are done in general with a very simple (probably too simple!) model, on the other hand, we are in good company here; the renowned scientific journal *nature* writes:

Researchers are struggling to accurately model the outbreak and predict how it might unfold.

On the other hand, I wouldn’t go that far that the numbers are impossibly high. H1N1, also known as swine flu, infected up to 1.5 billion people during 2009/2010 and nearly 600,000 died. And this wasn’t the first pandemic of this proportion in history (think Spanish flu). Yet, this is one of the few times where I hope that my model is wrong and we will all stay healthy!

**UPDATE March 16, 2020**

I posted an updated version for the situation in Germany:

COVID-19: The Case of Germany

We have all watched with great horror the catastrophic fires in Australia. Over many years scientists have been studying simulations to understand the underlying dynamics better. They tell us, that “what Australia needs is more fires, but of the right kind”. What do they mean by that?

One such simulation of fire is based on *Multi-Agent Systems (MAS)*, also called *Agent-Based Modelling (ABM)*. An excellent piece of free software (and in fact the de facto standard) is NetLogo. Even better is that NetLogo can be fully controlled by R… and we will use this feature to learn some crucial lessons!

If you want to understand more about the dynamics of fire in particular and about some fascinating properties of *dynamical systems* in general via controlling NetLogo with R, read on!

The model we will use can be found here: NetLogo Models Library: Fire. There it reads:

This project simulates the spread of a fire through a forest. It shows that the fire’s chance of reaching the right edge of the forest depends critically on the density of trees. This is an example of a common feature of complex systems, the presence of a non-linear threshold or critical parameter.

Similar effects can be observed in other phenomena, namely the spread of diseases (the medical area of infectiology or epidemiology), the diffusion of information (or memes) within a population (e.g. via social media) or the diffusion of innovation within an economy.

Those effects can best be understood by looking at some examples:

Set the density of trees to 55%. At this setting, there is virtually no chance that the fire will reach the right edge of the forest.

Set the density of trees to 70%. At this setting, it is almost certain that the fire will reach the right edge.

There is a sharp transition around 59% density. At 59% density…

…the fire has a 50/50 chance…

…of reaching the right edge.

Now, we are going to control NetLogo with R via the excellent `nlrx`

package (on CRAN). The following analysis was inspired by the article “Agent Based Models and RNetLogo” from the Revolutions blog (a Microsoft company), yet the `RNetLogo`

package used therein is not compatible with current NetLogo versions (and therefore increasingly useless).

You can use the following piece of code as a template for your own experiments with all kinds of NetLogo models:

library(nlrx) # Windows default NetLogo installation path (adjust to your needs) netlogopath <- file.path("C:/Program Files/NetLogo 6.1.1") modelpath <- file.path(netlogopath, "app/models/Sample Models/Earth Science/Fire.nlogo") outpath <- file.path("data") # adjust to your needs nl <- nl(nlversion = "6.1.1", nlpath = netlogopath, modelpath = modelpath, jvmmem = 1024) # set up experiment nl@experiment <- experiment(expname = "fire", outpath = outpath, repetition = 1, tickmetrics = "false", idsetup = "setup", idgo = "go", runtime = 0, metrics = c("ifelse-value (initial-trees > 0) [(burned-trees / initial-trees) * 100][0]"), variables = list('density' = list(values = seq(0, 100, 10))), # use seq(0, 100, 1) to simulate over 100 different densitiy values constants = list()) # set nseeds = 10 to simulate over 10 different random seeds (replicates) set.seed(123) nl@simdesign <- simdesign_distinct(nl, nseeds = 2) ## Creating distinct simulation design # run experiment results <- run_nl_all(nl) # plot results as boxplots boxplot(results$`ifelse-value (initial-trees > 0) [(burned-trees / initial-trees) * 100][0]` ~ results$density, xlab = "Density", ylab = "Percent Burned", main = "NetLogo Fire Simulation")

The boxplots corroborate the point of a highly non-linear behaviour, also called *phase transition*, *critical threshhold* or *tipping point* around the 59% density level.

One practical consequence indeed is to ensure that the density of forests doesn’t grow beyond some critical threshold. In an interview, my colleague Professor Stephen Pyne, an American fire expert, answers the question on what roles Aborigines play in Australia’s history of fire:

It’s a key role. Evidence suggests that they burned the landscape for 50,000 years after humans first found their way to Australia. These fires resulted in a new equilibrium. But when the arrival of the Europeans caused the Aboriginal population to collapse, this equilibrium was thrown off. Among the many changes, woody vegetation increased in some places, which provided more fuel for the flames. That shifted the system towards larger, more intense fires. In a landscape dominated by Aborigines, there would be more charred areas, but fewer intense fires. What Australia needs is more fires, but of the right kind. We see too many bad fires, and too few good ones.

To sum up, the counter-intuitive effect seen in many dynamical systems is that it is not true what most people consider as a given: that in the real world small changes in one part of a system will only have an overall small impact. Multi-agent simulations can give us a feeling of why sometimes small changes can indeed have a huge impact (depending on the overall situation the system is in).

By controlling NetLogo with R we have its full statistical power to analyze this often mind-blowing behaviour and conduct serious research. It will not be the last time that we cover a project in this fascinating area, so stay tuned!

]]>We already covered the so-called

This is especially true for *Neural Networks*: while often delivering outstanding results, they are basically *black boxes* and notoriously hard to interpret (see also: Understanding the Magic of Neural Networks).

There is a new hot area of research to make black-box models interpretable, called *Explainable Artificial Intelligence (XAI)*, if you want to gain some intuition on one such approach (called *LIME*), read on!

Before we dive right into it it is important to point out when and why you would need interpretability of an AI. While it might be a desirable goal in itself it is not necessary in many fields, at least not for users of an AI, e.g. with text translation, character and speech recognition it is not that important why they do what they do but simply that they work.

In other areas, like medical applications (determining whether tissue is malignant), financial applications (granting a loan to a customer) or applications in the criminal-justice system (gauging the risk of recidivism) it is of the utmost importance (and sometimes even required by law) to know *why* the machine arrived at its conclusions.

One approach to make AI models explainable is called *LIME* for *Local Interpretable Model-Agnostic Explanations*. There is already a lot in this name!

- First, you are able to make
*any*(!) model interpretable (so not only neural networks). - Second, the explanation is always on a specific case, so the method tries to explain e.g. why this specific customer wasn’t approved for a loan but no general explanations, what is important to get a loan, will be given.
- Third, and now we are getting at the heart of the method already, it approximates the complex, unintelligible model with a linear model.

This idea is, in my opinion, quite sexy because it has its equivalent in calculus: if you zoom in deep enough you can build most (even very complicated) function out of linear building blocks. This is what LIME basically does!

To gain some intuition we will use a very similar method and compare that with the results of LIME. We will even use the same illustrative picture used in the original paper (“Why Should I Trust You?”: Explaining the Predictions of Any Classifier) and create a toy-example out of it:

The paper explains (p. 4):

Toy example to present intuition for LIME. The black-box model’s complex decision function f (unknown to LIME) is represented by the blue/pink background, which cannot be approximated well by a linear model. The bold red cross is the instance being explained. LIME samples instances, gets predictions using f, and weighs them by the proximity to the instance being explained (represented here by size). The dashed line is the learned explanation that is locally (but not globally) faithful.

We are now taking this picture as our actual black-box model and approximate the decision boundary linearly. We do this by using *logistic regression* (see also: Learning Data Science: The Supermarket knows you are pregnant before your Dad does). LIME itself uses *LASSO regression* (see also: Cambridge Analytica: Microtargeting or How to catch voters with the LASSO).

Another thing is that we don’t weigh instances by proximity but randomly create more data points that are nearer to the respective case (by sampling from a multivariate normal distribution), yet the idea is the same.

Now we are ready to get started (you can find the prepared image here lime2.jpg, the packages `jpeg`

and `lime`

are on CRAN):

library(jpeg) library(lime) img <- readJPEG("pics/lime2.jpg") # adjust path border_col <- mean(img[ , , 1]) model <- ifelse(img[ , , 1] < border_col, 0, 1) image(model, axes = FALSE, xlab = "x1", ylab = "x2", col = c("#B9CDE5", "#F1DCDB")) axis(1, at = seq(0, 1, .1), labels = seq(0, nrow(model), round(nrow(model)/10))) axis(2, at = seq(0, 1, .1), labels = seq(0, ncol(model), round(ncol(model)/10))) # some S3 magic class(model) <- "black_box" predict.black_box <- function(model, newdata, type = "prob") { newdata <- as.matrix(newdata) apply(newdata, 1, function(x) model[x[1], x[2]]) } # the case to be analyzed x1 <- 140; x2 <- 145 points(x1/nrow(model), x2/ncol(model), pch = 3, lwd = 6, col = "red") predict(model, cbind(x1, x2)) ## [1] 1 # approximate locally by logistic regression set.seed(123) x1_prox <- round(rnorm(100, x1, 18)) x2_prox <- round(rnorm(100, x2, 18)) data <- cbind(x1_prox, x2_prox) df <- cbind.data.frame(y = predict(model, data), data) logreg <- glm(y ~ ., data = df, family = binomial) ## Warning: glm.fit: algorithm did not converge ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred summary(logreg) ## ## Call: ## glm(formula = y ~ ., family = binomial, data = df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -6.423e-04 2.000e-08 2.000e-08 2.000e-08 5.100e-04 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 12378.05 735651.83 0.017 0.987 ## x1_prox -94.11 5606.33 -0.017 0.987 ## x2_prox 15.67 952.47 0.016 0.987 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1.0279e+02 on 99 degrees of freedom ## Residual deviance: 8.5369e-07 on 97 degrees of freedom ## AIC: 6 ## ## Number of Fisher Scoring iterations: 25 slope <- -coef(logreg)[2] / coef(logreg)[3] intercept <- -coef(logreg)[1] / coef(logreg)[3] segments(0, intercept/ncol(model), 1, (intercept + nrow(model) * slope)/ncol(model), lty = 2, lwd = 6, col = "grey40")

Here we can clearly see the locally approximated linear decision boundary. Now for the interpretation of the coefficients of the linear model:

# interpretation barplot(coef(logreg)[3:2], horiz = TRUE, col = ifelse(coef(logreg)[3:2] < 0, "firebrick", "steelblue"), border = NA, xlab = "Weight", ylab = "Feature", main = "Coefficients of linear model") legend("bottom", horiz = TRUE, c("Supports", "Contradicts"), fill = c("steelblue", "firebrick"))

The bar chart can be interpreted like so: attribute `x1`

has a strong negative influence on the resulting class (i.e. when you increase it the class will quickly change), while attribute `x2`

has a comparatively mild positive influence (i.e. when you increase it the class won’t change but the model will get even more confident, but only mildly so). This interpretation can also readily be understood when looking at the decision boundary above.

We are now going to compare this with the original LIME method:

# compare with original lime data <- expand.grid(1:nrow(model), 1:ncol(model)) colnames(data) <- c("x1", "x2") train <- data.frame(data, y = predict(model, data)) explainer <- lime(train, model) ## Warning: y does not contain enough variance to use quantile binning. Using ## standard binning instead. model_type.black_box <- function(x, ...) 'classification' explanation <- explain(data.frame(x1 = 140, x2 = 145), explainer, n_labels = 1, n_features = 2) explanation ## # A tibble: 2 x 13 ## model_type case label label_prob model_r2 model_intercept ## <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 classific~ 1 p 1 0.0515 0.511 ## 2 classific~ 1 p 1 0.0515 0.511 ## # ... with 7 more variables: model_prediction <dbl>, feature <chr>, ## # feature_value <dbl>, feature_weight <dbl>, feature_desc <chr>, ## # data <list>, prediction <list> plot_features(explanation, ncol = 1)

While the results are not a full match (because of the slightly different approach taken here) the direction and ratio of the magnitude are very similar.

All in all, I think LIME is a very powerful and intuitive method to whiten any black-box model and XAI will be one of the most important and relevant research areas in machine learning and artificial intelligence in the future!

]]>While Excel remains the tool of choice among analysts for business case analysis, presumably because of its low barrier to entry and ease of use, it nonetheless continues to be the source of many vexing and disastrous errors. As the research from Panko and others have revealed, the source of these errors arise in part for the following reasons:

- Analysts continue in a persistent lack of QA standards and practices to ensure that spreadsheets are as error free as possible.
- Cell referencing and replicating formulas across relevant dimensions propagate errors at exponential rates.
- Effective auditing gets obscured because formulas’ conceptual meanings are difficult to interpret from formulas constructed from grid coordinates.

To suggest a means to remedy that situation, in the spring of 2018, I published Business Case Analysis with R: Simulation Tutorials to Support Complex Business Decisions (*BCAwR*). I also wanted to showcase how the popular programming language **R** could be employed to evaluate business case opportunities that are fraught with uncertainty nor supported by an abundance of well structured, pedigreed, and relevant data, a situation that plagues just about all business decisions faced in the real world. *BCAwR* takes the reader through the process of analyzing a rather complex commercial investment problem of bringing a chemical processing plant on line. For my post here, I will present a simplified model to give a preview of the contents of the book. I will also provide a little surprise at the end that I hope will contribute to bridging the efforts of the decision and data science communities.

But before we delve into any coding or quantitative activity related to analysis of any business case, we ought to thoroughly frame our problem by narrowing the scope of our inquiry to the appropriate level for the context of the problem at hand, identifying the goals and objectives we want to achieve when we eventually commit to action, partitioning the set of actionable decision we can make to achieve our goals, and creating an inventory of the influences and relationships that connect decisions and uncertainties to the measures of success. Although we will look at a much more simplified business problem here, a little framing will still serve us well.

Imagine that we are considering investing in a project that solves a particular market problem with the ultimate desire to generate a financial return to our organization. We identify three possible strategic pathways to compare for the investment, each with a different level of capital commitment required to address certain nuances of the market problem. Each (1) investment, in turn, will generate (2) an initial post investment net cash flow, (3) a peak cash flow, and varying durations of (4) time to reach the peak from the initial cash flow.

To complicate the matter, for each strategy, we do not know the precise quantitative value of the four parameters that characterize the potential cash flow. For this reason, when we evaluate the three strategies, we will employ a Monte Carlo simulation approach that samples a large number of potential realizations for each parameter from probability distributions. With these ideas in mind, we can now set up the coding to handle the simulation problem.

library(leonRdo) # https://www.incitedecisiontech.com/packages/packages.html library(tidyverse) # (on CRAN) # Set the simulation seed. set.seed(42) # The number of simulation runs per uncertain variable. trials <- 1000 # The time frame across which our cash flow simulation will run. # The initial investment will occur in year 0. year <- 0:15 # The weighted average cost of capital without any additional risk premium. discount.rate <- 0.1 # The distinctive capital allocation strategic decisions. strategy <- c("Strategy 1", "Strategy 2", "Strategy 3")

**R** employs simple Monte Carlo techniques, which doesn’t typically yield stable means for result distributions until many thousands of trials are used. However, When developing a model like this, I suggest setting the trials to 50-100 to make sure that the logic is running correctly in short responsive time frames. Then, when you’re ready to produce real results, change the values to greater than 10,000. Given that Monte Carlo in **R** is noisy even for reasonably large samples (~1000) and for different seed settings, I recently developed the `leonRdo`

package (R Packages: leonRdo & inteRest) to provide a median Latin hypercube approach to sampling that produces much more stable means with approximately 1/10th the number of trials regardless of the seed value used.

In this particular showcase discussion, I have included the `tidyverse`

package (on CRAN) to help with data manipulation and graphing. In the original *BCAwR*, I based all the code on base **R** so that readers new to **R** and simulation could focus more on learning the key principles over the idiosyncratic syntax of secondary packages.

The `set.seed()`

function ensures that we will produce the same sample trial set on each run of our model. This will help us debug problems as we develop code. Eventually as others might interact with our code, they will be able to observe the same set of results we work with. Later, we can reset the seed to other values to make sure that our inferences on a given set of trials remain consistent on a different trial set.

Next, we need to declare some important functions that we will call in our model. The first is a function that provides an analytic solution to the differential equation that relates the proportional absorption of an entity into a fixed environment. This ramp up function takes as parameters the amount of time `Tp`

required to go from, say, `Y`

= 1% absorption to `Y`

= 99% across `Time`

. We will use this function to model the growth of our cash flow over the `year`

index.

# Ramp up function. calcRampUp <- function(Y0, Time, Tp) { # Y0 = initial absorption # Time = time index # tp = time to Y = 1 - Y0, the time to peak. Y <- 1 / (1 + (Y0 / (1 - Y0)) ^ (2 * (Time - 1) / (Tp - 1) - 1)) return(Y) }

The second function we need is the actual business model we want to represent. The business model logic will remain fixed across strategies, and the results will vary only on the basis of the strategy conditional parameters we supply. By “functionalizing” our model, we can iterate the entire model over variable ranges to understand the sensitivity of the output function to those variables, as we shall later see. Please note, this is a toy model we are using for illustration purposes only. *BCAwR* demonstrates a much more complex consideration for the implementation of a chemical processing plant.

# Business model cash flow. calcCashFlow <- function(I, SCF, PCF, YP, Y) { # I = initial investment # SCF = starting cash flow # PCF = peak cash flow # YP = year of peak cash flow # Y = year index cf <- (Y == 0) * (-I) + (Y > 0) * (SCF + (PCF - SCF) * calcRampUp(0.01, Y, YP)) return(cf) }

The last function we will employ is the net present value (NPV) function that we will apply to the individual trial results of the cash flow simulation across strategies.

# The net present value of the cash flow function. calcNPV <- function(CF, Y, DR) { # CF = cash flow vector # Y = year index # DR = discount rate npv <- sum(CF / (1 + DR) ^ Y) return(npv) }

Our next task is to create trial samples for the business model cash flow function parameters for the three project strategy investments. For this particular example, I have chosen to use canonical distributions as if their parameters were based on empirical or historical data and to utilize a simple method for generating trials. However, typically I would use a combination of both empirical data and subject matter expert guidance reflected as cumulative probabilities across the range of the assessed variables’ potential values. I explain how to take this latter approach in greater detail in **Section III** of *BCAwR*.

investment1 <- rlnorm_mlhs(n = trials, meanlog = log(800), sdlog = log(1 + 200 / 800)) investment2 <- rlnorm_mlhs(n = trials, meanlog = log(700), sdlog = log(1 + 50 / 700)) investment3 <- rlnorm_mlhs(n = trials, meanlog = log(1000), sdlog = log(1 + 150 / 1000)) start.cash.flow1 <- rnorm_mlhs(n = trials, mean = -100, sd = 20) start.cash.flow2 <- rnorm_mlhs(n = trials, mean = -90, sd = 5) start.cash.flow3 <- rnorm_mlhs(n = trials, mean = -120, sd = 15) peak.cash.flow1 <- rnorm_mlhs(n = trials, mean = 300, sd = 20) peak.cash.flow2 <- rnorm_mlhs(n = trials, mean = 280, sd = 5) peak.cash.flow3 <- rnorm_mlhs(n = trials, mean = 375, sd = 15) yr.peak1 <- rnorm_mlhs(n = trials, mean = 8.5, sd = 0.75) yr.peak2 <- rnorm_mlhs(n = trials, mean = 10, sd = 0.85) yr.peak3 <- rnorm_mlhs(n = trials, mean = 11, sd = 1) # Store the business model parameter samples in a list of data frames. proj.data <- list( investment = data.frame(investment1, investment2, investment3), start.cash.flow = data.frame(start.cash.flow1, start.cash.flow2, start.cash.flow3), peak.cash.flow = data.frame(peak.cash.flow1, peak.cash.flow2, peak.cash.flow3), yr.peak = data.frame(yr.peak1, yr.peak2, yr.peak3) )

For each strategy, we apply the samples from each parameter to the business model cash flow function. This will result in a list of cash flows for the three project strategies. For each strategy, there will be as many cash flows as defined by `trials`

, and each cash flow trial will be as long as the `year`

index. We can use the `lapply()`

and `sapply()`

functions to avoid using for loops [see also Learning R: A gentle introduction to higher-order functions].

proj.cf <- lapply(1:length(strategy), function(s) { sapply(1:trials, function(t) { calcCashFlow( I = proj.data$investment[[s]][t], SCF = proj.data$start.cash.flow[[s]][t], PCF = proj.data$peak.cash.flow[[s]][t], YP = proj.data$yr.peak[[s]][t], Y = year ) }) }) names(proj.cf) <- strategy

By running the first five trials of `Strategy 1`

, we can see what the cash flows look like.

head(round(proj.cf[[1]][, 1:5], digits = 1), n=length(year)) ## [,1] [,2] [,3] [,4] [,5] ## [1,] -852.7 -852.2 -766.7 -974.3 -912.6 ## [2,] -85.8 -103.1 -117.2 -116.1 -88.1 ## [3,] -77.8 -93.6 -107.3 -106.9 -79.9 ## [4,] -55.3 -64.3 -76.7 -79.9 -57.1 ## [5,] 0.0 9.4 0.2 -13.1 -1.6 ## [6,] 98.0 128.7 123.7 102.2 97.9 ## [7,] 201.5 230.2 227.2 215.6 206.8 ## [8,] 265.3 279.2 276.2 279.8 276.9 ## [9,] 292.5 296.5 293.3 305.3 308.0 ## [10,] 302.3 301.8 298.5 314.1 319.5 ## [11,] 305.6 303.4 300.1 316.9 323.5 ## [12,] 306.7 303.9 300.5 317.8 324.8 ## [13,] 307.1 304.1 300.7 318.1 325.2 ## [14,] 307.2 304.1 300.7 318.2 325.4 ## [15,] 307.2 304.1 300.7 318.2 325.4 ## [16,] 307.2 304.1 300.7 318.2 325.5

We can calculate some summary results for the cash flows for each strategy and plot them. First, we might like to plot the mean cash flow over time.

proj.cf.mean <- as.data.frame(lapply(strategy, function(s) { rowMeans(proj.cf[[s]]) })) names(proj.cf.mean) <- strategy proj.cf.mean <- cbind(year, proj.cf.mean) # Plot the mean strategy cash flows. gg.mean.cf <- ggplot(gather(proj.cf.mean, "strategy", "mean", -year)) + geom_line(aes(x = year, y = mean, color = strategy)) + geom_point(aes(x = year, y = mean, color = strategy)) + labs(title = "Mean Strategy Cash Flow", y = "[$M]") plot(gg.mean.cf)

Then we can calculate the cumulative mean cash flow and plot that for each strategy, too.

proj.ccf.mean <- proj.cf.mean %>% mutate_at(vars(strategy), list(~ cumsum(.))) # Plot the cumulative mean strategy cash flows. gg.mean.ccf <- ggplot(gather(proj.ccf.mean, "strategy", "mean", -year)) + geom_line(aes(x = year, y = mean, color = strategy)) + geom_point(aes(x = year, y = mean, color = strategy)) + labs(title = "Cumulative Mean Strategy Cash Flow", y = "[$M]") plot(gg.mean.ccf)

Now we can observe the risk profile of the cash flows by calculating the trial NPVs of the cash flows.

proj.npv <- as.data.frame(lapply(1:length(strategy), function(s) { sapply(1:trials, function(t) { calcNPV(CF = proj.cf[[s]][, t], Y = year, DR = discount.rate) }) })) names(proj.npv) <- strategy # Plot the CDF of the strategies' sample NPVs. gg.ecdf <- ggplot(gather(proj.npv, "strategy", "NPV"), aes(x = NPV, color = strategy)) + stat_ecdf(geom = "point") + labs(title = "Strategy NPV Risk Profile", x = "Strategy NPV [$M]", y = "Cumulative Probability") plot(gg.ecdf)

And we can calculate the mean NPV of the cash flows.

proj.npv.mean <- round(colMeans(proj.npv), digits = 1) print(proj.npv.mean) ## Strategy 1 Strategy 2 Strategy 3 ## 157.4 59.3 -123.6

The first observation we make about the risk profiles of the strategies is that we can dismiss `Strategy 3`

immediately because it presents negative mean NPV, the probability that it produces a negative economic value is ~75% (i.e., the probability(NPV<=0) = 0.75), and practically none of its trials present any opportunity for dominance over any other strategy.

When we observe the relative volatility and dominance of the remaining strategies, we realize that we face a bit of ambiguity about how to choose the best pathway forward. `Strategy 1`

exhibits the best overall mean NPV, but it does so with the greatest relative volatility. While `Strategy 2`

exhibits approximately the same risk of failure (~25%) as `Strategy 1`

(~27%), it also exhibits the least maximum exposure and relative volatility. To reduce the ambiguity of choosing, we might like to know which uncertainty, due to the overall quality of the information we possess about it, contributes most to switching dominance from `Strategy 1`

over `Strategy 2`

. Knowing which uncertainty our strategy values are most sensitive to gives us the ability to stop worrying about the other uncertainties for the purpose of choosing clearly and focus only on improving our understanding of those that matter most.

To accomplish that latter feat, we run our functionalized model to test the sensitivity of the strategy means to the 80th percentile range in each of the variables. We start by initializing the lists to hold the sensitivity responses.

data.temp <- proj.data proj.npv.sens <- list(p10 = proj.data, p90 = proj.data) d <- data.frame(0, 0, 0) dd <- list( investment = d, start.cash.flow = d, peak.cash.flow = d, yr.peak = d ) proj.npv.sens.mean <- list(p10 = dd, p90 = dd)

We calculate the sensitivity of the strategies’ mean NPVs by fixing each variable to its p10 and p90 quantile values sequentially while all the other variables run according to their defined variation. The result is a chart that shows how sensitive we might be to changing our decision to pursue the best strategy over the next best strategy on the outcome of a given uncertainty.

p1090 <- c(0.1, 0.9) for (q in 1:length(p1090)) { for (v in 1:length(proj.data)) { for (s in 1:length(strategy)) { data.temp[[v]][s] <- rep(quantile(unlist(proj.data[[v]][s]), probs = p1090[q]), trials) for (t in 1:trials) { proj.npv.sens[[q]][[v]][t, s] <- calcNPV( CF = calcCashFlow( I = data.temp$investment[[s]][t], SCF = data.temp$start.cash.flow[[s]][t], PCF = data.temp$peak.cash.flow[[s]][t], YP = data.temp$yr.peak[[s]][t], Y = year ), Y = year, DR = discount.rate ) } proj.npv.sens.mean[[q]][[v]][s] <- mean(proj.npv.sens[[q]][[v]][, s]) data.temp <- proj.data } } } # Recast the sensitivity values to a form that can be plotted. variable.names <- c("investment", "start.cash.flow", "peak.cash.flow", "yr.peak") proj.npv.sens.mean2 <- as.data.frame(sapply(1:length(p1090), function(q) { sapply(1:length(proj.data), function(v) { sapply(1:length(strategy), function(s) { proj.npv.sens.mean[[q]][[v]][[s]] }) }) })) %>% rename(p10 = V1, p90 = V2) %>% mutate( variable = rep(variable.names, each = length(strategy)), strategy = rep(strategy, times = length(proj.data)), strategy.mean = rep(t(proj.npv.mean), times = length(proj.data)) ) %>% select(variable, strategy, strategy.mean, p10, p90) gg.sens <- ggplot(proj.npv.sens.mean2, aes( x = variable, y = p10, yend = p90, color = strategy )) + geom_segment(aes( x = variable, xend = variable, y = p10, yend = p90 ), color = "gray50", size = 0.75) + geom_point( aes(x = variable, y = strategy.mean, color = strategy), size = 1, color = "black", fill = "black", shape = 23 ) + geom_point( aes(x = variable, y = p90), size = 3) + geom_point( aes(x = variable, y = p10), size = 3) + labs(title = "Sensitivity of Strategy Mean NPV to Uncertain Variable Ranges", y = "Strategy NPV [$M]") + coord_flip() + theme(legend.position = "none") + facet_wrap(~ strategy) plot(gg.sens)

This sensitivity chart is valuable not only to the decision analyst, but to the data analyst as well for this one simple reason: now that we know which uncertainties potentially exposes us to the greatest risk or reward, we can calculate the value of information on those uncertainties. This value of information represents the rational maximum budget we should allocate to acquire better data and insight on those uncertainties. While a typical business case analysis of sufficient complexity may contain 10-100 uncertainties, we should focus our research budget only on those that are critical to making a decision. This simple “stopping function” helps us ensure that our data science resources are allocated properly and most efficiently. **Section IV** of *BCAwR* provides a means to calculate the value of information on continuous variables like the ones we have utilized in our example here.

Business decision analysts can definitely use **R** effectively to conduct many of the business case analyses in a more transparent and less error prone environment than their current tools typically allow, especially through the use of clearly named variables that array abstract and the use of functionalized structural models. However, the real power that R provides is the means to simulate information when high quality, large scale data may be scarce, infeasible, or impossible to obtain. Then, by using such concepts as value of information, data scientists can be called upon to refine the quality of information on those critical uncertainties that may frustrate success when commitment to action turns into real decisions. I hope the simple example presented here inspires you to learn more about it.

Happy New Year to all of you! 2020 is here and it seems that we are being overwhelmed by more and more irrationality, especially

In this post, I will give you some indication that this might actually not be the case (shock horror: good news alert!). We will be using *Google Trends* for that: If you want to know what Google Trends is, learn how to query it from within R and process the retrieved data, read on!

Google delivers nearly 90% of all search queries worldwide (which might be a problem anyway but that is not our concern in this post)! Wouldn’t it be interesting to see what people search for on Google and how that varies over time: Google Trends provides just that, it offers a glimpse into Google’s big database to observe the ups and downs of any search term you are interested in since 2004! And the best thing: it is completely free!

…and of course, there is an R package for that too: the `gtrendsR`

package (on CRAN). We are creating a little function which retrieves global Google search data from 2004 till today, plots them and calculates its growth rate just for illustrative purposes to show how easy we can use those data for further analyses:

library(gtrendsR) google_trends <- function(keyword) { pres_data <- gtrends(keyword = keyword, time = "all", onlyInterest = TRUE) plot(pres_data) hits <- pres_data$interest_over_time$hits last <- length(hits) round((mean(hits[(last-10):last]) / mean(hits[1:10]) - 1) * 100) # percentage rise }

Now, we are going to use this function to find out how certain search terms for well-known conspiracy theories developed over time.

Let us start with “chemtrails”, the conspiracy theory that states that the condensation trails of planes are some kind of chemical or biological agent that is being sprayed for nefarious purposes to harm the general public. Well, I myself think that it isn’t obviously the cleanest air that comes out of big airplane turbines but the claim that there is some hidden agenda implemented by some evil forces is just ludicrous. Let us see how this search term develops over time (all data were obtained January 1, 2020):

google_trends("chemtrails")

## [1] -26

This seems to be good news, with a peak around 2016 fewer and fewer people seem to be interested in this kind of stuff (perhaps they are brainwashed by those chemicals so that they don’t google them anymore )

Another well-known conspiracy theory is the “flat earth”, the belief that we don’t live on a globe but that the earth is literally flat! I cannot even imagine how one can believe such a thing but obviously, there are people out there (called “flat earthers”)… Let us have a look:

google_trends("flat earth")

## [1] 182

It used to be on the decline but all of a sudden since 2015 it was trending strong but fortunately it seems to be in sharp decline again. Just to be clear: those numbers are always relative because Google doesn’t give any absolute numbers and not everybody who searches for this stuff is a believer… yet the more people search for it the more this topic is, well, sought after.

Another big area where conspiracy theories thrive is the topic of *extraterrestrials*. Let us see how the search for “Unidentified Flying Objects” a.k.a. “UFOs” is trending:

google_trends("ufos")

## [1] -77

Fewer and fewer people seem to be interested… one special place in the minds of many conspiracy theorists takes the infamous “Area 51”, a highly classified United States Air Force facility located in Nevada. Here the US government hides, so the story goes, evidence of aliens who have visited earth. Let us have a look:

google_trends("area 51")

## [1] 277

This one is interesting because of the huge peak in 2019. What had happened? Another hoax, this time a real one: an anonymous Facebook post called for the storming of the area end of September 2019 to look for extraterrestrial life and the post went viral. Wikipedia knows how it ended:

While the event was intended as comedic, some took it seriously, and traveled to areas surrounding the facility. Beginning on September 19, the day before the planned event date, people were reported to be showing up and camping around Rachel in preparation for the raid.

The Lincoln County Sheriff stated about 1,500 people showed up at the festivals, while over 150 people made the journey over several miles of rough roads to get near the gates to Area 51. While only one person ever crossed the boundary, receiving a warning, six others were arrested for crimes including public urination, alcohol related offences and indecent exposure.

Now, many of those conspiracy theories boil down to dark forces that act in the shadows… but who are they? Especially through the books (and subsequent movies) by Dan Brown one name was brought into the limelight (literally so to speak):

google_trends("Illuminati")

## [1] -45

We can clearly see a peak in 2009 which should be due to the film. Around 2012 there is a huge surge (does anybody know why? Please leave your ideas in the comments below!) and from there on a constant decline. All together we can see a decline of 45% since 2004.

Finally, we come to a particularly disgusting variant of this kind of conspiracy theories, those that are based on *antisemitism*. “The Protocols of the Elders of Zion” is a fabricated text purporting to describe a Jewish plan for world domination. It was first published in Russia in 1903 and after that translated into many languages. Among other things, it was used by the Nazis to justify the millionfold genocide of the Jews. So let us see how many people are interested in this kind of dubious stuff:

google_trends("Elders of Zion")

## [1] -86

It is good to see that interest seems to go down considerably! A more recent manifestation of this idea is what is called the “New World Order (NWO)”. It is often also heavily associated with antisemitic undercurrents. Let us see:

google_trends("New World Order")

## [1] -75

After some peaks in the period between 2008 and 2010, there is a pronounced decline, 75% since 2004! The so-called “New World Order” seems pretty old now… and that’s a good thing!

So, all in all, it seems that, at least for the terms we tested here, interest goes down, often considerably. I would think that this could at least be an indication that people are, on average, getting more rational… in contrast to the pessimistic trope that we are overwhelmed by *fake news* and *trolls*.

What do you think about the analysis above? Do you have any sources that could corroborate or falsify this hypothesis? And do you have other interesting search terms that reveal more interesting patterns? Please put everything in the comments below!

]]>After my little rant (which went viral!) about the

ASCII Art is the art of painting pictures with letters, or more specifically, with Ascii characters. This works because different characters consume smaller or bigger areas, e.g. compare a dot “.” to the “@” sign! So the basic idea is to evaluate the brightness of each pixel (or blocks of pixels) and replace it with an ASCII character. Simon Barthelmé (who is also the author of the `imager`

package!) has written an excellent post on how to do that with R: Generating ASCII art using imager and ggplot2.

Here I will give you a 100% tidyverse free version of this code in base R! I don’t add any additional comments because it is more or less a direct translation of the original code. I don’t go into the details either because this would go beyond the scope of this post.

You can find the image we will be using on Pixabay: Santa Claus. Before using `imager`

(on CRAN) it might be necessary to install GraphicsMagick first.

library(imager) ## Loading required package: magrittr ## ## Attaching package: 'imager' ## The following object is masked from 'package:magrittr': ## ## add ## The following objects are masked from 'package:stats': ## ## convolve, spectrum ## The following object is masked from 'package:graphics': ## ## frame ## The following object is masked from 'package:base': ## ## save.image im <- load.image("D:/Downloads/santa-claus-1628845_1920.jpg") plot(im)

asc <- gtools::chr(38:126) g_chr <- function(chr) mean(grayscale(implot(imfill(50, 50, val = 1), text(25, 25, chr, cex = 5)))) g_chr <- Vectorize(g_chr) g <- g_chr(asc) n <- length(g) char <- asc[order(g)] d <- as.data.frame(imresize(grayscale(im), 0.1)) d$qv <- cut(d$value, c(quantile(d$value, seq(0, 1, 1/n))), include.lowest = TRUE, labels = FALSE) d$char <- char[d$qv] plot(d$x, d$y, xlab = "x", ylab = "y", ylim = rev(range(d$y)), type = "n") text(d$x, d$y, d$char, cex = 0.25)

I think the base R code is certainly different but no less elegant and in some cases even clearer what is going. If you have any comments, please let me know below!

Merry Christmas, Happy Holidays and a Happy New Year to you all and I will be back on the 7’th of January 2020… stayed tuned!

]]>