In this post, we will create a small network (also called *graph* mathematically) and ask some question about which is the “most important” node (also called *vertex*, pl. *vertices*). If you want to understand important concepts of *network centrality* and how to calculate those in R, read on!

This post is based on a LinkedIn post by renowned data scientist Dr. Keith McNulty. Let us (re-)create the small example network from there by first defining the *adjacency matrix* and after that plotting it with the `igraph`

package (on CRAN). We have used this package already in another post on networks: Google’s Eigenvector… or: How a Random Surfer Finds the Most Relevant Webpages.

library(igraph) ## Warning: package 'igraph' was built under R version 4.0.2 ## ## Attaching package: 'igraph' ## The following objects are masked from 'package:stats': ## ## decompose, spectrum ## The following object is masked from 'package:base': ## ## union # define simple network # A, B, C, D, E, F, G, H, I, J, K, L, M, N A <- matrix(c(0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # A 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # B 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # C 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, # D 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # E 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, # F 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, # G 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, # H 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, # I 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, # J 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, # K 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, # L 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, # M 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0 # N ), nrow = 14) colnames(A) <- rownames(A) <- LETTERS[1:ncol(A)] A # adjacency matrix ## A B C D E F G H I J K L M N ## A 0 1 0 1 0 0 0 0 0 0 0 0 0 0 ## B 1 0 0 1 0 0 0 0 0 0 0 0 0 0 ## C 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ## D 1 1 1 0 1 1 1 1 0 0 0 0 0 0 ## E 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ## F 0 0 0 1 0 0 1 0 0 0 0 0 0 0 ## G 0 0 0 1 0 1 0 1 1 0 0 0 0 0 ## H 0 0 0 1 0 0 1 0 1 0 0 0 0 0 ## I 0 0 0 0 0 0 1 1 0 1 0 0 1 0 ## J 0 0 0 0 0 0 0 0 1 0 1 1 0 0 ## K 0 0 0 0 0 0 0 0 0 1 0 1 0 0 ## L 0 0 0 0 0 0 0 0 0 1 1 0 0 0 ## M 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ## N 0 0 0 0 0 0 0 0 0 0 0 0 1 0 g <- graph_from_adjacency_matrix(A, mode = "undirected") set.seed(258) oldpar <- par(mar = c(1, 1, 1, 1)) plot(g) par(oldpar)

McNulty writes in his post:

I love to use this example when I teach about network analysis. I ask the group: who is the most important person in this network?

Now, what does “most important” person mean? It of course depends on the definition and this is where *network centrality measures* come into play. We will have a look at three of those (there are many more out there…).

McNulty explains:

Degree centrality tells you the most connected person: it is simply the number of nodes connected to each node, and it’s easy to see that

Dhas the highest (7).

This is often the only metric given to identify “influencers”: how many followers do they have?

Degree centrality is easy to calculate in R (first “by hand”, after that with the `igraph`

package):

rowSums(A) ## A B C D E F G H I J K L M N ## 2 2 1 7 1 2 4 3 4 3 2 2 2 1 degree(g) ## A B C D E F G H I J K L M N ## 2 2 1 7 1 2 4 3 4 3 2 2 2 1

McNulty explains:

Closeness centrality tells you who can propagate information quickest: you sum the path lengths from your node to each other node and then inverse it.

Ghas four paths of length 1, 6 of length 2 and 3 of length 3. Which gives it a closeness centrality of 1/25. With the other main candidatesIis 1/26,His 1/26 andDis 1/27.

One application that comes to mind is identifying so-called superspreaders of infectious diseases, like COVID-19.

This is a little bit more involved, the simplest approach is to first convert the adjacency matrix to a *distance matrix* which measures the distances of the shortest paths from and to each node (I won’t go into the details, some pointers are given in the comments of the code):

# exponentiate the n x n adjacency matrix to the n'th power in the min-plus algebra. This is, instead of adding taking the minimum and instead of multiplying taking the sum. # more details: https://en.wikipedia.org/wiki/Distance_matrix#Non-metric_distance_matrices # distance product "%C%" <- function(A, B) { n <- nrow(A) A[A == 0] <- Inf diag(A) <- 0 B[B == 0] <- Inf diag(B) <- 0 C <- matrix(0, nrow = n, ncol = n) for (i in 1:n) { for (j in 1:n) { tmp <- vector("integer", ) for (k in 1:n) { tmp[k] <- A[i, k] + B[k, j] } C[i, j] <- min(tmp) } } colnames(C) <- rownames(C) <- rownames(A) C } # calculate distance matrix DM <- function(A) { D <- A for (n in 1:nrow(A)) { D <- D %C% D } D } D <- DM(A) # distance matrix D ## A B C D E F G H I J K L M N ## A 0 1 2 1 2 2 2 2 3 4 5 5 4 5 ## B 1 0 2 1 2 2 2 2 3 4 5 5 4 5 ## C 2 2 0 1 2 2 2 2 3 4 5 5 4 5 ## D 1 1 1 0 1 1 1 1 2 3 4 4 3 4 ## E 2 2 2 1 0 2 2 2 3 4 5 5 4 5 ## F 2 2 2 1 2 0 1 2 2 3 4 4 3 4 ## G 2 2 2 1 2 1 0 1 1 2 3 3 2 3 ## H 2 2 2 1 2 2 1 0 1 2 3 3 2 3 ## I 3 3 3 2 3 2 1 1 0 1 2 2 1 2 ## J 4 4 4 3 4 3 2 2 1 0 1 1 2 3 ## K 5 5 5 4 5 4 3 3 2 1 0 1 3 4 ## L 5 5 5 4 5 4 3 3 2 1 1 0 3 4 ## M 4 4 4 3 4 3 2 2 1 2 3 3 0 1 ## N 5 5 5 4 5 4 3 3 2 3 4 4 1 0 1 / rowSums(D) ## A B C D E F G ## 0.02631579 0.02631579 0.02564103 0.03703704 0.02564103 0.03125000 0.04000000 ## H I J K L M N ## 0.03846154 0.03846154 0.02941176 0.02222222 0.02222222 0.02777778 0.02083333 closeness(g) ## A B C D E F G ## 0.02631579 0.02631579 0.02564103 0.03703704 0.02564103 0.03125000 0.04000000 ## H I J K L M N ## 0.03846154 0.03846154 0.02941176 0.02222222 0.02222222 0.02777778 0.02083333

McNulty explains:

Betweenness centrality tells you who is most important in maintaining connection throughout the network: it is the number of times your node is on the shortest path between any other pair of nodes.

Iuniquely connects all nodes on the left with all nodes on the right, which means it connects at 8×5 = 40 pairs, plus any node in the top right with the bottom right, a further 6 pairs, so 46 in total. If you follow a similar process forD,HandGyou’ll see that they don’t come close to this.

For example in protein-interaction networks, betweenness centrality can be used to find important proteins in signalling pathways which can form targets for drug discovery.

The actual algorithm to calculate betweenness centrality is much too involved to show here (if you have a simple to understand algorithm please let me know on StackOverflow or in the comments), so we will just make use of the `igraph`

package to calculate it:

betweenness(g) ## A B C D E F G H I J K L M N ## 0.0 0.0 0.0 41.5 0.0 0.0 21.5 15.0 46.0 22.0 0.0 0.0 12.0 0.0

As we have seen, there is more than one definition of “most important”. It will depend on the context (and the available information) which one to choose.

Please let me know your thoughts in the comments and please share further possible applications with us.

We’re taking our summer break! Look forward to the next post on October 6, 2020… and stay healthy!

]]>So, is there a method to find the respective proportion of people without putting them on the spot? Actually, there is! If you want to learn about *randomized response* (and how to create flowcharts in R along the way) read on!

The question is how can you get a truthful result overall without being able to attribute a certain answer to any single individual. As it turns out, there is a very elegant and ingenious method, called *randomized response*. The big idea is to, as the name suggests, add noise to every answer without compromising the overall proportion too much, i.e. add noise to every answer so that it *cancels out overall*!

Big tech companies like Google and Microsoft also use this method to e.g. collect telemetry data in a *privacy-preserving* manner. The broader concept is called *differential privacy*, which is differentiated into *local* and *global* methods. Global methods are being used when the data is already collected, local methods concern the process of data collection itself. We will be focusing on the latter here.

One way to do randomized response is the following: each participant is asked to flip two coins. If the first coin comes up heads, she answers the question truthfully. Otherwise, she answers “yes” if the second coin came up heads and “no” if it came up tails. This way nobody can find out whether the participant answered truthfully or not, her answers could have been produced by randomness. Yet, the great thing is that when asking a whole group of people it is possible to calculate the true proportions because the added random noise cancels out… we will see how later in the post.

The process is being depicted in the following flowchart, created with R and the `diagram`

package (on CRAN):

library(diagram) ## Loading required package: shape oldpar <- par(mar = c(1, 1, 1, 1)) openplotmat() elpos <- coordinates (c(1, 1, 2, 4)) fromto <- matrix(ncol = 2, byrow = TRUE, data = c(1, 2, 2, 3, 2, 4, 4, 7, 4, 8)) nr <- nrow(fromto) arrpos <- matrix(ncol = 2, nrow = nr) for (i in 1:nr) arrpos[i, ] <- straightarrow(to = elpos[fromto[i, 2], ], from = elpos[fromto[i, 1], ] , lwd = 2, arr.pos = 0.6, arr.length = 0.5) textellipse(elpos[1, ], 0.1, lab = "START", box.col = "green", shadow.col = "darkgreen", shadow.size = 0.005, cex = 1.5) textrect (elpos[2, ], 0.15, 0.05, lab = "1'st coin flip", box.col = "grey", shadow.col = "black", shadow.size = 0.005, cex = 1.5) textrect (elpos[4, ], 0.15, 0.05, lab = "2'nd coin flip", box.col = "grey", shadow.col = "black", shadow.size = 0.005, cex = 1.5) textellipse(elpos[3, ], 0.1, 0.1, lab = c("True", "Answer"), box.col = "orange", shadow.col = "red", shadow.size = 0.005, cex = 1.5) textellipse(elpos[7, ], 0.1, 0.1, lab = c("Yes"), box.col = "orange", shadow.col = "red", shadow.size = 0.005, cex = 1.5) textellipse(elpos[8, ], 0.1, 0.1, lab = c("No"), box.col = "orange", shadow.col = "red", shadow.size = 0.005, cex = 1.5) dd <- c(0.0, 0.025) text(arrpos[2, 1]+0.06, arrpos[2, 2], "50%") text(arrpos[3, 1]-0.06, arrpos[3, 2], "50%") text(arrpos[4, 1]-0.03, arrpos[4, 2]+0.05, "25%") text(arrpos[5, 1]+0.03, arrpos[5, 2]+0.05, "25%") par(oldpar)

Of course, this is just one potential way of doing it and it doesn’t have to be real coin flips: the whole process could be done transparently in the background while collecting personal data automatically, yet to give you some feeling for the method let us actually do a toy example next.

We now simulate a survey of 200 people who get asked an embarrassing question, e.g. whether they are an alcoholic. In our sample, about 15% are alcoholics (which is about the real rate in Western countries, believe it or not!):

set.seed(123) n <- 200 true_responses <- sample(c(TRUE, FALSE), size = n, replace = TRUE, prob = c(0.15, 0.85)) # only responders know table(true_responses) / n ## true_responses ## FALSE TRUE ## 0.85 0.15

We now use the above method to get an estimate of the proportion without embarrassing single individuals:

randomized_response <- function(true_response) ifelse(sample(c(TRUE, FALSE), 1), true_response, sample(c(TRUE, FALSE), 1)) randomized_response <- Vectorize(randomized_response) randomized_responses <- randomized_response(true_responses) # what comes out of survey

How can we actually get rid of the noise? Mathematician Dr. John D. Cook explains (source: Randomized response, privacy, and Bayes theorem):

How can [one] estimate p, the [true] proportion […]? Around half […] will get a head on the first coin and answer truthfully; the rest will look at the second coin and answer yes or no with equal probability. So the expected proportion of yes answers is Y = 0.5p + 0.25, and we can estimate p as 2Y – 0.5.

Let us actually do this for our little example:

2 * sum(randomized_responses) / n - 0.5 ## [1] 0.14

With an estimate of 14% we are obviously not so far off from the true proportion (15%). To find out how many people answer truthfully, we do another quick calculation:

sum(true_responses == randomized_responses) / n ## [1] 0.75

This makes sense since 50% give the true answer anyway and 25% by chance, which makes 75% altogether.

Who would have thought that a little bit of randomness at the right place could reveal hidden truths people are normally ashamed of to admit?

]]>When I first saw the

There are many so-called *Computer Algebra Systems (CAS)* out there, commercial but also open-source. One very mature one is called *YACAS* (for *Yet Another Computer Algebra System*). You find the documentation here: Yacas Documentation (many of the following examples are taken from there).

You can use the full power of it in R by installing the `Ryacas`

package from CRAN. You can use Yacas commands directly by invoking the `yac_str`

function, the `as_r`

function converts the output to R. Let us first *simplify a mathematical expression*:

library(Ryacas) ## ## Attaching package: 'Ryacas' ## The following object is masked from 'package:stats': ## ## deriv ## The following objects are masked from 'package:base': ## ## %*%, determinant, diag, diag<-, I, lower.tri, upper.tri # simplify expressions as_r(yac_str("Simplify(a*b*a^2/b-a^3)")) ## [1] 0

Or *solve an equation*:

as_r(yac_str("Solve(a+x*y==z,x)")) ## [1] "x==-(a-z)/y"

And you can do all kinds of tedious stuff that is quite error-prone when done differently, e.g. *expanding* expressions like by using the binomial theorem:

as_r(yac_str("Expand((x-2)^20)")) ## expression(x^20 - 40 * x^19 + 760 * x^18 - 9120 * x^17 + 77520 * ## x^16 - 496128 * x^15 + 2480640 * x^14 - 9922560 * x^13 + ## 32248320 * x^12 - 85995520 * x^11 + 189190144 * x^10 - 343982080 * ## x^9 + 515973120 * x^8 - 635043840 * x^7 + 635043840 * x^6 - ## 508035072 * x^5 + 317521920 * x^4 - 149422080 * x^3 + 49807360 * ## x^2 - 10485760 * x + 1048576)

To demonstrate how easily the results can be integrated into R let us do some *curve sketching* on a function. First, we define two helper function for converting an expression into a function (which can then be used to plot it) and for determining the *derivative of order n* of some function (we redefine the `D`

function for that):

as_function <- function(expr) { as.function(alist(x =, eval(parse(text = expr)))) } # redefine D function D <- function(eq, order = 1) { yac_str(paste("D(x,", order, ")", eq)) }

Now, we define the function (in this case a simple *polynomial* ), determine the first and second derivatives symbolically and plot everything:

xmin <- -5 xmax <- 5 eq <- "2*x^3 - 3*x^2 + 4*x - 5" eq_f <- as_function(eq) curve(eq_f, xmin, xmax, ylab = "y(x)") abline(h = 0, lty = 2) abline(v = 0, lty = 2) D_eq <- D(eq) D_eq ## [1] "6*x^2-6*x+4" D_eq_f <- as_function(D_eq) curve(D_eq_f, xmin, xmax, add = TRUE, col = "blue") D2_eq <- D(eq, 2) D2_eq ## [1] "12*x-6" D2_eq_f <- as_function(D2_eq) curve(D2_eq_f, xmin, xmax, add = TRUE, col = "green")

Impressive, isn’t it! Yacas can also determine *limits* and *integrals*:

# determine limits yac_str("Limit(x,0) 1/x") ## [1] "Undefined" yac_str("Limit(x,0,Left) 1/x") ## [1] "-Infinity" yac_str("Limit(x,0,Right) 1/x") ## [1] "Infinity" # integration yac_str("Integrate(x) Cos(x)") ## [1] "Sin(x)" yac_str("Integrate(x,a,b) Cos(x)") ## [1] "Sin(b)-Sin(a)"

As an example, we can prove in no-time that the famous approximation is actually too big (more details can be found here: Proof that 22/7 exceeds π):

yac_str("Integrate(x,0,1) x^4*(1-x)^4/(1+x^2)") ## [1] "22/7-Pi"

And, as the grand finale of this post, Yacas is even able to solve *ordinary differential equations* symbolically! Let us first take the simplest of them all:

as_r(yac_str("OdeSolve(y' == y)")) ## expression(C115 * exp(x))

It correctly returns the *exponential function* (The `C`

-term is just an arbitrary *constant*). And finally a more complex, higher-order one:

as_r(yac_str("OdeSolve(y'' - 4*y == 0)")) ## expression(C154 * exp(2 * x) + C158 * exp(-2 * x))

I still find CAS amazing and extremely useful… and an especially powerful one can be used from within R!

]]>The

The following post is based on the post “Das Kalman-Filter einfach erklärt” which is written in German and uses Matlab code (so basically two languages nobody is interested in any more ). This post is itself based on an online course “Artificial Intelligence for Robotics” by my colleague Professor Sebastian Thrun of Standford University.

Because we are dealing with uncertainty here, we need a probability distribution. A good choice often is a *Gaussian* or *normal distribution*. It is defined by two parameters, the *mean* and the *variance* (or *standard deviation* which is just the square root of the variance). These two parameters have to be updated by incoming information which is itself uncertain.

This can be interpreted as some form of *Bayesian updating* (if you want to learn more about Bayes and his famous formula, you can read two former posts of mine here: Base Rate Fallacy – or why No One is justified to believe that Jesus rose and Learning Data Science: Sentiment Analysis with Naive Bayes).

Professor Thrun explains (just ignore the quiz at the end of the video):

The *update step* in R code:

update <- function(mean1, var1, mean2, var2) { # calculates new position as multiplication of two Gaussians: # prior probability and new information (noisy measurement) new_mean <- (var2*mean1 + var1*mean2) / (var1 + var2) new_var <- 1/(1/var1 + 1/var2) return(c(new_mean, new_var)) }

After calculating the position based on new information we also have to take into account the motion itself, this is done in the *prediction step*:

Here is the R implementation:

predict <- function(mean1, var1, mean2, var2) { # Calculates new postion as sum (= convolution) of two Gaussians: # prior probability and new information (noisy movement) new_mean <- mean1 + mean2 new_var <- var1 + var2 return(c(new_mean, new_var)) }

Now, we are putting both steps together to form a cycle:

As an example let us create a trajectory along a sine wave with measurements and motion affected by noise (= uncertainty):

var_measure <- 5 # variance measure var_motion <- 2 # variance motion pos <- c(0, 10000) # Starting values position and variance ## Kalman calculation set.seed(123) pos_real <- 10 * sin(seq(1, 20, 0.1)) motion_real <- diff(pos_real) measure <- pos_real + rnorm(length(pos_real), 0, sqrt(var_measure)) motion <- motion_real + rnorm(length(motion_real), 0, sqrt(var_motion)) kalman_update <- c() for (i in 1:length(measure)) { pos <- update(pos[1], pos[2], measure[i], var_measure) kalman_update <- c(kalman_update, pos[1]) pos <- predict(pos[1], pos[2], motion[i], var_motion) } plot(measure, col = "red") lines(kalman_update, col = "blue") lines(pos_real, col = "black")

As you can see the resulting blue curve is much more stable than the noisy measurements (small red circles)!

If you want to read about another algorithm (RANSAC) that is able to handle noisy measurements, you can find it here: Separating the Signal from the Noise: Robust Statistics for Pedestrians.

In any case, stay tuned for more to come!

]]>Forecasting the future has always been one of man’s biggest desires and many approaches have been tried over the centuries. In this post we will look at a simple statistical method for

Let us dive directly into the matter and build an AR model out of the box. We will use the inbuilt `BJsales`

dataset which contains 150 observations of sales data (for more information consult the R documentation). Conveniently enough AR models can be built directly in base R with the `ar.ols()`

function (*OLS* stands for *Ordinary Least Squares* which is the method used to fit the actual model). Have a look at the following code:

data <- BJsales head(data) ## [1] 200.1 199.5 199.4 198.9 199.0 200.2 N <- 3 # how many periods lookback n_ahead <- 10 # how many periods forecast # build autoregressive model with ar.ols() model_ar <- ar.ols(data, order.max = N) # ar-model pred_ar <- predict(model_ar, n.ahead = n_ahead) pred_ar$pred ## Time Series: ## Start = 151 ## End = 160 ## Frequency = 1 ## [1] 263.0299 263.3366 263.6017 263.8507 264.0863 264.3145 264.5372 ## [8] 264.7563 264.9727 265.1868 plot(data, xlim = c(1, length(data) + 15), ylim = c(min(data), max(data) + 10)) lines(pred_ar$pred, col = "blue", lwd = 5)

Well, this seems to be good news for the sales team: rising sales! Yet, how does this model arrive at those numbers? To understand what is going on we will now rebuild the model. Basically, everything is in the name already: *auto-regressive*, i.e. a *(linear) regression* on (a delayed copy of) itself (*auto* from Ancient Greek *self*)!

So, what we are going to do is create a delayed copy of the time series and run a linear regression on it. We will use the `lm()`

function from base R for that (see also Learning Data Science: Modelling Basics). Have a look at the following code:

# reproduce with lm() df_data <- data.frame(embed(data, N+1) - mean(data)) head(df_data) ## X1 X2 X3 X4 ## 1 -31.078 -30.578 -30.478 -29.878 ## 2 -30.978 -31.078 -30.578 -30.478 ## 3 -29.778 -30.978 -31.078 -30.578 ## 4 -31.378 -29.778 -30.978 -31.078 ## 5 -29.978 -31.378 -29.778 -30.978 ## 6 -29.678 -29.978 -31.378 -29.778 model_lm <- lm(X1 ~., data = df_data) # lm-model coeffs <- cbind(c(model_ar$x.intercept, model_ar$ar), coef(model_lm)) coeffs <- cbind(coeffs, coeffs[ , 1] - coeffs[ , 2]) round(coeffs, 12) ## [,1] [,2] [,3] ## (Intercept) 0.2390796 0.2390796 0 ## X2 1.2460868 1.2460868 0 ## X3 -0.0453811 -0.0453811 0 ## X4 -0.2042412 -0.2042412 0 data_pred <- df_data[nrow(df_data), 1:N] colnames(data_pred) <- names(model_lm$coefficients)[-1] pred_lm <- numeric() for (i in 1:n_ahead) { data_pred <- cbind(predict(model_lm, data_pred), data_pred) pred_lm <- cbind(pred_lm, data_pred[ , 1]) data_pred <- data_pred[ , 1:N] colnames(data_pred) <- names(model_lm$coefficients)[-1] } preds <- cbind(pred_ar$pred, as.numeric(pred_lm) + mean(data)) preds <- cbind(preds, preds[ , 1] - preds[ , 2]) colnames(preds) <- NULL round(preds, 9) ## Time Series: ## Start = 151 ## End = 160 ## Frequency = 1 ## [,1] [,2] [,3] ## 151 263.0299 263.0299 0 ## 152 263.3366 263.3366 0 ## 153 263.6017 263.6017 0 ## 154 263.8507 263.8507 0 ## 155 264.0863 264.0863 0 ## 156 264.3145 264.3145 0 ## 157 264.5372 264.5372 0 ## 158 264.7563 264.7563 0 ## 159 264.9727 264.9727 0 ## 160 265.1868 265.1868 0

As you can see, the coefficients and predicted values are the same (except for some negligible rounding errors)!

A few things warrant further attention: When building the linear model in line 17 the formula is created dynamically on the fly because the *dependent variable* is in the last column which number depends on `N`

(the number of lookback periods). To be more precise, it is not just a simple linear regression but a *multiple regression* because each column (which represent different time delays) goes into the model as a separate *(independent) variable*. Additionally, the regression is performed on the *demeaned* data, meaning that you subtract the mean.

So, under the hood what sounds so impressive (“Autoregressive model”.. wow!) is nothing else but good ol’ linear regression. So, for this method to work, there must be some *autocorrelation* in the data, i.e. some repeating linear pattern.

As you can imagine there are instances where this will not work. For example, in financial time series there is next to no autocorrelation (otherwise it would be too easy, right! – see also my question and answers on Quant.SE here: Why aren’t econometric models used more in Quant Finance?).

In order to use this model to predict `n_ahead`

periods ahead the predict function first uses the last `N`

periods and then uses the new predicted values as input for the next prediction, and so forth `n_ahead`

times. After that, the mean is added again. Obviously, the farther we predict into the future the more uncertain the forecast becomes because the basis of the prediction comprises more and more values that were predicted themselves. The values for both parameters were taken here for demonstration purposes only. A realistic scenario would be to take more lookback periods than predicted periods and you would, of course, take domain knowledge into account, e.g. when you have monthly data take at least twelve periods as your `N`

.

This post only barely scratched the surface of forecasting time series data. Basically, many of the standard approaches of statistics and machine learning can be modified so that they can be used on time series data. Yet, even the most sophisticated method is not able to foresee external shocks (like the current COVID-19 pandemic) and feedback loops when the very forecasts change the actual behaviour of people.

So, all methods have to be taken with a grain of salt because of those systematic challenges. You should always keep that in mind when you get the latest sales forecast!

]]>Many people think that when they get a positive result of such a test they are immune to the virus with high probability. If you want to find out why nothing could be further from the truth, read on!

The following article provides a good intuition of why the accuracy of screening tests is highly dependent on the infection rate: Scientific American: Coronavirus Antibody Tests Have a Mathematical Pitfall. They write:

A mathematical wrinkle makes these tests—and in fact, all screening tests—hard to interpret: even with a very accurate test, the fewer people in a population who have a condition, the more likely it is that an individual’s positive result is wrong. If it is, people might think they have the antibodies (and thus may have immunity), when in fact they do not.

We covered this concept in another post already (see Base Rate Fallacy – or why No One is justified to believe that Jesus rose) but because of the importance of the topic let us replicate their analysis with *personograph plots* (also called *Kuiper-Marshall plots*) which are an excellent way to communicate risks (see also Lying with Statistics: One Beer a Day will Kill you!).

Let us first repeat the definitions of *sensitivity* and *specificity* (we covered those here already: Learning Data Science: Understanding ROC Curves):

**Sensitivity**or**True Positive Rate (TPR)**: the proportion of actual (= true) positives (TP) that are correctly identified as such (e.g. the percentage of sick people who are correctly identified as having (had) the condition).**Specificity**or**True Negative Rate (TNR)**: the proportion of actual (= true) negatives (TN) that are correctly identified as such (e.g. the percentage of healthy people who are correctly identified as not having (had) the condition).

That might be quite interesting but what really counts in the end aren’t any test results but whether you actually have (had) the virus or not! Those are two different things and therefore we define two additional measures:

**Positive predictive value (PPV)**: the probability that subjects with a positive screening test truly have (had) the disease.**Negative predictive value (NPV)**: the probability that subjects with a negative screening test truly don’t have (have not had) the disease.

Please note that all four figures are *conditional probabilities*, but are going in different directions. For example, TPR gives the probability of a positive test result *given you are sick*, while PPV gives the probability of you being sick *given a positive test result*. We will see that both numbers can (and actually do!) diverge quite considerably.

Now we have everything in place to replicate the analysis with R (the `personograph`

package can be found on CRAN). First with an infection rate of 5% (which is in the ballpark of the likely infection rate for most countries at the moment):

library(personograph) ## Loading required package: grImport ## Loading required package: grid ## Loading required package: XML colorlist <- list(FN = "darkred", TP = "red", FP = "darkgreen", TN = "green") TPR <- 0.95 # true positive rate = sensitivity TNR <- 0.95 # true negative rate = specificity IR <- 0.05 # infection rate data <- list(FN = (1-TPR)*IR, TP = TPR*IR, FP = (1-TNR)*(1-IR), TN = TNR*(1-IR)) NPV <- round(100 * data$TN / (data$TN + data$FN), 1) # negative predictive value PPV <- round(100 * data$TP / (data$TP + data$FP), 1) # positive predictive value personograph(data, colors = colorlist, fig.title = paste0("PPV = ", PPV, "%, NPV = ", NPV, "%"), n.icons = 500, dimensions = c(20, 25))

So we can see that even with antibody tests with high values for sensitivity and specificity a positive test result is not better than the toss of a coin, given the current infection rate! Why? Because for every true positive case (TP, in red) there is a false positive one (FP, in dark green).

The situation gets a better with an infection rate of 25%…

IR <- 0.25 # infection rate data <- list(FN = (1-TPR)*IR, TP = TPR*IR, FP = (1-TNR)*(1-IR), TN = TNR*(1-IR)) NPV <- round(100 * data$TN / (data$TN + data$FN), 1) # negative predictive value PPV <- round(100 * data$TP / (data$TP + data$FP), 1) # positive predictive value personograph(data, colors = colorlist, fig.title = paste0("PPV = ", PPV, "%, NPV = ", NPV, "%"), n.icons = 500, dimensions = c(20, 25))

…where you can be nearly 90% confident that a positive test result means that you have (had) the virus! All of this shows the massive impact infection rates can have on the predictive value of these tests for individuals.

If you still doubt this I give you another intuition: the less probable it is that you have (had) the virus (which is the case right now because of the low infection rate) the more evidence you will need to prove the contrary, either by better tests (which we either don’t have at the moment or which are prioritized for the sickest patients) or by additional tests (which are scarcely done because of still limited resources).

On the other hand, if you get a negative result you can be pretty sure that you indeed have not had the virus!

]]>In concrete terms, what would a postdigital world look like in which people — both individually and as a society — used digital technologies sovereignly? With respect to the four primary areas of the techlash critique, a scenario for the future is emerging.

- We’ll only reach for our smartphones when there’s a reason and not because each new reflexive use creates new reasons for the next ritualistic unlocking of the home screen.
- Companies will no longer digitalize their production, internal processes and communications because digitalization is the first commandment of management at the moment, but instead rigorously test each step in the digitalization process against the most important criterion of value creation: What would this change really add to the bottom line? Sham innovations that in truth only complicate things — essential things like production, internal processes and communications — will be eliminated. Intelligent regulation will ensure that competition returns to digital markets and the superprofitable superstar firms will finally pay their fair share of taxes. That would help pay for the public infrastructure that the digital superstars also use in their highly profitable business models. If necessary, some essential digital services would be regulated much like water companies, natural gas suppliers and other basic utilities are today. At least the left wing of the Democratic Party is having similar thoughts, as exemplified by Elizabeth Warren and Bernie Sanders.
- Democratic discourse needs moderation online and offline. The value of opinion in discussion is significantly lower than commonly supposed by those who constantly give their own. Never fear, there will continue to be unmoderated discussion forums. PewDiePie will still have unfettered opportunity to riff on social topics (in between game play videos) on YouTube. But in the postdigital age, essential political debates will take place on discussion platforms where the discourse is serious, fact-based and nuanced, using real names and with pauses built in for reflection instead of a mad rush to respond. The fact check would be an inherent and accepted part of this culture of digital discussion. Truth would be negotiated via some kind of Super-Wikipedia. Even the powerful would express themselves on these platforms instead of through media rooted in visceral emotion like Twitter, or media designed for aesthetic self-promotion like Instagram. Perhaps software and platforms such as “Liquid Democracy” would come into use to enable new forms of direct, grass-roots decision making, especially at the local level. In the meantime, government and public administration will have learned to improve through digital technology.

Technology is never good or bad. It depends on what we use it for. This statement, constantly repeated by tech-fixated idealists, is […] naive. Technology is usually developed for a specific purpose in a socio-technical context.

- Technology is never good or bad. It depends on what we use it for. This statement, constantly repeated by tech-fixated idealists, is on the one hand naive. Technology is usually developed for a specific purpose in a socio-technical context. It’s more suitable for this purpose than for others, and therefore it’s not neutral. On the other hand, it’s of course still true: You can use machine learning to promote the sale of a new digital device with a dreadful environmental impact. Amazon bred its recommendation algorithms with just this kind of value-maximizing function. By using similar systems that learn from data, however, supply and demand could also be better balanced in decentralized energy networks.

In a green postdigital world, the fight against climate change will be a pressing goal, and perhaps the most important goal of new technological development. Perhaps in the fight against global warming, digitalization might even gain a second chance to radically improve the world: with decentralized energy networks, energy-efficient autonomous vehicles or a digitally controlled Cradle to Cradle (circular) economy. New green technologies might not only delay climate change. There will be an increasing importance for digitally supported innovations that promote resilience — that is, they will make it easier for people to deal with the actual consequences of climate change.

The future can’t be predicted, but it can be created.

The future can’t be predicted, but it can be created. What’s stopping us from embracing a postdigital future with a radical paradigm for evaluating technology’s costs and benefits? Or we could formulate the question positively: What would we need to make this vision of a postdigital future a reality? Gesche Joost, a scholar of design, Germany’s former Internet Ambassador and a member of SAP’s board of directors, answers the question with two words: “digital sovereignty.” But Joost also wishes that critiques of technology were more constructive. I share this view.

No matter how attractive or even justified the ideas of techlash may be, the perpetual demands for absolute data protection and absolute minimization of data use are as wearying as an endless wait on hold for the next available agent, and as unproductive. The answer to techlash, says Joost, can’t be for people to withdraw into a technological backwater. Constructive criticism of technology, on the other hand, will help us regain our digital sovereignty. This can only succeed if we develop digital technology that corresponds to our own desires and values, improve our ability to use it through digital education and finally turn the concept of a common digital market for like-minded countries into economic reality. More specifically, this means:

- Areas of the western world that have not invested enough in their own technological development, Europe in particular, need a massive and coordinated effort to turn things around. In the discussion of 5G cellphone networks and the potential for Chinese espionage or even sabotage by Huawei, an urgent question has received far too little attention: Why are there no European firms able to compete with Huawei on price and quality?
- Digital education needs a kickstart, beginning with elementary school. This includes data science and coding and understanding social media and the platform economy. The concepts and materials used for teaching these topics have been tested and proven to be successful, but unfortunately are rarely used in public schools. You’re more likely to find them used in private or non-profit initiatives such as those of the Open Knowledge Foundation.

If superstar firms construct data monopolies, then the data needs to be made open.

- The third major area of action involves promoting innovation and regulation of digital markets. An important element of this is a consistent data policy. As described in the second section, if superstar firms construct data monopolies, then the data needs to be made open. But that’s only the first important step. Open technical standards, if necessary enforced by law, have the same aim. Creating massive policy incentives for voluntary sharing of data, data cooperation and cross-sector data pools need to be on the agendas of national legislatures and regulators. And, of course, digitalization’s biggest winners need to pay their corresponding share of taxes, whether in the form of a digital tax using the French model or in the form of a minimum global tax based on international agreements, as the German Foreign Ministry has proposed.

All of that is not just desirable, but the prerequisite for bringing the rebound effects of digitalization under control in the postdigital age.

**Thomas Ramge** is a research fellow at the Weizenbaum Institute in Berlin. He has authored more than a dozen nonfiction books, including *Who‘s Afraid of AI*, *Reinventing Capitalism in the Age of Big Data*, coauthored with Viktor Mayer-Schönberger, and *The Global Economy as You’ve Never Seen It*. Ramge has been honored with multiple journalism and literary awards, including the Axiom Business Book Award’s Gold Medal, the getAbstract International Book Award, strategy+business magazine’s Best Business Book of the Year (in Technology and Innovation), the Herbert Quandt Media Prize, and the German Business Book Prize. He lives in Berlin with his wife and son.

You can follow him on LinkedIn: Thomas Ramge.

]]>Google does it! Facebook does it! Amazon does it for sure!

Especially in the areas of web design and online advertising, everybody is talking about *A/B testing*. If you quickly want to understand what it is and how you can do it with R, read on!

The basic idea of A/B testing is to systematically (and normally automatically) test *two different alternatives*, e.g. two different web designs, and decide which one does better, e.g. in terms of *conversion rate* (i.e. how many people click on a button or buy a product):

The bad news is, that you have to understand a little bit about *statistical hypothesis testing*, the good news is that if you read the following post, you have everything you need (plus, as an added bonus R has all the tools you need already at hand!): From Coin Tosses to p-Hacking: Make Statistics Significant Again! (ok, reading it would make it over one minute…).

To give you a practical example we will use a dataset from DataCamp’s course on “A/B Testing in R” (experiment_data.csv), which shows whether each group (`control`

and `test`

group) either clicked on the respective offer… or not (`clicked_adopt_today`

):

experiment <- read.csv("data/experiment_data.csv") experiment <- experiment[ , 2:3] head(experiment, 10) ## condition clicked_adopt_today ## 1 control 0 ## 2 control 1 ## 3 control 0 ## 4 control 0 ## 5 test 0 ## 6 test 0 ## 7 test 1 ## 8 test 0 ## 9 test 0 ## 10 test 1

Let us create two tables with the *absolute* and the *relative proportions*:

prop <- table(experiment) prop_abs <- addmargins(prop) prop_abs ## clicked_adopt_today ## condition 0 1 Sum ## control 245 49 294 ## test 181 113 294 ## Sum 426 162 588 prop_rel <- prop.table(prop, 1) prop_rel <- round(addmargins(prop_rel, 2), 2) prop_rel ## clicked_adopt_today ## condition 0 1 Sum ## control 0.83 0.17 1.00 ## test 0.62 0.38 1.00

Now for the actual test: conveniently enough, R has the `prop.test`

function, which tests whether two proportions are *significantly* different (by performing a so-called *Pearson’s chi-squared test* under the hood). We only have to put our original table into the function and R does the rest for us:

prop.test(prop) ## ## 2-sample test for equality of proportions with continuity ## correction ## ## data: prop ## X-squared = 33.817, df = 1, p-value = 6.055e-09 ## alternative hypothesis: two.sided ## 95 percent confidence interval: ## 0.1442390 0.2911352 ## sample estimates: ## prop 1 prop 2 ## 0.8333333 0.6156463

Voilà, that was it already! Because the *p-value* is way below the common threshold of 0.05 the difference is highly significant, so we can reject the *null hypothesis* (that the difference is just due to chance)!

As a consequence, we would definitely go for the design that was presented to the test group in the future.

]]>It is not easy to create secure passwords. The best way is to let a computer do it by randomly combining lower- and upper-case letters, digits and other printable characters.

If you want to learn how to write a small function to achieve that read on!

The exact task is again taken from Rosetta Code:

Create a password generation program which will generate passwords containing random ASCII characters from the following groups:

lower-case letters: a ──► z

upper-case letters: A ──► Z

digits: 0 ──► 9

other printable characters: !”#$%&'()*+,-./:;?@[]^_{|}~

(the above character list excludes white-space, backslash and grave)The generated password(s) must include at least one (of each of the four groups):

lower-case letter,

upper-case letter,

digit (numeral), and

one “other” character.The user must be able to specify the password length and the number of passwords to generate.

The passwords should be displayed or written to a file, one per line.

The randomness should be from a system source or library.

The program should implement a help option or button which should describe the program and options when invoked.

As often in our “Learning R” posts (for more see here: Category: Learning R) we will give a few hints but give you the chance to solve it yourself (to not spoil the fun) before showing a possible solution:

- Create a function with the arguments
`nl`

for the password length and`npw`

for the number of passwords to be created. Also, include a logical`help`

for the help functionality:`passwords <- function(nl = 8, npw = 1, help = FALSE) {}`

- Use an
`if`

statement to return some text if help is true - Use a
`for`

loop for creating`npw`

passwords - Use
`letters`

,`LETTERS`

,`0:9`

and`c("!", "\"", "#", "$", "%", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "", "?", "@", "[", "]", "^", "_", "{", "|", "}", "~")`

for the different groups - Use the
`sample`

function for sampling from the four different groups - Start by sampling from each of the four groups to ensure that you include at least one of each
- Use
`cat(..., "\n", sep = "")`

for printing the passwords

This should be more than enough help, now please try to build the function yourself!

Here I give one possible solution (which I also posted on Rosetta Code):

passwords <- function(nl = 8, npw = 1, help = FALSE) { if (help) return("gives npw passwords with nl characters each") if (nl < 4) nl <- 4 spch <- c("!", "\"", "#", "$", "%", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "<", "=", ">", "?", "@", "[", "]", "^", "_", "{", "|", "}", "~") for(i in 1:npw) { pw <- c(sample(letters, 1), sample(LETTERS, 1), sample(0:9, 1), sample(spch, 1)) pw <- c(pw, sample(c(letters, LETTERS, 0:9, spch), nl-4, replace = TRUE)) cat(sample(pw), "\n", sep = "") } } set.seed(123) passwords(help = TRUE) ## [1] "gives npw passwords with nl characters each" passwords(8) ## S2XnQoy* passwords(14, 5) ## :.iJ=Q7_gP?Cio ## !yUu7OL|eH;}1p ## y2{DNvV^Zl^IFe ## Tj@T19L.q1;I*] ## 6M+{)xV?i|1UJ/

When you compare that to most of the other solutions in different programming languages shown on the Rosetta Code page you will appreciate how very powerful R is!

If you have any questions or suggestions please let me know in the comments.

Hope you learned something new today, stay tuned!

]]>Our intuition concerning

When Apple first introduced its shuffling function on the iPod customers were irritated and complained that it was not truly random. Oftentimes some titles appeared to be repeated too often while others seemed to have disappeared completely. What was going on?

To illustrate the point I sometimes show my students the following two pics and ask them which was generated by randomness and which by an deterministic rule (you find the used `randtoolbox`

package on CRAN):

library(randtoolbox) ## Loading required package: rngWELL ## This is randtoolbox. For an overview, type 'help("randtoolbox")'. n <- 200 set.seed(2345) x <- runif(n) y <- runif(n) oldpar = par(mar=c(2, 2, 2, 2) + 0.1) plot(x, y, ylim = c(0, 1), xlim = c(0, 1), xaxs = "i", yaxs = "i", axes = FALSE, frame.plot = TRUE, pch = 16, cex = 2.1)

s <- sobol(n, 2, scrambling = 3) plot(s, ylim = c(0, 1), xlim = c(0, 1), xaxs = "i", yaxs = "i", axes = FALSE, frame.plot = TRUE, pch = 16, cex = 2.1)

Many a student thinks that the first pic was created by some underlying pattern (because of its points clumping together in some areas while leaving others empty) and that the second one is “more” random. The truth is that technically both are not random (but only *pseudo-random*) but the first resembles “true” randomness more closely while the second is a *low-discrepancy sequence*.

While coming to the point of pseudo-randomness in a moment “true” randomness may appear to have a tendency to occur in clusters or clumps (technically called *Poisson clumping*). This is the effect seen (or shall I say heard) in the iPod shuffling function. Apple changed it to a more regular behaviour (in the spirit of the second picture)… which was then perceived to be *more* random (as with my students)!

Now imagine that the first pic represents some map showing, let’s say, leukaemia in children. Wouldn’t we want to know whether there is some underlying reason for those clusters?!? Now imagine that there is a nuclear power plant near one of the more prominent clusters… just by chance! Oh, dear! Of course, it *could* be the reason for the cancer cases but just by looking at the map no real conclusions can be drawn! The takeaway message is that randomness often seems to have more pronounced patterns than purely deterministic sequences.

Another area where people are easily fooled by randomness is the stock market! Have a look at the following chart:

set.seed(3141) run <- sample(c(-1, 1), 1e5, replace = TRUE) plot(cumsum(run), type = "l", xaxs = "i", yaxs = "i", axes = FALSE, frame.plot = TRUE)

par(oldpar)

So-called *technical analysts* will clearly see what they call a *Double Top pattern* (basically the letter M in the chart) which they interpret as a bearish (= sell) signal. Now before you sell all of your stocks when you encounter something like this remember that the above chart was created purely by chance (as can be seen in the code)! Yet it seems as if all kinds of bullish and bearish trends can be observed.

Every *quantitative analyst* (or just *quant*) knows that stock charts (in most cases) cannot be distinguished from ones created by the toss of a coin. Yet we are evolutionarily trained to see all kinds of patterns, even when there are none. We see faces in fronts of cars and animals (or other funny things) in clouds… and buy and sell signals in random sequences.

While I will not get into the thorny (and philosophical) issue of what constitutes “true” randomness (perhaps some other time…) one thing is clear: computers are notoriously bad at creating it. Why? Because under the hood computers are purely deterministic animals, working on one command at a time. So they are only able to create something that *looks like* randomness: *pseudo-randomness*. On the positive side, that means that this kind of randomness is reproducible: in R you use the `set.seed()`

function to get the same “random” sequence every time.

In the old days of computers (basically only a few decades ago) whole books with “good” random numbers were being published! The following can still be bought for over 50 bucks as a paperback and has over 600 pages! I guess it is the most unread book ever (even more than James Joyce’s Ulysses )

The following *xkcd* cartoon takes the idea of pseudo-randomness to its absurd extreme (as usual ):