Last week, I showed you a method of how to find the fastest path from A to B: Finding the shortest Path with Dijkstra’s Agorithm. To make use of that, we need a method to determine our position at any point in time.

For that matter, many devices use the so-called *Global Positioning System (GPS)*. If you want to understand how it works and do some simple calculations in R, read on!

Nowadays most of us have several GPS devices, in their cars, their smartphones, and their smartwatches. Those are receivers of signals coming from over 30 satellites orbiting the earth. You need that many so that there are always enough in sight:

Many of the technical details are interesting, yet here I want you to understand the core principle of how it works. With the real GPS, you need three satellites to determine two coordinates one of which can be discarded. Because the clock in the GPS receiver is not fully synchronized you need a fourth satellite to compensate for that. The resulting system of equations has to be solved to determine the position. In our toy example, we will only do the 2D-case with two satellites, yet the principle stays the same.

The following example is taken from the excellent article From Barns to Satellites: An Introduction to the Mathematics of Global Positioning Systems by my colleague Professor Kyle Schultz from the University of Mary Washington.

The basic principle of determining the position is finding the *intersection* of the two *circles* of the two satellites (in this case) and matching this with a third circle, given by the radius of the earth. We get circles because when we receive the signal of a satellite we receive the *time* when it sent the signal and the *position* where it was when it sent the signal. From that, we can infer a circle of all the positions the signal could have traveled to during the time interval until we received it.

When we have two satellites we get two circles. Two circles have either no, one or two intersections. When the system is set up correctly we will have two intersections most of the time. One of those can in nearly all cases be discarded because it will be somewhere in space or beneath the earth (`x`

and `y`

are the coordinates, `r`

is the radius of the circle):

draw_circle <- function(x = 0, y = 0, r, xlim = NULL, ylim = NULL, col = "black", add = FALSE) { h <- x k <- y if (is.null(ylim)) ylim <- c(k - r, k + r) curve(+ sqrt(r^2 - (x - h)^2) + k, h - r, h + r, xlim = xlim, ylim = ylim, xlab = "", ylab = "", col = col, add = add) curve(- sqrt(r^2 - (x - h)^2) + k, h - r, h + r, xlim = xlim, ylim = ylim, xlab = "", ylab = "", col = col, add = TRUE) } earth <- c(x = 0, y = 0, r = 13) # earth sat_1 <- c(x = 21, y = 0, r = 20) # satellite 1 sat_2 <- c(x = 9, y = 15, r = 5) # satellite 2 draw_circle(r = earth["r"], xlim = c(-13, 41), ylim = c(-20, 20)) # earth draw_circle(x = sat_1["x"], y = sat_1["y"], r = sat_1["r"], xlim = c(-13, 41), ylim = c(-20, 20), col = "green", add = TRUE) # satellite 1 draw_circle(x = sat_2["x"], y = sat_2["y"], r = sat_2["r"], ylim = c(-20, 20), col = "blue", add = TRUE) # satellite 2

The method for finding the intersection is called *trilateration*. You can find the exact mathematical derivation in the above article. We put the resulting formula into an R function (the `d`

‘s stand for distances, the `p`

, `q`

and `r`

for the coordinates of the satellites (we do not need the y-value of the first satellite because of the alignment of the diagram, see the article for details!):

trilaterate <- function(d1, d2, d3, p, q, r) { x <- (p^2 + d1^2 - d2^2) / (2 * p) y <- (q^2 + r^2 + d1^2 - d3^2 - (q * (p^2 + d1^2 - d2^2)) / p) / (2 * r) return(c(x, y)) } (P <- trilaterate(earth["r"], sat_1["r"], sat_2["r"], sat_1["x"], sat_2["x"], sat_2["y"])) # don't need sat_1["y"] because of alignment of satellites with earth ## x y ## 5 12 lines(c(earth["x"], P["x"]), c(earth["y"], P["y"])) lines(c(sat_1["x"], P["x"]), c(sat_1["y"], P["y"]), col = "green") lines(c(sat_2["x"], P["x"]), c(sat_2["y"], P["y"]), col = "blue") points(P["x"], P["y"], pch = 16, col = "red")

As you can see, we got a unique position *(5, 12)* by this method.

Another fascinating aspect of GPS is, that it is a practical application of *Einstein’s Theory of Relativity*! That is because the satellites are moving very fast in relation to the surface of the earth and gravity is much weaker up there! Both have to be incorporated into the equation, otherwise, we would have an accumulating error of more than 10 km per day! The system would obviously be totally useless without Einstein and his brilliant theory!

I have to make a confession: when it comes to my sense of orientation I am a total failure… sometimes it feels like GPS and Google maps were actually invented for me!

Well, nowadays anybody uses those practical little helpers. But how do they actually manage to find the shortest path from A to B?

If you want to understand the father of all routing algorithms, *Dijkstra’s algorithm*, and want to know how to program it in R read on!

This post is partly based on this essay Python Patterns – Implementing Graphs, the example is from the German book “Das Geheimnis des kürzesten Weges” (“The secret of the shortest path”) by my colleague Professor Gritzmann and Dr. Brandenberg. For finding the most elegant way to convert data frames into igraph-objects I got help (once again!) from the wonderful R community over at StackOverflow.

Dijkstra’s algorithm is a *recursive algorithm*. If you are not familiar with *recursion* you might want to read my post To understand Recursion you have to understand Recursion… first.

First, we are going to define the *graph* in which we want to navigate and we attach *weights* for the time it takes to cover it. We use the excellent `igraph`

package (on CRAN) for visualizing the graph:

library(igraph) ## Attaching package: 'igraph' ## The following objects are masked from 'package:stats': ## ## decompose, spectrum ## The following object is masked from 'package:base': ## ## union graph <- list(s = c("a", "b"), a = c("s", "b", "c", "d"), b = c("s", "a", "c", "d"), c = c("a", "b", "d", "e", "f"), d = c("a", "b", "c", "e", "f"), e = c("c", "d", "f", "z"), f = c("c", "d", "e", "z"), z = c("e", "f")) weights <- list(s = c(3, 5), a = c(3, 1, 10, 11), b = c(5, 3, 2, 3), c = c(10, 2, 3, 7, 12), d = c(15, 7, 2, 11, 2), e = c(7, 11, 3, 2), f = c(12, 2, 3, 2), z = c(2, 2)) # create edgelist with weights G <- data.frame(stack(graph), weights = stack(weights)[[1]]) set.seed(500) el <- as.matrix(stack(graph)) g <- graph_from_edgelist(el) oldpar <- par(mar = c(1, 1, 1, 1)) plot(g, edge.label = stack(weights)[[1]]) par(oldpar)

Next, we create a helper function to calculate the path length:

path_length <- function(path) { # if path is NULL return infinite length if (is.null(path)) return(Inf) # get all consecutive nodes pairs <- cbind(values = path[-length(path)], ind = path[-1]) # join with G and sum over weights sum(merge(pairs, G)[ , "weights"]) }

And now for the core of the matter, Dijkstra’s algorithm: the general idea of the algorithm is very simple and elegant: start at the starting node and call the algorithm recursively for all nodes linked from there as new starting nodes and thereby build your path step by step. Only keep the shortest path and stop when reaching the end node (base case of the recursion). In case you reach a dead-end in between assign infinity as length (by the `path_length`

function above).

I added a lot of documentation to the code so it is hopefully possible to understand how it works:

find_shortest_path <- function(graph, start, end, path = c()) { # if there are no nodes linked from current node (= dead end) return NULL if (is.null(graph[[start]])) return(NULL) # add next node to path so far path <- c(path, start) # base case of recursion: if end is reached return path if (start == end) return(path) # initialize shortest path as NULL shortest <- NULL # loop through all nodes linked from the current node (given in start) for (node in graph[[start]]) { # proceed only if linked node is not already in path if (!(node %in% path)) { # recursively call function for finding shortest path with node as start and assign it to newpath newpath <- find_shortest_path(graph, node, end, path) # if newpath is shorter than shortest so far assign newpath to shortest if (path_length(newpath) < path_length(shortest)) shortest <- newpath } } # return shortest path shortest }

Now, we can finally test the algorithm by calculating the shortest path from **s** to **z** and back:

find_shortest_path(graph, "s", "z") # via b ## [1] "s" "b" "c" "d" "f" "z" find_shortest_path(graph, "z", "s") # back via a ## [1] "z" "f" "d" "b" "a" "s"

Note that the two routes are actually different because of the different weights in both directions (e.g. think of some construction work in one direction but not the other).

Next week we will learn how the *Global Positioning System (GPS)* works, so stay tuned!

We are living in a complex world and it is often not easy to distinguish “alternative facts” from the truth. In this post, I am giving you the statistical analog of a Swiss army knife that you could use with minimal effort to sort out false claims even on the back of an envelope (sometimes you might need a pocket calculator which should be no problem in the age of ubiquitous smartphones). The post is inspired by the excellent book “A Universal ‘Pocket’ Statistical Tool – Based on the Pearson Chi-Square Goodness of Fit Test” by Dr. Frederick Ruland.

The big idea behind the method is to find out whether something is actually the case due to some underlying effect – or just due to pure chance. Relevant questions like the following can be answered:

- Gambling: is that die fair or loaded?
- Business: is there an unusually large proportion of men in management positions?
- Politics: is there an unusually large proportion of Afro-Americans stopped and searched by police?
- Education: are there unusually large discrepancies in the scoring of essays between two teachers?
- Production: are the box weights unusually different from the target goal?

You may have noticed that the word “unusually” is used unusually often (no pun intended). With “unusual” we mean something happening with a probability of only 5% or less (for details again see From Coin Tosses to p-Hacking: Make Statistics Significant Again!).

To achieve this a very simple calculation is done:

- Square the difference of the observed value of one group with what would be the expected value of that group.
- Divide by the expected value of that group.
- Square the difference of the observed value of the other group with what would be the expected value of that group.
- Divide by the expected value of that group.
- Add both values.
- If the result is bigger than 3.84 the probability of the observed result happening just by chance is below 5% -> we are on to something!

In the mentioned blog post we had the example of a coin being tossed ten times and showing nine times heads (and only one time tails). Let us repeat the calculation here:

- Observed number of heads = 9
- Expected number of heads = 5
- Observed number of tails = 1
- Expected number of tails = 5

Let us use R as a calculator:

(((9 - 5)^2) / 5) + (((1 - 5)^2) / 5) ## [1] 6.4

Because the result 6.4 is bigger than 3.84 we can say that the coin is unfair!

What seems like magic is actually based on solid mathematics! What we are performing here is in statistical lingo a so-called *Pearson’s (chi-squared) goodness of fit test*. We won’t go into the mathematical details but it can be shown that the way the calculation is set up the *test statistic* follows a so-called * (chi-squared) distribution*. The magic number 3.84 is just the 5% mark of the *p-value*:

curve(pchisq(x, df = 1, lower.tail = FALSE), from = 0, to = 10, axes = FALSE, xlab = "X-squared", ylab = "p-value", main = "X-squared = 3.84 <=> p-value = 0.05") axis(1, at = c(0, 2, 3.84, seq(4, 10, 2)), labels = c("0", "2", "3.84", "4", "6", "8", "10")) axis(2, at = c(0, 0.05, seq(0.2, 1, 0.2)), labels = c("0", "0.05", "0.2", "0.4", "0.6", "0.8", "1.0")) lines(c(3.84, 3.84), c(0, 0.05), col = "blue") lines(c(0, 3.84), c(0.05, 0.05), col = "blue")

This is why we can also do the above calculation directly in R:

chisq.test(c(9, 1)) ## ## Chi-squared test for given probabilities ## ## data: c(9, 1) ## X-squared = 6.4, df = 1, p-value = 0.01141

As expected we not only get the same result but a p-value below 0.05!

Let us do another example, this time from the above-mentioned book:

You take your child to kindergarten on the first day. Of the 21 students, there are only 6 boys. You wonder if this is unusual. Based on your knowledge of biology, you expect something closer to half boys and half girls.

The values for our formula:

- Observed number of boys = 6
- Expected number of boys = 21 / 2 = 10.5
- Observed number of girls = 15
- Expected number of girls = 21/2 = 10.5

And now for the calculation (by “hand” and afterwards again with the `chisq.test`

function:

(((6 - 10.5)^2) / 10.5) + (((15 - 10.5)^2) / 10.5) ## [1] 3.857143 chisq.test(c(6, 15)) ## ## Chi-squared test for given probabilities ## ## data: c(6, 15) ## X-squared = 3.8571, df = 1, p-value = 0.04953

Dr. Ruland interprets the result for us:

Since our calculated value is greater than 3.84, our observed frequencies (6 boys and 15 girls) do not fit the expected frequencies very closely. (Perhaps there is another school in town for boys only?)

Isn’t it fascinating how you can perform some serious statistical analysis with this simple tool!

]]>In this post, we will create a small network (also called *graph* mathematically) and ask some question about which is the “most important” node (also called *vertex*, pl. *vertices*). If you want to understand important concepts of *network centrality* and how to calculate those in R, read on!

This post is based on a LinkedIn post by renowned data scientist Dr. Keith McNulty. Let us (re-)create the small example network from there by first defining the *adjacency matrix* and after that plotting it with the `igraph`

package (on CRAN). We have used this package already in another post on networks: Google’s Eigenvector… or: How a Random Surfer Finds the Most Relevant Webpages.

library(igraph) ## Warning: package 'igraph' was built under R version 4.0.2 ## ## Attaching package: 'igraph' ## The following objects are masked from 'package:stats': ## ## decompose, spectrum ## The following object is masked from 'package:base': ## ## union # define simple network # A, B, C, D, E, F, G, H, I, J, K, L, M, N A <- matrix(c(0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # A 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # B 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # C 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, # D 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # E 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, # F 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, # G 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, # H 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, # I 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, # J 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, # K 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, # L 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, # M 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0 # N ), nrow = 14) colnames(A) <- rownames(A) <- LETTERS[1:ncol(A)] A # adjacency matrix ## A B C D E F G H I J K L M N ## A 0 1 0 1 0 0 0 0 0 0 0 0 0 0 ## B 1 0 0 1 0 0 0 0 0 0 0 0 0 0 ## C 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ## D 1 1 1 0 1 1 1 1 0 0 0 0 0 0 ## E 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ## F 0 0 0 1 0 0 1 0 0 0 0 0 0 0 ## G 0 0 0 1 0 1 0 1 1 0 0 0 0 0 ## H 0 0 0 1 0 0 1 0 1 0 0 0 0 0 ## I 0 0 0 0 0 0 1 1 0 1 0 0 1 0 ## J 0 0 0 0 0 0 0 0 1 0 1 1 0 0 ## K 0 0 0 0 0 0 0 0 0 1 0 1 0 0 ## L 0 0 0 0 0 0 0 0 0 1 1 0 0 0 ## M 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ## N 0 0 0 0 0 0 0 0 0 0 0 0 1 0 g <- graph_from_adjacency_matrix(A, mode = "undirected") set.seed(258) oldpar <- par(mar = c(1, 1, 1, 1)) plot(g) par(oldpar)

McNulty writes in his post:

I love to use this example when I teach about network analysis. I ask the group: who is the most important person in this network?

Now, what does “most important” person mean? It of course depends on the definition and this is where *network centrality measures* come into play. We will have a look at three of those (there are many more out there…).

McNulty explains:

Degree centrality tells you the most connected person: it is simply the number of nodes connected to each node, and it’s easy to see that

Dhas the highest (7).

This is often the only metric given to identify “influencers”: how many followers do they have?

Degree centrality is easy to calculate in R (first “by hand”, after that with the `igraph`

package):

rowSums(A) ## A B C D E F G H I J K L M N ## 2 2 1 7 1 2 4 3 4 3 2 2 2 1 degree(g) ## A B C D E F G H I J K L M N ## 2 2 1 7 1 2 4 3 4 3 2 2 2 1

McNulty explains:

Closeness centrality tells you who can propagate information quickest: you sum the path lengths from your node to each other node and then inverse it.

Ghas four paths of length 1, 6 of length 2 and 3 of length 3. Which gives it a closeness centrality of 1/25. With the other main candidatesIis 1/26,His 1/26 andDis 1/27.

One application that comes to mind is identifying so-called superspreaders of infectious diseases, like COVID-19.

This is a little bit more involved, the simplest approach is to first convert the adjacency matrix to a *distance matrix* which measures the distances of the shortest paths from and to each node (I won’t go into the details, some pointers are given in the comments of the code):

# exponentiate the n x n adjacency matrix to the n'th power in the min-plus algebra. This is, instead of adding taking the minimum and instead of multiplying taking the sum. # more details: https://en.wikipedia.org/wiki/Distance_matrix#Non-metric_distance_matrices # distance product "%C%" <- function(A, B) { n <- nrow(A) A[A == 0] <- Inf diag(A) <- 0 B[B == 0] <- Inf diag(B) <- 0 C <- matrix(0, nrow = n, ncol = n) for (i in 1:n) { for (j in 1:n) { tmp <- vector("integer", ) for (k in 1:n) { tmp[k] <- A[i, k] + B[k, j] } C[i, j] <- min(tmp) } } colnames(C) <- rownames(C) <- rownames(A) C } # calculate distance matrix DM <- function(A) { D <- A for (n in 1:nrow(A)) { D <- D %C% D } D } D <- DM(A) # distance matrix D ## A B C D E F G H I J K L M N ## A 0 1 2 1 2 2 2 2 3 4 5 5 4 5 ## B 1 0 2 1 2 2 2 2 3 4 5 5 4 5 ## C 2 2 0 1 2 2 2 2 3 4 5 5 4 5 ## D 1 1 1 0 1 1 1 1 2 3 4 4 3 4 ## E 2 2 2 1 0 2 2 2 3 4 5 5 4 5 ## F 2 2 2 1 2 0 1 2 2 3 4 4 3 4 ## G 2 2 2 1 2 1 0 1 1 2 3 3 2 3 ## H 2 2 2 1 2 2 1 0 1 2 3 3 2 3 ## I 3 3 3 2 3 2 1 1 0 1 2 2 1 2 ## J 4 4 4 3 4 3 2 2 1 0 1 1 2 3 ## K 5 5 5 4 5 4 3 3 2 1 0 1 3 4 ## L 5 5 5 4 5 4 3 3 2 1 1 0 3 4 ## M 4 4 4 3 4 3 2 2 1 2 3 3 0 1 ## N 5 5 5 4 5 4 3 3 2 3 4 4 1 0 1 / rowSums(D) ## A B C D E F G ## 0.02631579 0.02631579 0.02564103 0.03703704 0.02564103 0.03125000 0.04000000 ## H I J K L M N ## 0.03846154 0.03846154 0.02941176 0.02222222 0.02222222 0.02777778 0.02083333 closeness(g) ## A B C D E F G ## 0.02631579 0.02631579 0.02564103 0.03703704 0.02564103 0.03125000 0.04000000 ## H I J K L M N ## 0.03846154 0.03846154 0.02941176 0.02222222 0.02222222 0.02777778 0.02083333

McNulty explains:

Betweenness centrality tells you who is most important in maintaining connection throughout the network: it is the number of times your node is on the shortest path between any other pair of nodes.

Iuniquely connects all nodes on the left with all nodes on the right, which means it connects at 8×5 = 40 pairs, plus any node in the top right with the bottom right, a further 6 pairs, so 46 in total. If you follow a similar process forD,HandGyou’ll see that they don’t come close to this.

For example in protein-interaction networks, betweenness centrality can be used to find important proteins in signalling pathways which can form targets for drug discovery.

The actual algorithm to calculate betweenness centrality is much too involved to show here (if you have a simple to understand algorithm please let me know on StackOverflow or in the comments), so we will just make use of the `igraph`

package to calculate it:

betweenness(g) ## A B C D E F G H I J K L M N ## 0.0 0.0 0.0 41.5 0.0 0.0 21.5 15.0 46.0 22.0 0.0 0.0 12.0 0.0

As we have seen, there is more than one definition of “most important”. It will depend on the context (and the available information) which one to choose.

Please let me know your thoughts in the comments and please share further possible applications with us.

We’re taking our summer break! Look forward to the next post on October 6, 2020… and stay healthy!

]]>So, is there a method to find the respective proportion of people without putting them on the spot? Actually, there is! If you want to learn about *randomized response* (and how to create flowcharts in R along the way) read on!

The question is how can you get a truthful result overall without being able to attribute a certain answer to any single individual. As it turns out, there is a very elegant and ingenious method, called *randomized response*. The big idea is to, as the name suggests, add noise to every answer without compromising the overall proportion too much, i.e. add noise to every answer so that it *cancels out overall*!

Big tech companies like Google and Microsoft also use this method to e.g. collect telemetry data in a *privacy-preserving* manner. The broader concept is called *differential privacy*, which is differentiated into *local* and *global* methods. Global methods are being used when the data is already collected, local methods concern the process of data collection itself. We will be focusing on the latter here.

One way to do randomized response is the following: each participant is asked to flip two coins. If the first coin comes up heads, she answers the question truthfully. Otherwise, she answers “yes” if the second coin came up heads and “no” if it came up tails. This way nobody can find out whether the participant answered truthfully or not, her answers could have been produced by randomness. Yet, the great thing is that when asking a whole group of people it is possible to calculate the true proportions because the added random noise cancels out… we will see how later in the post.

The process is being depicted in the following flowchart, created with R and the `diagram`

package (on CRAN):

library(diagram) ## Loading required package: shape oldpar <- par(mar = c(1, 1, 1, 1)) openplotmat() elpos <- coordinates (c(1, 1, 2, 4)) fromto <- matrix(ncol = 2, byrow = TRUE, data = c(1, 2, 2, 3, 2, 4, 4, 7, 4, 8)) nr <- nrow(fromto) arrpos <- matrix(ncol = 2, nrow = nr) for (i in 1:nr) arrpos[i, ] <- straightarrow(to = elpos[fromto[i, 2], ], from = elpos[fromto[i, 1], ] , lwd = 2, arr.pos = 0.6, arr.length = 0.5) textellipse(elpos[1, ], 0.1, lab = "START", box.col = "green", shadow.col = "darkgreen", shadow.size = 0.005, cex = 1.5) textrect (elpos[2, ], 0.15, 0.05, lab = "1'st coin flip", box.col = "grey", shadow.col = "black", shadow.size = 0.005, cex = 1.5) textrect (elpos[4, ], 0.15, 0.05, lab = "2'nd coin flip", box.col = "grey", shadow.col = "black", shadow.size = 0.005, cex = 1.5) textellipse(elpos[3, ], 0.1, 0.1, lab = c("True", "Answer"), box.col = "orange", shadow.col = "red", shadow.size = 0.005, cex = 1.5) textellipse(elpos[7, ], 0.1, 0.1, lab = c("Yes"), box.col = "orange", shadow.col = "red", shadow.size = 0.005, cex = 1.5) textellipse(elpos[8, ], 0.1, 0.1, lab = c("No"), box.col = "orange", shadow.col = "red", shadow.size = 0.005, cex = 1.5) dd <- c(0.0, 0.025) text(arrpos[2, 1]+0.06, arrpos[2, 2], "50%") text(arrpos[3, 1]-0.06, arrpos[3, 2], "50%") text(arrpos[4, 1]-0.03, arrpos[4, 2]+0.05, "25%") text(arrpos[5, 1]+0.03, arrpos[5, 2]+0.05, "25%") par(oldpar)

Of course, this is just one potential way of doing it and it doesn’t have to be real coin flips: the whole process could be done transparently in the background while collecting personal data automatically, yet to give you some feeling for the method let us actually do a toy example next.

We now simulate a survey of 200 people who get asked an embarrassing question, e.g. whether they are an alcoholic. In our sample, about 15% are alcoholics (which is about the real rate in Western countries, believe it or not!):

set.seed(123) n <- 200 true_responses <- sample(c(TRUE, FALSE), size = n, replace = TRUE, prob = c(0.15, 0.85)) # only responders know table(true_responses) / n ## true_responses ## FALSE TRUE ## 0.85 0.15

We now use the above method to get an estimate of the proportion without embarrassing single individuals:

randomized_response <- function(true_response) ifelse(sample(c(TRUE, FALSE), 1), true_response, sample(c(TRUE, FALSE), 1)) randomized_response <- Vectorize(randomized_response) randomized_responses <- randomized_response(true_responses) # what comes out of survey

How can we actually get rid of the noise? Mathematician Dr. John D. Cook explains (source: Randomized response, privacy, and Bayes theorem):

How can [one] estimate p, the [true] proportion […]? Around half […] will get a head on the first coin and answer truthfully; the rest will look at the second coin and answer yes or no with equal probability. So the expected proportion of yes answers is Y = 0.5p + 0.25, and we can estimate p as 2Y – 0.5.

Let us actually do this for our little example:

2 * sum(randomized_responses) / n - 0.5 ## [1] 0.14

With an estimate of 14% we are obviously not so far off from the true proportion (15%). To find out how many people answer truthfully, we do another quick calculation:

sum(true_responses == randomized_responses) / n ## [1] 0.75

This makes sense since 50% give the true answer anyway and 25% by chance, which makes 75% altogether.

Who would have thought that a little bit of randomness at the right place could reveal hidden truths people are normally ashamed of to admit?

]]>When I first saw the

There are many so-called *Computer Algebra Systems (CAS)* out there, commercial but also open-source. One very mature one is called *YACAS* (for *Yet Another Computer Algebra System*). You find the documentation here: Yacas Documentation (many of the following examples are taken from there).

You can use the full power of it in R by installing the `Ryacas`

package from CRAN. You can use Yacas commands directly by invoking the `yac_str`

function, the `as_r`

function converts the output to R. Let us first *simplify a mathematical expression*:

library(Ryacas) ## ## Attaching package: 'Ryacas' ## The following object is masked from 'package:stats': ## ## deriv ## The following objects are masked from 'package:base': ## ## %*%, determinant, diag, diag<-, I, lower.tri, upper.tri # simplify expressions as_r(yac_str("Simplify(a*b*a^2/b-a^3)")) ## [1] 0

Or *solve an equation*:

as_r(yac_str("Solve(a+x*y==z,x)")) ## [1] "x==-(a-z)/y"

And you can do all kinds of tedious stuff that is quite error-prone when done differently, e.g. *expanding* expressions like by using the binomial theorem:

as_r(yac_str("Expand((x-2)^20)")) ## expression(x^20 - 40 * x^19 + 760 * x^18 - 9120 * x^17 + 77520 * ## x^16 - 496128 * x^15 + 2480640 * x^14 - 9922560 * x^13 + ## 32248320 * x^12 - 85995520 * x^11 + 189190144 * x^10 - 343982080 * ## x^9 + 515973120 * x^8 - 635043840 * x^7 + 635043840 * x^6 - ## 508035072 * x^5 + 317521920 * x^4 - 149422080 * x^3 + 49807360 * ## x^2 - 10485760 * x + 1048576)

To demonstrate how easily the results can be integrated into R let us do some *curve sketching* on a function. First, we define two helper function for converting an expression into a function (which can then be used to plot it) and for determining the *derivative of order n* of some function (we redefine the `D`

function for that):

as_function <- function(expr) { as.function(alist(x =, eval(parse(text = expr)))) } # redefine D function D <- function(eq, order = 1) { yac_str(paste("D(x,", order, ")", eq)) }

Now, we define the function (in this case a simple *polynomial* ), determine the first and second derivatives symbolically and plot everything:

xmin <- -5 xmax <- 5 eq <- "2*x^3 - 3*x^2 + 4*x - 5" eq_f <- as_function(eq) curve(eq_f, xmin, xmax, ylab = "y(x)") abline(h = 0, lty = 2) abline(v = 0, lty = 2) D_eq <- D(eq) D_eq ## [1] "6*x^2-6*x+4" D_eq_f <- as_function(D_eq) curve(D_eq_f, xmin, xmax, add = TRUE, col = "blue") D2_eq <- D(eq, 2) D2_eq ## [1] "12*x-6" D2_eq_f <- as_function(D2_eq) curve(D2_eq_f, xmin, xmax, add = TRUE, col = "green")

Impressive, isn’t it! Yacas can also determine *limits* and *integrals*:

# determine limits yac_str("Limit(x,0) 1/x") ## [1] "Undefined" yac_str("Limit(x,0,Left) 1/x") ## [1] "-Infinity" yac_str("Limit(x,0,Right) 1/x") ## [1] "Infinity" # integration yac_str("Integrate(x) Cos(x)") ## [1] "Sin(x)" yac_str("Integrate(x,a,b) Cos(x)") ## [1] "Sin(b)-Sin(a)"

As an example, we can prove in no-time that the famous approximation is actually too big (more details can be found here: Proof that 22/7 exceeds π):

yac_str("Integrate(x,0,1) x^4*(1-x)^4/(1+x^2)") ## [1] "22/7-Pi"

And, as the grand finale of this post, Yacas is even able to solve *ordinary differential equations* symbolically! Let us first take the simplest of them all:

as_r(yac_str("OdeSolve(y' == y)")) ## expression(C115 * exp(x))

It correctly returns the *exponential function* (The `C`

-term is just an arbitrary *constant*). And finally a more complex, higher-order one:

as_r(yac_str("OdeSolve(y'' - 4*y == 0)")) ## expression(C154 * exp(2 * x) + C158 * exp(-2 * x))

I still find CAS amazing and extremely useful… and an especially powerful one can be used from within R!

]]>The

The following post is based on the post “Das Kalman-Filter einfach erklärt” which is written in German and uses Matlab code (so basically two languages nobody is interested in any more ). This post is itself based on an online course “Artificial Intelligence for Robotics” by my colleague Professor Sebastian Thrun of Standford University.

Because we are dealing with uncertainty here, we need a probability distribution. A good choice often is a *Gaussian* or *normal distribution*. It is defined by two parameters, the *mean* and the *variance* (or *standard deviation* which is just the square root of the variance). These two parameters have to be updated by incoming information which is itself uncertain.

This can be interpreted as some form of *Bayesian updating* (if you want to learn more about Bayes and his famous formula, you can read two former posts of mine here: Base Rate Fallacy – or why No One is justified to believe that Jesus rose and Learning Data Science: Sentiment Analysis with Naive Bayes).

Professor Thrun explains (just ignore the quiz at the end of the video):

The *update step* in R code:

update <- function(mean1, var1, mean2, var2) { # calculates new position as multiplication of two Gaussians: # prior probability and new information (noisy measurement) new_mean <- (var2*mean1 + var1*mean2) / (var1 + var2) new_var <- 1/(1/var1 + 1/var2) return(c(new_mean, new_var)) }

After calculating the position based on new information we also have to take into account the motion itself, this is done in the *prediction step*:

Here is the R implementation:

predict <- function(mean1, var1, mean2, var2) { # Calculates new postion as sum (= convolution) of two Gaussians: # prior probability and new information (noisy movement) new_mean <- mean1 + mean2 new_var <- var1 + var2 return(c(new_mean, new_var)) }

Now, we are putting both steps together to form a cycle:

As an example let us create a trajectory along a sine wave with measurements and motion affected by noise (= uncertainty):

var_measure <- 5 # variance measure var_motion <- 2 # variance motion pos <- c(0, 10000) # Starting values position and variance ## Kalman calculation set.seed(123) pos_real <- 10 * sin(seq(1, 20, 0.1)) motion_real <- diff(pos_real) measure <- pos_real + rnorm(length(pos_real), 0, sqrt(var_measure)) motion <- motion_real + rnorm(length(motion_real), 0, sqrt(var_motion)) kalman_update <- c() for (i in 1:length(measure)) { pos <- update(pos[1], pos[2], measure[i], var_measure) kalman_update <- c(kalman_update, pos[1]) pos <- predict(pos[1], pos[2], motion[i], var_motion) } plot(measure, col = "red") lines(kalman_update, col = "blue") lines(pos_real, col = "black")

As you can see the resulting blue curve is much more stable than the noisy measurements (small red circles)!

If you want to read about another algorithm (RANSAC) that is able to handle noisy measurements, you can find it here: Separating the Signal from the Noise: Robust Statistics for Pedestrians.

In any case, stay tuned for more to come!

]]>Forecasting the future has always been one of man’s biggest desires and many approaches have been tried over the centuries. In this post we will look at a simple statistical method for

Let us dive directly into the matter and build an AR model out of the box. We will use the inbuilt `BJsales`

dataset which contains 150 observations of sales data (for more information consult the R documentation). Conveniently enough AR models can be built directly in base R with the `ar.ols()`

function (*OLS* stands for *Ordinary Least Squares* which is the method used to fit the actual model). Have a look at the following code:

data <- BJsales head(data) ## [1] 200.1 199.5 199.4 198.9 199.0 200.2 N <- 3 # how many periods lookback n_ahead <- 10 # how many periods forecast # build autoregressive model with ar.ols() model_ar <- ar.ols(data, order.max = N) # ar-model pred_ar <- predict(model_ar, n.ahead = n_ahead) pred_ar$pred ## Time Series: ## Start = 151 ## End = 160 ## Frequency = 1 ## [1] 263.0299 263.3366 263.6017 263.8507 264.0863 264.3145 264.5372 ## [8] 264.7563 264.9727 265.1868 plot(data, xlim = c(1, length(data) + 15), ylim = c(min(data), max(data) + 10)) lines(pred_ar$pred, col = "blue", lwd = 5)

Well, this seems to be good news for the sales team: rising sales! Yet, how does this model arrive at those numbers? To understand what is going on we will now rebuild the model. Basically, everything is in the name already: *auto-regressive*, i.e. a *(linear) regression* on (a delayed copy of) itself (*auto* from Ancient Greek *self*)!

So, what we are going to do is create a delayed copy of the time series and run a linear regression on it. We will use the `lm()`

function from base R for that (see also Learning Data Science: Modelling Basics). Have a look at the following code:

# reproduce with lm() df_data <- data.frame(embed(data, N+1) - mean(data)) head(df_data) ## X1 X2 X3 X4 ## 1 -31.078 -30.578 -30.478 -29.878 ## 2 -30.978 -31.078 -30.578 -30.478 ## 3 -29.778 -30.978 -31.078 -30.578 ## 4 -31.378 -29.778 -30.978 -31.078 ## 5 -29.978 -31.378 -29.778 -30.978 ## 6 -29.678 -29.978 -31.378 -29.778 model_lm <- lm(X1 ~., data = df_data) # lm-model coeffs <- cbind(c(model_ar$x.intercept, model_ar$ar), coef(model_lm)) coeffs <- cbind(coeffs, coeffs[ , 1] - coeffs[ , 2]) round(coeffs, 12) ## [,1] [,2] [,3] ## (Intercept) 0.2390796 0.2390796 0 ## X2 1.2460868 1.2460868 0 ## X3 -0.0453811 -0.0453811 0 ## X4 -0.2042412 -0.2042412 0 data_pred <- df_data[nrow(df_data), 1:N] colnames(data_pred) <- names(model_lm$coefficients)[-1] pred_lm <- numeric() for (i in 1:n_ahead) { data_pred <- cbind(predict(model_lm, data_pred), data_pred) pred_lm <- cbind(pred_lm, data_pred[ , 1]) data_pred <- data_pred[ , 1:N] colnames(data_pred) <- names(model_lm$coefficients)[-1] } preds <- cbind(pred_ar$pred, as.numeric(pred_lm) + mean(data)) preds <- cbind(preds, preds[ , 1] - preds[ , 2]) colnames(preds) <- NULL round(preds, 9) ## Time Series: ## Start = 151 ## End = 160 ## Frequency = 1 ## [,1] [,2] [,3] ## 151 263.0299 263.0299 0 ## 152 263.3366 263.3366 0 ## 153 263.6017 263.6017 0 ## 154 263.8507 263.8507 0 ## 155 264.0863 264.0863 0 ## 156 264.3145 264.3145 0 ## 157 264.5372 264.5372 0 ## 158 264.7563 264.7563 0 ## 159 264.9727 264.9727 0 ## 160 265.1868 265.1868 0

As you can see, the coefficients and predicted values are the same (except for some negligible rounding errors)!

A few things warrant further attention: When building the linear model in line 17 the formula is created dynamically on the fly because the *dependent variable* is in the last column which number depends on `N`

(the number of lookback periods). To be more precise, it is not just a simple linear regression but a *multiple regression* because each column (which represent different time delays) goes into the model as a separate *(independent) variable*. Additionally, the regression is performed on the *demeaned* data, meaning that you subtract the mean.

So, under the hood what sounds so impressive (“Autoregressive model”.. wow!) is nothing else but good ol’ linear regression. So, for this method to work, there must be some *autocorrelation* in the data, i.e. some repeating linear pattern.

As you can imagine there are instances where this will not work. For example, in financial time series there is next to no autocorrelation (otherwise it would be too easy, right! – see also my question and answers on Quant.SE here: Why aren’t econometric models used more in Quant Finance?).

In order to use this model to predict `n_ahead`

periods ahead the predict function first uses the last `N`

periods and then uses the new predicted values as input for the next prediction, and so forth `n_ahead`

times. After that, the mean is added again. Obviously, the farther we predict into the future the more uncertain the forecast becomes because the basis of the prediction comprises more and more values that were predicted themselves. The values for both parameters were taken here for demonstration purposes only. A realistic scenario would be to take more lookback periods than predicted periods and you would, of course, take domain knowledge into account, e.g. when you have monthly data take at least twelve periods as your `N`

.

This post only barely scratched the surface of forecasting time series data. Basically, many of the standard approaches of statistics and machine learning can be modified so that they can be used on time series data. Yet, even the most sophisticated method is not able to foresee external shocks (like the current COVID-19 pandemic) and feedback loops when the very forecasts change the actual behaviour of people.

So, all methods have to be taken with a grain of salt because of those systematic challenges. You should always keep that in mind when you get the latest sales forecast!

]]>Many people think that when they get a positive result of such a test they are immune to the virus with high probability. If you want to find out why nothing could be further from the truth, read on!

The following article provides a good intuition of why the accuracy of screening tests is highly dependent on the infection rate: Scientific American: Coronavirus Antibody Tests Have a Mathematical Pitfall. They write:

A mathematical wrinkle makes these tests—and in fact, all screening tests—hard to interpret: even with a very accurate test, the fewer people in a population who have a condition, the more likely it is that an individual’s positive result is wrong. If it is, people might think they have the antibodies (and thus may have immunity), when in fact they do not.

We covered this concept in another post already (see Base Rate Fallacy – or why No One is justified to believe that Jesus rose) but because of the importance of the topic let us replicate their analysis with *personograph plots* (also called *Kuiper-Marshall plots*) which are an excellent way to communicate risks (see also Lying with Statistics: One Beer a Day will Kill you!).

Let us first repeat the definitions of *sensitivity* and *specificity* (we covered those here already: Learning Data Science: Understanding ROC Curves):

**Sensitivity**or**True Positive Rate (TPR)**: the proportion of actual (= true) positives (TP) that are correctly identified as such (e.g. the percentage of sick people who are correctly identified as having (had) the condition).**Specificity**or**True Negative Rate (TNR)**: the proportion of actual (= true) negatives (TN) that are correctly identified as such (e.g. the percentage of healthy people who are correctly identified as not having (had) the condition).

That might be quite interesting but what really counts in the end aren’t any test results but whether you actually have (had) the virus or not! Those are two different things and therefore we define two additional measures:

**Positive predictive value (PPV)**: the probability that subjects with a positive screening test truly have (had) the disease.**Negative predictive value (NPV)**: the probability that subjects with a negative screening test truly don’t have (have not had) the disease.

Please note that all four figures are *conditional probabilities*, but are going in different directions. For example, TPR gives the probability of a positive test result *given you are sick*, while PPV gives the probability of you being sick *given a positive test result*. We will see that both numbers can (and actually do!) diverge quite considerably.

Now we have everything in place to replicate the analysis with R (the `personograph`

package can be found on CRAN). First with an infection rate of 5% (which is in the ballpark of the likely infection rate for most countries at the moment):

library(personograph) ## Loading required package: grImport ## Loading required package: grid ## Loading required package: XML colorlist <- list(FN = "darkred", TP = "red", FP = "darkgreen", TN = "green") TPR <- 0.95 # true positive rate = sensitivity TNR <- 0.95 # true negative rate = specificity IR <- 0.05 # infection rate data <- list(FN = (1-TPR)*IR, TP = TPR*IR, FP = (1-TNR)*(1-IR), TN = TNR*(1-IR)) NPV <- round(100 * data$TN / (data$TN + data$FN), 1) # negative predictive value PPV <- round(100 * data$TP / (data$TP + data$FP), 1) # positive predictive value personograph(data, colors = colorlist, fig.title = paste0("PPV = ", PPV, "%, NPV = ", NPV, "%"), n.icons = 500, dimensions = c(20, 25))

So we can see that even with antibody tests with high values for sensitivity and specificity a positive test result is not better than the toss of a coin, given the current infection rate! Why? Because for every true positive case (TP, in red) there is a false positive one (FP, in dark green).

The situation gets a better with an infection rate of 25%…

IR <- 0.25 # infection rate data <- list(FN = (1-TPR)*IR, TP = TPR*IR, FP = (1-TNR)*(1-IR), TN = TNR*(1-IR)) NPV <- round(100 * data$TN / (data$TN + data$FN), 1) # negative predictive value PPV <- round(100 * data$TP / (data$TP + data$FP), 1) # positive predictive value personograph(data, colors = colorlist, fig.title = paste0("PPV = ", PPV, "%, NPV = ", NPV, "%"), n.icons = 500, dimensions = c(20, 25))

…where you can be nearly 90% confident that a positive test result means that you have (had) the virus! All of this shows the massive impact infection rates can have on the predictive value of these tests for individuals.

If you still doubt this I give you another intuition: the less probable it is that you have (had) the virus (which is the case right now because of the low infection rate) the more evidence you will need to prove the contrary, either by better tests (which we either don’t have at the moment or which are prioritized for the sickest patients) or by additional tests (which are scarcely done because of still limited resources).

On the other hand, if you get a negative result you can be pretty sure that you indeed have not had the virus!

]]>In concrete terms, what would a postdigital world look like in which people — both individually and as a society — used digital technologies sovereignly? With respect to the four primary areas of the techlash critique, a scenario for the future is emerging.

- We’ll only reach for our smartphones when there’s a reason and not because each new reflexive use creates new reasons for the next ritualistic unlocking of the home screen.
- Companies will no longer digitalize their production, internal processes and communications because digitalization is the first commandment of management at the moment, but instead rigorously test each step in the digitalization process against the most important criterion of value creation: What would this change really add to the bottom line? Sham innovations that in truth only complicate things — essential things like production, internal processes and communications — will be eliminated. Intelligent regulation will ensure that competition returns to digital markets and the superprofitable superstar firms will finally pay their fair share of taxes. That would help pay for the public infrastructure that the digital superstars also use in their highly profitable business models. If necessary, some essential digital services would be regulated much like water companies, natural gas suppliers and other basic utilities are today. At least the left wing of the Democratic Party is having similar thoughts, as exemplified by Elizabeth Warren and Bernie Sanders.
- Democratic discourse needs moderation online and offline. The value of opinion in discussion is significantly lower than commonly supposed by those who constantly give their own. Never fear, there will continue to be unmoderated discussion forums. PewDiePie will still have unfettered opportunity to riff on social topics (in between game play videos) on YouTube. But in the postdigital age, essential political debates will take place on discussion platforms where the discourse is serious, fact-based and nuanced, using real names and with pauses built in for reflection instead of a mad rush to respond. The fact check would be an inherent and accepted part of this culture of digital discussion. Truth would be negotiated via some kind of Super-Wikipedia. Even the powerful would express themselves on these platforms instead of through media rooted in visceral emotion like Twitter, or media designed for aesthetic self-promotion like Instagram. Perhaps software and platforms such as “Liquid Democracy” would come into use to enable new forms of direct, grass-roots decision making, especially at the local level. In the meantime, government and public administration will have learned to improve through digital technology.

Technology is never good or bad. It depends on what we use it for. This statement, constantly repeated by tech-fixated idealists, is […] naive. Technology is usually developed for a specific purpose in a socio-technical context.

- Technology is never good or bad. It depends on what we use it for. This statement, constantly repeated by tech-fixated idealists, is on the one hand naive. Technology is usually developed for a specific purpose in a socio-technical context. It’s more suitable for this purpose than for others, and therefore it’s not neutral. On the other hand, it’s of course still true: You can use machine learning to promote the sale of a new digital device with a dreadful environmental impact. Amazon bred its recommendation algorithms with just this kind of value-maximizing function. By using similar systems that learn from data, however, supply and demand could also be better balanced in decentralized energy networks.

In a green postdigital world, the fight against climate change will be a pressing goal, and perhaps the most important goal of new technological development. Perhaps in the fight against global warming, digitalization might even gain a second chance to radically improve the world: with decentralized energy networks, energy-efficient autonomous vehicles or a digitally controlled Cradle to Cradle (circular) economy. New green technologies might not only delay climate change. There will be an increasing importance for digitally supported innovations that promote resilience — that is, they will make it easier for people to deal with the actual consequences of climate change.

The future can’t be predicted, but it can be created.

The future can’t be predicted, but it can be created. What’s stopping us from embracing a postdigital future with a radical paradigm for evaluating technology’s costs and benefits? Or we could formulate the question positively: What would we need to make this vision of a postdigital future a reality? Gesche Joost, a scholar of design, Germany’s former Internet Ambassador and a member of SAP’s board of directors, answers the question with two words: “digital sovereignty.” But Joost also wishes that critiques of technology were more constructive. I share this view.

No matter how attractive or even justified the ideas of techlash may be, the perpetual demands for absolute data protection and absolute minimization of data use are as wearying as an endless wait on hold for the next available agent, and as unproductive. The answer to techlash, says Joost, can’t be for people to withdraw into a technological backwater. Constructive criticism of technology, on the other hand, will help us regain our digital sovereignty. This can only succeed if we develop digital technology that corresponds to our own desires and values, improve our ability to use it through digital education and finally turn the concept of a common digital market for like-minded countries into economic reality. More specifically, this means:

- Areas of the western world that have not invested enough in their own technological development, Europe in particular, need a massive and coordinated effort to turn things around. In the discussion of 5G cellphone networks and the potential for Chinese espionage or even sabotage by Huawei, an urgent question has received far too little attention: Why are there no European firms able to compete with Huawei on price and quality?
- Digital education needs a kickstart, beginning with elementary school. This includes data science and coding and understanding social media and the platform economy. The concepts and materials used for teaching these topics have been tested and proven to be successful, but unfortunately are rarely used in public schools. You’re more likely to find them used in private or non-profit initiatives such as those of the Open Knowledge Foundation.

If superstar firms construct data monopolies, then the data needs to be made open.

- The third major area of action involves promoting innovation and regulation of digital markets. An important element of this is a consistent data policy. As described in the second section, if superstar firms construct data monopolies, then the data needs to be made open. But that’s only the first important step. Open technical standards, if necessary enforced by law, have the same aim. Creating massive policy incentives for voluntary sharing of data, data cooperation and cross-sector data pools need to be on the agendas of national legislatures and regulators. And, of course, digitalization’s biggest winners need to pay their corresponding share of taxes, whether in the form of a digital tax using the French model or in the form of a minimum global tax based on international agreements, as the German Foreign Ministry has proposed.

All of that is not just desirable, but the prerequisite for bringing the rebound effects of digitalization under control in the postdigital age.

**Thomas Ramge** is a research fellow at the Weizenbaum Institute in Berlin. He has authored more than a dozen nonfiction books, including *Who‘s Afraid of AI*, *Reinventing Capitalism in the Age of Big Data*, coauthored with Viktor Mayer-Schönberger, and *The Global Economy as You’ve Never Seen It*. Ramge has been honored with multiple journalism and literary awards, including the Axiom Business Book Award’s Gold Medal, the getAbstract International Book Award, strategy+business magazine’s Best Business Book of the Year (in Technology and Innovation), the Herbert Quandt Media Prize, and the German Business Book Prize. He lives in Berlin with his wife and son.

You can follow him on LinkedIn: Thomas Ramge.

]]>