Over one billion dollars have been spent in the US to split up big schools into smaller ones because small schools regularly show up in rankings as top performers.

In this post, I will show you why that money was wasted because of a widespread (but not so well known) statistical artifact, so read on!

Why do small schools perform better? Many are quick to point out that an intimate, family-like atmosphere helps facilitate learning and smaller units can cater more to the special needs of their pupils. But is this really the case?

Let us conduct a little experiment where we distribute *randomly* performing pupils on one thousand schools, where half of them are big and the other half small. To be clear: we distribute those pupils by chance alone so that every school should have the same shot at getting top, average-, and low performers!

As a proxy for performance let us just take the *intelligence quotient (IQ)* which is defined to be normally distributed with a mean of 100 and a standard deviation of 15.

Let us now have a look at the fifty best performing schools:

n <- 1000 # no of schools sm <- 50 # no of pupils small school bg <- 1000 # no of pupils big school set.seed(12345) small <- matrix(rnorm(n/2 * sm, mean = 100, sd = 15), nrow = n/2, ncol = sm) big <- matrix(rnorm(n/2 * bg, mean = 100, sd = 15), nrow = n/2, ncol = bg) df_small <- data.frame(size = "small", performance = rowMeans(small)) df_big <- data.frame(size = "big", performance = rowMeans(big)) df <- rbind(df_small, df_big) df_order <- df[order(df$performance, decreasing = TRUE), ] df_order |> head(50) ## size performance ## 296 small 106.3838 ## 464 small 105.9089 ## 128 small 105.5734 ## 146 small 105.2287 ## 36 small 104.9394 ## 479 small 104.7963 ## 406 small 104.7407 ## 126 small 104.7316 ## 386 small 104.6500 ## 15 small 104.6092 ## 183 small 104.5761 ## 492 small 104.5029 ## 106 small 104.5003 ## 330 small 104.3905 ## 456 small 104.2457 ## 84 small 104.2373 ## 474 small 104.1203 ## 89 small 103.9073 ## 315 small 103.7264 ## 108 small 103.5928 ## 19 small 103.5878 ## 268 small 103.5433 ## 435 small 103.4099 ## 481 small 103.2921 ## 70 small 103.2447 ## 110 small 103.2089 ## 96 small 103.1939 ## 497 small 103.1700 ## 103 small 103.1609 ## 262 small 103.1533 ## 33 small 103.1376 ## 293 small 103.1008 ## 252 small 103.0873 ## 240 small 103.0657 ## 170 small 103.0553 ## 220 small 103.0485 ## 185 small 103.0196 ## 195 small 103.0042 ## 98 small 102.9625 ## 294 small 102.9349 ## 51 small 102.9339 ## 317 small 102.9308 ## 403 small 102.9258 ## 202 small 102.9255 ## 463 small 102.9072 ## 321 small 102.8631 ## 124 small 102.8468 ## 380 small 102.8341 ## 273 small 102.8147 ## 217 small 102.8005

In fact, all of them are small! How is that possible?

You are in for another surprise when you also look at the fifty worst performers:

df_order |> tail(50) ## size performance ## 221 small 97.43354 ## 271 small 97.43122 ## 420 small 97.42636 ## 77 small 97.40313 ## 141 small 97.38554 ## 192 small 97.38214 ## 45 small 97.36123 ## 331 small 97.35636 ## 133 small 97.15977 ## 400 small 97.12790 ## 350 small 97.09748 ## 161 small 96.97677 ## 395 small 96.97156 ## 17 small 96.95732 ## 50 small 96.94013 ## 353 small 96.77378 ## 376 small 96.63066 ## 140 small 96.60068 ## 80 small 96.59475 ## 115 small 96.56996 ## 362 small 96.52821 ## 16 small 96.35869 ## 344 small 96.29919 ## 290 small 96.23309 ## 41 small 96.21083 ## 246 small 96.13112 ## 345 small 96.06559 ## 104 small 96.03273 ## 425 small 96.02648 ## 32 small 96.01366 ## 7 small 95.99434 ## 364 small 95.99399 ## 397 small 95.95282 ## 374 small 95.92229 ## 18 small 95.80126 ## 184 small 95.73809 ## 127 small 95.65754 ## 270 small 95.60595 ## 356 small 95.39492 ## 433 small 95.33532 ## 475 small 95.30683 ## 66 small 95.30076 ## 445 small 94.94566 ## 61 small 94.88395 ## 500 small 94.85970 ## 442 small 94.85128 ## 6 small 94.56465 ## 347 small 94.43273 ## 261 small 94.23249 ## 171 small 93.82268

Again, all of them are small! The following plot reveals the overall pattern:

(df_order$performance - 100) |> barplot(main = "Sorted relative performance of schools", col = ifelse(df_order$size == "small", "blue", "red"), border = NA) abline(h = 0) legend("bottomleft", legend = c("Small schools", "Big schools"), fill = c("blue", "red"), bty = "n")

The statistical effect at work here is what my colleague Professor Howard Wainer from the University of Pennsylvania coined “The Most Dangerous Equation” in an article in the renowned American Scientist.

To understand this let us first look at some statistical measures:

summary(df_small$performance) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 93.82 98.59 100.06 100.01 101.42 106.38 summary(df_big$performance) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 98.58 99.71 100.02 99.99 100.32 101.19

We can see that, while both have the same mean performance, the performance variation is bigger for small schools than for big ones where the extreme cases are averaged out. *De Moivre’s equation* gives the exact relation in connection to the sample size, i.e. the number of pupils in this case:

where is the standard error of the mean, is the standard deviation of the sample and is the size of the sample.

Let us calculate those numbers for our little example and compare them to the actual standard deviations of the school performance:

sd(df_small$performance) ## [1] 2.144287 15 / sqrt(sm) ## [1] 2.12132 sd(df_big$performance) ## [1] 0.4677121 15 / sqrt(bg) ## [1] 0.4743416

We can see that those numbers fit very well.

Basically, de Moivre’s equation tells us that the smaller the sample size (= the number of pupils per school) the bigger the variation (= standard error/standard deviation) which results in the effect that small schools inhabit both extremes of the performance rankings. When you only look at the top performers you will falsely conclude that you have to split up big schools!

In fact, later studies (which took this effect into account) even found that larger schools are better performing after all because they are able to offer a wider range of classes with teachers who can focus on fewer subjects. The “de Moivre”-effect was hiding this and turned it on its head!

Other examples of this widespread effect include:

- Disease rates in small and big cities, e.g. for cancer or COVID-19 infections. Small cities simultaneously lead and lag statistics compared to large cities. Sometimes even one single case can make all the difference.
- Traffic safety rates: same here, small cities are at each end of the spectrum.
- Quality of hospitals: regularly small houses end up being at the top
*and*at the bottom. - Mutual fund performance: there are often large swings in the performance of small funds which lets them show up in rankings on both extremes.
- …and the list goes on and on.

Of course, commentators are always fast to find reasons for each extreme, e.g. in the case of cancer rates: either “clean living of the rural lifestyle — no air pollution, no water pollution, access to fresh food without additives”, and so on vs. “poverty of the rural lifestyle — no access to good medical care, a high-fat diet, and too much alcohol and tobacco”, where in fact both results are mainly due to the statistical artifact explained by de Moivre’s equation!

This equation is so dangerous exactly because people don’t know it yet its effects are so far-reaching.

Another example of what can be called “Only looking at the winners-fallacy” can be found here: How to be Successful! The Role of Risk-taking: A Simulation Study. For another interesting fallacy due to sampling see: Collider Bias, or Are Hot Babes Dim and Eggheads Ugly?.

Do you have other examples where the “de Moivre”-effect creeps up? Please share your thoughts in the comments.

]]>One of the most fiercely fought debates in quantitative finance is whether the stock market (or financial markets in general) is (are)

If you want to learn about an ingenious method (that is already present in anyone’s computer) to approach that question, read on!

The general idea of market efficiency is that markets are conceptionally information processing systems that incorporate all available information to arrive at the most accurate price, i.e. the best estimate of the current value, of a company. The only possibility for a price change is that new information becomes available. Because, as the name says, the information is *new*, it cannot be anticipated, and therefore it is impossible to beat the market: price changes are *un*predictable!

The other side of the debate argues that there are certain patterns hidden in the ups and downs of the charts and you only have to understand the underlying logic to make use of that. One prominent candidate is the so-called *technical analysis* that tries to discern all kinds of structures within the data, e.g. head and shoulders or double top/bottom reversal patterns, lines of support or resistance, and channels… and much more: price changes are *un*predictable!

So, is it *randomness* vs. *pattern recognition*, or *noise* vs. *signal*: who is right?

**Enter (algorithmic) information theory! **

One of the basic ideas of information theory is that random sequences are incompressible. Let us illustrate this with a *very* simple example. A simple compression method is called *run length encoding (RLE)*. It just computes the lengths and values of runs of equal values in a sequence:

# simple pattern rle(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)) ## Run Length Encoding ## lengths: int 10 ## values : num 1 # "random" rle(c(0, 1, 1, 0, 1, 0, 1, 0, 0, 1)) ## Run Length Encoding ## lengths: int [1:8] 1 2 1 1 1 1 2 1 ## values : num [1:8] 0 1 0 1 0 1 0 1

As you can see, the encoding of the simple pattern is much shorter than the encoding of the “random” series. Put another way there are way fewer “surprises” in the first series than in the second. Or in the lingo of information theory, the *algorithmic complexity (AC)* of the second series is much higher.

Now, this was a very simple pattern recognition engine, we all have something way more sophisticated on our computers.

**Enter the ZIP compression tool!**

Most of us use this little piece of software to compress files that we e.g. want to send over the internet. Only a few people know that it is a very advanced piece of technology, that is a master at spotting patterns in the files it is supposed to compress. Well, we are not the first to recognize this. In fact, there are many renowned papers out there that examine all kinds of complex dynamical systems with this little tool!

Concerning the exact inner working of the ZIP tool, we won’t go into the details but you can think of the general idea as a further development of RLE, where not only simple runs of values are being considered but all kinds of more complicated combinations/blocks of values. This is perfectly suited to spot any patterns in all kinds of data.

In the following analysis, we load the Standard & Poors price data series beginning of 1990 till today, transform it into returns, scale (detrend) it and “binarize” it into up and downtick data:

library(quantmod) ## Loading required package: xts ## Loading required package: zoo ## ## Attaching package: 'zoo' ## The following objects are masked from 'package:base': ## ## as.Date, as.Date.numeric ## Loading required package: TTR ## Registered S3 method overwritten by 'quantmod': ## method from ## as.zoo.data.frame zoo library(BMS) # for function bin2hex() getSymbols("^GSPC", from = "1990-01-01") ## 'getSymbols' currently uses auto.assign=TRUE by default, but will ## use auto.assign=FALSE in 0.5-0. You will still be able to use ## 'loadSymbols' to automatically load data. getOption("getSymbols.env") ## and getOption("getSymbols.auto.assign") will still be checked for ## alternate defaults. ## ## This message is shown once per session and may be disabled by setting ## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details. ## [1] "^GSPC" returns <- GSPC |> Cl() |> ROC() |> coredata() |> na.omit() if (length(returns) %% 8 != 0) returns <- returns[(length(returns) %% 8 + 1):length(returns)] # trim for hex conversion returns_zscores <- returns |> scale() returns_zscores_bin <- ifelse(returns_zscores > 0, 1, 0) returns_zscores_bin |> as.vector() |> head(100) ## [1] 0 0 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 0 0 ## [38] 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 0 1 1 0 1 1 0 0 ## [75] 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0

After that we transform the binary data further into chunks of hexadecimal data because this is the format we need later on:

returns_zscores_hex <- returns_zscores_bin |> bin2hex() |> {\(x) substring(x, seq(1, nchar(x), 2), seq(2, nchar(x), 2))}() returns_zscores_hex |> head(100) ## [1] "12" "5a" "0b" "57" "07" "d5" "78" "f1" "1b" "03" "7f" "f3" "e7" "c3" "0e" ## [16] "3e" "de" "4c" "c3" "b9" "1a" "e9" "30" "9d" "46" "74" "5c" "e9" "59" "e5" ## [31] "1a" "b0" "37" "9c" "dd" "ac" "d5" "04" "2e" "2a" "7c" "63" "cd" "36" "f8" ## [46] "16" "33" "4c" "e4" "2e" "23" "8f" "20" "1a" "e8" "c2" "fa" "0e" "1b" "92" ## [61] "48" "8c" "bf" "e4" "69" "45" "8a" "35" "11" "9d" "89" "b3" "e1" "3a" "34" ## [76] "cb" "54" "71" "4d" "5f" "42" "7c" "23" "86" "b1" "aa" "6e" "0b" "7c" "f4" ## [91] "ac" "77" "8e" "06" "24" "4b" "8e" "56" "30" "dd"

To have some comparison we again create a long sequence of ones, as in our first example (here directly in hexadecimal form)

ones_hex <- rep(1, length(returns)) |> bin2hex() |> {\(x) substring(x, seq(1, nchar(x), 2), seq(2, nchar(x), 2))}() ones_hex |> head(100) ## [1] "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" ## [16] "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" ## [31] "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" ## [46] "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" ## [61] "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" ## [76] "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" ## [91] "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff"

As the last sequence we create a pseudo-random sequence of zeros and ones:

set.seed(123) pseudorandom_hex <- sample(c(0, 1), length(returns), replace = TRUE) |> bin2hex() |> {\(x) substring(x, seq(1, nchar(x), 2), seq(2, nchar(x), 2))}() pseudorandom_hex |> head(100) ## [1] "17" "3a" "84" "35" "61" "61" "21" "a6" "42" "1b" "dc" "b6" "46" "c7" "9d" ## [16] "46" "2f" "ee" "ed" "40" "74" "bc" "78" "66" "65" "ee" "e7" "a8" "c9" "be" ## [31] "51" "d8" "2e" "a0" "50" "c7" "41" "cd" "22" "bc" "49" "70" "86" "88" "91" ## [46] "8a" "8c" "97" "fa" "03" "02" "be" "80" "31" "aa" "37" "ee" "da" "68" "c2" ## [61] "f6" "d7" "8f" "0a" "bb" "d4" "39" "e3" "9a" "ef" "6f" "a9" "64" "99" "53" ## [76] "e1" "4b" "bf" "81" "47" "d9" "43" "d2" "0e" "3d" "16" "a7" "f8" "8d" "5c" ## [91] "f9" "41" "a2" "17" "1a" "bf" "c9" "30" "3d" "47"

And now for the great finale, we compress all three sequences with *gzip* and calculate the compression rate:

n <- length(returns_zscores_hex) round(length(memCompress(as.raw(as.hexmode(ones_hex)))) / n * 100, 2) ## [1] 1.7 round(length(memCompress(as.raw(as.hexmode(pseudorandom_hex)))) / n * 100, 2) ## [1] 101.1 round(length(memCompress(as.raw(as.hexmode(returns_zscores_hex)))) / n * 100, 2) ## [1] 101.1

Now, that is interesting: while the sequence of ones is compressed by nearly 98% of its original size two things jump out at us:

- The compression rate of the (pseudo-)random sequence and market sequence are the same!
- Both are
*over*100%!

The first point means that up and downtick market data are indistinguishable from randomness, the second point is due to the fact that the zipped data contains some additional metadata. Because no compression was possible (= randomness) this boils down to an *inflation* of the original size!

Does that mean, that markets are 100% efficient (= random)? It is at least another indication.

There remain some loopholes though:

- We were only looking at up and downtick data. It is a well-known (stylized) fact that certain market regimes exist. Taking volatility data into account could change the picture.
- We examine the whole time series at once. It could very well be that there are some
*pockets of predictability*when we slice it up into smaller subsequences (e.g. yearly windows). - We only look at one example, it could be that other indices, e.g. from developing countries, or single stocks are less efficient.
- Bascially we are only looking at technical analysis. Taking other information into account, from other markets, company information, the economy, etc., i.e.
*fundamental analysis*, could also bring back some predictability.

Still, I think this is an interesting analysis with quite an unexpected result. Who would have thought that such an innocuously looking tool has such analytical power!

Please share your thoughts on market efficiency and this analysis, or even the results of your own analyses, with us in the comments.

]]>What is the

Many real-world processes have this self-reinforcing property, e.g. leading to the distribution of wealth or the number of followers on social media. If you want to learn how to simulate such a process with R and encounter some surprising results, read on!

The basic idea of the Pólya urn model is to return two balls of the same colour to the urn which you have just drawn. Technically this process results asymptotically in a Dirichlet distribution. While sampling with and without replacement already exists in R (the `sample()`

function and its argument `replace = TRUE`

or `FALSE`

) it is a nice little coding exercise to implement this model.

The following function takes as arguments the colours of the balls at the start `cols`

and the number of draws `n`

(technically n-1 draws because the initial filling of the urn is defined as the first step):

# needs R 4.1.0 or higher to run polya_urn <- function(cols = c("black", "blue", "green", "red", "yellow"), n = 100) { urn <- cols |> table() |> as.matrix() |> t() urn <- rbind(urn, matrix(NA, nrow = n-1, ncol = ncol(urn))) cols_unique <- colnames(urn) # n-1 draws from Dirichelet distribution for (i in seq_len(n-1)) { urn[i+1, ] <- urn[i, ] col_sample <- sample(cols_unique, size = 1, prob = (urn[i, ] / sum(urn[i, ]))) # sample ball urn[i+1, col_sample] <- urn[i+1, col_sample] + 1 # add ball with same colour } plot(x = NA, xlim = c(1, n), ylim = c(1, ceiling(max(urn[n, ], na.rm = TRUE) / 10)*10), main = "Polya urn model", xlab = "no. of draws", ylab = "no. of balls", type = "n") sapply(cols_unique, \(x) lines(urn[ , x], col = x, lwd = 2)) invisible(urn) } polya_urn()

As expected little random differences result in one colour dominating the rest, in this case, green. Please note that results can and will differ because of the involved randomness. To get a feel for the possible, sometimes surprisingly complex, dynamics, let’s run some more simulations:

polya_urn()

Again, green but this time with a strong contender, black! Now, let us increase the number of draws to 200:

polya_urn(n = 200)

The effect is now much more pronounced, blue leaves all other colours clearly behind. Let’s do this one more time:

polya_urn(n = 200)

This one is really interesting! Black and red are neck and neck! Obviously, there is more going on than just one colour dominating the rest from start to finish.

Now, let us make the initial filling more pronounced, three balls of blue instead of one:

polya_urn(cols = c("black", "blue", "blue", "blue", "green", "red", "yellow"))

Clearly, blue is the winner. But that does not have to be the case, here with three red balls instead of one:

polya_urn(cols = c("black", "blue", "green", "red", "red", "red", "yellow"))

At the beginning (till about step 20) red and blue are neck and neck, then blue starts to dominate and red and green are neck and neck. After about step 90 green even overtakes red!

Sometimes, even in these “biased” models, randomness takes its toll and we see more complicated dynamics playing out. This interplay of randomness and the self-reinforcing property makes these models so interesting and fascinating.

I hope that you enjoyed this post and it got you thinking. Please share your thoughts and the results of your own experiments in the comments below!

]]>The Bundesliga is Germany’s primary football league. It is one of the most important football leagues in the world, broadcast on television in over 200 countries.

If you want to get your hands on a tool to forecast the result of any game (and perform some more statistical analyses), read on!

The basis of our forecasting tool was laid in this blog post: Euro 2020: Will Switzerland kick out Spain too?. There we also explained the methodology. For this post, we adapted the parameters for the Bundesliga (the sources are given in the code below) to forecast the result of the upcoming game Herta BSC (Berlin) against the international top team Bayern Munich on August 28 as an example. The tool can also easily be adapted to other football leagues, e.g. the English Premier League.

On top of that, we made the model even more accurate by adding a home advantage. This effect is surprisingly stable across the main European football leagues at about 0.4 goals extra for the home team. By the way: in times of Corona, when no spectators were allowed in the stadiums, the home advantage disappeared!

Another thing we added is a probability calculation for all possible outcomes. We do this by assuming that the goals scored for each team are independent of each other (it can be discussed whether this is a reasonable assumption) so that all marginal probabilities can just be multiplied. This can easily be done in R with the `outer()`

(product) function (= `%o%`

). The most probable outcome can then easily be extracted:

mean_total_score <- 3.03 # https://de.statista.com/statistik/daten/studie/1622/umfrage/bundesliga-entwicklung-der-durchschnittlich-erzielten-tore-pro-spiel/ # https://www.transfermarkt.de/bundesliga/marktwerteverein/wettbewerb/L1 team1 = "Bayern Munich"; colour1 <- "red" ; value1 <- 818.5 # rows team2 = "Herta BSC" ; colour2 <- "blue"; value2 <- 176.75 # columns # https://www.saechsische.de/mehr-auswaerts-tore-bei-geisterspielen-5219318.html ratio <- value1 / (value1 + value2) mean_goals1 <- ratio * mean_total_score + 0.4 # 0.4 = home advantage mean_goals2 <- (1 - ratio) * mean_total_score - 0.4 goals <- 0:7 prob_goals1 <- dpois(goals, mean_goals1) prob_goals2 <- dpois(goals, mean_goals2) probs <- round((prob_goals1 %o% prob_goals2) * 100, 1) # outer product colnames(probs) <- rownames(probs) <- goals parbkp <- par(mfrow=c(1, 2)) max_ylim <- max(prob_goals1, prob_goals2) plot(goals, prob_goals1, type = "h", ylim = c(0, max_ylim), xlab = team1, ylab = "Probability", col = colour1, lwd = 10) plot(goals, prob_goals2, type = "h", ylim = c(0, max_ylim), xlab = team2, ylab = "", col = colour2, lwd = 10) title(paste(team1, paste(goals[which(probs == max(probs), arr.ind = TRUE)], collapse = ":"), team2), line = -2, outer = TRUE) par(parbkp)

So, the most probable outcome will be Bayern Munich 2:0 Herta BSC. Let us have a look at the probabilities in more detail:

probs ## 0 1 2 3 4 5 6 7 ## 0 4.8 0.7 0.0 0 0 0 0 0 ## 1 14.0 1.9 0.1 0 0 0 0 0 ## 2 20.2 2.8 0.2 0 0 0 0 0 ## 3 19.5 2.7 0.2 0 0 0 0 0 ## 4 14.1 1.9 0.1 0 0 0 0 0 ## 5 8.1 1.1 0.1 0 0 0 0 0 ## 6 3.9 0.5 0.0 0 0 0 0 0 ## 7 1.6 0.2 0.0 0 0 0 0 0

The number of goals of Bayern Munich is in the rows, Herta BSC is in the columns. The 2:0 result has a probability of over twenty percent, which is quite high. But even a result of 3:0 still has a probability of nearly 20 percent!

To calculate the overall probabilities for a win for each team and a draw we can conveniently use the `lower.tri()`

, `upper.tri()`

, and `diag()`

functions:

sum(probs[lower.tri(probs)]) # probability team 1 wins ## [1] 91 sum(diag(probs)) # probability for a draw ## [1] 6.9 sum(probs[upper.tri(probs)]) # probability team 2 wins ## [1] 0.8

So, to answer the original question, Herta BSC’s chance to beat Bayern Munich is below one percent: they need nothing less than a miracle to win in Munich!

**DISCLAIMER**

*This post is written on an “as is” basis for educational purposes only and comes without any warranty. The findings and interpretations are exclusively those of the author and are not endorsed by or affiliated with any third party.*

In particular, this post provides no sports betting advice! No responsibility is taken whatsoever if you lose money.

*(If you make any money though I would be happy if you would buy me a coffee… that is not too much to ask, is it? )*

**UPDATE August 28, 2021**

Bayern Munich won 5:0! As shown in the post it was to be expected that they win but this result is a real demonstration of power: the probability for this result was only 8.1%… but this is football!

In view of the current dramatic events in Afghanistan many wonder why the extensive international efforts to bring some stability to the country have failed so miserably.

In this post, we will present and analytically examine a fascinating theory that seems to be able to explain political (in-)stability almost mono-causally, so read on!

The theory we are talking about states that a so-called “youth bulge”, which describes an *excess in the young population of a country*, is the real (and main) hidden force behind political instability. The term was first used by Gary Fuller in 1995, then adopted by the renowned political scientist Samuel Huntington, and later extended and popularized by German sociologist and economist Gunnar Heinsohn in his book “Söhne und Weltmacht” (2003, new ed. 2019).

The basic idea is that resources are limited and are owned and managed by (older) adults. This is also true for (political) power. Young people want a piece of the action. The bigger the imbalance, i.e. the more young people there are, the bigger the conflicts because more young people fight for their place in the economy and society. If there is a balance the transition of power and resources can be organized more smoothly and if a society is even dominated by older people (like e.g. in Germany) the young just don’t have a say in those matters

Professor Heinsohn provides several convincing (and also many historical) examples in his book, here we will (as always) take a current data-centric approach and use official data from the World Bank to see whether there is some merit to the theory. We will also look at the case of Afghanistan (and Germany) in this context.

There are several definitions for “youth bulge” out there, we will use the following: “Proportion of age group 0-14 (% of total population)” whose data can directly be downloaded from the World Bank via the excellent `WDI`

package (on CRAN) which makes use of their official API. Heinsohn uses the age group 15-24 (and especially males) but both proportions are highly correlated and those numbers are much harder to come by.

The other indicator we need is an estimate of the political stability of a country, we will also use the official World Bank indicator for that: “Political Stability And Absence Of Violence/Terrorism: Estimate (PV.EST)”:

Political Stability and Absence of Violence/Terrorism measures perceptions of the likelihood of political instability and/or politically-motivated violence, including terrorism. Estimate gives the country’s score on the aggregate indicator, in units of a standard normal distribution, i.e. ranging from approximately -2.5 to 2.5.

So, let us get our hands on the data and do some inspection of it first (the 2020 data are not yet available at the time of writing but that should not bother us because we are concerned with a relatively robust demographic structure here):

# needs R 4.1.0 or higher to run library(WDI) age0_14 <- WDI(indicator = "SP.POP.0014.TO.ZS", start = 2019, end = 2019) # proportion of 0-14 year olds pol_stab <- WDI(indicator = "PV.EST", start = 2019, end = 2019) # political stability indicator data <- merge(age0_14, pol_stab)[c(2, 4, 5)] |> na.omit() colnames(data) <- c("country", "age0_14", "pol_stab") data |> summary() ## country age0_14 pol_stab ## Length:188 Min. :12.33 Min. :-2.76829 ## Class :character 1st Qu.:17.62 1st Qu.:-0.64972 ## Mode :character Median :26.20 Median :-0.08217 ## Mean :27.43 Mean :-0.10738 ## 3rd Qu.:37.10 3rd Qu.: 0.69295 ## Max. :49.84 Max. : 1.65590 data[order(data$pol_stab), ] |> head(15) ## country age0_14 pol_stab ## 201 Yemen, Rep. 39.22337 -2.768294 ## 175 Syrian Arab Republic 31.05803 -2.727996 ## 3 Afghanistan 42.47227 -2.649407 ## 113 Libya 28.06828 -2.565753 ## 86 Iraq 38.02093 -2.564838 ## 172 South Sudan 41.56805 -2.558902 ## 170 Somalia 46.37951 -2.383227 ## 147 Pakistan 35.05438 -2.246336 ## 34 Central African Republic 43.92002 -2.182968 ## 121 Mali 47.30470 -2.153436 ## 150 West Bank and Gaza 38.64966 -1.942078 ## 135 Nigeria 43.68753 -1.934674 ## 33 Congo, Dem. Rep. 46.00140 -1.808007 ## 87 Iran, Islamic Rep. 24.65466 -1.698151 ## 162 Sudan 40.15571 -1.673614 data[order(data$age0_14), ] |> head(15) ## country age0_14 pol_stab ## 77 Hong Kong SAR, China 12.32671 -0.2677673 ## 164 Singapore 12.33150 1.5316020 ## 92 Japan 12.57303 1.0412250 ## 99 Korea, Rep. 12.74644 0.4780881 ## 89 Italy 13.16666 0.4626993 ## 151 Portugal 13.25209 1.1282480 ## 154 Qatar 13.59590 0.7046954 ## 47 Germany 13.79949 0.5828653 ## 72 Greece 13.89660 0.2909985 ## 124 Macao SAR, China 14.01947 1.2506430 ## 126 Malta 14.32879 1.0901870 ## 10 Austria 14.36186 0.9801227 ## 81 Hungary 14.43322 0.7325271 ## 79 Croatia 14.56222 0.7554738 ## 57 Spain 14.57687 0.3228728

Altogether we have data for 188 countries: the first table shows the first 15 countries ordered by political instability. As we can see, those are the countries that keep making it into the news because of violence, terrorism, and war. It is notable that the proportion of young people is consistently very high (around 40%). Afghanistan is in third position here!

The second table shows the first 15 countries ordered by the proportion of young people: With the notable exception of Hong Kong, those are all politically stable countries.

Now let us analyze the dependence structure of those two variables:

plot(data$age0_14, data$pol_stab, xlab = "Proportion of young people (in %)", ylab = "Political stability", main = "Youth bulge") lm(pol_stab ~ age0_14, data = data) |> abline(col = "blue", lwd = 3) country <- "Afghanistan" points(data$age0_14[data$country == country], data$pol_stab[data$country == country], col = "red", lwd = 8) text(data$age0_14[data$country == country], data$pol_stab[data$country == country], labels = country, pos = 4)

cor.test(data$age0_14, data$pol_stab) ## ## Pearson's product-moment correlation ## ## data: data$age0_14 and data$pol_stab ## t = -9.382, df = 186, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## -0.6566176 -0.4610512 ## sample estimates: ## cor ## -0.5667656

As we can see, both variables are indeed highly negatively (nearly -0.6) and significantly correlated! As we know correlation doesn’t necessarily mean causation but it is a clear indication that there really is some kind of (linear) association between a high proportion of young people and political unrest.

The red dot represents Afghanistan which unfortunately seems to be a (negative) role model for this phenomenon. The theory could be interpreted in a way that the radical Taliban are not the “real” reason for the political unrest but that the youth bulge is the hidden force behind terrorism and political instability.

To bring the point home, the Taliban would according to this theory “only” be some kind of historically path-dependent but ultimately contingent representation of political unrest. If it weren’t for the Taliban there would be some other group of political extremists spreading fear and terror.

Let us finally determine a *tipping point* (a.k.a. *cut-off value*) for the proportion of young people which signifies the shift from stable to unstable political circumstances. We will use the `OneR`

package (on CRAN) for that:

library(OneR) data$pol_stab_bin <- cut(data$pol_stab, breaks = c(-Inf, 0, Inf), labels = c("unstable", "stable")) optbin(pol_stab_bin ~ age0_14, data = data, method = "infogain") |> OneR() |> summary() ## ## Call: ## OneR.data.frame(x = optbin(pol_stab_bin ~ age0_14, data = data, ## method = "infogain")) ## ## Rules: ## If age0_14 = (12.3,23.8] then pol_stab_bin = stable ## If age0_14 = (23.8,49.9] then pol_stab_bin = unstable ## ## Accuracy: ## 149 of 188 instances classified correctly (79.26%) ## ## Contingency table: ## age0_14 ## pol_stab_bin (12.3,23.8] (23.8,49.9] Sum ## unstable 15 * 82 97 ## stable * 67 24 91 ## Sum 82 106 188 ## --- ## Maximum in each column: '*' ## ## Pearson's Chi-squared test: ## X-squared = 62.242, df = 1, p-value = 3.037e-15

The tipping point is at about 24%, which is lower than the often stated number 30% for the age group 0-15 (which is one year more). This number is highly significant and gives an accuracy of nearly 80% when using it as a forecasting instrument for political instability.

On the other hand, this would be good news for countries like Germany (under 14%), where the so-called “Querdenker” COVID protest movement makes some headlines. As serious as one has to take this (German intelligence agencies watch the movement already), the risk of substantial political unrest is very low.

What do you think about the youth bulge theory? Please let us know in the comments!

]]>Over the course of the last two and a half years, I have written over one hundred posts for my blog “Learning Machines” on the topics of data science, i.e. statistics, artificial intelligence, machine learning, and deep learning.

I use many of those in my university classes and in this post, I will give you the first part of a learning path for the knowledge that has accumulated on this blog over the years to become a well-rounded data scientist, so read on!

We start by explaining why R is the best choice to do data science. This is still one of my most popular posts!

Why R for Data Science – and not Python?

This post will get you up to speed by introducing you to important concepts of R.

Learning R: The Ultimate Introduction (incl. Machine Learning!)

Data science is all about building good models, so we start by building a very simple *linear model* here and learn about the important concept of *overfitting*.

Learning Data Science: Modelling Basics

In all of data science the concept of *correlation* is an important one, so you should develop some intuition about it.

Causation doesn’t imply Correlation either

This post shows that even with simple correlations powerful models can be built.

The OneR package (which I developed) is a very simple *classification* algorithm which can serve as a great use case for machine learning.

One Rule (OneR) Machine Learning Classification in under One Minute

A few more examples of how to use the OneR classification algorithm to gain valuable insights from your data.

OneR – Fascinating Insights through Simple Rules

A consistent example where we apply three classification algorithms (OneR, *decision trees*, *random forests*) on the same data set to understand the *Accuracy-Interpretability Trade-Off*.

Learning Data Science: Predicting Income Brackets

We are now in a good position to really understand what *Artifical Intelligence* is all about!

Diving even deeper into the concepts of Artificial Intelligence, *Machine Learning* and *Deep Learning* – written by an advanced AI!

The Most Advanced AI in the World explains what AI, Machine Learning, and Deep Learning are!

Everybody is talking about *Neural Networks* (a.k.a. deep learning), we explain how the underlying technology works!

Understanding the Magic of Neural Networks

One of the biggest problems of neural networks is that they are *black boxes*. One remedy is *Explainable AI (XAI)* which renders those systems interpretable.

Explainable AI (XAI)… Explained! Or: How to whiten any Black Box with LIME

We already covered simple *linear regression*. Here we introduce another linear model, *logistic regression*, and explain its connection with neural networks.

Logistic Regression as the Smallest Possible Neural Network

Learn about a fascinating real-life application of logistic regression.

Learning Data Science: The Supermarket knows you are pregnant before your Dad does

Another fascinating real-life application of yet another linear model, the *LASSO*, and how it got Trump elected (kind of)!

Cambridge Analytica: Microtargeting or How to catch voters with the LASSO

**Bonus Post**

Now that we are already talking about some controversial uses of AI, let us dive a little bit deeper into what AI could mean for us and for the society as a whole.

Thomas Ramge: Postdigital (Book Excerpt)

Many data science-techniques are based on some form of *distance measure*. Here we learn about the *k-nearest neighbours (KNN)* algorithm.

Teach R to read handwritten Digits with just 4 Lines of Code

Another distance measure-based algorithm in the area of *unsupervised learning* is the *k-means clustering algorithm*.

Learning Data Science: Understanding and Using k-means Clustering

Besides supervised and unsupervised learning there is a third category which is getting more and more popular (especially in combination with neural networks): *Reinforcement Learning*.

Reinforcement Learning: Life is a Maze

A glimpse into another important topic: *Sentiment Analysis* which is an important application area of *Natural Language Processing (NLP)*.

Learning Data Science: Sentiment Analysis with Naive Bayes

An important real-world application in the area of *Customer Relationship Management (CRM)*: how to retain your customers!

Data Science on Rails: Analyzing Customer Churn

Another important real-world application in the area of Finance: how to get your money back!

Will I get my Money back? Credit Scoring with OneR

Quite a controversial application in the area of law enforcement: how to only let those guys free who have become good citizens again.

Recidivism: Identifying the Most Important Predictors for Re-offending with OneR

Medical research has become more and more important for the application of machine learning (I also provide a link to an open-source paper I published on Covid-19)!

OneR in Medical Research: Finding Leading Symptoms, Main Predictors and Cut-Off Points

Here we use clustering to find a pattern in the Greek Gospels only Biblical Scholars normally know about – fascinating stuff!

With this application of sentiment analysis you can automatically extract plots from novels!

Extracting Basic Plots from Novels: Dracula is a Man in a Hole

**Bonus Post**

After all that nerdy stuff some practical advice on how to date – which proves the point that data scientist is the sexiest job of the 21st century!

Cupid’s Arrow: How to Boost your Chances at Speed Dating!

If you work through this learning path you will have a good basic understanding of important foundations, techniques, and real-life applications of machine learning.

Yet this is only a selection of topics covered on this blog so far. Feel free to discover other gems around here… and please share your feedback and suggestions in the comment section below.

In the future, I plan to publish another post on a learning path for *statistics* which is also an important foundation for becoming a data scientist but which would have overwhelmed this post, so stay tuned!

**UPDATE September 20, 2021**

I created a video for this post (in German):

Everybody is talking about

If you want to learn how to determine the range of the typical value of a dataset (i.e. the *median*) with just five values and why this works, read on!

This blog post is inspired by a chapter from the wonderful book “Alles kein Zufall! Liebe, Geld, Fußball” (“No coincidence! Love, Money, Football”, only available in German at the moment) by my colleague Professor Christian Hesse from the University of Stuttgart, Germany.

Let us dive directly into the matter, the *Small Data Rule* states:

In a sample of five numerical values from any unknown population, the median of this population lies between the smallest and the largest sample value with 94 percent certainty.

The “population” can be anything, like data about age in a population, income in a country, television consumption, donation amounts, body sizes, temperatures and so on.

The median is the “middle value” and thereby a good representation of a population’s “typical value”. It is calculated by sorting all of the values and then dividing them into two halves of the same size. The value that lies exactly between those two halves is the median. Contrary to the *mean* (often simply called the “average”) the median is robust with regard to outliers:

x <- 0:10 median(x) ## [1] 5 mean(x) ## [1] 5 x <- c(0:9, 10000) median(x) ## [1] 5 mean(x) ## [1] 913.1818

Obviously, the median is quite useful for getting a quick overview of a large dataset. So, it seems almost magical that you could determine the range of it by just five randomly drawn numbers. Yet, the rationale is quite straightforward:

The probability of drawing a random value from a population that is *above* the median is 50 percent or 1/2. The probability that all five values are above the median is 1/2 x 1/2 x 1/2 x 1/2 x 1/2. Of course, this is the same probability that all of those values are *below* the median. To cover both cases just add those probabilities.

But we are interested in the complementary event, i.e. that at least one value lies on each side of the median so that we get an interval that encloses it. We get that by subtracting the above probability from one:

1-2*(0.5^5) ## [1] 0.9375

The result is a high degree of certainty of nearly 94% that this will indeed be the case!

If you don’t believe this let us conduct a little experiment for illustrative purposes. We enumerate all possibilities of drawing five values from the range of zero to one hundred and see how often the median (= 50) falls within the interval of the minimum and the maximum of the samples (to understand how to do this, this post might be helpful: Learning R: Permutations and Combinations with Base R).

Beware, the following code will run for quite a while (about three to four minutes on an average computer) because there are nearly 80 million possibilities that have to be created and after that evaluated:

# needs at least version 4.1.0 of R M <- combn(0:100, 5) between <- apply(M, 2, \(x) min(x) < 50 && max(x) > 50) sum(between) / ncol(M) ## [1] 0.9406869

As you can see: 94% indeed! (The resulting value is not exactly the same as above because it only asymptotically reaches that value the bigger the underlying population is.)

Professor Hesse gives a nice example of how to use the small data rule in practice:

The manager of a company is interested in the distance his employees have to commute to work. He plans to open another branch if the distances are too long for many. He could, of course, ask his entire staff about the distance to their place of residence. That would be costly, generate a lot of data and provide more information than the manager actually needs. Instead, he surveys only five randomly selected employees. They live 7, 19, 13, 18, and 9 km away from the company. Thus, the manager can be 94 per cent sure that his employees have to commute a median distance of 7 to 19 kilometres to the company. He considers this acceptable and decides against an additional location.

As an aside, not many people know the `range`

function which might come in handy in contexts like these:

range(c(7, 19, 13, 18, 9)) ## [1] 7 19

So you see, small data can help you determine the big picture!

For another handy tool to infer whether something unusual is going on see this post: 3.84 or: How to Detect BS (Fast).

]]>In this post, we will first give some intuition for and then demonstrate what is often called the most beautiful formula in mathematics, Euler’s identity, in R – first numerically with base R and then also symbolically, so read on!

Euler’s identity (also known as Euler’s equation) is the equality:

where

- is Euler’s number, the base of natural logarithms
- is the imaginary unit, which satisfies
- is the ratio of the circumference of a circle to its diameter

It is often credited as the most beautiful formula in mathematics, nerds sport it on T-shirts and even get tattoos with it.

It combines three of the basic arithmetic operations: addition, multiplication, and exponentiation and links five fundamental mathematical constants:

- The number 0
- The number 1
- The number
- The number
- The number the imaginary unit of the complex numbers

We won’t go into the mathematical details here (when you google it you can find literally thousands of posts, articles, videos, and even whole books on it) but just give you some hand-waving intuition: as stated above is the ratio of the circumference of a circle to its diameter, which means that when you have a radius of 1 (= unit circle) you will need to go full circle. This is illustrated in the following animation:

Many of you know the exponential function (thanks to Covid anyway) which is nothing else but taking Euler’s number to some power. Something magical happens when you take imaginary/complex instead of the “normal” real numbers: the exponential function starts to rotate:

As we have seen above a rotation by boils down to a rotation by degrees. So when you start at 0 and put that into the exponential function you get 1 (because ) and when you then do a one-eighty (=) you will end up at -1. To get to the right-hand side of the identity you just have to add 1 to that -1 which equals 0. So Euler’s identity basically means:

**When you turn around, you will look in the opposite direction!**

Seen this way, it is easy, isn’t it!

Now for the R part. The following is the original task from Rosetta Code (for more solved Rosetta code tasks see the respective Category: Rosetta Code on this blog):

Show in your language that Euler’s identity is true. As much as possible and practical, mimic the Euler’s identity equation.

Most languages are limited to IEEE 754 floating point calculations so will have some error in the calculation.

If that is the case, or there is some other limitation, show that is approximately equal to zero and show the amount of error in the calculation.

If your language is capable of symbolic calculations, show that is exactly equal to zero for bonus kudos points.

First, as always, you should give it a try yourself…

…and now for the solution!

For coding the left-hand side of the identity we have to know the following:

- is not a built-in constant. Instead, the exponential function
`exp()`

is used (if you want to get Euler’s number just use`exp(1)`

) - R can handle complex numbers! You can use the
`complex()`

function for that, or the`Re()`

and`Im()`

functions for the real and the imaginary parts. In this case it is even easier because we will only need and this is exactly the way we code it in R:`1i`

! - is an built-in constant:
`pi`

Putting it all together:

exp(1i * pi) + 1 ## [1] 0+1.224606e-16i

Besides the small rounding error, this is the whole solution!

Now for the symbolic solution to also get the bonus kudos points.

We will use the fantastic `Ryacas`

package (on CRAN) to finish the job (for an introduction to this package see: Doing Maths Symbolically: R as a Computer Algebra System (CAS)).

library(Ryacas) ## ## Attaching package: 'Ryacas' ## The following object is masked from 'package:stats': ## ## integrate ## The following objects are masked from 'package:base': ## ## %*%, diag, diag<-, lower.tri, upper.tri as_r(yac_str("Exp(I * Pi) + 1")) ## [1] 0

And this finishes the task. It is also the solution I contributed to Rosetta code.

Isn’t maths beautiful! And isn’t R beautiful!

]]>One of the big sensations of the UEFA Euro 2020 is that Switzerland kicked out world champion France. We take this as an opportunity to share with you a simple statistical model to predict football (soccer) results with R, so read on!

Football is a highly stochastic game, which is one of the reasons for its appeal: anything can happen! But there are still some known patterns that can be used to build a predictive model.

First, it is well known that the probability for the number of goals in a game can be well approximated by a *Poisson distribution*.

Second, it is also well known that one of the best predictors of the strength of a team is its *market value* (in German there is the saying “Geld schießt Tore”, which translates to “money scores goals”). We can find the market value of the different teams e.g. here: transfermarkt.de.

The third ingredient that we need is the *average number of goals scored per game*. Wikipedia tells us that this is about 2.8 for the current tournament.

The main idea is to divide this average number according to the market values of both teams to get the average number of goals per team and feed that into two Poisson distributions to determine the probabilities for each potential number of goals scored:

mean_total_score <- 2.8 # https://en.wikipedia.org/wiki/UEFA_Euro_2020_statistics # https://www.transfermarkt.de/europameisterschaft-2020/teilnehmer/pokalwettbewerb/EM20 country1 = "Switzerland"; colour1 <- "red" ; value1 <- 0.29 country2 = "Spain" ; colour2 <- "orange" ; value2 <- 0.92 ratio <- value1 / (value1 + value2) mean_goals1 <- ratio * mean_total_score mean_goals2 <- (1 - ratio) * mean_total_score prob_goals1 <- dpois(0:7, mean_goals1) prob_goals2 <- dpois(0:7, mean_goals2) parbkp <- par(mfrow=c(1, 2)) max_ylim <- max(prob_goals1, prob_goals2) plot(0:7, prob_goals1, type = "h", ylim = c(0, max_ylim), xlab = country1, ylab = "Probability", col = colour1, lwd = 10) plot(0:7, prob_goals2, type = "h", ylim = c(0, max_ylim), xlab = country2, ylab = "", col = colour2, lwd = 10) title(paste0(country1, " ", which.max(prob_goals1) - 1, ":", which.max(prob_goals2) - 1, " ", country2), line = -2, outer = TRUE) par(parbkp)

So the most probable prediction is that Spain will win this one clearly… but you never know! And apart from those hard numbers, many will root for the underdog anyway (me too because I have a special relationship with Switzerland since I did my Ph.D. there at the University of St. Gallen and still have a lot of friends from that time).

I have played around with this simple model for nearly ten years now and it often proved surprisingly accurate. Its biggest shortcoming is of course that it treats both distributions independently. Another one is that it doesn’t include the home advantage (although this effect seems to be fading). A third point is that it is based on only one variable (market value), but there are of course others that are also important (e.g. the ratio of goal shots of both teams or the World Football ELO Ratings).

Any ideas on how to improve the above model are highly welcome, please share them in the comments below.

**DISCLAIMER**

*This post is written on an “as is” basis for educational purposes only and comes without any warranty. The findings and interpretations are exclusively those of the author and are not endorsed by or affiliated with any third party.*

In particular, this post provides no sports betting advice! No responsibility is taken whatsoever if you lose money.

*(If you make any money though I would be happy if you would buy me a coffee… that is not too much to ask, is it? )*

**UPDATE July 2, 2021**

Unfortunately for Switzerland, our prediction got it (nearly) right. Spain won by a two-goal difference. The end result was 1:3 (after penalty shoot-out), instead of the predicted 0:2.

**UPDATE July 3, 2021**

A (slightly) better method would be not to take the total market value of the whole team but the average value per player (same source). In this case, the prediction would have been the same but in other cases, it could be different.

**UPDATE July 12, 2021**

The result of the final (before the following penalty shoot-out) would have been correctly predicted as Italy 1:1 England! Penalty shoot-outs are really only better coin tosses, the only thing one might conclude is that the stronger (= more expensive) team should have some edge. In this case that would have been England. But as we all know things turned out differently and Italy is the new European champion – Congratulations!

**UPDATE August 25, 2021**

I further improved the tool and adapted it for the German Bundesliga:

New Bundesliga Forecasting Tool: Can Underdog Herta Berlin beat Bayern Munich?

If you want to see how to do that in at least seven different ways in R, read on!

There are many different solutions possible, making use of several aspects of the R language. So this blog post can be seen as a fun exercise to recap some of the concepts explained in our introduction to R: Learning R: The Ultimate Introduction (incl. Machine Learning!).

First, as usual, you should try this for yourself…

Ok, so if you didn’t have any ideas whatsoever you could have done it by hand, i.e. use *R as a calculator*:

2^2 + 4^2 + 6^2 + 8^2 + 10^2 + 12^2 + 14^2 + 16^2 + 18^2 + 20^2 + 22^2 + 24^2 + 26^2 + 28^2 + 30^2 + 32^2 + 34^2 + 36^2 + 38^2 + 40^2 + 42^2 + 44^2 + 46^2 + 48^2 + 50^2 + 52^2 + 54^2 + 56^2 + 58^2 + 60^2 + 62^2 + 64^2 + 66^2 + 68^2 + 70^2 + 72^2 + 74^2 + 76^2 + 78^2 + 80^2 + 82^2 + 84^2 + 86^2 + 88^2 + 90^2 + 92^2 + 94^2 + 96^2 + 98^2 + 100^2 - 99^2 - 97^2 - 95^2 - 93^2 - 91^2 - 89^2 - 87^2 - 85^2 - 83^2 - 81^2 - 79^2 - 77^2 - 75^2 - 73^2 - 71^2 - 69^2 - 67^2 - 65^2 - 63^2 - 61^2 - 59^2 - 57^2 - 55^2 - 53^2 - 51^2 - 49^2 - 47^2 - 45^2 - 43^2 - 41^2 - 39^2 - 37^2 - 35^2 - 33^2 - 31^2 - 29^2 - 27^2 - 25^2 - 23^2 - 21^2 - 19^2 - 17^2 - 15^2 - 13^2 - 11^2 - 9^2 - 7^2 - 5^2 - 3^2 - 1^2 ## [1] 5050

The result is 5050. But there are of course many much more elegant solutions. The first solution I thought of was the following, it makes use of the `seq`

function:

sum(seq(2, 100, 2)^2 - seq(99, 1, -2)^2) ## [1] 5050

An integral part is splitting the numbers into 50 even and 50 odd ones. There are several ways to do that. One way is to create both with the formulas *2n* for even and *2n-1* for odd numbers:

n <- 1:50 even <- 2 * n odd <- 2 * n - 1 sum(even^2 - odd^2) ## [1] 5050

Another possibility is by *subsetting* with *recycling*…

x <- 1:100 even <- x[c(FALSE, TRUE)] # subsetting with recycling odd <- x[c(TRUE, FALSE)] sum(even^2 - odd^2) ## [1] 5050

…or elegantly by creating a *matrix*:

M <- matrix(1:100, nrow = 2) sum(M[2, ]^2 - M[1, ]^2) ## [1] 5050

If you come from another language, especially C and its derivatives you might have wanted to use a *loop*. This is of course also possible but discouraged in R (some say that you then “speak R with a C accent”):

s <- 0 for (x in 1:100) { if (x %% 2) s <- s - x^2 else s <- s + x^2 } s ## [1] 5050

As you can see, inside of the loop is a *conditional statement* and the *modulo operator* (`%%`

) to get the remainder of a division. The loop can easily be *vectorized* which is the preferred method in R:

x <- 1:100 sum(ifelse(x %% 2, -x^2, x^2)) # vectorized if statement ## [1] 5050

Those were seven ways to get to the same correct result… and now for the bonus: if you think long enough about this little riddle you will see that is equivalent to adding up the original numbers (I leave this as an exercise, it is not too hard to see). The resulting code out of this analysis couldn’t be any simpler:

sum(1:100) # analytical ## [1] 5050

That was fun, wasn’t it! If you want to share your own solution please do so in the comments below. If it is an especially clever, elegant, or creative one you will get an honorary mention in an update of this post!

P.S.: The Bart Simpson blackboard pic was created with the code provided in this post: Create Bart Simpson Blackboard Memes with R.

**UPDATE June 24, 2021**

A very concise and elegant solution came from Rob in the comments:

sum(c(-1, 1) * (1:100)^2) ## [1] 5050

**UPDATE July 28, 2021**

The most elegant and concise solution came from NelaTo in the comments – well done!

x <- 1:100 sum((-1)^x * x^2) ## [1] 5050]]>