Everybody is talking about

If you want to learn how to determine the range of the typical value of a dataset (i.e. the *median*) with just five values and why this works, read on!

This blog post is inspired by a chapter from the wonderful book “Alles kein Zufall! Liebe, Geld, Fußball” (“No coincidence! Love, Money, Football”, only available in German at the moment) by my colleague Professor Christian Hesse from the University of Stuttgart, Germany.

Let us dive directly into the matter, the *Small Data Rule* states:

In a sample of five numerical values from any unknown population, the median of this population lies between the smallest and the largest sample value with 94 percent certainty.

The “population” can be anything, like data about age in a population, income in a country, television consumption, donation amounts, body sizes, temperatures and so on.

The median is the “middle value” and thereby a good representation of a population’s “typical value”. It is calculated by sorting all of the values and then dividing them into two halves of the same size. The value that lies exactly between those two halves is the median. Contrary to the *mean* (often simply called the “average”) the median is robust with regard to outliers:

x <- 0:10 median(x) ## [1] 5 mean(x) ## [1] 5 x <- c(0:9, 10000) median(x) ## [1] 5 mean(x) ## [1] 913.1818

Obviously, the median is quite useful for getting a quick overview of a large dataset. So, it seems almost magical that you could determine the range of it by just five randomly drawn numbers. Yet, the rationale is quite straightforward:

The probability of drawing a random value from a population that is *above* the median is 50 percent or 1/2. The probability that all five values are above the median is 1/2 x 1/2 x 1/2 x 1/2 x 1/2. Of course, this is the same probability that all of those values are *below* the median. To cover both cases just add those probabilities.

But we are interested in the complementary event, i.e. that at least one value lies on each side of the median so that we get an interval that encloses it. We get that by subtracting the above probability from one:

1-2*(0.5^5) ## [1] 0.9375

The result is a high degree of certainty of nearly 94% that this will indeed be the case!

If you don’t believe this let us conduct a little experiment for illustrative purposes. We enumerate all possibilities of drawing five values from the range of zero to one hundred and see how often the median (= 50) falls within the interval of the minimum and the maximum of the samples (to understand how to do this, this post might be helpful: Learning R: Permutations and Combinations with Base R).

Beware, the following code will run for quite a while (about three to four minutes on an average computer) because there are nearly 80 million possibilities that have to be created and after that evaluated:

M <- combn(0:100, 5) between <- apply(M, 2, \(x) min(x) < 50 && max(x) > 50) sum(between) / ncol(M) ## [1] 0.9406869

As you can see: 94% indeed! (The resulting value is not exactly the same as above because it only asymptotically reaches that value the bigger the underlying population is.)

Professor Hesse gives a nice example of how to use the small data rule in practice:

The manager of a company is interested in the distance his employees have to commute to work. He plans to open another branch if the distances are too long for many. He could, of course, ask his entire staff about the distance to their place of residence. That would be costly, generate a lot of data and provide more information than the manager actually needs. Instead, he surveys only five randomly selected employees. They live 7, 19, 13, 18, and 9 km away from the company. Thus, the manager can be 94 per cent sure that his employees have to commute a median distance of 7 to 19 kilometres to the company. He considers this acceptable and decides against an additional location.

As an aside, not many people know the `range`

function which might come in handy in contexts like these:

range(c(7, 19, 13, 18, 9)) ## [1] 7 19

So you see, small data can help you determine the big picture!

For another handy tool to infer whether something unusual is going on see this post: 3.84 or: How to Detect BS (Fast).

]]>In this post, we will first give some intuition for and then demonstrate what is often called the most beautiful formula in mathematics, Euler’s identity, in R – first numerically with base R and then also symbolically, so read on!

Euler’s identity (also known as Euler’s equation) is the equality:

where

- is Euler’s number, the base of natural logarithms
- is the imaginary unit, which satisfies
- is the ratio of the circumference of a circle to its diameter

It is often credited as the most beautiful formula in mathematics, nerds sport it on T-shirts and even get tattoos with it.

It combines three of the basic arithmetic operations: addition, multiplication, and exponentiation and links five fundamental mathematical constants:

- The number 0
- The number 1
- The number
- The number
- The number the imaginary unit of the complex numbers

We won’t go into the mathematical details here (when you google it you can find literally thousands of posts, articles, videos, and even whole books on it) but just give you some hand-waving intuition: as stated above is the ratio of the circumference of a circle to its diameter, which means that when you have a radius of 1 (= unit circle) you will need to go full circle. This is illustrated in the following animation:

Many of you know the exponential function (thanks to Covid anyway) which is nothing else but taking Euler’s number to some power. Something magical happens when you take imaginary/complex instead of the “normal” real numbers: the exponential function starts to rotate:

As we have seen above a rotation by boils down to a rotation by degrees. So when you start at 0 and put that into the exponential function you get 1 (because ) and when you then do a one-eighty (=) you will end up at -1. To get to the right-hand side of the identity you just have to add 1 to that -1 which equals 0. So Euler’s identity basically means:

**When you turn around, you will look in the opposite direction!**

Seen this way, it is easy, isn’t it!

Now for the R part. The following is the original task from Rosetta Code (for more solved Rosetta code tasks see the respective Category: Rosetta Code on this blog):

Show in your language that Euler’s identity is true. As much as possible and practical, mimic the Euler’s identity equation.

Most languages are limited to IEEE 754 floating point calculations so will have some error in the calculation.

If that is the case, or there is some other limitation, show that is approximately equal to zero and show the amount of error in the calculation.

If your language is capable of symbolic calculations, show that is exactly equal to zero for bonus kudos points.

First, as always, you should give it a try yourself…

…and now for the solution!

For coding the left-hand side of the identity we have to know the following:

- is not a built-in constant. Instead, the exponential function
`exp()`

is used (if you want to get Euler’s number just use`exp(1)`

) - R can handle complex numbers! You can use the
`complex()`

function for that, or the`Re()`

and`Im()`

functions for the real and the imaginary parts. In this case it is even easier because we will only need and this is exactly the way we code it in R:`1i`

! - is an built-in constant:
`pi`

Putting it all together:

exp(1i * pi) + 1 ## [1] 0+1.224606e-16i

Besides the small rounding error, this is the whole solution!

Now for the symbolic solution to also get the bonus kudos points.

We will use the fantastic `Ryacas`

package (on CRAN) to finish the job (for an introduction to this package see: Doing Maths Symbolically: R as a Computer Algebra System (CAS)).

library(Ryacas) ## ## Attaching package: 'Ryacas' ## The following object is masked from 'package:stats': ## ## integrate ## The following objects are masked from 'package:base': ## ## %*%, diag, diag<-, lower.tri, upper.tri as_r(yac_str("Exp(I * Pi) + 1")) ## [1] 0

And this finishes the task. It is also the solution I contributed to Rosetta code.

Isn’t maths beautiful! And isn’t R beautiful!

]]>One of the big sensations of the UEFA Euro 2020 is that Switzerland kicked out world champion France. We take this as an opportunity to share with you a simple statistical model to predict football (soccer) results with R, so read on!

Football is a highly stochastic game, which is one of the reasons for its appeal: anything can happen! But there are still some known patterns that can be used to build a predictive model.

First, it is well known that the probability for the number of goals in a game can be well approximated by a *Poisson distribution*.

Second, it is also well known that one of the best predictors of the strength of a team is its *market value* (in German there is the saying “Geld schießt Tore”, which translates to “money scores goals”). We can find the market value of the different teams e.g. here: transfermarkt.de.

The third ingredient that we need is the *average number of goals scored per game*. Wikipedia tells us that this is about 2.8 for the current tournament.

The main idea is to divide this average number according to the market values of both teams to get the average number of goals per team and feed that into two Poisson distributions to determine the probabilities for each potential number of goals scored:

mean_total_score <- 2.8 # https://en.wikipedia.org/wiki/UEFA_Euro_2020_statistics # https://www.transfermarkt.de/europameisterschaft-2020/teilnehmer/pokalwettbewerb/EM20 country1 = "Switzerland"; colour1 <- "red" ; value1 <- 0.29 country2 = "Spain" ; colour2 <- "orange" ; value2 <- 0.92 ratio <- value1 / (value1 + value2) mean_goals1 <- ratio * mean_total_score mean_goals2 <- (1 - ratio) * mean_total_score prob_goals1 <- dpois(0:7, mean_goals1) prob_goals2 <- dpois(0:7, mean_goals2) parbkp <- par(mfrow=c(1, 2)) max_ylim <- max(prob_goals1, prob_goals2) plot(0:7, prob_goals1, type = "h", ylim = c(0, max_ylim), xlab = country1, ylab = "Probability", col = colour1, lwd = 10) plot(0:7, prob_goals2, type = "h", ylim = c(0, max_ylim), xlab = country2, ylab = "", col = colour2, lwd = 10) title(paste0(country1, " ", which.max(prob_goals1) - 1, ":", which.max(prob_goals2) - 1, " ", country2), line = -2, outer = TRUE) par(parbkp)

So the most probable prediction is that Spain will win this one clearly… but you never know! And apart from those hard numbers, many will root for the underdog anyway (me too because I have a special relationship with Switzerland since I did my Ph.D. there at the University of St. Gallen and still have a lot of friends from that time).

I have played around with this simple model for nearly ten years now and it often proved surprisingly accurate. Its biggest shortcoming is of course that it treats both distributions independently. Another one is that it doesn’t include the home advantage (although this effect seems to be fading). A third point is that it is based on only one variable (market value), but there are of course others that are also important (e.g. the ratio of goal shots of both teams or the World Football ELO Ratings).

Any ideas on how to improve the above model are highly welcome, please share them in the comments below.

**DISCLAIMER**

*This post is written on an “as is” basis for educational purposes only and comes without any warranty. The findings and interpretations are exclusively those of the author and are not endorsed by or affiliated with any third party.*

In particular, this post provides no sports betting advice! No responsibility is taken whatsoever if you lose money.

*(If you gain money though I would be happy if you would buy me a coffee… that is not too much to ask, is it? )*

**UPDATE July 2, 2021**

Unfortunately for Switzerland, our prediction got it (nearly) right. Spain won by a two-goal difference. The end result was 1:3 (after penalty shoot-out), instead of the predicted 0:2.

**UPDATE July 3, 2021**

A (slightly) better method would be not to take the total market value of the whole team but the average value per player (same source). In this case, the prediction would have been the same but in other cases, it could be different.

**UPDATE July 12, 2021**

The result of the final (before the following penalty shoot-out) would have been correctly predicted as Italy 1:1 England! Penalty shoot-outs are really only better coin tosses, the only thing one might conclude is that the stronger (= more expensive) team should have some edge. In this case that would have been England. But as we all know things turned out differently and Italy is the new European champion – Congratulations!

If you want to see how to do that in at least seven different ways in R, read on!

There are many different solutions possible, making use of several aspects of the R language. So this blog post can be seen as a fun exercise to recap some of the concepts explained in our introduction to R: Learning R: The Ultimate Introduction (incl. Machine Learning!).

First, as usual, you should try this for yourself…

Ok, so if you didn’t have any ideas whatsoever you could have done it by hand, i.e. use *R as a calculator*:

2^2 + 4^2 + 6^2 + 8^2 + 10^2 + 12^2 + 14^2 + 16^2 + 18^2 + 20^2 + 22^2 + 24^2 + 26^2 + 28^2 + 30^2 + 32^2 + 34^2 + 36^2 + 38^2 + 40^2 + 42^2 + 44^2 + 46^2 + 48^2 + 50^2 + 52^2 + 54^2 + 56^2 + 58^2 + 60^2 + 62^2 + 64^2 + 66^2 + 68^2 + 70^2 + 72^2 + 74^2 + 76^2 + 78^2 + 80^2 + 82^2 + 84^2 + 86^2 + 88^2 + 90^2 + 92^2 + 94^2 + 96^2 + 98^2 + 100^2 - 99^2 - 97^2 - 95^2 - 93^2 - 91^2 - 89^2 - 87^2 - 85^2 - 83^2 - 81^2 - 79^2 - 77^2 - 75^2 - 73^2 - 71^2 - 69^2 - 67^2 - 65^2 - 63^2 - 61^2 - 59^2 - 57^2 - 55^2 - 53^2 - 51^2 - 49^2 - 47^2 - 45^2 - 43^2 - 41^2 - 39^2 - 37^2 - 35^2 - 33^2 - 31^2 - 29^2 - 27^2 - 25^2 - 23^2 - 21^2 - 19^2 - 17^2 - 15^2 - 13^2 - 11^2 - 9^2 - 7^2 - 5^2 - 3^2 - 1^2 ## [1] 5050

The result is 5050. But there are of course many much more elegant solutions. The first solution I thought of was the following, it makes use of the `seq`

function:

sum(seq(2, 100, 2)^2 - seq(99, 1, -2)^2) ## [1] 5050

An integral part is splitting the numbers into 50 even and 50 odd ones. There are several ways to do that. One way is to create both with the formulas *2n* for even and *2n-1* for odd numbers:

n <- 1:50 even <- 2 * n odd <- 2 * n - 1 sum(even^2 - odd^2) ## [1] 5050

Another possibility is by *subsetting* with *recycling*…

x <- 1:100 even <- x[c(FALSE, TRUE)] # subsetting with recycling odd <- x[c(TRUE, FALSE)] sum(even^2 - odd^2) ## [1] 5050

…or elegantly by creating a *matrix*:

M <- matrix(1:100, nrow = 2) sum(M[2, ]^2 - M[1, ]^2) ## [1] 5050

If you come from another language, especially C and its derivatives you might have wanted to use a *loop*. This is of course also possible but discouraged in R (some say that you then “speak R with a C accent”):

s <- 0 for (x in 1:100) { if (x %% 2) s <- s - x^2 else s <- s + x^2 } s ## [1] 5050

As you can see, inside of the loop is a *conditional statement* and the *modulo operator* (`%%`

) to get the remainder of a division. The loop can easily be *vectorized* which is the preferred method in R:

x <- 1:100 sum(ifelse(x %% 2, -x^2, x^2)) # vectorized if statement ## [1] 5050

Those were seven ways to get to the same correct result… and now for the bonus: if you think long enough about this little riddle you will see that is equivalent to adding up the original numbers (I leave this as an exercise, it is not too hard to see). The resulting code out of this analysis couldn’t be any simpler:

sum(1:100) # analytical ## [1] 5050

That was fun, wasn’t it! If you want to share your own solution please do so in the comments below. If it is an especially clever, elegant, or creative one you will get an honorary mention in an update of this post!

P.S.: The Bart Simpson blackboard pic was created with the code provided in this post: Create Bart Simpson Blackboard Memes with R.

**UPDATE June 24, 2021**

A very concise and elegant solution came from Rob in the comments:

sum(c(-1, 1) * (1:100)^2) ## [1] 5050]]>

More and more decisions by banks on who gets a loan are being made by artificial intelligence. The terms being used are

They base their decisions on models whether the customer will pay back the loan or will default, i.e. determine their creditworthiness. If you want to learn how to build such a model in R yourself (with the latest R ≥ 4.1.0 syntax as a bonus), read on!

As always we need data to build our model. In this case we will use credit scoring data from a *kaggle* competition: Give Me Some Credit.

The goal is to predict whether somebody will experience financial distress in the next two years, the full list of variables includes:

- Serious delinquency in 2 years (
`SeriousDlqin2yrs`

) - Revolving Utilization Of Unsecured Lines
- Age of borrower in years
- Number Of Time 30–59 Days Past Due Not Worse
- Debt Ratio
- Monthly Income
- Number Of Open Credit Lines And Loans
- Number Of Times 90 Days Late
- Number of Real Estate Loans Or Lines
- Number Of Time 60–89 Days Past Due Not Worse
- Number of Dependents

We start by reading the data into R and inspecting it:

cs <- read.csv("data/cs-training.csv") data <- cs[ , -1] str(data) ## 'data.frame': 150000 obs. of 11 variables: ## $ SeriousDlqin2yrs : int 1 0 0 0 0 0 0 0 0 0 ... ## $ RevolvingUtilizationOfUnsecuredLines: num 0.766 0.957 0.658 0.234 0.907 ... ## $ age : int 45 40 38 30 49 74 57 39 27 57 ... ## $ NumberOfTime30.59DaysPastDueNotWorse: int 2 0 1 0 1 0 0 0 0 0 ... ## $ DebtRatio : num 0.803 0.1219 0.0851 0.036 0.0249 ... ## $ MonthlyIncome : int 9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ... ## $ NumberOfOpenCreditLinesAndLoans : int 13 4 2 5 7 3 8 8 2 9 ... ## $ NumberOfTimes90DaysLate : int 0 0 1 0 0 0 0 0 0 0 ... ## $ NumberRealEstateLoansOrLines : int 6 0 0 0 1 1 3 0 0 4 ... ## $ NumberOfTime60.89DaysPastDueNotWorse: int 0 0 0 0 0 0 0 0 0 0 ... ## $ NumberOfDependents : int 2 1 0 0 0 1 0 0 NA 2 ...

We see that we have 150,000 observations altogether, which we split randomly into a training (80%) and a test (20%) set:

set.seed(3141) # for reproducibility random <- sample(1:nrow(data), 0.8 * nrow(data)) data_train <- data[random, ] data_test <- data[-random, ]

We will build our model with the `OneR`

package (on CRAN, for more posts on this package see Category: OneR). I will use the new native pipe operator `|>`

, in combination with the new shorthand syntax `\(x)`

for anonymous functions, to build the model. You will need at least R version 4.1.0 to run the code. For comparison, I include the traditional form in the comments above the new syntax:

library(OneR) # same as 'model <- OneR(optbin(SeriousDlqin2yrs ~., data = data_train))' model <- data_train |> {\(x) optbin(SeriousDlqin2yrs ~., data = x)}() |> OneR() ## Warning in optbin.data.frame(x = data, method = method, na.omit = na.omit): ## target is numeric ## Warning in optbin.data.frame(x = data, method = method, na.omit = na.omit): ## 23698 instance(s) removed due to missing values # same as summary(model) model |> summary() ## ## Call: ## OneR.data.frame(x = { ## function(x) optbin(SeriousDlqin2yrs ~ ., data = x) ## }(data_train)) ## ## Rules: ## If NumberOfTimes90DaysLate = (-0.098,1.6] then SeriousDlqin2yrs = 0 ## If NumberOfTimes90DaysLate = (1.6,98.1] then SeriousDlqin2yrs = 1 ## ## Accuracy: ## 89774 of 96302 instances classified correctly (93.22%) ## ## Contingency table: ## NumberOfTimes90DaysLate ## SeriousDlqin2yrs (-0.098,1.6] (1.6,98.1] Sum ## 0 * 88719 863 89582 ## 1 5665 * 1055 6720 ## Sum 94384 1918 96302 ## --- ## Maximum in each column: '*' ## ## Pearson's Chi-squared test: ## X-squared = 6946.5, df = 1, p-value < 2.2e-16 # same as plot(model) model |> plot()

We see that the data are extremely unbalanced, which is quite normal for credit scoring data because most customers (thankfully!) pay back their loans. Still, OneR was able to find a simple rule: If the borrower has been 90 days or more past due more than once chances are that (s)he will default on the loan. Let us see how well this model fares with the test set:

# same as 'eval_model(prediction = predict(model, data_test), actual = data_test$SeriousDlqin2yrs)' data_test |> {\(x) eval_model(prediction = predict(model, x), actual = x$SeriousDlqin2yrs)}() ## ## Confusion matrix (absolute): ## Actual ## Prediction 0 1 Sum ## 0 27745 1621 29366 ## 1 269 365 634 ## Sum 28014 1986 30000 ## ## Confusion matrix (relative): ## Actual ## Prediction 0 1 Sum ## 0 0.92 0.05 0.98 ## 1 0.01 0.01 0.02 ## Sum 0.93 0.07 1.00 ## ## Accuracy: ## 0.937 (28110/30000) ## ## Error rate: ## 0.063 (1890/30000) ## ## Error rate reduction (vs. base rate): ## 0.0483 (p-value = 0.01284)

The accuracy of the model of nearly 94% is quite good. One important point to keep in mind though: because of the extreme imbalance of the data this accuracy is a little bit misleading. Yet the last line of the output above (error rate reduction with p < 0.05) tells us that despite this imbalance the model really is able to give better predictions than a naive approach (if you want to know more about this please consult: ZeroR: The Simplest Possible Classifier… or: Why High Accuracy can be Misleading).

Anyway, it is way better than this use case from “DataRobot” on the same dataset which had an accuracy of just about 77%: Predicting financial delinquency using credit scoring data. That model is way more complex, less interpretable, needed more tweaking, and was built using proprietary software on a 72 core private cloud for about five hours. We built our way better model out of the box with the freely available OneR-package on a 10-year-old computer within seconds!

]]>Not many people understand the financial alchemy of modern financial investment vehicles, like hedge funds, that often use sophisticated trading strategies. But everybody understands the meaning of rising and falling markets. Why not simply translate one into the other?

If you want to get your hands on a simple R script that creates an easy-to-understand plot (a *profit & loss profile* or *payoff diagram*) out of any price series, read on!

Once again we will stand on the shoulders of giants by using the mighty `quantmod`

package (on CRAN) and a not so well-known function from Base R, `scatter.smooth`

(to run the code you must have R ≥ 4.1.0 installed):

library(quantmod) ## Loading required package: xts ## Loading required package: zoo ## ## Attaching package: 'zoo' ## The following objects are masked from 'package:base': ## ## as.Date, as.Date.numeric ## Loading required package: TTR ## Registered S3 method overwritten by 'quantmod': ## method from ## as.zoo.data.frame zoo loessplot <- function(comp, benchm) { data <- merge(benchm, comp, all = FALSE) |> ROC() |> coredata() |> na.omit() |> data.frame() names(data) <- c("benchmark", "comparison") with(data, scatter.smooth(benchmark, comparison, evaluation = 200, xlab = names(benchm), ylab = names(comp), main = paste("Profit & Loss Profile, Correlation:", benchmark |> cor(comparison) |> round(2)), lpars = list(col = "red", lwd = 3))) abline(h = 0); abline(v = 0); abline(0, -1); abline(0, 1, col = "blue") }

What this code, i.e. the `loessplot`

function, does is to create a scatter plot from the respective price series and a benchmark (normally an index to compare it with) and superimpose a payoff diagram. The payoff diagram is created by a *local regression*, or more precisely *locally estimated scatterplot smoothing* or *LOESS*. LOESS can be seen as a generalization of polynomial regression which is itself a generalization of linear regression, that is closely related to correlation (the correlation coefficient is additionally provided in the title of the plot to give some context).

It is best explained by showing a few examples. Let us start with a simple index tracker of the S&P 500:

SP500 <- getSymbols("^GSPC", auto.assign = FALSE) ## 'getSymbols' currently uses auto.assign=TRUE by default, but will ## use auto.assign=FALSE in 0.5-0. You will still be able to use ## 'loadSymbols' to automatically load data. getOption("getSymbols.env") ## and getOption("getSymbols.auto.assign") will still be checked for ## alternate defaults. ## ## This message is shown once per session and may be disabled by setting ## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details. getSymbols("IVV") # iShares Core S&P 500 ETF ## [1] "IVV" loessplot(IVV$IVV.Adjusted, SP500$GSPC.Adjusted)

On the x-axis we have the benchmark, in this case the S&P 500, the y-axis shows the performance of the ETF. The blue line in the diagram signifies a perfect replication of the underlying, while the red line is the average payoff profile of the price series for each market phase. We can see that the tracking error is quite small, the nearly perfect positive correlation corroborates this.

It is common wisdom that combining stocks with bonds can be worthwhile. Let us have a look at their profit & loss profile:

getSymbols("TLT") # iShares 20+ Year Treasury Bond ETF ## [1] "TLT" loessplot(TLT$TLT.Adjusted, SP500$GSPC.Adjusted)

We can clearly see that they are, certainly not perfectly, but reasonably negatively correlated. So a combination is indeed a good idea.

How about gold:

Gold <- getSymbols("GC=F", auto.assign = FALSE) ## Warning: GC=F contains missing values. Some functions will not work if objects ## contain missing values in the middle of the series. Consider using na.omit(), ## na.approx(), na.fill(), etc to remove or replace them. loessplot(Gold$`GC=F.Adjusted`, SP500$GSPC.Adjusted)

No correlation whatsoever! So, adding it to a portfolio is also a good idea diversification-wise.

Now for a more complicated trading-strategy based on the Nasdaq 100:

getSymbols("^NDX") # Nasdaq 100 ## [1] "^NDX" getSymbols("NUSI") # Nationwide Risk-Managed Income ETF ## [1] "NUSI" loessplot(NUSI$NUSI.Adjusted, NDX$NDX.Adjusted)

This is indeed an interesting profile: Losses are capped beyond a certain point – as are profits. The typical profile of a well-known options strategy, a so-called collar: holding an underlying, buying an out-of-the-money put option, and selling an out-of-the-money call option. Even without reading any further documents about or from this fund, we can clearly dissect their trading strategy: financial X-rays!

Let us examine another hedge fund strategy ETF:

getSymbols("QMN") # iM DBi Hedge Strategy ETF ## [1] "QMN" loessplot(QMN$QMN.Adjusted, SP500$GSPC.Adjusted)

Well, this doesn’t look too impressive: while holding this fund might be quite expensive (which I don’t know) a similar profile should also be achievable by investing just about 60% of your money in a cheap index tracker (like the one seen at the beginning of this post)!

To make our little collection complete there are of course also instruments with which you can short the market, in this case the Russell 2000-index:

Russell2000 <- getSymbols("RTY=F", auto.assign = FALSE) ## Warning: RTY=F contains missing values. Some functions will not work if objects ## contain missing values in the middle of the series. Consider using na.omit(), ## na.approx(), na.fill(), etc to remove or replace them. getSymbols("RWM") # ProShares Short Russell2000 ## [1] "RWM" loessplot(RWM$RWM.Adjusted, Russell2000$'RTY=F.Adjusted')

I will end this post with – of course – Bitcoin! As a benchmark we take the S&P 500 again:

BTC <- getSymbols("BTC-USD", auto.assign = FALSE) ## Warning: BTC-USD contains missing values. Some functions will not work if ## objects contain missing values in the middle of the series. Consider using ## na.omit(), na.approx(), na.fill(), etc to remove or replace them. loessplot(BTC$'BTC-USD.Adjusted', SP500$GSPC.Adjusted)

As you can see it is nearly uncorrelated – but not entirely. In this respect, gold seems to be a better alternative. And considering that gold will be there even if the lights go out and doesn’t have such an abysmal CO2 footprint underscores this: Bitcoin is like gold – only worse!

I hope that you find this useful and I would love if you could share some of your own analyses with us in the comments below.

The next logical step would be to consider replicating the found payoff structures, especially of high-cost hedge funds in a cost-effective manner. I published another post some time ago on how to do just that: Financial Engineering: Static Replication of any Payoff Function.

**DISCLAIMER**

*This post is written on an “as is” basis for educational purposes only and comes without any warranty. The findings and interpretations are exclusively those of the author and are not endorsed by or affiliated with any third party.*

In particular, this post provides no investment advice! No responsibility is taken whatsoever if you lose money.

*(If you gain money though I would be happy if you would buy me a coffee… that is not too much to ask, is it? )*

One problem with cryptography is that you often would like to have a shared secret key to encrypt and decrypt messages (e.g. credit card information) but you of course cannot just send the secret key to the other party *unencrypted* – now you see the problem!

We want to find a method to generated a shared secret key by only sending publicly observable information. “Impossible”, you say? No, only ingenious!

Let me start by presenting a riddle to you: say, you want to have a box with some valuables delivered to a friend of yours but you don’t trust the carrier. You and your friend are allowed to use locks but you don’t have a lock for which you and your friend have a key. You can also send the box back and forth as often as you want. What to do? Think about it for a moment, I’ll wait…

…ok, here is a possible solution: You put a lock on the box, for which only you have the key. The box is delivered to your friend. Your friend puts on another lock, for which only he has the key and sends it back to you, now with two locks on. You remove your lock and send it back to him one last time, problem solved!

The method we are going to explain is designed in a very much the same spirit. It is called “Diffie–Hellman key exchange” and was one of the first so-called *public-key protocols* and is still widely used today.

The following explanation based on mixing colours is not new but here we not only do it illustratively but demonstrate it by actually mixing colours with R!

We will use the `MixColor()`

function from the versatile `DescTools`

package (on CRAN). First we write a small helper function for mixing the colours, plotting the result and returning the RGB (Red-Green-Blue) code of the new colour:

library(DescTools) mix_col <- function(col1, col2 = col1, amount1 = 0.5, main = "") { mix_col <- MixColor(col1, col2, amount1) plot(0, type = 'n', xlim = c(0, 100), ylim = c(0, 100), axes = FALSE, xlab = "", ylab = "", main = main) rect(0, 0, 100, 100, col = mix_col) mix_col }

Both parties, traditionally called Alice and Bob, start out with their own private colour which they will keep secret:

Alices_private_col <- "red" mix_col(Alices_private_col, main = "Alice's private colour") ## [1] "#FF0000FF"

Bobs_private_col <- "blue" mix_col(Bobs_private_col, main = "Bob's private colour") ## [1] "#0000FFFF"

On top of that, we need a public colour:

public_col <- "green" mix_col(public_col, main = "Public colour") ## [1] "#00FF00FF"

Now, the fun can begin! Alice takes her private colour (red), mixes it with the public colour (green) and sends the result publicly to Bob…

(Alice2Bob <- mix_col(Alices_private_col, public_col, main = "Alice's private colour with public colour")) ## [1] "#7F7F00FF"

…and Bob takes his private colour (blue), mixes it with the public colour (green) and sends the result publicly to Alice:

(Bob2Alice <- mix_col(Bobs_private_col, public_col, main = "Bob's private colour with public colour")) ## [1] "#007F7FFF"

So far, so good. And now for the final step: Both take the colour the other party has sent them and mix it with their own private secret colour (in the ratio 2/3 to 1/3, so that all three colours share 1/3 of the total):

mix_col(Bob2Alice, Alices_private_col, amount = 2/3, main = "Shared secret colour") ## [1] "#545454FF" mix_col(Alice2Bob, Bobs_private_col, amount = 2/3, main = "Shared secret colour") ## [1] "#545454FF"

As can be seen, the result is the same secret – but now shared – colour. This method works because there is some asymmetry involved: It is easy to mix colours but very hard to undo that! Even if a hostile third party, traditionally called Eve, was listening in, she couldn’t make sense of the intercepted data.

Now that you understand the general principle, as promised, for the nerd version:

We need a mathematical equivalent of our colour mixing: a function that is simple to calculate in our direction but hard in the other. Those functions indeed exist and are called *trapdoor functions*.

You don’t have to look any further than dividing a simple exponentiation by some number and only keeping the remainder (which is the modulo function, or `%%`

in R). Those numbers have to fulfil certain prerequisites like being prime and being very long random numbers but the principle is the same: simple to calculate in one direction but very hard in the other (in this case the reason is what is known under the name “discrete logarithm problem” but we won’t give any more details here).

When you have a look at the following fully documented toy example you will recognize the same principle at work as with the colours above:

Alices_private_no <- 15 Bobs_private_no <- 13 prime <- 17 # public prime generator <- 3 # public generator (Alice2Bob <- generator^Alices_private_no %% prime) # Alice selects private random number (15) and sends result (6) publicly to Bob ## [1] 6 (Bob2Alice <- generator^Bobs_private_no %% prime) # Bob selects private random number (13) and sends result (12) publicly to Alice ## [1] 12 # shared secret key Bob2Alice^Alices_private_no %% prime # Alice takes Bob's public result (12), raises it to her private random number (15) modulo the public prime number (17) ## [1] 10 Alice2Bob^Bobs_private_no %% prime # Bob takes Alice's public result (6), raises it to his private random number (13) modulo the public prime number (17) ## [1] 10

Fascinating, isn’t it? And now you can understand the principle behind this foundational technology!

Let me end this post with a well-kept secret about nerds:

]]>A short one for today: in this post we will learn how to easily create

We have covered bits of code that I contributed to Rosetta Code on this blog before (see Category: Rosetta Code). This time we want to solve the following task:

Truth table

A truth table is a display of the inputs to, and the output of a Boolean function organized as a table where each row gives one combination of input values and the corresponding value of the function.

Task

- Input a Boolean function from the user as a string then calculate and print a formatted truth table for the given function.

(One can assume that the user input is correct).- Print and show output for Boolean functions of two and three input variables, but any program should not be limited to that many variables in the function.
- Either reverse-polish or infix notation expressions are allowed.

The core of a truth table is a permutation of all `TRUE`

and `FALSE`

statements for all variables (= letters), which we extract from a Boolean function `x`

. Fortunately, we created such a permutation function a few posts ago (see Learning R: Permutations and Combinations with Base R), so that we can adapt it accordingly: `expand.grid(rep(list(c(FALSE, TRUE)), length(vars)))`

. We then add another column with the resulting evaluation of the Boolean function and return the resulting table:

truth_table <- function(x) { vars <- unique(unlist(strsplit(x, "[^a-zA-Z]+"))) vars <- vars[vars != ""] perm <- expand.grid(rep(list(c(FALSE, TRUE)), length(vars))) names(perm) <- vars perm[ , x] <- with(perm, eval(parse(text = x))) perm }

Now, let us try some examples:

"%^%" <- xor # define unary xor operator truth_table("!A") # not ## A !A ## 1 FALSE TRUE ## 2 TRUE FALSE truth_table("A | B") # or ## A B A | B ## 1 FALSE FALSE FALSE ## 2 TRUE FALSE TRUE ## 3 FALSE TRUE TRUE ## 4 TRUE TRUE TRUE truth_table("A & B") # and ## A B A & B ## 1 FALSE FALSE FALSE ## 2 TRUE FALSE FALSE ## 3 FALSE TRUE FALSE ## 4 TRUE TRUE TRUE truth_table("A %^% B") # xor ## A B A %^% B ## 1 FALSE FALSE FALSE ## 2 TRUE FALSE TRUE ## 3 FALSE TRUE TRUE ## 4 TRUE TRUE FALSE truth_table("S | (T %^% U)") # 3 variables with brackets ## S T U S | (T %^% U) ## 1 FALSE FALSE FALSE FALSE ## 2 TRUE FALSE FALSE TRUE ## 3 FALSE TRUE FALSE TRUE ## 4 TRUE TRUE FALSE TRUE ## 5 FALSE FALSE TRUE TRUE ## 6 TRUE FALSE TRUE TRUE ## 7 FALSE TRUE TRUE FALSE ## 8 TRUE TRUE TRUE TRUE truth_table("A %^% (B %^% (C %^% D))") # 4 variables with nested brackets ## A B C D A %^% (B %^% (C %^% D)) ## 1 FALSE FALSE FALSE FALSE FALSE ## 2 TRUE FALSE FALSE FALSE TRUE ## 3 FALSE TRUE FALSE FALSE TRUE ## 4 TRUE TRUE FALSE FALSE FALSE ## 5 FALSE FALSE TRUE FALSE TRUE ## 6 TRUE FALSE TRUE FALSE FALSE ## 7 FALSE TRUE TRUE FALSE FALSE ## 8 TRUE TRUE TRUE FALSE TRUE ## 9 FALSE FALSE FALSE TRUE TRUE ## 10 TRUE FALSE FALSE TRUE FALSE ## 11 FALSE TRUE FALSE TRUE FALSE ## 12 TRUE TRUE FALSE TRUE TRUE ## 13 FALSE FALSE TRUE TRUE FALSE ## 14 TRUE FALSE TRUE TRUE TRUE ## 15 FALSE TRUE TRUE TRUE TRUE ## 16 TRUE TRUE TRUE TRUE FALSE

Looks good! The full code can also be found here: Rosetta Code: Truth Table: R.

I suspect that this function will come in handy for solving further tasks in the future, so stay tuned!

]]>I sometimes joke that as an Aries I don’t believe in zodiac signs. But could there still be some pattern, e.g. in the sense that people born in spring are more prone to success than those born during the winter months?

In this post, we will provide a definitive answer with one of the most fascinating datasets I have ever encountered, so read on!

The data we will be using is from the extraordinary Pantheon project:

Pantheon is an observatory of collective memory focused on biographies with a presence in at least 15 languages in Wikipedia. We have data on more than 85,000 biographies, organized by countries, cities, occupations, and eras. Explore this data to learn about the characters that shape human culture. Pantheon began as a project at the Collective Learning group at MIT.

To test whether zodiac signs have any bearing on success we will do the following three steps:

- Load the Pantheon project data of famous people, subset all living persons born in the US and calculate their zodiac signs.
- Load the distribution of zodiac signs of all US citizens.
- Test whether there is a statistically significant difference between both distributions.

The latest Pantheon data can be loaded from here: Pantheon datasets. It is a bzip compressed comma delimited file, which can without any intermediate steps directly loaded into R:

pantheon <- read.csv("data/person_2020_update.csv.bz2", encoding = "UTF-8") data <- pantheon[pantheon$bplace_country == "United States" & pantheon$alive == TRUE, ] nrow(data) ## [1] 10106 head(data$name, 25) ## [1] "Donald Trump" "Jimmy Carter" "Sylvester Stallone" ## [4] "Hillary Clinton" "Steven Spielberg" "Martin Scorsese" ## [7] "Clint Eastwood" "Al Pacino" "Bill Gates" ## [10] "Cher" "Robert De Niro" "Morgan Freeman" ## [13] "Warren Buffett" "Jack Nicholson" "Al Gore" ## [16] "Noam Chomsky" "Danny DeVito" "Woody Allen" ## [19] "Stephen King" "Joe Biden" "Dustin Hoffman" ## [22] "Bob Dylan" "Tina Turner" "Meryl Streep" ## [25] "Bill Clinton"

We see that we get more than 10,000 famous persons from the US, where I listed the first 25.

Now for the data on the distribution of zodiac signs for the US population as a whole. You can download the file from here: distribution_zodiac_US (source).

distr_zodiac_US <- read.csv("data/distribution_zodiac_US.csv") # change path accordingly distr_zodiac_US <- structure(distr_zodiac_US$Percent.of.US.Population, names = distr_zodiac_US$Zodiac.Sign)

After having all the data available we will calculate the zodiac signs with the `DescTools`

package (on CRAN) and create a table with both distributions:

library(DescTools) birthdates <- substr(data$birthdate[nchar(data$birthdate) > 0], 6, 10) zodiac <- table(Zodiac(as.Date(birthdates, "%m-%d"))) zodiacs <- rbind(distr_zodiac_US[names(zodiac)], prop.table(zodiac)) row.names(zodiacs) <- c("whole US", "celeb US") zodiacs <- 100*zodiacs round(zodiacs, 2) ## Capricorn Aquarius Pisces Aries Taurus Gemini Cancer Leo Virgo Libra ## celeb US 8.20 6.30 9.00 8.10 8.30 9.2 8.40 7.10 9.30 8.70 ## whole US 7.54 7.56 8.75 8.72 7.74 8.7 9.01 8.97 8.49 8.63 ## Scorpio Sagittarius ## celeb US 9.40 7.30 ## whole US 8.03 7.87

We see that there are differences, e.g. there are more celebrities with the zodiac sign Leo than in the general population, but are those differences statistically significant (to understand the concept of statistical significance please see: From Coin Tosses to p-Hacking: Make Statistics Significant Again!)?

To evaluate we use perform a chi-squared goodness-of-fit test:

chisq.test(x = zodiacs["celeb US", ], p = zodiacs["whole US", ]/100) ## ## Chi-squared test for given probabilities ## ## data: zodiacs["celeb US", ] ## X-squared = 1.1756, df = 11, p-value = 0.9999

The result is crystal clear: both distributions are not statistically different (the p-value is way above the significance level of 0.05), or put another way, your zodiac sign says nothing about your success in life! The numbers don’t lie (and the stars are silent).

If you have more ideas on how to use this fantastic dataset, please let me know in the comments!

]]>This is our 101’st blog post here on

Oftentimes the different concepts of *data science*, namely *artificial intelligence (AI)*, *machine learning (ML)*, and *deep learning (DL)* are confused… so we asked the most advanced AI in the world, ** OpenAI GPT-3**, to write a guest post for us to provide some clarification on their definitions and how they are related.

We are most delighted to present this very impressive (and only slightly redacted) essay to you – enjoy!

Artificial intelligence (AI), machine learning (ML), and deep learning (DL) are related concepts that are often used interchangeably. They are also three distinct and different concepts. In this blog post, we will define artificial intelligence, machine learning, and deep learning and explain why they are all different and how they are related.

AI is a broad and complex concept that has been around for decades. AI is used to describe a concept or a system that mimics the cognitive functions of the human brain. It can be used to describe a situation where machines can act or behave in a way that mimics human behavior. AI is often used to describe a system that can learn from experience, can use knowledge to perform tasks, to reason, and to make decisions.

There are many different types of AI. For example, there are expert systems, neural networks, and fuzzy logic. In this blog post, we are going to focus on the different types of machine learning. A machine learning model is an AI system that can learn from a dataset and can make predictions or decisions based on the data (see also So, what is AI really?).

Machine learning is a subset of AI and is a method for algorithms to learn from data. It can be used to build models that can predict future behavior based on past experience. Machine learning is used to analyze large datasets and to find patterns in the data. An example of a machine learning model is a spam filter that learns to differentiate between spam and non-spam messages.

There are three different types of machine learning. Each of them is used for a different type of problem.

Supervised learning is the most common type of machine learning. It is used to find patterns in data and is used to predict future behavior based on past experience. In supervised learning, the data is split into two parts, known as the training set and the test set. The training set is used to train the model and the test set is used to evaluate the accuracy of the model.

The goal of supervised learning is to find a relationship between independent variables and dependent variables. The independent variables are the things that we know about the data. For example, the independent variables are the features that are used to describe a customer. The dependent variables are the things we want to know about the data. For example, the dependent variable is the profit made from a specific customer. The relationship between the independent variables and the dependent variables is known as the model.

Supervised learning can be used to predict the profit of a customer based on the features of the customer.

Unsupervised learning is used to find structure in the data. In unsupervised learning, there is no training set. The model is learned from the test set. It can also be used to find groups or clusters in the data or to identify anomalies in the data.

Unsupervised learning can be used to find groups of similar customers.

Reinforcement learning is a type of machine learning that is used to find good actions or decisions based on the data. Reinforcement learning is used to find an optimal action or decision that will maximize the reward. It is used to find the optimal solution to a problem. The optimal solution depends on the reward function.

Reinforcement learning can be used to optimize different types of problems. For example, it can be used to optimize a non-linear function or to find the shortest route in a network (see also Reinforcement Learning: Life is a Maze).

Deep learning is a subset of machine learning that uses artificial neural networks. Artificial neural networks are computational models that are inspired by the architecture of the human brain. They are used to develop algorithms that can learn from data (see also Understanding the Magic of Neural Networks).

Deep learning is used to build models that can classify data or find patterns in the data. Deep learning is used to perform complex tasks such as object recognition, speech recognition, and translation. Deep learning is the most popular type of machine learning.

In this blog post, we explained the difference between artificial intelligence, machine learning, and deep learning.

We also covered the three different types of machine learning (supervised learning, unsupervised learning, and reinforcement learning) and explained how they are related.

*It seems almost impossible but this whole post was really written by a very advanced AI and only slightly redacted. You won’t find it anywhere else on the internet, it is unique. I think it is fair to say that we haven’t even begun to understand the full potential of this new technology…*