How lucrative stocks are in the long run is not only dependent on the length of the investment period but even more on the actual date the investment starts and ends!

*Return Triangle Plots* are a great way to visualize this phenomenon. If you want to learn more about them and how to create them with R read on!

If you had invested in the Standard & Poors 500 index beginning of 2000 you would have had to wait 14 years until you were in the plus! The reason was, of course, the so-called dot-com bubble which was at its peak then and crashed soon afterwards. On the other hand, if you had invested in the same index beginning of 2003 you would have never had any loss below your initial investment and would have a return of more than 300% by now!

Return triangle plots are a great way to get to grips with this. The following function returns a return triangle for any ticker symbol (`Symbol`

) for any start (`from`

) and end year (`to`

). For retrieving the stock or index data we use the wonderful `quantmod`

package (on CRAN):

library(quantmod) ## Loading required package: xts ## Loading required package: zoo ## ## Attaching package: 'zoo' ## The following objects are masked from 'package:base': ## ## as.Date, as.Date.numeric ## Loading required package: TTR ## Registered S3 method overwritten by 'quantmod': ## method from ## as.zoo.data.frame zoo ## Version 0.4-0 included new data defaults. See ?getSymbols. return_triangle <- function(Symbol = "^GSPC", from = 2000, to = 2020) { symbol <- getSymbols(Symbol, from = paste0(from, "-01-01"), to = paste0(to, "-12-31"), auto.assign = FALSE) symbol_y <- coredata(to.yearly(symbol)[ , c(1, 4)]) from_to <- seq(from, to) M <- matrix(NA, nrow = length(from_to), ncol = length(from_to)) rownames(M) <- colnames(M) <- from_to for (buy in seq_along(from_to)) { for (sell in seq(buy, length(from_to))) { M[buy, sell] <- (symbol_y[sell, 2] - symbol_y[buy, 1]) / symbol_y[buy, 1] } } round(100 * M, 1) } rt <- return_triangle(from = 2009, to = 2020) ## 'getSymbols' currently uses auto.assign=TRUE by default, but will ## use auto.assign=FALSE in 0.5-0. You will still be able to use ## 'loadSymbols' to automatically load data. getOption("getSymbols.env") ## and getOption("getSymbols.auto.assign") will still be checked for ## alternate defaults. ## ## This message is shown once per session and may be disabled by setting ## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details. print(rt, na.print = "") ## 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 ## 2009 23.5 39.3 39.3 57.9 104.7 128.0 126.4 147.9 196.1 177.6 257.8 313.3 ## 2010 12.6 12.6 27.7 65.5 84.4 83.1 100.5 139.5 124.5 189.4 234.2 ## 2011 0.0 13.4 47.0 63.7 62.5 78.0 112.6 99.3 156.9 196.8 ## 2012 13.3 46.8 63.6 62.4 77.8 112.4 99.1 156.6 196.5 ## 2013 29.6 44.4 43.3 57.0 87.5 75.8 126.5 161.7 ## 2014 11.5 10.7 21.3 44.8 35.8 75.0 102.2 ## 2015 -0.7 8.7 29.9 21.8 56.9 81.3 ## 2016 9.8 31.2 23.0 58.5 83.1 ## 2017 18.7 11.3 43.5 65.8 ## 2018 -6.6 20.4 39.1 ## 2019 30.4 50.7 ## 2020 15.0

The rows represent the buy dates (beginning of the respective year), the columns the sell dates (end of the respective year). To create a return triangle plot out of that data we use the fantastic `plot.matrix`

package (on CRAN):

library(plot.matrix) rt <- return_triangle(from = 2000, to = 2020) bkp_par <- par(mar = c(5.1, 4.1, 4.1, 4.1)) # adapt margins plot(rt, digits = 1, text.cell = list(cex = 0.5), breaks = 15, col = colorRampPalette(c("red", "white", "green1", "green2", "green3", "green4", "darkgreen")), na.print = FALSE, border = NA, key = NULL, main = "S&P 500", xlab = "sell", ylab = "buy") par(bkp_par)

As it stands some care needs to be taken with setting the `breaks`

and `col`

arguments when creating your own triangle plots. It might help to remove `key = NULL`

so that you can see in the legend whether values below zero are in red and above in green. If you know some elegant method to set those values automatically please share it with us in the comments below. I will update the post with an honourable mention of you!

Back to the triangle plot itself: you can clearly see how whole periods form clusters of positive and negative returns… like a heat map with green hills and red valleys.

In the long run, all investments get into the green but with huge differences of sometimes several hundred percentage points! So even for long-term investors timing indeed is important but as every quant knows unfortunately very, very hard (if not outright impossible).

]]>The workhorse of Machine Learning is

Gradient Descent is a mathematical algorithm to optimize functions, i.e. finding their minima or maxima. In Machine Learning it is used to *minimize* the *cost function* of many learning algorithms, e.g. artificial neural networks a.k.a. deep learning. The cost function simply is the function that measures how good a set of predictions is compared to the actual values (e.g. in regression problems).

The gradient (technically the *negative* gradient) is the direction of steepest descent. Just imagine a skier standing on top of a hill: the direction which points into the direction of steepest descent is the gradient!

Mathematically the gradient “stores” all the partial derivative information of a multivariable function, basically the slopes with respect to the directions of all axes. So, the Gradient Descent-algorithm always moves in the direction of the gradient to reach the minimum of the function (NB: in some cases unfortunately only a *local minimum*!), like our skier always taking the steepest route to reach the valley as fast as possible!

The question is: why does a bundle of partial derivatives point in the direction of the steepest descent? I answered this question some time ago here: Math.SE: Why is gradient the direction of steepest ascent?

To give some intuition why the gradient (technically the negative gradient) has to point in the direction of steepest descent I created the following animation.

It shows all of the points that can be reached by a vector of a given length and two variables x and y that are multiplied by a constant and summed up to give a very simple linear function (which give very simple directional derivatives).

I then vary the constants relative to each other: when the constant of x goes up (down) the constant of y goes down (up). The red area equals the highest point which means that you have the steepest descent from there.

As can be seen, this point varies smoothly with the proportion of the constants which represent the derivatives in each direction!

Only when one constant equals zero do we have a corner solution, when both constants are the same the red area is exactly in the middle. There is no good reason why the red area (= steepest descent) should jump around between those points.

This means that the gradient will always point in the direction of the steepest descent (nb: which is of course not a proof but a hand-waving indication of its behaviour to give some intuition only!)

I created the animation with the `rgl`

package (on CRAN) and *ImageMagick* (see aso this post: Creating a Movie with Data from Outer Space in R).

The following fully documented code can be taken as a template to create plots and animations of 3D-functions with constraints (or inequalities). `f`

is defined as follows:

where is varied between and .

library(rgl) # create many x and y pairs as input of function x <- y <- seq(0, 10, 0.01) # define 3D-function with constraint f <- function(x, y) { ifelse(sqrt(x^2 + y^2) < 10, (3-n)*x + n*y, NA) } # for rainbow colouring the function nbcol <- 100 color <- rev(rainbow(nbcol, start = 0/6, end = 4/6)) zcol <- cut(z, nbcol) # loop for 3D-plots of function with n from 0 to 3 and back olddir <- setwd("anim") # change path accordingly nlim <- 3 step <- 0.1 niter <- c(seq(0, nlim, step), seq(nlim-step, step, -step)) for (i in 1:length(niter)) { n <- niter[i] z <- outer(x, y, f) zcol <- cut(z, nbcol) persp3d(x, y, z, col = color[zcol]) filename <- paste0("pic", ifelse((i-1) < 10, paste0("0", (i-1)), (i-1)), ".png") rgl.snapshot(filename, fmt = "png", top = TRUE) } # make animated gif with ImageMagick system("cmd.exe", input = "convert -delay 20 -loop 0 pic*.png gradient.gif") setwd(olddir)

I am always fascinated by how versatile R is!

]]>In this year’s end post I will give you a little programming challenge!

Everybody knows the Christmas song “The Twelve Days of Christmas”! Your task is to write an R script that creates the lyrics!

The lyrics are the following:

On the first day of Christmas

My true love gave to me:

A partridge in a pear tree.On the second day of Christmas

My true love gave to me:

Two turtle doves and

A partridge in a pear tree.On the third day of Christmas

My true love gave to me:

Three french hens

Two turtle doves and

A partridge in a pear tree.On the forth day of Christmas

My true love gave to me:

Four calling birds

Three french hens

Two turtle doves and

A partridge in a pear tree.…

On the Twelfth day of Christmas,

My true love gave to me:

Twelve drummers drumming

Eleven pipers piping

Ten lords a-leaping

Nine ladies dancing

Eight maids a-milking

Seven swans a-swimming

Six geese a-laying

Five golden rings

Four calling birds

Three french hens

Two turtle doves and

A partridge in a pear tree.

Your challenge is to write an R script to create the above lyrics. I provide the building blocks here as a starting point:

gifts <- c("A partridge in a pear tree.", "Two turtle doves and", "Three french hens", "Four calling birds", "Five golden rings", "Six geese a-laying", "Seven swans a-swimming", "Eight maids a-milking", "Nine ladies dancing", "Ten lords a-leaping", "Eleven pipers piping", "Twelve drummers drumming") days <- c("first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth", "ninth", "tenth", "eleventh", "twelfth")

Hint: for the output you can use the `cat`

function, to concatenate (combine) strings you can use `paste`

and for a new line use `"\n"`

.

I will provide my solution below but you should give it a try… Have fun!

Here is my solution, which I also posted on Rosetta Code: The Twelve Days of Christmas:

for (i in 1:length(days)) { cat("On the", days[i], "day of Christmas\n") cat("My true love gave to me:\n") cat(paste(gifts[i:1], collapse = "\n"), "\n\n") }

I am always amazed at how elegantly one can code with R! If you have other solutions please don’t hesitate to share them with us in the comment section below.

I wish you all a Merry Christmas, Happy Holidays and A Happy New Year! (Hopefully, 2021 will be a better one than 2020!)

And above all: Please stay safe!

We will be taking our Christmas break and will be back on January 12, 2021!

]]>COVID-19 has the world more than ever in its grip – but there is hope: several vaccines have been developed which promise to deliver “95% efficacy”.

When people read this many assume that it means that 95% of vaccinated persons will be protected from infection – but that is not true. Even many (science) journalists get it wrong! If you want to understand what it really means, read on!

Let us take the example of the German biotech company *BioNTech* which developed its mRNA vaccine *BNT162b2* in cooperation with pharmaceutical giant *Pfizer*. First of all, let me make it very clear that this is an extraordinary achievement and I am hopeful that it will prove to be a real game-changer in the fight against this horrible virus. I stress this because nowadays you cannot be cautious enough not to be instrumentalized by all kinds of strange people who have an interest in downplaying this devastating pandemic.

Now, what does a so-called *efficacy* rate of 95% mean? Let me start by stating what it *doesn’t* mean: it doesn’t mean that 95 out of 100 vaccinated persons will be protected from COVID-19, nor does it mean that it will reduce the severity of the illness in case you contract the virus despite being vaccinated.

To understand the real meaning let us go through the press release of Pfizer:

Primary efficacy analysis demonstrates BNT162b2 to be 95% effective against COVID-19 beginning 28 days after the first dose;170 confirmed cases of COVID-19 were evaluated, with 162 observed in the placebo group versus 8 in the vaccine group

What Pfizer did was to select about 43,000 voluntary participants where about half received the vaccine and the other half (= the *control group*) received only a *placebo*, without any active substance. After about a month after the first dose (i.e. one week after the second dose), they started to count the number of confirmed COVID-19 cases for each group: in the placebo group 162 cases were confirmed, whereas in the vaccine group only 8 cases appeared.

To get to the 95%-number the following calculation was performed:

(1 - 8/162) * 100 ## [1] 95.06173

So the 95% is the *relative* risk-reduction in infections, it doesn’t tell us the *absolute* probability of not getting infected despite being vaccinated!

We already covered relative vs. absolute risk reductions (and its dangers) in this blog. There we saw that *personograph plots* (also called *Kuiper-Marshall plots*) are an excellent way to communicate risks.

Let us start by illustrating the confirmed COVID-19 cases in the placebo (= control) group for a better manageable group size of 2,500…

library(personograph) # first install from CRAN ## Loading required package: grImport ## Loading required package: grid ## Loading required package: XML n <- 2500 inf_wo_vac <- 20 / n data <- list(first = inf_wo_vac, second = 1-inf_wo_vac) personograph(data, colors = list(first = "red", second = "lightgrey"), fig.title = "20 of 2500 infected without vaccine", draw.legend = FALSE, n.icons = n, dimensions = c(25, 100), plot.width = 0.97)

…and now for the vaccine group of the same size:

inf_w_vac <- 1 / n data <- list(first = inf_w_vac, second = 1-inf_w_vac) personograph(data, colors = list(first = "red", second = "lightgrey"), fig.title = "1 of 2500 infected despite of vaccine", draw.legend = FALSE, n.icons = n, dimensions = c(25, 100), plot.width = 0.97)

I think that those plots really put matters in perspective.

The main problem is that vaccine studies cannot directly measure what we really want to know: the *effectiveness* of the vaccine in the real world, i.e. how well it protects us from contracting the disease. Therefore they use *efficacy* as a proxy instead, i.e. relative risk-reduction of infections in the two study groups. This can give a good indication of the order of magnitude of the real-world effect but is not the same!

So while “95% effective” indeed does mean something different from what most people think it means let us hope that it translates into a real turning point for the better.

]]>We already had a lot of examples that make use of the

`OneR`

package (on CRAN), which can be found in the respective Category: OneR.
Here we will give you some concrete examples in the area of research on *Type 2 Diabetes Mellitus (DM)* to show that the package is especially well suited in the field of medical research, so read on!

One of the big advantages of the package is that the resulting models are often not only highly accurate but very easy to interpret:

- the predictors are ordered from best to worst (based on accuracy),
*the best one is chosen*, - the model is given in the form of
*simple if-then rules*, - the rules contain
*exact cut-off points*.

An additional advantage, compared to other methods, is that with the included `optbin`

function you find as many cut-off points as there are needed to separate all the classes instead of just one (e.g. with decision trees).

For more advantages, a quick introduction, and a real-world example in the area of *histology* (the study of the microscopic structure of tissues) for *breast cancer detection* have a look at the official vignette: OneR – Establishing a New Baseline for Machine Learning Classification Models.

The first example is based on the *early-stage diabetes risk prediction dataset* from the Queen Mary University of London which contains the sign and symptom data of newly diabetic or would be diabetic patients (diabetes_data_upload.csv). We use this dataset to find the leading symptoms of diabetes:

library(OneR) # leading symptoms data1 <- read.csv("data/diabetes_data_upload.csv") # adjust path accordingly OneR(data1, verbose = TRUE) ## ## Attribute Accuracy ## 1 * Polyuria 82.31% ## 2 Polydipsia 80.19% ## 3 partial.paresis 69.23% ## 4 sudden.weight.loss 69.04% ## 5 Gender 68.08% ## 6 Alopecia 65.96% ## 7 Polyphagia 65.58% ## 8 Age 64.42% ## 9 weakness 63.65% ## 10 Genital.thrush 61.54% ## 10 visual.blurring 61.54% ## 10 Itching 61.54% ## 10 Irritability 61.54% ## 10 delayed.healing 61.54% ## 10 muscle.stiffness 61.54% ## 10 Obesity 61.54% ## --- ## Chosen attribute due to accuracy ## and ties method (if applicable): '*' ## ## Call: ## OneR.data.frame(x = data1, verbose = TRUE) ## ## Rules: ## If Polyuria = No then class = Negative ## If Polyuria = Yes then class = Positive ## ## Accuracy: ## 428 of 520 instances classified correctly (82.31%)

As we can see in the table the leading symptoms are *polyuria* (excessive urination volume) and *polydipsia* (excessive thirst) with an accuracy of over 80 percent each. This result is corroborated by the medical literature.

The next dataset is the quite famous *Pima Indians Diabetes Database* which is often used as a benchmark for machine learning methods. It can be found in the `mlbench`

package (on CRAN):

# glucose library(mlbench) data("PimaIndiansDiabetes") data2 <- PimaIndiansDiabetes OneR(optbin(data2)) ## ## Call: ## OneR.data.frame(x = optbin(data2)) ## ## Rules: ## If glucose = (-0.199,141] then diabetes = neg ## If glucose = (141,199] then diabetes = pos ## ## Accuracy: ## 573 of 768 instances classified correctly (74.61%)

*Glucose* (blood sugar) with a cut-off value of 141 is identified as the main predictor of DM, the “official” cut-off point is at 140 mg/dl.

The last dataset is from a *National Health and Nutrition Examination Survey (NHANES)*: nhgh.rda (here you can find more info on the attributes of the dataset).

# HbA1c load("data/nhgh.rda") # adjust path accordingly data3 <- nhgh[ , !names(nhgh) %in% c("seqn", "tx")] OneR(optbin(dx ~., data = data3, method = "infogain")) ## Warning in optbin.data.frame(x = data, method = method, na.omit = na.omit): ## target is numeric ## Warning in optbin.data.frame(x = data, method = method, na.omit = na.omit): 1452 ## instance(s) removed due to missing values ## ## Call: ## OneR.data.frame(x = optbin(dx ~ ., data = data3, method = "infogain")) ## ## Rules: ## If gh = (3.99,6.4] then dx = 0 ## If gh = (6.4,15.5] then dx = 1 ## ## Accuracy: ## 4955 of 5343 instances classified correctly (92.74%)

Here *HbA1c* (glycated hemoglobin, measured primarily to determine the three-month average blood sugar level) with a cut-off value of 6.4 is identified as the main predictor for DM with an accuracy of nearly 93%, the “official” cut-off point lies at 6.5%.

In fact, several researchers around the world use the OneR package already. To give you just one publication: Computational prediction of diagnosis and feature selection on mesothelioma patient health records by D. Chicco and C. Rovelli, PLoS One, 2019.

I myself have a paper on COVID-19 under review which was submitted in cooperation with Dr. med. Anna Laura Herzog and Prof. Dr. med. Patrick Meybohm, both from the renowned University Hospital Würzburg, where we used the OneR package among other machine learning methods.

I hope that you can see that the OneR package is well worth a try (not only) in the field of medical research. If you have a project in mind where you are looking for a cooperation partner please leave a note in the comments or contact me directly: About.

]]>We already covered

If you want to gain an even deeper understanding of the fascinating connection between those two popular machine learning techniques read on!

Let us recap what an artificial neuron looks like:

Mathematically it is some kind of *non-linear activation function* of the *scalar product* of the *input vector* and the *weight vector*. One of the inputs the so-called *bias* (neuron), is fixed at 1.

The activation function can e.g. be (and often is) the *logistic* function (which is an example of a *sigmoid* function):

logistic <- function(x) 1 / (1 + exp(-x)) curve(logistic, -6, 6, lwd = 3, col = "blue") abline(h = c(0, 0.5, 1), v = 0)

Written mathematically an artificial neuron performs the following function:

Written this way that is nothing else but a logistic regression as a *Generalized Linear Model (GLM)* (which is basically itself nothing else but the logistic function of a *simple linear regression*)! More precisely it is the probability given by a binary logistic regression that the actual class is equal to 1. So, basically:

neuron = logistic regression = logistic(linear regression)

The following table translates the terms used in each domain:

Neural network | Logistic regression |
---|---|

Activation function | Link function |

Weights | Coefficients |

Bias | Intercept |

Learning | Fitting |

Interestingly enough, there is also no closed-form solution for logistic regression, so the fitting is also done via a numeric optimization algorithm like *gradient descent*. Gradient descent is also widely used for the training of neural networks (if you want to understand how it works see here: Why Gradient Descent Works (and How To Animate 3D-Functions in R)).

To illustrate this connection in practice we will again take the example from “Understanding the Magic of Neural Networks” to classify points in a plane, but this time with the logistic function and some more learning cycles. Have a look at the following table:

Input 1 | Input 2 | Output |
---|---|---|

1 | 0 | 0 |

0 | 0 | 1 |

1 | 1 | 0 |

0 | 1 | 1 |

If you plot those points with the colour coded pattern you get the following picture:

The task for the neuron is to find a separating line and thereby classify the two groups. Have a look at the following code:

# inspired by Kubat: An Introduction to Machine Learning, p. 72 plot_line <- function(w, col = "blue", add = "FALSE", type = "l") curve(-w[1] / w[2] * x - w[3] / w[2], xlim = c(-0.5, 1.5), ylim = c(-0.5, 1.5), col = col, lwd = 3, xlab = "Input 1", ylab = "Input 2", add = add, type = type) neuron <- function(input) as.vector(logistic(input %*% weights)) # logistic function on scalar product of weights and input eta <- 0.7 # learning rate # examples input <- matrix(c(1, 0, 0, 0, 1, 1, 0, 1), ncol = 2, byrow = TRUE) input <- cbind(input, 1) # bias for intercept of line output <- c(0, 1, 0, 1) weights <- rep(0.2, 3) # random initial weights plot_line(weights, type = "n"); grid() points(input[ , 1:2], pch = 16, col = (output + 2)) # training of weights of neuron for (i in 1:1e5) { for (example in 1:length(output)) { weights <- weights + eta * (output[example] - neuron(input[example, ])) * input[example, ] } } plot_line(weights, add = TRUE, col = "black")

# test: applying neuron on input round(apply(input, 1, neuron)) ## [1] 0 1 0 1

As you can see, the result matches the desired output, graphically the black line separates the green from the red points: the neuron has learned this simple classification task. Now let us do the same with logistic regression:

# logistic regression - glm stands for generalized linear model logreg <- glm(output ~ ., data = data.frame(input, output), family = binomial) # test: prediction logreg on input round(predict(logreg, data.frame(input), "response"), 3) ## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == : ## prediction from a rank-deficient fit may be misleading ## 1 2 3 4 ## 0 1 0 1 plot_line(weights, type = "n"); grid() points(input[ , 1:2], pch = 16, col = (output + 2)) plot_line(weights, add = TRUE, col = "darkgrey") plot_line(coef(logreg)[c(2:3, 1)], add = TRUE, col = "black")

Here we used the same plotting function with the coefficients of the logistic regression (instead of the weights of the neuron) as input: as you can see the black line perfectly separates both groups, even a little bit better than the neuron (in dark grey). Fiddling with the initialization, the learning rate, and the number of learning cycles should move the neuron’s line even further towards the logistic regression’s perfect solution.

I always find it fascinating to understand the hidden connections between different realms and I think this insight is especially cool: logistic regression really is the smallest possible neural network!

]]>Many services on the internet provide a *web service* so that it is easier for machines to get access to their data. The access point is called an *Application Programming Interface (API)* and there are different types of APIs. One especially widespread type is known under the name *REST* (for *REpresentational State Transfer*) and a data format that is often used is called *JSON* (for *JavaScript Object Notation*). So this is what we will do first here: Fetching JSON data from a REST API!

As a simple example, we will use the JSON interface of xkcd webcomics, the documentation is very concise:

If you want to fetch comics and metadata automatically,

you can use the JSON interface. The URLs look like this:http://xkcd.com/info.0.json (current comic)

or:

http://xkcd.com/614/info.0.json (comic #614)

Those files contain, in a plaintext and easily-parsed format: comic titles,

URLs, post dates, transcripts (when available), and other metadata.

To access the data we will use the wonderful `jsonlite`

package (on CRAN):

library(jsonlite) # call api xkcd <- fromJSON("http://xkcd.com/1838/info.0.json") str(xkcd) ## List of 11 ## $ month : chr "5" ## $ num : int 1838 ## $ link : chr "" ## $ year : chr "2017" ## $ news : chr "" ## $ safe_title: chr "Machine Learning" ## $ transcript: chr "" ## $ alt : chr "The pile gets soaked with data and starts to get mushy over time, so it's technically recurrent." ## $ img : chr "https://imgs.xkcd.com/comics/machine_learning.png" ## $ title : chr "Machine Learning" ## $ day : chr "17"

It couldn’t be any easier, right!

To download the PNG image as a raw file we use `download.file`

from Base R:

#download file download.file(xkcd$img, destfile = "images/xkcd.png", mode = 'wb')

Finally, we want to plot the image, we use the `png`

package (on CRAN) for that:

library(png) # plot png plot(1:2, type='n', main = xkcd$title, xlab = "", ylab = "") rasterImage(readPNG("images/xkcd.png"), 1, 1, 2, 2)

If this doesn’t work on your system please consult the documentation, there might be system-related differences.

We can, of course, display the other data directly:

xkcd$alt ## [1] "The pile gets soaked with data and starts to get mushy over time, so it's technically recurrent."

This has hopefully given you some inspiration for your own experiments with more sophisticated APIs… if you have interesting examples and use cases please post them in the comments below!

**UPDATE November 24, 2020**

Jerry shared another, very elegant, method with us in the comments. It is based on the excellent `magick`

package (on CRAN):

library(magick) image_read(xkcd$img)]]>

One of them happened to approach me to offer me a once in a lifetime investment opportunity. Or so it seemed. Now, there is this old saying that when something seems too good to be true it probably is. If you want to learn what *Benford’s law* is and how to apply it to uncover fraud, read on!

Here are Bernie’s monthly returns (you can find them here: madoff_returns.csv):

madoff_returns <- read.csv("Data/madoff_returns.csv") equity_curve <- cumprod(c(100, (1 + madoff_returns$Return))) plot(equity_curve, main = "Bernie's equity curve", ylab = "$", type = "l")

An equity curve with annual returns of over 10% as if drawn with a ruler! Wow… and Double Wow! What a hell of a fund manager!

I set off to understand how Bernie accomplished those high and especially extraordinarily stable returns. And found: Nothing! I literally rebuilt his purported split-strike strategy and backtested it, it of course didn’t work. And therefore I didn’t invest with him. A wise decision as history proved. And yet, I learned so much along the way, especially on trading and options strategies.

A very good and detailed account of the Madoff fraud can be read in the excellent book “No One Would Listen: A True Financial Thriller” by whistleblower Harry Markopolos who was on Bernie’s heels for many years but as the title says, no one would listen… The reason is some variant of the above wisdom “What seems to good…”: people told him that Bernie could not be a fraud because his fund was so big and other people would have realized that!

One of the red flags that those returns were made up could have been raised by applying Benford’s law. It states that the frequency of the leading digits of many real-world data sets follows a very distinct pattern:

theory <- log10(2:9) - log10(1:8) theory <- round(c(theory, 1-sum(theory)), 3) data.frame(theory) ## theory ## 1 0.301 ## 2 0.176 ## 3 0.125 ## 4 0.097 ## 5 0.079 ## 6 0.067 ## 7 0.058 ## 8 0.051 ## 9 0.046

The discovery of Benford’s law goes back to 1881 when the astronomer Simon Newcomb noticed that in logarithm tables the earlier pages were much more worn than the other pages. It was re-discovered in 1938 by the physicist Frank Benford and subsequently named after him. Thereby it is just another instance of *Stigler’s law* which states that no scientific discovery is named after its original discoverer (Stigler’s law is by the way another instance of Stigler’s law because the idea goes back at least as far as to Mark Twain).

Thie following analysis is inspired by the great book “Analytics Stories” by my colleague Professor em. Wayne L. Winston from Kelley School of Business at Indiana University. Professor Winston gives an insightful explanation of why Benford’s law holds for many real-world data sets:

Many quantities (such as population and a company’s sales revenue) grow by a similar percentage (say, 10%) each year. If this is the case, and the first digit is a 1, it may take several years to get to a first digit of 2. If your first digit is 8 or 9, however, growing at 10% will quickly send you back to a first digit of 1. This explains why smaller first digits are more likely than larger first digits.

We are going to simulate a growth process by sampling some random numbers as a starting value and a growth rate and letting it grow a few hundred times, each time extracting the first digit of the resulting number, tallying everything up, and comparing it to the above distribution at the end:

# needs dataframe with actual and theoretic distribution plot_benford <- function(benford) { colours = c("red", "blue") bars <- t(benford) colnames(bars) <- 1:9 barplot(bars, main = "Frequency analysis of first digits", xlab = "Digits", ylab = "Frequency", beside = TRUE, col = colours, ylim=c(0, max(benford) * 1.2)) legend('topright', fill = colours, legend = c("Actual", "Theory")) } set.seed(123) start <- sample(1:9000000, 1) growth <- 1 + sample(1:50, 1) / 100 n <- 500 sim <- cumprod(c(start, rep(growth, (n-1)))) # vectorize recursive simulation first_digit <- as.numeric(substr(sim, 1, 1)) actual <- as.vector(table(first_digit) / n) benford_sim <- data.frame(actual, theory) benford_sim ## actual theory ## 1 0.300 0.301 ## 2 0.174 0.176 ## 3 0.126 0.125 ## 4 0.098 0.097 ## 5 0.078 0.079 ## 6 0.068 0.067 ## 7 0.058 0.058 ## 8 0.050 0.051 ## 9 0.048 0.046 plot_benford(benford_sim)

We can see a nearly perfect fit!

We are now doing the same kind of analysis with Bernie’s made-up returns:

first_digit <- as.numeric(substr(abs(madoff_returns$Return * 10000), 1, 1)) actual <- round(as.vector(table(first_digit) / length(first_digit)), 3) madoff <- data.frame(actual = actual[2:10], theory) madoff ## actual theory ## 1 0.391 0.301 ## 2 0.135 0.176 ## 3 0.093 0.125 ## 4 0.060 0.097 ## 5 0.051 0.079 ## 6 0.079 0.067 ## 7 0.065 0.058 ## 8 0.070 0.051 ## 9 0.051 0.046 plot_benford(madoff)

Just by inspection, we can see that something doesn’t seem to be quite right. This is of course no proof but another indication that something could be amiss.

Benford’s law has become one of the standard methods used for fraud detection, *forensic analytics* and *forensic accounting* (also called *forensic accountancy* or *financial forensics*). There are several R packages with which you can finetune the above analysis, yet the principle stays the same. Because this has become common knowledge many more sophisticated fraudsters tailor their numbers according to Benford’s law so that it may become an instance of yet another law: *Goodhart’s law*:

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

Let us hope that this law doesn’t lead to more lawlessness!

]]>Everybody knows the Simpsons, everybody loves the Simpsons and everybody can laugh about Bart Simpson writing funny lines on the blackboard! If you want to create your own

Conveniently enough there is a package for creating *memes* already (who would have thought otherwise, because there is a package for *everything*!), the `meme`

package by my colleague Professor Guangchuang Yu from the University of Hong Kong. After installing it from CRAN we load it, assuming that you work on a Windows machine load the *Comic Sans* font and a clean Bart Simpson Blackboard pic into R:

library(meme) if (.Platform$OS.type == "windows") { windowsFonts(Comic = windowsFont("Comic Sans MS")) } bart <- "pics/bart_simpson_chalkboard-5157.gif" # source: http://free-extras.com/images/bart_simpson_chalkboard-5157.htm

We can start right away by using the `meme`

function with the pic, text, font size and font as arguments (you might have to adapt the font size, for a new line use the `"\n"`

escape sequence):

meme(bart, "\nfor (i in 1:100) {\n print(\"I will not use loops in R\")\n}", size = 1.8, font = "Comic", vjust = 0, r = 0)

As an aside: to fully appreciate the joke you should know something about *loops* and how to avoid them by making use of *vectorization* in R (see here: Learning R: The Ultimate Introduction (incl. Machine Learning!). Another method to avoid loops is by using the *apply family of functions* (for those so-called *higher-order functions* see here: Learning R: A Gentle Introduction to Higher-Order Functions).

But if you want to go full “Bart Simpson” you, of course, need to repeat the lines several times (the `rep`

function comes in handy here). The whole (punitive) work is done by this short piece of code (just change `text`

for your own memes):

text <- "I will not waste chalk" text <- paste(rep(text, 8), collapse = "\n") text <- paste0("\n", text) meme(bart, text, size = 1.6, font = "Comic", vjust = 0, r = 0)

Happy memeing with BaRt!

]]>In the area of

Many years ago there was such a tool online but it has long gone since and the domain is inactive. So, based on the old project paper from that website I decided to program it in R and make it available for free here!

The project paper the algorithm is based on and which is translated to R can be found here: Financial Engineering Tool: Replication Strategy and Algorithm. I will not explain the algorithm as such and how it works because this is done brilliantly in the paper. Also, I won’t get into any details concerning *derivatives *and *structured products* either. You can find tons of material on the web just by googling. So, without further ado let’s get started!

First, we need a way to define the *payoff function*: for each kink we provide two values, one for the *underlying* which goes from 0 to infinity and one for the payoff we want to replicate. We will use the names used in the paper for all the needed variables for clarity. Let us start by defining a *plain vanilla call*:

payoff <- data.frame(pi = c(0, 100, 110, Inf), f_pi = c(0, 0, 10, Inf)) payoff ## pi f_pi ## 1 0 0 ## 2 100 0 ## 3 110 10 ## 4 Inf Inf

The last value of the payoff must be either equal to the penultimate value (= payoff staying flat at the given value) or must be (minus) infinity for a linear continuation in the given direction. Next we want to plot this payoff:

plot_payoff <- function(payoff, xtrpol = 1.5) { k <- nrow(payoff) - 1 payoff_prt <- payoff payoff_prt$pi[k+1] <- payoff$pi[k] * xtrpol # linear extrapolation of last kink slope <- diff(c(payoff$f_pi[k-1], payoff$f_pi[k])) / diff(c(payoff$pi[k-1], payoff$pi[k])) payoff_prt$f_pi[k+1] <- ifelse(payoff$f_pi[k] == payoff$f_pi[k+1], payoff$f_pi[k+1], payoff$f_pi[k] + payoff$pi[k] * (xtrpol - 1) * slope) plot(payoff_prt, ylim = c(-max(abs(payoff_prt$f_pi) * xtrpol), max(abs(payoff_prt$f_pi) * xtrpol)), main = "Payoff diagram", xlab = "S(T)", ylab = "f(S(T))", type = "l") abline(h = 0, col = "blue") grid() lines(payoff_prt, type = "l") invisible(payoff_prt) } plot_payoff(payoff)

Now comes the actual replication. We need to functions for that: a helper function to calculate some parameters…

calculate_params <- function(payoff) { params <- payoff k <- nrow(params) - 1 # add additional columns s_f_pi, lambda and s_lambda params$s_f_pi <- ifelse(params$f_pi < 0, -1, 1) # linear extrapolation of last kink slope <- diff(c(params$f_pi[k-1], params$f_pi[k])) / diff(c(params$pi[k-1], params$pi[k])) f_pi_k <- ifelse(params$f_pi[k] == params$f_pi[k+1], params$f_pi[k+1], slope) params$lambda <- c(diff(params$f_pi) / diff(params$pi), f_pi_k) params$s_lambda <- ifelse(params$lambda < 0, -1, 1) # consolidate params[k, ] <- c(params[k, 1:3], params[(k+1), 4:5]) params <- params[1:k, ] params }

…and the main function with the replication algorithm:

replicate_payoff <- function(payoff) { params <- calculate_params(payoff) suppressMessages(attach(params)) k <- nrow(params) portfolios <- as.data.frame(matrix("", nrow = k, ncol = 6)) colnames(portfolios) <- c("zerobonds", "nominal", "calls", "call_strike", "puts", "put_strike") # step 0 (initialization) i <- 1 i_r <- 1 i_l <- 1 while (i <= k) { # step 1 (leveling) if (f_pi[i] != 0) { portfolios[i, "zerobonds"] <- s_f_pi[i] portfolios[i, "nominal"] <- abs(f_pi[i]) } # step 2 (replication to the right) while (i_r <= k) { if (i_r == i) { if (lambda[i] != 0) { portfolios[i, "calls"] <- paste(portfolios[i, "calls"], lambda[i]) portfolios[i, "call_strike"] <- paste(portfolios[i, "call_strike"], pi[i]) } i_r <- i_r + 1 next } if ((lambda[i_r] - lambda[i_r-1]) != 0) { portfolios[i, "calls"] <- paste(portfolios[i, "calls"], (lambda[i_r] - lambda[i_r-1])) portfolios[i, "call_strike"] <- paste(portfolios[i, "call_strike"], pi[i_r]) } i_r <- i_r + 1 } # step 3 (replication to the left) while (i_l != 1) { if (i_l == i) { if (-lambda[i_l-1] != 0) { portfolios[i, "puts"] <- paste(portfolios[i, "puts"], -lambda[i_l-1]) portfolios[i, "put_strike"] <- paste(portfolios[i, "put_strike"], pi[i_l]) } } else { if ((lambda[i_l] - lambda[i_l-1]) != 0) { portfolios[i, "puts"] <- paste(portfolios[i, "puts"], (lambda[i_l] - lambda[i_l-1])) portfolios[i, "put_strike"] <- paste(portfolios[i, "put_strike"], pi[i_l]) } } i_l <- i_l - 1 } # step 4 i <- i + 1 i_r <- i i_l <- i } # remove duplicate portfolios portfolios <- unique(portfolios) # renumber rows after removal row.names(portfolios) <- 1:nrow(portfolios) portfolios }

Let us test our function for the plain vanilla call:

replicate_payoff(payoff) ## zerobonds nominal calls call_strike puts put_strike ## 1 1 100 ## 2 1 10 1 110 -1 1 110 100

There are always several possibilities for replication. In this case, the first is just our call with a strike of 100. Another possibility is buying a zerobond with a nominal of 10, going long a call with strike 110 and simultaneously going short a put with strike 110 and long another put with strike 100.

Let us try a more complicated payoff, a classic *bear spread* (which is also the example given in the paper):

payoff <- data.frame(pi = c(0, 90, 110, Inf), f_pi = c(20, 20, 0, 0)) payoff ## pi f_pi ## 1 0 20 ## 2 90 20 ## 3 110 0 ## 4 Inf 0 plot_payoff(payoff)

replicate_payoff(payoff) ## zerobonds nominal calls call_strike puts put_strike ## 1 1 20 -1 1 90 110 ## 2 1 -1 110 90

Or for a so-called *airbag note*:

payoff <- data.frame(pi = c(0, 80, 100, 200, Inf), f_pi = c(0, 100, 100, 200, Inf)) payoff ## pi f_pi ## 1 0 0 ## 2 80 100 ## 3 100 100 ## 4 200 200 ## 5 Inf Inf plot_payoff(payoff, xtrpol = 1)

replicate_payoff(payoff) ## zerobonds nominal calls call_strike puts put_strike ## 1 1.25 -1.25 1 0 80 100 ## 2 1 100 1 100 -1.25 80 ## 3 1 200 1 200 -1 1 -1.25 200 100 80

As a final example: how to replicate the underlying itself? Let’s see:

payoff <- data.frame(pi = c(0, 100, Inf), f_pi = c(0, 100, Inf)) payoff ## pi f_pi ## 1 0 0 ## 2 100 100 ## 3 Inf Inf plot_payoff(payoff, 1)

replicate_payoff(payoff) ## zerobonds nominal calls call_strike puts put_strike ## 1 1 0 ## 2 1 100 1 100 -1 100

The first solution correctly gives us what is called a *zero-strike call*, i.e. a call with the strike of zero!

I hope you find this helpful! If you have any questions or comments, please leave them below.

I am even thinking that it might be worthwhile to turn this into a package and put it on CRAN, yet I don’t have the time to do that at the moment… if you are interested in cooperating on that please leave a note in the comments too. Thank you!

**UPDATE November 3, 2020**

For another (a little bit more involved) example, see my answer on Quant.SE: Replicate a Portfolio with Given Payoff.