# The Small Data Rule: Infer the Big Picture from only Five Values!

Everybody is talking about big data but the real skill lies in the art of inferring useful information from only a handful of values!

If you want to learn how to determine the range of the typical value of a dataset (i.e. the median) with just five values and why this works, read on!

This blog post is inspired by a chapter from the wonderful book “Alles kein Zufall! Liebe, Geld, Fußball” (“No coincidence! Love, Money, Football”, only available in German at the moment) by my colleague Professor Christian Hesse from the University of Stuttgart, Germany.

Let us dive directly into the matter, the Small Data Rule states:

In a sample of five numerical values from any unknown population, the median of this population lies between the smallest and the largest sample value with 94 percent certainty.

The “population” can be anything, like data about age in a population, income in a country, television consumption, donation amounts, body sizes, temperatures and so on.

The median is the “middle value” and thereby a good representation of a population’s “typical value”. It is calculated by sorting all of the values and then dividing them into two halves of the same size. The value that lies exactly between those two halves is the median. Contrary to the mean (often simply called the “average”) the median is robust with regard to outliers:

```x <- 0:10
median(x)
## [1] 5

mean(x)
## [1] 5

x <- c(0:9, 10000)
median(x)
## [1] 5

mean(x)
## [1] 913.1818
```

Obviously, the median is quite useful for getting a quick overview of a large dataset. So, it seems almost magical that you could determine the range of it by just five randomly drawn numbers. Yet, the rationale is quite straightforward:

The probability of drawing a random value from a population that is above the median is 50 percent or 1/2. The probability that all five values are above the median is 1/2 x 1/2 x 1/2 x 1/2 x 1/2. Of course, this is the same probability that all of those values are below the median. To cover both cases just add those probabilities.

But we are interested in the complementary event, i.e. that at least one value lies on each side of the median so that we get an interval that encloses it. We get that by subtracting the above probability from one:

```1-2*(0.5^5)
## [1] 0.9375
```

The result is a high degree of certainty of nearly 94% that this will indeed be the case!

If you don’t believe this let us conduct a little experiment for illustrative purposes. We enumerate all possibilities of drawing five values from the range of zero to one hundred and see how often the median (= 50) falls within the interval of the minimum and the maximum of the samples (to understand how to do this, this post might be helpful: Learning R: Permutations and Combinations with Base R).

Beware, the following code will run for quite a while (about three to four minutes on an average computer) because there are nearly 80 million possibilities that have to be created and after that evaluated:

```# needs at least version 4.1.0 of R
M <- combn(0:100, 5)
between <- apply(M, 2, \(x) min(x) < 50 && max(x) > 50)
sum(between) / ncol(M)
## [1] 0.9406869
```

As you can see: 94% indeed! (The resulting value is not exactly the same as above because it only asymptotically reaches that value the bigger the underlying population is.)

Professor Hesse gives a nice example of how to use the small data rule in practice:

The manager of a company is interested in the distance his employees have to commute to work. He plans to open another branch if the distances are too long for many. He could, of course, ask his entire staff about the distance to their place of residence. That would be costly, generate a lot of data and provide more information than the manager actually needs. Instead, he surveys only five randomly selected employees. They live 7, 19, 13, 18, and 9 km away from the company. Thus, the manager can be 94 per cent sure that his employees have to commute a median distance of 7 to 19 kilometres to the company. He considers this acceptable and decides against an additional location.

As an aside, not many people know the `range` function which might come in handy in contexts like these:

```range(c(7, 19, 13, 18, 9))
## [1]  7 19
```

So you see, small data can help you determine the big picture!

For another handy tool to infer whether something unusual is going on see this post: 3.84 or: How to Detect BS (Fast).

## 10 thoughts on “The Small Data Rule: Infer the Big Picture from only Five Values!”

1. Rob Walker says:

Your example at the bottom, of the manager and average commute time, is incorrect unless you are using the term average and median as interchangeable. The median commute lies between 7 and 19 km with probability 1 – 2(0.5^5) is true.

1. Thank you, Rob!

Although it is indeed sometimes used interchangeably I agree that it is a little bit unfortunate in this context.

The German original says “im Mittel” which literarily translates to “in the middle” but which is generally translated as “on average”. Because there is no good equivalent in English I changed it to “median distance”.

2. I love this–so simple but powerful. I also like the focus on “small data.” As a biostatistician that analyzes ecological data, I’m often faced with trying to draw conclusions from very few data points. The nature of the beast.

I have to think about this some more, but the immediate question that comes to mind in terms of the work that I do is how important it is for those five data points to be truly random with respect to the full distribution of values. I think that might be the crux. In a simulation it’s easy to select numbers at random. But if I was given five data points representing the measured height of a certain tree species, my confidence in those values encompassing the median would be completely dependent on the data collection methodology.

1. Thank you, Luka! Yes, “simple but powerful” is a recurring theme on my blog, could even be my motto!

This is also one of the reasons I developed the OneR-package. If you don’t know it yet, please have a look: One Rule (OneR) Machine Learning Classification in under One Minute.

Concerning bias: in what way might your data be biased? Perhaps certain biases and their consequences for the range of the median could be simulated too…

1. I’ve been reading a bunch of your posts and I really like your blog a lot (for that “simple but powerful” component).

I didn’t know about the OneR package, but excited to check it out. Thanks!

I’ll think about the bias question a bit more and let you know.

3. Joe says:

…if we draw 8 random numbers does the probability increase to 99%: 1-0.5^8*2

4. Professor Dr. Karl-Werner Hansmann says:

My compiler rejects the second line of your little program
`between <- apply(M, 2, \(x) min(x) 50)`,
especially `\(x)` would be no function.
Is there a missprint?
Best regards
Karl-Werner Hansmann

1. Dear colleague,