Epidemiology: How contagious is Novel Coronavirus (2019-nCoV)?

A new invisible enemy, only 30kb in size, has emerged and is on a killing spree around the world: 2019-nCoV, the Novel Coronavirus!

It has already killed more people than the SARS pandemic and its outbreak has been declared a Public Health Emergency of International Concern (PHEIC) by the World Health Organization (WHO).

If you want to learn how epidemiologists estimate how contagious a new virus is and how to do it in R read on!

There are many epidemiological models around, we will use one of the simplest here, the so-called SIR model. We will use this model with the latest data from the current outbreak of 2019-nCoV (from here: Wikipedia: Case statistics). You can use the following R code as a starting point for your own experiments and estimations.

Before we start to calculate a forecast let us begin with what is confirmed so far:

Infected <- c(45, 62, 121, 198, 291, 440, 571, 830, 1287, 1975, 2744, 4515, 5974, 7711, 9692, 11791, 14380, 17205, 20440)
Day <- 1:(length(Infected))
N <- 1400000000 # population of mainland china

old <- par(mfrow = c(1, 2))
plot(Day, Infected, type ="b")
plot(Day, Infected, log = "y")
abline(lm(log10(Infected) ~ Day))
title("Confirmed Cases 2019-nCoV China", outer = TRUE, line = -2)

On the left, we see the confirmed cases in mainland China and on the right the same but with a log scale on the y-axis (a so-called semi-log plot or more precisely log-linear plot here), which indicates that the epidemic is in an exponential phase, although at a slightly smaller rate than at the beginning. By the way: many people were not alarmed at all at the beginning. Why? Because an exponential function looks linear in the beginning. It was the same with HIV/AIDS when it first started.

Now we come to the prediction part with the SIR model, which basic idea is quite simple. There are three groups of people: those that are healthy but susceptible to the disease (S), the infected (I) and the people who have recovered (R):

Source: wikimedia

To model the dynamics of the outbreak we need three differential equations, one for the change in each group, where \beta is the parameter that controls the transition between S and I and \gamma which controls the transition between I and R:

    \[\frac{dS}{dt} = - \frac{\beta I S}{N}\]

    \[\frac{dI}{dt} = \frac{\beta I S}{N}- \gamma I\]

    \[\frac{dR}{dt} = \gamma I\]

This can easily be put into R code:

SIR <- function(time, state, parameters) {
  par <- as.list(c(state, parameters))
  with(par, {
    dS <- -beta/N * I * S
    dI <- beta/N * I * S - gamma * I
    dR <- gamma * I
    list(c(dS, dI, dR))

To fit the model to the data we need two things: a solver for differential equations and an optimizer. To solve differential equations the function ode from the deSolve package (on CRAN) is an excellent choice, to optimize we will use the optim function from base R. Concretely, we will minimize the sum of the squared differences between the number of infected I at time t and the corresponding number of predicted cases by our model \^{I}(t):

    \[RSS(\beta, \gamma) = \sum_{t} \left( I(t)-\^{I}(t) \right)^2\]

Putting it all together:

init <- c(S = N-Infected[1], I = Infected[1], R = 0)
RSS <- function(parameters) {
  names(parameters) <- c("beta", "gamma")
  out <- ode(y = init, times = Day, func = SIR, parms = parameters)
  fit <- out[ , 3]
  sum((Infected - fit)^2)

Opt <- optim(c(0.5, 0.5), RSS, method = "L-BFGS-B", lower = c(0, 0), upper = c(1, 1)) # optimize with some sensible conditions

Opt_par <- setNames(Opt$par, c("beta", "gamma"))
##      beta     gamma 
## 0.6746089 0.3253912

t <- 1:70 # time in days
fit <- data.frame(ode(y = init, times = t, func = SIR, parms = Opt_par))
col <- 1:3 # colour

matplot(fit$time, fit[ , 2:4], type = "l", xlab = "Day", ylab = "Number of subjects", lwd = 2, lty = 1, col = col)
matplot(fit$time, fit[ , 2:4], type = "l", xlab = "Day", ylab = "Number of subjects", lwd = 2, lty = 1, col = col, log = "y")
## Warning in xy.coords(x, y, xlabel, ylabel, log = log): 1 y value <= 0
## omitted from logarithmic plot

points(Day, Infected)
legend("bottomright", c("Susceptibles", "Infecteds", "Recovereds"), lty = 1, lwd = 2, col = col, inset = 0.05)
title("SIR model 2019-nCoV China", outer = TRUE, line = -2)

We see in the right log-linear plot that the model seems to fit the values quite well. We can now extract some interesting statistics. One important number is the so-called basic reproduction number (also basic reproduction ratio) R_0 (pronounced “R naught”) which basically shows how many healthy people get infected by a sick person on average:

    \[R_0 = \frac{\beta}{\gamma}\]


R0 <- setNames(Opt_par["beta"] / Opt_par["gamma"], "R0")
##       R0 
## 2.073224

fit[fit$I == max(fit$I), "I", drop = FALSE] # height of pandemic
##            I
## 50 232001865

max(fit$I) * 0.02 # max deaths with supposed 2% mortality rate
## [1] 4640037

So, R_0 is slightly above 2, which is the number many researchers and the WHO give and which is around the same range of SARS, Influenza or Ebola (while transmission of Ebola is via bodily fluids and not airborne droplets). Additionally, according to this model, the height of a possible pandemic would be reached by the beginning of March (50 days after it started) with over 200 million Chinese infected and over 4 million dead!

Do not panic! All of this is preliminary and hopefully (probably!) false. When you play along with the above model you will see that the fitted parameters are far from stable. On the one hand, the purpose of this post was just to give an illustration of how such analyses are done in general with a very simple (probably too simple!) model, on the other hand, we are in good company here; the renowned scientific journal nature writes:

Researchers are struggling to accurately model the outbreak and predict how it might unfold.

On the other hand, I wouldn’t go that far that the numbers are impossibly high. H1N1, also known as swine flu, infected up to 1.5 billion people during 2009/2010 and nearly 600,000 died. And this wasn’t the first pandemic of this proportion in history (think Spanish flu). Yet, this is one of the few times where I hope that my model is wrong and we will all stay healthy!

28 thoughts on “Epidemiology: How contagious is Novel Coronavirus (2019-nCoV)?”

  1. Problem is the data. No idea how many mild cases never seek treatment. No idea how many cases die waiting for a bed. I did see one Chinese paper a week ago where they let slip that fatalities were in the thousands. Didn’t bookmark it.

  2. > Because an exponential function looks linear in the beginning.
    Nonsense, an exponential function looks exponential at all scales. Meaning if you look at the function in real time, or if you cut any part of it, it looks exponential, not linear.

    1. Harsh words, my friend 😉
      Mathematically you are of course right but not in practice (and that is what we are dealing with here!)

      See e.g. the following quote from Dr. John D. Cook, a well known applied mathematician:
      “How do you know whether you’re on an exponential curve? This is not as easy as it sounds. Because of random noise, it may be hard to tell from a small amount of data whether growth is linear or exponential”.
      Source: https://www.johndcook.com/blog/2008/01/17/coping-with-exponential-growth/

      Another thing is that we, as humans, are not really used to exponential thinking and therefore underestimate such developments, exponential growth indeed looks linear to us in the beginning (although it is not mathematically).

      1. Well let me apologize in advance because I agree with Dr Cook but I’m going to be annoyingly nitpicky: if the number of new cases / previous day increases every day, you are most likely on an exponential trend, if the delta of new cases is randomly greater or smaller than the previous day, then you may be on a linear growth.
        So the reason is not “because the exponential function looks linear at the beginning”, but because at the start of the epidemy, there may be 0 new contamination one day, or more likely because we haven’t identified all the contaminations yet. Aka it’s a problem of lack of data which makes it very hard to identify the trend.

        1. Yes, lack of data, data quality problems, and psychology because we, as humans, “think” linearly and not exponentially (also because the absolute change is so small at the beginning!)… which is why I think “looks linear at the beginning” might not be the best of characterizations (and is mathematically wrong!) but delivers the point concisely. Yet, in essence, we agree!

  3. Very nice analysis of an worrisome situation. I learned a great deal from your presentation and explanation. I’m accustomed to solving systems of differential equations the old fashioned way, with FORTRAN ;-), so seeing how you set this up and solved it in R, which I’m still learning, was very useful.

    Now let’s all hope that your numbers are indeed incorrect, and are over-estimates.

  4. Hello,
    Thanks for posting.
    With the EpiDynamics package we have:

    > parameters N initials sir max(sir$results$I)
    [1] 0.165818
    > 0.165818*N
    [1] 232145200
    > 0.002*232145200
    [1] 464290.4

    All the best,

  5. Sorry,

    > parameters N initials sir max(sir$results$I)
    [1] 0.165818
    > 0.165818*N
    [1] 232145200
    > 0.002*232145200
    [1] 464290. #about the same result
      1. Thanks for great article and quick reply.

        Apologies as im not really a coder but have just started studying stats at uni.

        I assumed you were inputting beta and gamma manually…

        Forgive me for being dumb but im struggling to understand how you can calculate gamma from this?

        Would you mind explaining for someone who is not familiar?

        1. Thank you for your great feedback, I really appreciate it.

          No, you give the optimizer some sensible starting values for beta and gamma and it then “intelligently” (some maths magic….) tries to find better and better values for both parameters which fit the given data best.

          Is it clearer now?

          1. Yes thats really helpful thank you.

            My only question now would be what the beta and gamma are actually quantities of?

            Also i really want to emphasise that you taking time to answer these questions is very appreciated!

  6. Cross posted from Stackexchange
    This is only marginally related to the detailed coding discussion, but seems highly relevant to the original question concerning modeling of the current 2019-nCoV epidemic. Please see arxiv:2002.00418v1 (paper at https://arxiv.org/pdf/2002.00418v1.pdf ) for a delayed diff equation system ~5 component model, with parameter estimation and predictions using dde23 in MatLab. These are compared to daily published reports of confirmed cases, number cured, etc. To me, it is quite worthy of discussion, refinement, and updating. It concludes that there is a bifurcation in the solution space dependent upon the efficacy of isolation, thus explaining the strong public health measures recently taken, which have a fair chance of success so far.

  7. Thanks a lot for the amazing model. I recreated it but there is still something that I not entirely understand. The model has three states S,I,R. If I sum these states up they come to a total of N (total population) at each time point. As there are also people who (unfortunately) die. Should there not also be a D state in the model? like dD/dt = (1-gamma)*I. So that always the following is true –> N = S(t) + I(t) +R(t) + D(t)

  8. Hi
    How can i create dynamic line chart and dynamic scatterplot(like gapminder) for Corona virus data???
    Please help me.
    I need to this plots.
    Very very necessary
    My gmail for answer me:m.rostami4543@gmail.com

  9. Hi, thanks a lot for this post. I have been looking for a while how to fit ode to real data.

    I was wondering
    – 1° Why we need 1/N in the equations. As N is constant I would assume we don’t need it and it could be implicit in the parameters.
    – 2° Why we need 1/N in equations with beta but not with gamma.
    – 3° So are beta and gamma really comparable (same units), given the fact that 1/N is only in the equations with beta?

    I tried
    – to remove 1/N in dS/dt and dI/dt and I obtained estimated values of beta and gamma of 0.
    – to include 1/N in the equation of dR/dt : dR/dt=(gamma/N)*I, and it didn’t change the estimated parameters. Weird.

    Sorry in advance if these are dumb questions.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.