Published: July 19, 2021
I recently grew an interest in data science and artificial intelligence, and we are still in a pandemic (although hopefully at the end, fingers crossed for the Delta Variant), so I chose to test out some models to work with data from Covid-19.
Here are the tools that I used:
By using those tools I can come up with a setup like this:
I generally use the light theme for visibility of the graphs, here I used the dark theme because it looks better in screenshots.
dygraph
:Using this library, we can produce graphs that react depending on cursor events.
The data comes from the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.
Here’s how the data is structured:
Each row is a country/province, while each column after column 4 is a day. This specific screenshot is taken from the confirmed cases csv file.
We can easily pull this data using the covid19.analysis
package described above.
# reads time series data
all_confirmed_cases <- covid19.data("ts-confirmed")
all_confirmed_deaths <- covid19.data("ts-deaths")
all_confirmed_recoveries <- covid19.data("ts-recovered")
We can easily build a cell to manipulate this data:
indexList <- c()
countryList <- c()
# Get all rows
for (i in rownames(all_confirmed_cases)) {
# print(c(i, all_confirmed_cases[i, 2]))
indexList <- c(indexList, i)
countryList <- c(countryList, all_confirmed_cases[i, 2])
}
country_index_list <- as.data.frame(cbind(indexList, countryList))
country_index_list[154, ]
# We can see that Italy is index 154, so we are going to use that
italy_index <- 154
After this, we’ll just extract the relevant rows.
it_confirmed_cases <- all_confirmed_cases[italy_index, ]
it_confirmed_deaths <- all_confirmed_deaths[italy_index, ]
it_confirmed_recoveries <- all_confirmed_recoveries[italy_index, ]
print("Cases:")
View(it_confirmed_cases)
print("Deaths:")
View(it_confirmed_deaths)
print("Recoveries:")
View(it_confirmed_recoveries)
firstCaseDate <- "2020-01-31"
We will also need the day of the first case, and we will use that as Day 0 for our models.
# Find index of first case
firstInfection <- 0
for (i in 1:ncol(it_confirmed_cases)) {
if (class(it_confirmed_cases[, i]) == 'integer' && it_confirmed_cases[, i] >= 1) {
print(paste0("Index of the first infection is: ", i, ", Number of infections is: ", it_confirmed_cases[, i]))
firstInfection <- it_confirmed_cases[, i]
break
}
}
Let’s proceed to analyse this data.
First let’s define some model-agnostic variables, such as the time frame.
# Days that I'm analyzing
analysis_days <- 365
# Date list
dates <- seq(as.Date(firstCaseDate), by = "days", length.out = analysis_days)
The SIR is one of the simplest compartmental models used in epidemiology. You can read more about it here, but simply put, the SIR model divides the population in 3 categories:
It can be described by these 3 differential equations, which will show the relations between the Susceptible \(S(t)\), Infected \(I(t)\) and Recovered \(R(t)\) as functions of time.
There are also two parameters, \(\beta\) and \(\gamma\), where beta is the effective transmission rate, and gamma is the effective recovery rate. We can use these two parameters to obtain \(R_0\), which some of you may have heard if you followed the news.
\[R_0 = \frac{\beta}{\gamma}\]Now that we know the math behind the model, we can code it in R.
susceptible <- 60e+06 # Source: https://www.statista.com/statistics/786485/population-by-gender-in-italy/#:~:text=Population%20in%20Italy%20in%202020%2C%20by%20gender&text=As%20of%20January%202020%2C%2060.2,roughly%2016%20million%20people%20lived.
infected <- firstInfection
recovered <- 0
initial_state_values = c(S = susceptible, I = infected, R = recovered)
# Parameters
# The beta for covid is estimated to be ranging from 1.5 to 6.68. With median of 2.79. Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7751056/#:~:text=R0%20of%20COVID%2D19,-R0%20of&text=compared%2012%20studies%20published%20from,an%20interquartile%20range%20of%201.16.
# We can simulate this scenario by having our beta as 0.27 and our gamma as 0.1
parameters = c(gamma = 0.1, beta = 0.27)
# Time points
time = seq(from = 1,to = analysis_days, by = 1)
# SIR model function
sir_model <- function(time,state,parameters){
with(as.list(c(state,parameters)),{
N = S + I + R
lambda = beta*(I/N)
dS =- lambda*S
dI = lambda*S - gamma*I
dR = gamma*I
return(list(c(dS,dI,dR)))
}
)
}
#Solving the differential equations
output <- as.data.frame(ode(y = initial_state_values, func = sir_model, parms = parameters, times = time))
out_long = melt(output , id = "time")
colnames(out_long) <- c("Time", "Variable", "Value")
And let’s graph the model’s output.
Susceptible <- out_long[1:analysis_days, 3]
Infected <- out_long[(analysis_days + 1):(analysis_days*2), 3]
Recovered <- out_long[(analysis_days*2 + 1):(analysis_days*3), 3]
plot_data <- as.data.frame(cbind(Susceptible/1000000, Infected/1000000, Recovered/1000000))
colnames(plot_data) <- c("Susceptbile", "Infected", "Recovered")
rownames(plot_data) <- dates
plot_data <- as.xts(plot_data)
# head(plot_data)
dygraph(plot_data) %>%
dyAxis("y", label = "People (Millions)") %>%
dyAxis("x", label = "Date")
The SEIR model is an expansion on the SIR model. The E stands for Exposed. An exposed person is a person that was exposed to the virus, but isn’t Infectious yet.
From their website: jupyter.org ↩