Get the data

For data we will use the coronavirus dataset, which is a conveniently tidied table based on the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE). The original data is available at the CSSEGISandData/COVID-19 repository if you prefer to do the cleanup yourself.

The coronavirus package is published to CRAN, but it is updated on GitHub on a daily basis. To get the latest version, we run:


remotes::install_github("RamiKrispin/coronavirus", dependencies = TRUE)

Another dataset that is available is the nCov2019 package(Wu et al. 2020) available from their GitHub site. They provide a useful vignette as well. This package does more than get data, it also provides function utilities for mapping and plotting the cases. We can obtain the latest version by running:


remotes::install_github("GuangchuangYu/nCov2019", dependencies = TRUE)

The coronavirus dataset

We can have a look at the data. As we see, each row is a particular record with the number of cases, confirmed, dead or recovered, for a given region. Geographical coordinates are added for map plots.


library(coronavirus)
# View the data
paged_table(coronavirus)

Each row of this table gives the number of cases reported per each day and region between January 22 2020 and March 31 2020.

The nCov2019 dataset


library(nCov2019)

all.ncov <- load_nCov2019(lang = "en") %>% .['global'] %>% tbl_df()

# Filter ncov for the same coutries we're interested in: 
all.ncov %>%
  filter(country %in% countries_list) %>% 
  paged_table(all.ncov)

This dataset is also collected from the GitHub repo, and contains data from December 01 2019 to March 31 2020. The all.ncov2019 has more recent data than the coronavirus dataset, and thus we will use the all.ncov2019 dataset as it is usually updated faster.

Exploratory Analysis


# Summarise per country and add a cumulative count
all.corona <- coronavirus %>% 
  select(-Lat, -Long) %>% 
  rename(country = Country.Region) %>% 
  group_by(country, date, type) %>% 
  summarise(cases = sum(cases)) %>% 
  arrange(country, date, type) %>% 
  group_by(country, type) %>% 
  # Add a cumulative sum of cases
  mutate(cumcases = cumsum(cases)) %>% 
  arrange(desc(cumcases))

# Visualise all the countries caces data
paged_table(all.corona)

We immediately see that in China the total number of confirmed cases is still high, but the new cases are extremely low. If we look instead at the data from nCov2019, we get very similar numbers (but not exactly identical, as it is expected).


all.ncov %>% 
  arrange(desc(cum_confirm)) %>% 
  paged_table()

A limitation of the nCov2019 dataset is that it does not give us the number of cases, but we can easily work around that by simply getting the difference from one day to the previous one:


all.ncov <- all.ncov %>% 
  arrange(country, time) %>% 
  group_by(country) %>% 
  mutate(
    cases_confirmed = cum_confirm - lag(cum_confirm), 
    cases_recovered = cum_heal - lag(cum_heal), 
    cases_death = cum_dead - lag(cum_dead)
  ) %>% 
  # Remove empty rows; just removing if empty confirm should be enough
  rename(cum_confirmed = cum_confirm, 
         cum_recovered = cum_heal, 
         cum_death = cum_dead,
         date = time) %>% 
  filter(!is.na(cases_confirmed))

The coronavirus dataset from John Hopkins and the data from the nCov2019 packages come from different sources, that can be updated at different times and with different degrees of accuracy. Therefore it would be a good idea to run a comparison of both datasets, to see if there any major discrepancies between the datasets. Comparing the coronavirus and the nCov2019 datasets require a little bit of data manipulation as they are in different formats.


long.ncov <- inner_join(
  all.ncov %>% 
    select(-starts_with("cases_")) %>% 
    pivot_longer(cum_confirmed:cum_death, 
                 names_to = "type",
                 names_prefix = "cum_",
                 values_to = "cumcases"), 
  all.ncov %>% 
    select(-starts_with("cum_")) %>% 
    pivot_longer(cases_confirmed:cases_death, 
                 names_to = "type", 
                 names_prefix = "cases_",
                 values_to = "cases"), 
  by = c("date", "country", "type")
)


joined <- full_join(all.corona, 
                    long.ncov, 
                    by = c("date", "country", "type"), 
                    suffix = c("_coronavirus", "_nCov2019")) %>% 
  pivot_longer(cases_coronavirus:cases_nCov2019,
               names_to = "dataset") %>% 
  separate(dataset, c("class", "dataset"), "_")


# We get examples for our data of interest
joined %>% 
  filter(country %in% countries_list, 
         type == "confirmed", 
         class == "cumcases",
         date >= "2020-03-01") %>% 
  ggplot(aes(date, value, color = dataset))+
  geom_line(alpha = .7)+
  theme(axis.text.x = element_text(angle = 90))+
  facet_wrap(country~., scales = "free")+
  scale_color_viridis_d(end = .7)+
  scale_y_continuous("Number of cases", labels = scales::comma)+
  labs(title = "Coronavirus (John Hopkins) vs. nCov2019 (China) COVID-19 datasets", 
       subtitle = "Cumulative cases comparison. Y axes not to scale.", 
       caption = "Data limited from March 1 to enhance readabilty")

The two datasets are in remarkable agreement as to the number of cumulative cases, and the mior discrepancies won’t impact much our modelling. As we have pointed out above, we will use the nCov2019 dataset as it is usually updated faster. We will have first a look at the nCov2019, restricting our analysis to those european countries for which we have a reasonably large number of cases.


1. Denmark
2. Finland
3. France
4. Germany
5. Italy
6. Norway
7. Poland
8. Portugal
9. Spain
10. Sweden
11. Switzerland
12. United Kingdom

We will focus this analysis on this 12 european countries only. We can get the table of the new confirmed cases today and the total confirmed cases up to date:


countries <- long.ncov %>% 
  filter(country %in% countries_list) 

countries %>% 
  ungroup %>% 
  filter(date == max(date), 
         type %in% c("confirmed", "death")) %>% 
  select(country, type, cases, cumcases) %>% 
  arrange(type, desc(cumcases)) %>% 
  paged_table()

We clearly see that Italy is still the most gravely affected EU country with 102106 accumulated cases as of March 31. By this date, 14620 patients have died already. The number of new confirmed cases yesterday (last report) in Italy is 4417.

Spain is catching up fast. The country has 94417 patients accumulated so far, with 9222 new confirmed cases as of March 31. So far there have been 19259 casualties due to the virus.

My country of residence, Poland has it comparably much better. There are 2215 confirmated cases so far, with 310 new cases as of March 31. The number of casualties is still low, but already 13 people have already died due to the virus.

We can plot on the number of cumulative confirmed cases up to each day:


(cumconfplot <- countries %>% 
   filter(date >= "2020-03-01") %>% 
   mutate(end_label = ifelse(date == max(date), country, NA)) %>% 
   ggplot(aes(date, cumcases, color = country, linetype = type)) + 
   facet_wrap(type~., scales = "free")+
   geom_line()+
   geom_point()+
   geom_text_repel(aes(label = end_label), 
                   nudge_x = .1, 
                   nudge_y = .1, 
                   size = 3
   )+
   theme(legend.position = "bottom", 
         strip.text.y = element_text(angle = 0))+
   scale_color_viridis_d()+
   scale_y_continuous("Number of cases", labels = scales::comma)+
   labs(
     x = "Date of report",
     y = "",
     title = "Cases of COVID-19 per country", 
     subtitle = "Note: Y axes not to scale", 
     caption = glue("Data last updated on {format(Sys.Date(), '%B %d %Y')}"))
)

This is definitely not loooking good. Another way of looking at this is to see how many new cases are being confirmed each day, or the daily incremental incidence. This is referred to as the epidemic curve, which is usually plotted as a bar chart:


joined %>% 
  filter(country %in% countries_list, 
         type == "confirmed", 
         class == "cases") %>% 
  ggplot(aes(date, value, fill = dataset, color = dataset)) + 
  # geom_point()+
  # geom_line()+
  geom_bar(stat = "identity", position = "identity", alpha = 1)+
  theme(legend.position = "top", 
        strip.text.y = element_text(angle = 0), 
        axis.text.x.bottom = element_text(angle = 90)
        )+
  scale_fill_viridis_d(begin = .2, end = .8)+
  scale_color_viridis_d(begin = .2, end = .8)+
  scale_x_date(limits = c(as.Date("2020-02-20"), NA))+
  scale_y_continuous(labels = scales::comma)+
  facet_wrap(country~dataset, scales = "free", ncol = 4)+
  labs(
    x = "Date of report",
    y = "Number of cases",
    title = "Epidemic curve for each country",
    subitle = "Number of new cases per country. Y axes not to scale.", 
    caption = "Dataset comparison")

This looks bad. The largest number of confirmed cases are produced by the lates date of reporting, which means that the epidemic is far from being controlled. (Also we see we what seems to be a common pattern of missing data around March 12, as it seems unlikely there were suddenly no or little number of cases on that date. It is also possible that the data from March 13 reflect a dump of the 12th and 13th combined).

In comparison, lets’ have a look at how the situation looks in South Korea. South Korea was one of the countries that was heavily stuck by the SARS 2003 and 2009 epidemics, and learned how to deal with such cases. For most of the time, they had the situation under control, until a super spreader (called Patient-31) in the Daegu province managed to infect a large number of people in a short time, exploding the number of cases.

Since then however, the situation seems to back to relative control as the number of new cases is quickly dropping down.


(s.korea <- long.ncov %>% 
   filter(country == "South Korea",
          type == "confirmed",
          date >= "2020-01-01") %>%  
   ggplot(aes(date, cases, fill = "coronavirus"))+
   geom_bar(stat = "identity")+
   theme(legend.position = "", 
         strip.text.y = element_text(angle = 0))+
   scale_fill_viridis_d(begin = .2)+
   labs(title = "Epidemic curve for South Korea")+
   scale_x_date("Date of report")+
   scale_y_continuous("Number of cases", labels = scales::comma)
)

So it seems that the number of confirmed cases in South Korea is really coming down, although we would have to check the individual provinces to understand the situation there better.

Lastly, we can have a look at the situation in China:


(china <- long.ncov %>% 
   filter(country == "China",
          type == "confirmed",
          date >= "2020-01-01") %>%  
   ggplot(aes(date, cases, fill = "coronavirus"))+
   geom_bar(stat = "identity")+
   theme(legend.position = "", 
         strip.text.y = element_text(angle = 0))+
   scale_fill_viridis_d(begin = .2)+
   scale_x_date("Date of report")+
   scale_y_continuous("Number of cases", labels = scales::comma)+
   labs(title = "Epidemic curve for China")
)

China is actually doing pretty well, as the number of new confirmed cases has dropped down dramatically since March 3. Once they implemented the movement restrictions, the number of new cases has dropped down to a trickle, even if the total number of cases is still high (not shown). (The spike represents a change in the counting methodology).

In a future post we can start applying some modelling to predict the future situation of Europe, but I leave you with the following warning:

You know one thing I learned after doing a PhD? That devoting 3 years of your life to one topic doesn't make you an expert on it. You need a lifetime. So if you are a data scientist with no health domain knowledge, keep your naive analysis to yourself.
— Pelayo Arbués (@pelayoarbues) March 30, 2020

🤷

Corrections

Any mistakes or suggested changes please open an issue on the source repository.

Links

Wu, Tianzhi, Xijin Ge, Guangchuang Yu, and Erqiang Hu. 2020. “Open-Source Analytics Tools for Studying the COVID-19 Coronavirus Outbreak.” medRxiv, March, 2020.02.25.20027433.

COVID-19 in Europe

Table of Contents