AI4D Predict the Global Spread of COVID-19
$5,000 USD
The accuracy of your models will be assessed on future data
869 data scientists enrolled, 45 on the leaderboard
13 March—19 April

The objective of this challenge is to build an epidemiological model that predicts the spread of COVID-19 throughout the world. The target variable is the cumulative number of deaths caused by COVID-19 in each country by each date.

We have selected the cumulative number of fatalities rather than the number of reported infections as the target variable because the real number of infections is unknown and will perhaps never be known. The number of reported cases is understood to be underestimated and largely biased by the availability of tests, which varies from location to location and country to country.

We encourage participants to engage with the literature available on approaches and considerations when modelling the spread of diseases.

For this competition, we have used the publicly-available data from the Coronavirus COVID-19 Global Cases by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU), which is updated on a daily-basis.

For your reference:

  • The JHU raw data is published here.
  • The data has been visualized by JHU in this dashboard.
  • An early blog post on modeling the spread of COVID-19 by JHU.

For this competition, you can suggest other publicly-available datasets to use in your model. Please post them on the discussion forum for approval. We will update this page to include new datasets as they are suggested. Because you are predicting the future, virtually any dataset will be allowed as long as everyone has equal access to it.

This challenge is to build a model that actually looks into the future. Recognising that all of the data is publicly-available and grows every day, we have structured this challenge a bit differently from other Zindi challenges:

The Public Leaderboard will be updated once a week with the most recent seven days of actual data and scores will be recalculated. While the competition is open, the Public Leaderboard will rank the submitted solutions by the accuracy score they achieve on only the most recent seven days. Once submissions are closed and no longer accepted, the most recent Public Leaderboard will remain visible until the final close of the competition. Upon the final close of the competition, the Private Leaderboard will be revealed which gives an accuracy score on only the data from the time submissions closed until the time the competition closed. This will be the final ranking for the competition.

Files available for download

  • SampleSubmission.csv - is an example of what your submission file should look like. The order of the rows does not matter, but the names of the “Territory X Date” must be correct. The submission must contain all the days from 6 March 2020 to 7 June 2020 per country.
  • SampleSubmissionLocal.csv - is an example of what your submission file should look like for the current week under review. You can use this file for local validation. The order of the rows does not matter, but the names of the “Territory X Date” must be correct.
  • Train.csv - this is the data you will use to train your model. It contains data from 22 January to 5 March 2020. This training file will be updated every seven days to include the new actual data. “Target” is the number of confirmed deaths and “Cases” is the number of reported infections.
  • Covid-19-Data-Prep.ipynb - is a Python notebook that generates the Train.csv file as well as the reference file against which the Public Leaderboard will be calculated (approximately the most recent seven days). You can run it to generate these files yourself. Here is the link to the Colab Notebook.

Other learning resources from the community:

Two relevant learning opportunities from the Johns Hopkins Bloomberg School of Public Health.

Additional datasets from the community:

Population density mapping from Facebook: