25 Nov 2020, 14:41

Meet Mathurin Aché, winner of the Predict the Spread of COVID-19 Challenge and experienced ML practitioner

Catch up with seasoned data science competitor and Product Manager at Prevision.io, Mathurin Aché (mathurin) as he shares a few of his secrets to winning the AI4D Predict the Global Spread of COVID-19 challenge.

Hi Mathurin, please introduce yourself to the Zindi community.

My name is Mathurin Aché (mathurin), I am 39 years old, and I live in France.

Tell us a bit about your data science journey.

I have been a data scientist for 15 years. I am currently Product Manager at Prevision.io, publisher of machine learning machine software. I have participated in more than 200 competitions on Kaggle (around 20th in the world ranking) and other data science platforms. See my profile here.

What do you like about competing on Zindi?

When I participate in datascience contests, I have 2 objectives:

  • work on new data or issues, and if possible with a positive stake for humans
  • learn new techniques, datascience methods to continuously progress in the field

So I like that Zindi sets up competitions with tangible human objectives.

Tell us about the solution you built for the AI4D Predict the Global Spread of COVID-19 challenge.

I discovered the AI4D Predict the Global Spread of COVID-19 contest just two days before the end. I had just taken part in an equivalent competition on Kaggle, with some differences:

  • the metric used (MAE vs RMSLE)
  • the longer forecast window for Zindi.
  • the list of different countries and regions between the two competitions
  • the cumulative number of deaths by country and region of the world (on Kaggle, there is also the notion of people contaminated by COVID-19)

Since the data have different ranges, from a few deaths to several tens of thousands of deaths, it is usual to work with log (Fatalities) rather than raw data.

In terms of external data, I used data from the "country_codes.csv" metadata.

In terms of explanatory variables, I manually created the lags at 1, 3 and 7 days before sliding.

In terms of my algorithm, I took the average of 6 xgboost models. Each xgboost model was trained with a weighting equal to 1. / days ** WEIGHT_NORM with a value between 0.15 and 0.3, and a DECAY of 0.99. I also used some variants of the following parameters: min_child_weight, eta, colsample_bytree, max_depth, subsample, NROUND.

What do you think set your approach apart?

I was mainly inspired by the solutions proposed by the winners of the Kaggle COVID forecasting contest.