Data Talk: 4th place solution

Economic Well-Being Prediction Challenge

Helping Africa

2000 Points

Challenge completed over 4 years ago

Skills you will learn

Prediction

742 joined

140 active

Info Data Chat Leaderboard

Start

Apr 16, 21

Aug 15, 21

Reveal

Aug 15, 21

CapitainData

UM6P

4th place solution

Notebooks · 17 Aug 2021, 03:07 · 0

Solution

Thanks to @Zindi and AIMS for this awesome competition. I've learnt a lot.

During the code execution if possible, in the trainer function you can change from CPU to GPU for faster and better results.

I prefer to share with you some insights I gained from this competition which

could help you to get a similar score:

1. Preprocess the data correctly:

* You will notice after a little bit of exploration that some variables are at the scale of 100 and others at 1. So bring every percentage variable to a scale of 1. Then others numeric features, scale them when required, using any scaler from sklearn. I used RobustScaler on "ghsl_pop_density", "nighttime_lights", "dist_to_shoreline" and "dist_to_capital", the variables which weren't percentages

* We apply CountEncoder(normalize=True) from category_encoder library on the CountEncoder on country, urban_or_rural and year variables

2. Use the adequate cross validation scheme. here we are trying to learn from a set of countries to predict on another one.

- So you can split the data by country groups. And you will have leaderboard scores closer to training scores. But it won't do really great.

- Or you can split the data using a stratification technique on a version of the target discretized. To make sure that a representative set of values of the Target variable is repretated in every step of the cross validation phase. This one improved slightly my results.

- Or you can simply split the data using a simple KFold cv method.

3. Pick the best models and ensemble them after training multiple relevant ones for the task. I used catboost and lightgbm. Train them with customs functions for avoiding overfitting and underfitting.

Discussion 0 answers

Join the largest network for
data scientists and AI builders

About FAQs

Status