Primary competition visual

Economic Well-Being Prediction Challenge

Helping Africa
2000 Points
Challenge completed over 4 years ago
Prediction
742 joined
140 active
Starti
Apr 16, 21
Closei
Aug 15, 21
Reveali
Aug 15, 21
User avatar
CapitainData
UM6P
4th place solution
Notebooks · 17 Aug 2021, 03:07 · 0

Solution

Thanks to @Zindi and AIMS for this awesome competition. I've learnt a lot.

During the code execution if possible, in the trainer function you can change from CPU to GPU for faster and better results.

I prefer to share with you some insights I gained from this competition which

could help you to get a similar score:

1. Preprocess the data correctly:

* You will notice after a little bit of exploration that some variables are at the scale of 100 and others at 1. So bring every percentage variable to a scale of 1. Then others numeric features, scale them when required, using any scaler from sklearn. I used RobustScaler on "ghsl_pop_density", "nighttime_lights", "dist_to_shoreline" and "dist_to_capital", the variables which weren't percentages

* We apply CountEncoder(normalize=True) from category_encoder library on the CountEncoder on country, urban_or_rural and year variables

2. Use the adequate cross validation scheme. here we are trying to learn from a set of countries to predict on another one.

- So you can split the data by country groups. And you will have leaderboard scores closer to training scores. But it won't do really great.

- Or you can split the data using a stratification technique on a version of the target discretized. To make sure that a representative set of values of the Target variable is repretated in every step of the cross validation phase. This one improved slightly my results.

- Or you can simply split the data using a simple KFold cv method.

3. Pick the best models and ensemble them after training multiple relevant ones for the task. I used catboost and lightgbm. Train them with customs functions for avoiding overfitting and underfitting.

Discussion 0 answers