First of all, thanks to Togo's Ministry of the Digital Economy and Transformation for organizing this very interresting competition. These kind of initiatives should definitely be pursued ! Thanks also to Zindi for hosting this.
We approach this task as a tabular binary classification problem. My final submission is a combination of a LightGBM and Catboost model, using KNN-based features (more on this later) and cluster-based stratified kfold validation.
At the beginning of this competition, I've tried several models (boosting models, random forest and knn). I was surprised to see that simple KNN Classifier significantly outperforms all other models. Knowing this, my bet was that there should be a way to intelligently combine KNN with a more sophisticated tabular data model. After several tries, I end up using KNN to retrieve the 80 nearest neighbours for each data point, then add the average features (weighted by proximity) from these neightbours. These newly constructed features has proven to be very effective, allowing me to gain around +2 on the Leaderboard.
While the above recipe seems to work well, I still face some randomness on the validation-to-leaderboard correlation. Hence another important ingredient that must be found is a robust and trustworthy validation scheme (see below).
I use a cluster-based stratified kfold validation strategy. Let me recall that the provided data is dominated by geographical features, hence clustering can reveal some geographical proximity and neighborhood. I then decide to cluster the provided data into 300 clusters and use the final clusters as a stratification variable. The obtained validation folds had shown a very consistent correlation with the public leaderboard and finally, the private.
Apart from the KNN features mentionned above, I also use a Singular Value Decomposition (TruncatedSVD) to reduce the 4000 geographical features into 400. Not only this makes my algorithms faster but also more stable. I also do a row-wise feature scaling and normalization. I firstly weight each feature-group (ie geographical, categorical, numerical ...) and then normalize each row using its eucludian nom. I use no feature selection procedure. All categorical features have been one-hot-encoded, null values are also encoded as new modality.
Nothing special here, I just use boosting models (LightGBM & Catboost) on the initial and knn-based features.
Unfortunately, I can't share my scripts yet but, if allowed by the host I will be sharing whatever possible here after the finalization of the leaderboard.
And once again, thanks to all the competitors, this was an amazing challenge. Don't mind asking your questions if any.
Thank you for sharing 👍
Thanks @array_nd
Thank you for sharing. Very clever, the KNN trick. I also tried KNN, but not in this way. Congratulations for the 1er place.
Thanks @KodjoDjehouty . Yeah all is about intuition and sufficient luck :) .
Thanks for sharing. Great job. I've used LightGBM & Catboost but separately ... but thanks for explanation.
Thanks @Armand_PY_Kdp
Thank you for sharing
Thanks @Folly
Thank you for sharing
Thanks @sdo
What do you think about creating a community ?
Thanks for sharing goat
Thanks @Urek