Primary competition visual

Togo Fiber Optics Uptake Prediction Challenge

Helping Togo
3,000,000 XOF to be shared
Completed (over 1 year ago)
Prediction
212 joined
93 active
Starti
May 07, 24
Closei
Jun 23, 24
Reveali
Jun 23, 24
1st place write-up
Platform · 28 Jun 2024, 10:50 · 13

First of all, thanks to Togo's Ministry of the Digital Economy and Transformation for organizing this very interresting competition. These kind of initiatives should definitely be pursued ! Thanks also to Zindi for hosting this.

Summary

We approach this task as a tabular binary classification problem. My final submission is a combination of a LightGBM and Catboost model, using KNN-based features (more on this later) and cluster-based stratified kfold validation.

First steps

At the beginning of this competition, I've tried several models (boosting models, random forest and knn). I was surprised to see that simple KNN Classifier significantly outperforms all other models. Knowing this, my bet was that there should be a way to intelligently combine KNN with a more sophisticated tabular data model. After several tries, I end up using KNN to retrieve the 80 nearest neighbours for each data point, then add the average features (weighted by proximity) from these neightbours. These newly constructed features has proven to be very effective, allowing me to gain around +2 on the Leaderboard.

While the above recipe seems to work well, I still face some randomness on the validation-to-leaderboard correlation. Hence another important ingredient that must be found is a robust and trustworthy validation scheme (see below).

Model Validation

I use a cluster-based stratified kfold validation strategy. Let me recall that the provided data is dominated by geographical features, hence clustering can reveal some geographical proximity and neighborhood. I then decide to cluster the provided data into 300 clusters and use the final clusters as a stratification variable. The obtained validation folds had shown a very consistent correlation with the public leaderboard and finally, the private.

More on Feature Engineering

Apart from the KNN features mentionned above, I also use a Singular Value Decomposition (TruncatedSVD) to reduce the 4000 geographical features into 400. Not only this makes my algorithms faster but also more stable. I also do a row-wise feature scaling and normalization. I firstly weight each feature-group (ie geographical, categorical, numerical ...) and then normalize each row using its eucludian nom. I use no feature selection procedure. All categorical features have been one-hot-encoded, null values are also encoded as new modality.

Modeling

Nothing special here, I just use boosting models (LightGBM & Catboost) on the initial and knn-based features.

What does not work

  • Label encoding
  • Pseudo Labelling
  • Custom Transformer Neural Network (quickly discarded given the poor results)

Code & ressources

Unfortunately, I can't share my scripts yet but, if allowed by the host I will be sharing whatever possible here after the finalization of the leaderboard.

And once again, thanks to all the competitors, this was an amazing challenge. Don't mind asking your questions if any.

Discussion 13 answers

Thank you for sharing 👍

28 Jun 2024, 10:59
Upvotes 1

Thank you for sharing. Very clever, the KNN trick. I also tried KNN, but not in this way. Congratulations for the 1er place.

28 Jun 2024, 11:02
Upvotes 1

Thanks @KodjoDjehouty . Yeah all is about intuition and sufficient luck :) .

User avatar
Armand_PY_Kdp

Thanks for sharing. Great job. I've used LightGBM & Catboost but separately ... but thanks for explanation.

28 Jun 2024, 11:06
Upvotes 1

Thank you for sharing

28 Jun 2024, 11:13
Upvotes 1
User avatar
sdo

Thank you for sharing

28 Jun 2024, 11:21
Upvotes 1
User avatar
sdo

What do you think about creating a community ?

Thanks for sharing goat

28 Jun 2024, 12:29
Upvotes 1