This Week on Zindi: 1st place write-up

Togo Fiber Optics Uptake Prediction Challenge

Helping Togo

3,000,000 XOF to be shared

Completed (over 1 year ago)

Skills you will learn

Prediction

212 joined

93 active

Info Data Chat Leaderboard

Start

May 07, 24

Jun 23, 24

Reveal

Jun 23, 24

zkiller

1st place write-up

Platform · 28 Jun 2024, 10:50 · 13

First of all, thanks to Togo's Ministry of the Digital Economy and Transformation for organizing this very interresting competition. These kind of initiatives should definitely be pursued ! Thanks also to Zindi for hosting this.

Summary

We approach this task as a tabular binary classification problem. My final submission is a combination of a LightGBM and Catboost model, using KNN-based features (more on this later) and cluster-based stratified kfold validation.

First steps

At the beginning of this competition, I've tried several models (boosting models, random forest and knn). I was surprised to see that simple KNN Classifier significantly outperforms all other models. Knowing this, my bet was that there should be a way to intelligently combine KNN with a more sophisticated tabular data model. After several tries, I end up using KNN to retrieve the 80 nearest neighbours for each data point, then add the average features (weighted by proximity) from these neightbours. These newly constructed features has proven to be very effective, allowing me to gain around +2 on the Leaderboard.

While the above recipe seems to work well, I still face some randomness on the validation-to-leaderboard correlation. Hence another important ingredient that must be found is a robust and trustworthy validation scheme (see below).

Model Validation

I use a cluster-based stratified kfold validation strategy. Let me recall that the provided data is dominated by geographical features, hence clustering can reveal some geographical proximity and neighborhood. I then decide to cluster the provided data into 300 clusters and use the final clusters as a stratification variable. The obtained validation folds had shown a very consistent correlation with the public leaderboard and finally, the private.

More on Feature Engineering

Apart from the KNN features mentionned above, I also use a Singular Value Decomposition (TruncatedSVD) to reduce the 4000 geographical features into 400. Not only this makes my algorithms faster but also more stable. I also do a row-wise feature scaling and normalization. I firstly weight each feature-group (ie geographical, categorical, numerical ...) and then normalize each row using its eucludian nom. I use no feature selection procedure. All categorical features have been one-hot-encoded, null values are also encoded as new modality.

Modeling

Nothing special here, I just use boosting models (LightGBM & Catboost) on the initial and knn-based features.

What does not work

Label encoding
Pseudo Labelling
Custom Transformer Neural Network (quickly discarded given the poor results)

Code & ressources

Unfortunately, I can't share my scripts yet but, if allowed by the host I will be sharing whatever possible here after the finalization of the leaderboard.

And once again, thanks to all the competitors, this was an amazing challenge. Don't mind asking your questions if any.

Discussion 13 answers