🛡️ AI in Focus: My journey to 1st place ...

AirQo African Air Quality Prediction Challenge

$3 000 USD

Completed (~2 years ago)

Skills you will learn

Prediction

1032 joined

513 active

Info Data Chat Leaderboard

Start

Mar 15, 24

Jun 16, 24

Reveal

Jun 16, 24

marching_learning

Nostalgic Mathematics

My journey to 1st place : an overview

Notebooks · 19 Jun 2024, 12:27 · 10

# 1) SOLUTION OVERVIEW

During this interesting challenge, I tried many ideas. And as youy may expect, some worked and didn't work.

### WHAT DIDN'T WORK

* Dimension Reduction : through PCA

* KNN Features in general

* Neural networks: both simple MLP and sequence-based networks (CNN, GRU, LSTM, etc.) on KNN augmentated features.

### WHAT WORKED

* Feature Selection : by boosting model importance or OLS weight

* Blending of boosting models and smooth models (SVM and Lasso) which is very known by kagglers.

* Feature engineering : cross features, time features

* Post processing inspired from [here](https://www.kaggle.com/competitions/playground-series-s3e20/discussion/433567.) to address what was for me the biggest challenge of this solution.

* Training with clipped target.

* hyperparameter optimisation (with optuna)

* Pseudo-labelling

### INFRASTRUCTURE

* **Kaggle** : Though I started locally, I finished on kaggle to use their 2xT4 GPU to fit quickly my Rapids SVM. It allows me to gain type on optuna optimisation for Xgboost and Catboost.

* Kaggle is pined to original environment of 2022-04-27 to prevent rapids issue

### THE BIGGEST PART OF THE CHALLENGE (for me)

First of all, as many other participants, I noticed that both train and test sets do not cover the same sites. And this can be very challenging to generalize predictions for places (sites) with very idiosyncratic characteristics. For instance, some sites in the training set, especially in Nigeria have very high PM2.5 values. But these sites have similar characteristics to other sites with low PM2.5 considering the non missing data provided. That's the reason why, I decided to drop longitude and latitude data because they roughly "force" the model to learn by heart idiosyncratic characteristics for training sides which is very risky for the model ability to generalize on new sites. To illustrate what I mean by idiosyncratic characteristics, I also tried to check via google map the places in Lagos with higher PM2.5. Some of them were around the same river, And I guessed that this river might be polluted. I guess that the postprocessing applied [here](https://www.kaggle.com/competitions/playground-series-s3e20/discussion/433567.) attempted to dampen this issue. I also noticed that these sites have a very high coefficient of variation (CoV, hereafter). So after I retrieve the predictions from my models, I blend them, and compute the CoV by site and pick up randomly 5 of 6 sites with higher CoV and higher size and adapt their multipliers.

### QUICK MODEL OVERVIEW

* Step 1 : prediction = 0.2 * lightgbm + 0.2 * xgboost + 0.2 * catboost + 0.2 * Lasso + 0.2 * SVM with normal training

* Step 2 : Post Processing Step 1 output

* Step 3 : prediction = 0.2 * lightgbm + 0.2 * xgboost + 0.2 * catboost + 0.2 * Lasso + 0.2 * SVM with pseudo labels from step 2

* Step 4 : Post Processing Step 2 output

**Note** : I will share notebook afterwards (I didn't even finish my code documentation and restructuring because of job. Honestly, I'm a little bit lost in the midst of all my motebooks to gather everything at one place.)

Discussion 10 answers

Neriya98

Thanks for sharing

19 Jun 2024, 13:02

Upvotes 0

Ezekiel_Sanou

Data Science Institute

This is excellent. You deserved it. Congrats !

Upvotes 0

excellent!thank sir

Upvotes 0

Thank you for this.

And congratulations on winning the challenge.

19 Jun 2024, 15:21

Upvotes 0

engin

thank you for your sharing

Upvotes 0

thank u for sharing

Upvotes 0

Nasarawa State University

If you dont mind sharing the code @machine_learning

23 Jun 2024, 09:44

Upvotes 0

engin

Hello @marching_learning, I have two questions for you. Firstly, did you perform a single optimization for all five models, or did you build and optimize each one separately and then combine the results? Secondly, for the 5-6 sites with high CoV values, how did you determine the scaling factors used in post-processing? I tried using similar multipliers in my own solution but didn't get good results. I'm really curious about how you found the values of these multipliers. Any help would be appreciated. Have a great day!

29 Jun 2024, 14:01

Upvotes 0

IamIman

Tech4Dev

Thank you for this...congratulations.

2 Jul 2024, 08:34

Upvotes 0

ayushkumar_14159

Great job buddy.

17 Jul 2024, 14:12

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status