# 1) SOLUTION OVERVIEW
During this interesting challenge, I tried many ideas. And as youy may expect, some worked and didn't work.
### WHAT DIDN'T WORK
* Dimension Reduction : through PCA
* KNN Features in general
* Neural networks: both simple MLP and sequence-based networks (CNN, GRU, LSTM, etc.) on KNN augmentated features.
### WHAT WORKED
* Feature Selection : by boosting model importance or OLS weight
* Blending of boosting models and smooth models (SVM and Lasso) which is very known by kagglers.
* Feature engineering : cross features, time features
* Post processing inspired from [here](https://www.kaggle.com/competitions/playground-series-s3e20/discussion/433567.) to address what was for me the biggest challenge of this solution.
* Training with clipped target.
* hyperparameter optimisation (with optuna)
* Pseudo-labelling
### INFRASTRUCTURE
* **Kaggle** : Though I started locally, I finished on kaggle to use their 2xT4 GPU to fit quickly my Rapids SVM. It allows me to gain type on optuna optimisation for Xgboost and Catboost.
* Kaggle is pined to original environment of 2022-04-27 to prevent rapids issue
### THE BIGGEST PART OF THE CHALLENGE (for me)
First of all, as many other participants, I noticed that both train and test sets do not cover the same sites. And this can be very challenging to generalize predictions for places (sites) with very idiosyncratic characteristics. For instance, some sites in the training set, especially in Nigeria have very high PM2.5 values. But these sites have similar characteristics to other sites with low PM2.5 considering the non missing data provided. That's the reason why, I decided to drop longitude and latitude data because they roughly "force" the model to learn by heart idiosyncratic characteristics for training sides which is very risky for the model ability to generalize on new sites. To illustrate what I mean by idiosyncratic characteristics, I also tried to check via google map the places in Lagos with higher PM2.5. Some of them were around the same river, And I guessed that this river might be polluted. I guess that the postprocessing applied [here](https://www.kaggle.com/competitions/playground-series-s3e20/discussion/433567.) attempted to dampen this issue. I also noticed that these sites have a very high coefficient of variation (CoV, hereafter). So after I retrieve the predictions from my models, I blend them, and compute the CoV by site and pick up randomly 5 of 6 sites with higher CoV and higher size and adapt their multipliers.
### QUICK MODEL OVERVIEW
* Step 1 : prediction = 0.2 * lightgbm + 0.2 * xgboost + 0.2 * catboost + 0.2 * Lasso + 0.2 * SVM with normal training
* Step 2 : Post Processing Step 1 output
* Step 3 : prediction = 0.2 * lightgbm + 0.2 * xgboost + 0.2 * catboost + 0.2 * Lasso + 0.2 * SVM with pseudo labels from step 2
* Step 4 : Post Processing Step 2 output
**Note** : I will share notebook afterwards (I didn't even finish my code documentation and restructuring because of job. Honestly, I'm a little bit lost in the midst of all my motebooks to gather everything at one place.)
Thanks for sharing
This is excellent. You deserved it. Congrats !
excellent!thank sir
Thank you for this.
And congratulations on winning the challenge.
thank you for your sharing
thank u for sharing
If you dont mind sharing the code @machine_learning
Hello @marching_learning, I have two questions for you. Firstly, did you perform a single optimization for all five models, or did you build and optimize each one separately and then combine the results? Secondly, for the 5-6 sites with high CoV values, how did you determine the scaling factors used in post-processing? I tried using similar multipliers in my own solution but didn't get good results. I'm really curious about how you found the values of these multipliers. Any help would be appreciated. Have a great day!
Thank you for this...congratulations.
Great job buddy.