Primary competition visual

TIC-HEAP Cirta Particle Classification Challenge

Helping Switzerland
2 000 Zindi Points
Challenge completed over 5 years ago
Classification
174 joined
31 active
Starti
Oct 18, 19
Closei
Feb 16, 20
Reveali
Feb 17, 20
First place solution
Notebooks · 21 Feb 2020, 12:39 · edited 1 minute later · 4

This challenge aimed to predict which type of particle is present in 10x10 images. The images are collected using detectors in the collider following many events ( collisions ). the metric for this challenge is: Log Loss And this is a description of our classification solution that achieved first place with a score of 1.521101146 on private leaderboard.

Final solution :

The final solution which we used to reach such a score was a fine-tuned Catboost Classifier. We used an ensemble of catboosts each train on a different portion of the data for some classes ( the majority ones ).

Approach :

The main problem in this challenge was the class imbalance. You can quickly notice that there are 3 dominant classes with tens of thousands of samples, while 2 classes were having over 1300 samples and the other one around 3k samples. It is also mentioned that the test set was made to be balanced. If you train a model with the data as it is, your model will be predicting the majority classes only due to the severe class imbalance.

Data Preprocessing : the images' pixels were flattened and stacked in a dataframe, and each column represented a single pixel, and each row represented an image. We then were able to use sklearn and boosting algorithms on the newly created tabular data to perform classification.

1st approach : Oversampling

The oversampling approach didn't yield good results since there were not too many samples in the minority classes to be able to match the number of samples in the majority classes.

2nd approach: Undersampling

By undersampling all the classes to 1300 samples ( equal to least populated class ) , you get a score of 1.56 with a default catboost classifier. It is a good score compared to what people achieved back then, but there was another way to improve that score a lot more. Undersampling to the 2nd least populated class. Setting the data to 3k samples per class except for the 1k3 one yielded much better results. A default classifier with this approach yielded a 1.535 score.

Conclusion: 1300 samples weren't enough for the other 4 classes to be well classified by the model. Setting the undersampling threshold to 3k helped the model classify these 4 classes better than the first approach, but you definitely lose some power in predicting the least populated class.

Github link:https://github.com/JedidiMohamed/First_place_TIC-HEAP-cirta-particle_solution

Discussion 4 answers

Thanks for posting

21 Feb 2020, 16:06
Upvotes 0

Thanks for sharing.

Can you clarify this line?

"We used an ensemble of catboosts each train on a different portion of the data for some classes ( the majority ones )"

If you trained some models on the portion of the majority classes, how did you eventually combine the models?

21 Feb 2020, 17:07
Upvotes 0

Hey Alchemi, We still train on all classes, but each model is trained on a random sample of the majority classes + the same minorty classes sample. That way , the final model can generalize better to the majority classes so that even if by chance a single sample is not good enough, it gets corrected by the rest of the models.