10 Mar 2020, 11:06

Meet the winners of the TIC-HEAP Cirta Particle Challenge

Zindi is excited to announce the winners of the TIC-HEAP Cirta Particle Challenge. The challenge attracted 142 data scientists from across the continent and around the world, with 32 placing on the leaderboard.

The goal of this challenge was to build a machine learning model to predict which type of particle is present in images of simulated collisions at the Large Hadron Collider at the ATLAS research center in Geneva. The metric for this challenge was log loss.

The winners of this challenge are: Heisenberg_was_right (Mohamed Jedidi, Belamine Medamine and Youssef Fadhloun) from Tunisia in 1st place, Jonathan Whitaker from Zimbabwe in 2nd place and Temitope Mariam Atoyebi from Nigeria in 4th place.

A special thank you to the winners for their generous feedback. Here are their insights.

Heisenberg_was_right (1st place)

Zindi handle: Mohamed_Salem_Jedidi, Blenz and FADHLOUN

Where are you from? Tunisia

Tell us about the approach you took. (Github Repo)

The final solution was a fine-tuned Catboost Classifier. We used an ensemble of Catboost models, each train on a different portion of the data.
The main problem in this challenge was the class imbalance. You can quickly notice that there are 3 dominant classes with thousands of samples, while 2 classes only had around 1300 samples. It is also mentioned that the test set was made to be balanced. If you train a model with the data as it is, your model will be predicting the majority classes due to the severe class imbalance.
Data Preprocessing: the images' pixels were flattened and stacked in a dataframe, and each column represented a single pixel, and each row represented an image. We then were able to use Sklearn and boosting algorithms on the newly created tabular data to perform classification.
1st approach: Oversampling
The oversampling approach did not yield good results due to the small number of samples in the underrepresented particle classes compared to the larger classes.
2nd approach: Undersampling
By undersampling all the classes to 1300 samples (equal to smallest class), you get a score of 1.56 with a default Catboost classifier. It is a good score, but there was another way to improve that score. Undersampling to the second least populated class. Setting the data to 3000 samples per class except for the 1300 one yielded much better results. A default classifier with this approach yielded a 1.535 score.
Conclusion: 1300 samples were not enough for the other 4 classes to be well classified by the model. Setting the undersampling threshold to 3000 helped the model classify these 4 classes better than the first approach, but you definitely lose some power in predicting the least populated class.

Jonathon Whitacker (2nd place)

Zindi handle: Johnowhitacker

Where are you from? Zimbabwe

Tell us a bit about yourself.

I'm a data science consultant and independent researcher living in Zimbabwe. I love learning new skills, and sharing what I find with others. You can see some of my ML-related experiments on datasciencecastnet.home.blog. Besides staring at a computer, I enjoy spending time in nature, and am always looking for work that gives me an excuse for 'field trips' :)

Tell us about the approach you took. (Colab notebook)

This task was tricky due to the whole class (im)balance thing. It bugged me that my first attempt at a model was barely better than simply predicting 0.2 everywhere (Which scores ~1.6). So, when I saw the contest about to close, I couldn't resist having another go!
I didn't spend much time on the modelling part - I threw a couple of models into an ensemble, planning to come back and tune everything later. Turned out they worked OK!
The main trick was keeping things fairly balanced, but also taking advantage of the large amounts of data for the more common classes by training each model on a different subset of the data, and by re-training models on just the more common classes to refine the probability estimates for those.
With that already giving top 3 scores, I found an extra bit of performance by scaling the probabilities that the ensemble produced. Since the data was so noisy, and the exact class balance of the test set was unclear, it made sense to dial back the more confident predictions. I wish I'd done more experimenting here, but even my random choices made a bit of a difference to the final score.

What were the things that made the difference for you that you think others can learn from?

You see bold claims of x% accuracy all the time in ML, and it's always important to interrogate how the training/validation data matches up to the real world. If it's drawn from the same distribution as new data will be, then such claims make sense. But if you've changed the class balance, or worked on a nice clean 'pretend' dataset, real-world results can be a shock.
In this case, we had an interesting situation where the training data and the test data were totally different in terms of class balance - training on the imbalanced dataset as given would result in a model that does poorly on the test data. So it was key to think about how to bring the two closer together, and how to validate model performance on a dataset that matched the test set as closely as possible.

What are the biggest areas of opportunity you see for AI in Africa over the next few years?

Looking at the tools and tutorials available now, it's evident to me that it is easier than ever to teach oneself some AI skills. I think we'll see a new generation of ML practitioners appearing from unexpected places, without the need for large institutions or central hubs. My hope is that those who already have the skills will take the opportunity to pass on their knowledge, so that this new influx of problem-solvers can start to address the many challenges we all face.

What are you looking forward to most about the Zindi community?

It's fun seeing the community grow. I hope we can work together to focus on sharing knowledge as opposed to winning prizes, and I'm looking forward to hearing data scientists in years to come talk fondly about their beginner days trying things out on Zindi and getting help from the community with their first code baby-steps :)

Temitope Mariam Atoyebi (4th place)

Zindi handle: mariamtemi

Where are you from? Nigeria

Tell us a bit about yourself.

I am simply a machine enthusiast and I wish to become excellent at it.

Tell us about the approach you took. (Github Repo)

My solution was quite simple and straightforward. I used the inbuilt imbalance handling that comes with random forest classifier instead of creating synthetic data points. I also tuned some of the hyper parameters in order to get the optimum solution. Random forest has a superb way of handling imbalance data whether it is binary or multiclass, and that is what I built my model on.

What were the things that made the difference for you that you think others can learn from?

The simplicity of my model. The random forest classifier imbalance handling principle has always been there but we ignore it, I will implore we try using it for subsequent solutions.

What are the biggest areas of opportunity you see for AI in Africa over the next few years?

Mainly in Agriculture and health industries.

Github repo for the 3rd place solution.

This competition was hosted by ATLAS.

What are your thoughts on our winners' feedback? Engage via the Discussion page or leave a comment on social media.