🐝 Join the Buzz: First Place Solution

Snakes and Sequences: Senegalese Serpent Venom Sequencing Hackathon

Helping Senegal

$5 000 USD

Completed (over 1 year ago)

Skills you will learn

Classification

101 joined

43 active

Info Data Chat Leaderboard

Start

Sep 02, 24

Sep 06, 24

Reveal

Sep 07, 24

Aifenaike

University of ibadan

First Place Solution

Notebooks · 9 Sep 2024, 18:23 · 1

Special thanks to the organizers of Deep Learning Indaba, Zindi, and InstaDeep for providing such a fun and challenging competition. We are grateful for the opportunity to participate, learn, and collaborate with fellow participants!

What Didn't work?

Using the InstaNovo+ Diffusion Model to optimize the confidence score and correct beam 0 predictions. It took some time to train, but the end result wasn't too great.

What worked?

We noticed that none of the filteration techniques were applied on the test data and that the starter notebook simply returns the values of beam_0 and the associated confidence as the predictions. This was the benchmark we had to surpass.

1. Applying All the Filtering Methods to the Test Data

Result - [Above Starter NoteBook Baseline]

We rigorously applied all the filtering techniques that was implemented in the starter notebook to the test data [with the exception of the Logistic Regression Approach] . This gave a score above the benchmark.

2. Threshold Optimization

Results - [Further Improvement in Score]

The next phase was to play around tweaking the filter settings to squeeze more performance from the starter solution. We experimented with threshold optimization to fine-tune the targets and filter some more false positives. This led to a noticeable improvement in our score.

3. First Stage Filter

Results - [Further Improvement in Score]

All of the false p[ositive identified in step 1 was removed (i.e replcaed with empty targets as they were incorrect) and their confidence was reset to negative infinity (zero probability), we managed to enhance the model’s precision, pushing our score further.

4. Threshold Optimization + First Stage Filter + Retention Time Filter

Results - [Further Boost in Performance]

We created a retention time filter. Combining the first-stage filter with a retention time filter provided a significant performance boost, as it reduced noise and false positives more effectively.

5. Threshold Optimization + First Stage Filter + Retention Time Filter + Feature Generation & Machine Learning

Results - [Absolute Banger]

We created additional and simple features based on the beam predictions and summary statistics (mean, median, standard deviation) of the mass-to-charge and intensity arrays. Training a LightGBM classifier on these features led to a major leap in performance, proving to be the key differentiator in our solution. This model was responsible for finetuning the confidence scores.

Key to Win

Reading the InstaNovo paper and the office hours with the InstaDeep Team was key. Also the session on "Snake venom analysis and sequencing using Large Language Models" with Nicolas Lopez at Indaba gave clarity to the whole essence of peptides and mass spectroscopy.

The winning strategy was applying the right combination of filtering, and feature engineering to build a strong machine learning model and doing so in the right order. Iteratively testing and optimizing these components helped us steadily climb the leaderboard and ultimately finish in first place in one of the most challenging hackathons we have ever done.

Notebook: https://github.com/aifenaike/kasa_instadeep_Hack_DLI_2024

Discussion 1 answer

Adejumobi

Federal university of agriculture abeokuta

Great work, Team Kasa!!

9 Sep 2024, 21:05

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status