Special thanks to the organizers of Deep Learning Indaba, Zindi, and InstaDeep for providing such a fun and challenging competition. We are grateful for the opportunity to participate, learn, and collaborate with fellow participants!
Using the InstaNovo+ Diffusion Model to optimize the confidence score and correct beam 0 predictions. It took some time to train, but the end result wasn't too great.
We noticed that none of the filteration techniques were applied on the test data and that the starter notebook simply returns the values of beam_0 and the associated confidence as the predictions. This was the benchmark we had to surpass.
Result - [Above Starter NoteBook Baseline]
We rigorously applied all the filtering techniques that was implemented in the starter notebook to the test data [with the exception of the Logistic Regression Approach] . This gave a score above the benchmark.
Results - [Further Improvement in Score]
The next phase was to play around tweaking the filter settings to squeeze more performance from the starter solution. We experimented with threshold optimization to fine-tune the targets and filter some more false positives. This led to a noticeable improvement in our score.
Results - [Further Improvement in Score]
All of the false p[ositive identified in step 1 was removed (i.e replcaed with empty targets as they were incorrect) and their confidence was reset to negative infinity (zero probability), we managed to enhance the model’s precision, pushing our score further.
Results - [Further Boost in Performance]
We created a retention time filter. Combining the first-stage filter with a retention time filter provided a significant performance boost, as it reduced noise and false positives more effectively.
Results - [Absolute Banger]
We created additional and simple features based on the beam predictions and summary statistics (mean, median, standard deviation) of the mass-to-charge and intensity arrays. Training a LightGBM classifier on these features led to a major leap in performance, proving to be the key differentiator in our solution. This model was responsible for finetuning the confidence scores.
Reading the InstaNovo paper and the office hours with the InstaDeep Team was key. Also the session on "Snake venom analysis and sequencing using Large Language Models" with Nicolas Lopez at Indaba gave clarity to the whole essence of peptides and mass spectroscopy.
The winning strategy was applying the right combination of filtering, and feature engineering to build a strong machine learning model and doing so in the right order. Iteratively testing and optimizing these components helped us steadily climb the leaderboard and ultimately finish in first place in one of the most challenging hackathons we have ever done.
Notebook: https://github.com/aifenaike/kasa_instadeep_Hack_DLI_2024
Great work, Team Kasa!!