Primary competition visual

UmojaHack Africa 2022: African Snake Antivenom Binding Challenge (ADVANCED)

Helping Africa
$3 000 USD
Challenge completed over 3 years ago
Natural Language Processing
Classification
252 joined
112 active
Starti
Mar 19, 22
Closei
Mar 20, 22
Reveali
Mar 20, 22
2nd Place Solution
Notebooks · 21 Mar 2022, 14:42 · edited 3 minutes later · 7

Hello Zindians,

This is the 2nd place solution for the advanced challenge of UmojaHack 2022.

So I tried different approaches but many seemed to fail (I think the baseline was not bad after all)

Key Points:

  • For my validation strategy,I first split the signal in intervals (step =0.1) and split my data according to these intervals (stratification). I also tried grouping by ProteinId (Stratifiedgroupkfold) (seemed logical) but my cv and lb decreased so I dropped it.
  • I tried to turn the problem into a classification problem (predict interval) but this method also failed.
  • I tried introducing 1d CNNs after the embedding layers. This approach didn't improve the performance but I think it helps when blending.
  • I made the architecture bigger (larger embeddings, more LSTMs ...)
  • I didn't use only MSE loss, I also used Huber loss with different deltas. With huber loss we change the loss from MSE to other losses when the loss gets bigger than delta. This makes the model more robust to outliers and gives it more freedom to experiment more (double edged) since he will be penalized less.
  • I used protbert instead of the embedding layers (for feature extraction) but It was so slow and I couldn't train for many epochs so I didn't use it.

Link : https://github.com/Mo5mami/UmojaHack-Africa-2022-African-Snake-Antivenom-Binding-Challenge

Discussion 7 answers
User avatar
_MUFASA_

Interesting...thanks for sharing and congrats @mo5mami.

btw I was able to fine-tune protAlbert as well as protBert_bfd (on RTX 3090 lol) but the results were not so great. my guess is that the sequences are too small so even freezing all the layers did not really help.

Anyways I'll keep investigating. Once again Kudos !!!!!

21 Mar 2022, 15:06
Upvotes 0
User avatar
Insat

I tried freezing the layers and using a classifier header after protbert_bfd and removing lstm but the loss was too bad so I thought it was a bad idea. So then I removed the lstm and changed the embedding layer of kmer to the protbert (and of course lower the lr either for the protbert only or for the whole architecture) for 10 epochs results didn't seem that bad (they are a bit worse than my main pipeline) So I stopped trying

User avatar
DanielBruintjies

Awesome and congrats @mo5mami!

21 Mar 2022, 15:14
Upvotes 0

Thank you for sharing and

21 Mar 2022, 17:18
Upvotes 0
User avatar
100i
Ghana Health Service

Impressive approach and nice experiments! Congrats @mo5mami!!

Extracting Protbert_bfd and blosum62 amino acid composition features were promising FE ideas that I also explored but very compute intensive despite short and fixed sequence length. Eventually fast encoding and extracting embeddings in chunks helped a bit but I couldn't get SVR and Catboost regressor beat the benchmark. It would've been interesting to see how the starter baseline architecture also performed using these features in place of LSTMs or a combination

User avatar
Insat

I think that the small sequences are boosting the abilities of LSTM since the layer seem to struggle with longer sequences length. After all, LSTM was struggling with longer sequences of proteins in the last 2 UmojaHacks (sequence length was 300 to 500 if I remember correctly)

User avatar
100i
Ghana Health Service

That is correct! I completely agree!