🛡️ Join the Buzz: 2nd Place Solution

UmojaHack Africa 2022: African Snake Antivenom Binding Challenge (ADVANCED)

Helping Africa

$3 000 USD

Completed (almost 4 years ago)

Skills you will learn

Natural Language Processing

Classification

252 joined

112 active

Info Data Chat Leaderboard

Start

Mar 19, 22

Mar 20, 22

Reveal

Mar 20, 22

mo5mami

Insat

2nd Place Solution

Notebooks · 21 Mar 2022, 14:42 · edited 3 minutes later · 7

Hello Zindians,

This is the 2nd place solution for the advanced challenge of UmojaHack 2022.

So I tried different approaches but many seemed to fail (I think the baseline was not bad after all)

Key Points:

For my validation strategy,I first split the signal in intervals (step =0.1) and split my data according to these intervals (stratification). I also tried grouping by ProteinId (Stratifiedgroupkfold) (seemed logical) but my cv and lb decreased so I dropped it.
I tried to turn the problem into a classification problem (predict interval) but this method also failed.
I tried introducing 1d CNNs after the embedding layers. This approach didn't improve the performance but I think it helps when blending.
I made the architecture bigger (larger embeddings, more LSTMs ...)
I didn't use only MSE loss, I also used Huber loss with different deltas. With huber loss we change the loss from MSE to other losses when the loss gets bigger than delta. This makes the model more robust to outliers and gives it more freedom to experiment more (double edged) since he will be penalized less.
I used protbert instead of the embedding layers (for feature extraction) but It was so slow and I couldn't train for many epochs so I didn't use it.

Link : https://github.com/Mo5mami/UmojaHack-Africa-2022-African-Snake-Antivenom-Binding-Challenge

Discussion 7 answers

_MUFASA_

Interesting...thanks for sharing and congrats @mo5mami.

btw I was able to fine-tune protAlbert as well as protBert_bfd (on RTX 3090 lol) but the results were not so great. my guess is that the sequences are too small so even freezing all the layers did not really help.

Anyways I'll keep investigating. Once again Kudos !!!!!

21 Mar 2022, 15:06

Upvotes 0

mo5mami

Insat

I tried freezing the layers and using a classifier header after protbert_bfd and removing lstm but the loss was too bad so I thought it was a bad idea. So then I removed the lstm and changed the embedding layer of kmer to the protbert (and of course lower the lr either for the protbert only or for the whole architecture) for 10 epochs results didn't seem that bad (they are a bit worse than my main pipeline) So I stopped trying

replied to _MUFASA_21 Mar 2022, 16:41

Upvotes 0

21db

Awesome and congrats @mo5mami!

21 Mar 2022, 15:14

Upvotes 0

Emms

Thank you for sharing and

21 Mar 2022, 17:18

Upvotes 0

100i

Ghana Health Service

Impressive approach and nice experiments! Congrats @mo5mami!!

Extracting Protbert_bfd and blosum62 amino acid composition features were promising FE ideas that I also explored but very compute intensive despite short and fixed sequence length. Eventually fast encoding and extracting embeddings in chunks helped a bit but I couldn't get SVR and Catboost regressor beat the benchmark. It would've been interesting to see how the starter baseline architecture also performed using these features in place of LSTMs or a combination

22 Mar 2022, 00:49 (edited 1 minute later)

Upvotes 0

mo5mami

Insat

I think that the small sequences are boosting the abilities of LSTM since the layer seem to struggle with longer sequences length. After all, LSTM was struggling with longer sequences of proteins in the last 2 UmojaHacks (sequence length was 300 to 500 if I remember correctly)

replied to 100i22 Mar 2022, 10:12

Upvotes 0

100i

Ghana Health Service

That is correct! I completely agree!

replied to mo5mami22 Mar 2022, 11:22

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status