My model is a fine-tuned pretrained model that was trained on billions of protein sequence data (BFD). It is based on the Transformers architecture - BERT in particular. The pretrained model can be found here. There was no need for me to use the given unlabelled sequences. There is a similar pretrained model trained on lesser data. Both give good results. As it is integrated into the HuggingFace library, one can easily use them in similar fashion as other popular models like BERT.
There was no need for much fine-tuning as one could get good results of about 90%+ accuracy with little to no tuning. The data is also large enough to learn from various patterns.
Due to the number of model parameters (~420M), large data size, and high max sequence length (384), I had to use a TPU (v3-8) for fast model training. 1 epoch runs for ~60 minutes.
I used TensorFlow as it's more TPU friendly.
I would post link to code after review by Zindi.
Thank you for this explanation
Congrats and thanks for sharing
Congratulations and thank you for sharing
Nice work bro. Congratulation
Thank you brother! Looking forward to the code as well! Best of luck on the interview! :)
The rules of the competition stated: "Specifically, we should be able to re-create your submission on a single-GPU machine (eg Nvidia P100) with less than 8 hours training and two hours inference." For that same reason, I didn't use transformers.