InstaDeep Enzyme Classification Challenge
Job Interview
Can you predict the class of an enzyme using only its amino acid sequence?
346 data scientists enrolled, 70 on the leaderboard
BiologyClassificationStructured
17 November 2020—21 February 2021
97 days
7th Place Solution Approach
published 23 Feb 2021, 06:14

Congratulations to the winners and many thanks to Zindi and InstaDeep for hosting this interesting competition.

My solution is a transformer model with no ensembling or fold averaging. The code is written in Pytorch.

Approach

  • Fine tune a pre-trained prot_bert_bfd model from the Hugging Face library (prot_bert_bfd was trained on amino acid sequences).
  • Map rare amino acids (U, Z, O, B) to X. This is how the data was pre-processed when prot_bert_bfd was trained.
  • Use a max sequence length of 256.
  • Use 11000 samples per class and train for 2 epochs.
  • Use dropout and weight decay to regularize the model.
  • Set aside creature4 and creature5 data for validation.

I trained the model in Kaggle notebooks (1 x P100 GPU). Training time was approx. 8.3 hours per epoch. Inference time was approx. 2.2 hours. I used 11000 samples per class and a max length of 256 in order to stay within the 9 hour GPU runtime limit.