InstaDeep Enzyme Classification Challenge
Job Interview
Can you predict the class of an enzyme using only its amino acid sequence?
364 data scientists enrolled, 70 on the leaderboard
BiologyClassificationStructured
17 November 2020—21 February 2021
97 days

The data for this competition consists of labelled amino acid sequences. Each sequence has a unique ID, the amino acid sequence, the organism it came from and the label. You must predict the label for the test set. Labels consist of one of 20 classes. There are ten organisms, 8 in the training set and 2 in the test set. Sequences above a set length have been excluded from this dataset.

In addition to the labelled data, you are also provided with a large set of unlabelled sequences. You may use these for any model pre-training or data augmentation methods you choose to use. You may NOT use any external data for this competition.

Files available for download:

  • Train.csv - contains an ID, string indicating the protein and the target. This is the dataset that you will use to train your model.
  • Test.csv- resembles Train.csv but without the target-related column. This is the dataset on which you will apply your model to.
  • SampleSubmission.csv - shows the submission format for this competition, with the ‘ID’ column mirroring that of test.csv and the target columns containing your predictions. The order of the rows does not matter, but the names of the ID must be correct.
  • UnlabelledSequences.zip - Additional unlabelled sequences for language modelling or other unsupervised learning tasks.