The data for this competition consists of labelled amino acid sequences. Each sequence has a unique ID, the amino acid sequence, the organism it came from and the label. You must predict the label for the test set. Labels consist of one of 20 classes. There are ten organisms, 8 in the training set and 2 in the test set. Sequences above a set length have been excluded from this dataset.
In addition to the labelled data, you are also provided with a large set of unlabelled sequences. You may use these for any model pre-training or data augmentation methods you choose to use. You may NOT use any external data for this competition.
Files available for download: