The data was extracted from UniProtKB/Swiss-Prot is the expertly curated component of UniProtKB (produced by the UniProt consortium). It contains hundreds of thousands of protein descriptions, including function, domain structure, subcellular location, post-translational modifications and functionally characterized variants. We selected 18206 protein sequences together with the subcellular location from this database for this specific task. UniProt is one of the most widely used protein information resources in the world.
The train data contains ~18 000 protein sequences with different sequence length and different cell locations. The test data contains ~6 500 protein sequences with the different cell locations missing.
In this challenge, you are tasked to predict the location in the cell where this protein is likely to be located, using the amino acid sequence of the protein.
Variable Definitions
Join the largest network for
data scientists and AI builders