Primary competition visual

Protein Location Prediction Challenge by AI Hack

5 000 TND
Challenge completed over 3 years ago
Classification
133 joined
51 active
Starti
Aug 29, 22
Closei
Aug 30, 22
Reveali
Aug 31, 22
About

The data was extracted from UniProtKB/Swiss-Prot is the expertly curated component of UniProtKB (produced by the UniProt consortium). It contains hundreds of thousands of protein descriptions, including function, domain structure, subcellular location, post-translational modifications and functionally characterized variants. We selected 18206 protein sequences together with the subcellular location from this database for this specific task. UniProt is one of the most widely used protein information resources in the world.

The train data contains ~18 000 protein sequences with different sequence length and different cell locations. The test data contains ~6 500 protein sequences with the different cell locations missing.

In this challenge, you are tasked to predict the location in the cell where this protein is likely to be located, using the amino acid sequence of the protein.

Variable Definitions

  • ID: the unique identifier for each sequence
  • Sequence: protein sequence
  • Kingdom: The protein species/kingdom
  • Seq_length: length of the protein sequence
  • Targets: SubCellular Location [Nucleus, Cytoplasm, Membrane, Cell membrane, Extracellular]

Files
Description
Files
ESM_1b Classifier token embeddings for the sequence
Resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.
Contains the target. This is the dataset that you will use to train your model.
Mean embeddings across the sequence
Shows the submission format for this competition, with the ‘ID’ column mirroring that of Test.csv. The order of the rows does not matter, but the names of the ‘ID’ must be correct.