Primary competition visual

Unveiling Cassava's Secrets

Helping Ghana
$3 000 USD
Completed (over 2 years ago)
Classification
Natural Language Processing
148 joined
103 active
Starti
Sep 05, 23
Closei
Sep 08, 23
Reveali
Sep 09, 23
About

Each training sample is a 1000-base pair (bp) DNA sequence fetched from the cassava genome. The sequences are represented as a string of 1000 letters corresponding to the nucleotides (i.e., A, G, C, & T for the 4 nucleotide bases). For each sample, the associated class is labeled as 1 (i.e., positive) if the middle 200 bp region of the given sequence overlaps with an enhancer region by more than 50% of its length. Otherwise, it is labeled as 0 (i.e., negative). To strictly divide the training and test sets in a non-overlapping manner, we split them up by chromosomes, which are distinct organizational units of the cassava genome.

The dataset consists of a CSV file with the following columns:

  • ID:'Unique ID for each row.
  • Sequence: 1000-base pair DNA sequence
  • Chromosome: Distinct organizational units of the cassava genome
  • Region: The region of the considered Sequence
  • Target: This is what you are predicting: 1 if the middle 200 bp region of the given sequence overlaps with an enhancer region by more than 50% of its length. Otherwise, 0

The training dataset consists of 13225 sequences with roughly balanced classes, containing 6464 positive and 6761 negative sequences. The test dataset contains 5668 sequences. The public leaderboard represents approximately 30% of the test data.

In addition to the CSV files, we provide the mean-pooled embeddings of InstaDeep’s AgroNT model. We calculate these by passing the nucleotide sequence through AgroNT, taking the hidden representation of the sequence at the final layer (layer 40), and taking the mean along the sequence length. This results in a 1500-dimensional embedding for each sequence. These embeddings are provided in `Train_embeddings.npy` and `Test_embeddings.npy` files, following the same order as the Train and Test CSV files respectively. These files are saved as Numpy arrays, use `np.load(...)` to read them in. See the starter notebook for an example on how to use them!

Files
Description
Files
This is a starter notebook to help you make your first submission.
Introduction presentation.
This file describes the variables found in train and test.
Is an example of what your submission file should look like. The order of the rows does not matter, but the names of the "ID" must be correct.
Test resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.
AgroNT embeddings of the nucleotide sequence in the same order as Test.csv.
AgroNT embeddings of the nucleotide sequence in the same order as Train.csv.
Train contains the target. This is the dataset that you will use to train your model.