Each training sample is a 1000-base pair (bp) DNA sequence fetched from the cassava genome. The sequences are represented as a string of 1000 letters corresponding to the nucleotides (i.e., A, G, C, & T for the 4 nucleotide bases). For each sample, the associated class is labeled as 1 (i.e., positive) if the middle 200 bp region of the given sequence overlaps with an enhancer region by more than 50% of its length. Otherwise, it is labeled as 0 (i.e., negative). To strictly divide the training and test sets in a non-overlapping manner, we split them up by chromosomes, which are distinct organizational units of the cassava genome.
The dataset consists of a CSV file with the following columns:
The training dataset consists of 13225 sequences with roughly balanced classes, containing 6464 positive and 6761 negative sequences. The test dataset contains 5668 sequences. The public leaderboard represents approximately 30% of the test data.
In addition to the CSV files, we provide the mean-pooled embeddings of InstaDeep’s AgroNT model. We calculate these by passing the nucleotide sequence through AgroNT, taking the hidden representation of the sequence at the final layer (layer 40), and taking the mean along the sequence length. This results in a 1500-dimensional embedding for each sequence. These embeddings are provided in `Train_embeddings.npy` and `Test_embeddings.npy` files, following the same order as the Train and Test CSV files respectively. These files are saved as Numpy arrays, use `np.load(...)` to read them in. See the starter notebook for an example on how to use them!
Join the largest network for
data scientists and AI builders