Primary competition visual

Snakes and Sequences: Senegalese Serpent Venom Sequencing Hackathon

Helping Senegal
$5 000 USD
Completed (over 1 year ago)
Classification
101 joined
43 active
Starti
Sep 02, 24
Closei
Sep 06, 24
Reveali
Sep 07, 24
About

The data provided comes from tandem mass spectrometry experiments. Each row contains an MS2 spectrum representing a peptide to be identified. Additional data about each spectrum is also provided, including precursor information and retention time.

The dataset consists of a CSV file with the following columns:

  • ID: Unique ID for each row.
  • exp_id: Unique identifier for each experiment
  • precursor_mz: Mass-to-charge (m/z) of the precursor MS1 reading
  • precursor_mass: Mass of the precursor MS1 reading
  • precursor_charge: Charge of the precursor MS1 reading
  • retention_time: Time of elution from the HPLC
  • mz_array: Mass-to-charge values of the MS2 spectrum (peaks x-axis)
  • intensity_array: Intensity values of the MS2 spectrum (peaks y-axis)
  • target: This is what you are predicting: the ground truth peptide

We provide 10 additional columns to the .csv file. These columns contain the beam-search outputs of InstaDeep’s InstaNovo model. These columns are labeled preds_beam_0 to preds_beam_4 and log_probs_beam_0 to log_probs_beam_4, containing the model predictions and log-likelihood respectively. These predictions were generated using a similar script to the one found here.

The goal of this challenge is to improve the area under the curve (AUC) for the precision-recall plot.

The training dataset consists of 100,000 MS2 spectra where the provided predictions have an accuracy of roughly 66%. These training samples come from the validation set used to train InstaNovo. The test set contains samples from real snake venoms that we want to identify! The public leaderboard represents approximately 30% of the test data.

Files
Description
Files
This file describes the variables found in train and test.
Is an example of what your submission file should look like. The order of the rows does not matter, but the names of the "ID" must be correct.
Test resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.
Train contains the target. This is the dataset that you will use to train your model.
This is the submission file you will achieve from running the starter notebook.
This is a starter notebook to help you make your first submission.