Snakes and Sequences: Senegalese Serpent Venom Sequencing Hackathon 🐝

Snakes and Sequences: Senegalese Serpent Venom Sequencing Hackathon

Helping Senegal

$5 000 USD

Completed (over 1 year ago)

Skills you will learn

Classification

101 joined

43 active

Info Data Chat Leaderboard

Start

Sep 02, 24

Sep 06, 24

Reveal

Sep 07, 24

About

The data provided comes from tandem mass spectrometry experiments. Each row contains an MS2 spectrum representing a peptide to be identified. Additional data about each spectrum is also provided, including precursor information and retention time.

The dataset consists of a CSV file with the following columns:

ID: Unique ID for each row.
exp_id: Unique identifier for each experiment
precursor_mz: Mass-to-charge (m/z) of the precursor MS1 reading
precursor_mass: Mass of the precursor MS1 reading
precursor_charge: Charge of the precursor MS1 reading
retention_time: Time of elution from the HPLC
mz_array: Mass-to-charge values of the MS2 spectrum (peaks x-axis)
intensity_array: Intensity values of the MS2 spectrum (peaks y-axis)
target: This is what you are predicting: the ground truth peptide

We provide 10 additional columns to the .csv file. These columns contain the beam-search outputs of InstaDeep’s InstaNovo model. These columns are labeled preds_beam_0 to preds_beam_4 and log_probs_beam_0 to log_probs_beam_4, containing the model predictions and log-likelihood respectively. These predictions were generated using a similar script to the one found here.

The goal of this challenge is to improve the area under the curve (AUC) for the precision-recall plot.

The training dataset consists of 100,000 MS2 spectra where the provided predictions have an accuracy of roughly 66%. These training samples come from the validation set used to train InstaNovo. The test set contains samples from real snake venoms that we want to identify! The public leaderboard represents approximately 30% of the test data.

Files

Description

Files

This file describes the variables found in train and test.

Is an example of what your submission file should look like. The order of the rows does not matter, but the names of the "ID" must be correct.

Test resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model to.

Train contains the target. This is the dataset that you will use to train your model.

This is the submission file you will achieve from running the starter notebook.

This is a starter notebook to help you make your first submission.

Join the largest network for
data scientists and AI builders

About FAQs

Status