The data provided comes from tandem mass spectrometry experiments. Each row contains an MS2 spectrum representing a peptide to be identified. Additional data about each spectrum is also provided, including precursor information and retention time.
The dataset consists of a CSV file with the following columns:
We provide 10 additional columns to the .csv file. These columns contain the beam-search outputs of InstaDeep’s InstaNovo model. These columns are labeled preds_beam_0 to preds_beam_4 and log_probs_beam_0 to log_probs_beam_4, containing the model predictions and log-likelihood respectively. These predictions were generated using a similar script to the one found here.
The goal of this challenge is to improve the area under the curve (AUC) for the precision-recall plot.
The training dataset consists of 100,000 MS2 spectra where the provided predictions have an accuracy of roughly 66%. These training samples come from the validation set used to train InstaNovo. The test set contains samples from real snake venoms that we want to identify! The public leaderboard represents approximately 30% of the test data.
Join the largest network for
data scientists and AI builders