Primary competition visual

Snakes and Sequences: Senegalese Serpent Venom Sequencing Hackathon

Helping Senegal
$5 000 USD
Completed (over 1 year ago)
Classification
101 joined
43 active
Starti
Sep 02, 24
Closei
Sep 06, 24
Reveali
Sep 07, 24
A potential "data bug" in the starter notebook and a solution
Data Ā· 6 Sep 2024, 07:55 Ā· 1

We have this cell in the starter notebook:

train_df["mz_array"] = train_df["mz_array"].map(
lambda x: list(map(float, re.findall(r"\d+\.\d*", x))) if isinstance(x, str) else x
)
train_df["intensity_array"] = train_df["intensity_array"].map(
lambda x: list(map(float, re.findall(r"\d+\.\d*", x))) if isinstance(x, str) else x
)
test_df["mz_array"] = train_df["mz_array"].map(
lambda x: list(map(float, re.findall(r"\d+\.\d+", x))) if isinstance(x, str) else x
)
test_df["intensity_array"] = train_df["intensity_array"].map(
lambda x: list(map(float, re.findall(r"\d+\.\d*", x))) if isinstance(x, str) else x
)

Look closely at the two last assignments, test_df["intensity_array"] = train_df["intensity_array"].map...

This is weird. Perhaps a copy paste error. Do you what that means ? It means we are overwritting this part of the test set with the train set !

But it breaks the code when I change train_df to test_df, in these assignments.

Precisely, this cell:

train_df["intensity_array_normalised"] = train_df.apply(
    lambda x: remove_precursor(x["mz_array"], x["intensity_array"],             x["precursor_mz"]),
 axis=1,
)

After debugging, I've found that for some rows, the intensity_array column is great than the mz_array a column.

If you run into that the solution is to modify this line in remove_precusor function:

intensity_array[indices_to_zero] = 0

by:

intensity_array[:len(diff)][indices_to_zero] = 0

Good luck to all !

Don't forget the ultimate goal is to learn !

Discussion 1 answer
User avatar
lightonkalumba

I also faced problems specifically this cell

6 Sep 2024, 09:30
Upvotes 1