🐝 Must-Read: A potential "data bug" in the...

Snakes and Sequences: Senegalese Serpent Venom Sequencing Hackathon

Helping Senegal

$5 000 USD

Completed (over 1 year ago)

Skills you will learn

Classification

101 joined

43 active

Info Data Chat Leaderboard

Start

Sep 02, 24

Sep 06, 24

Reveal

Sep 07, 24

Ismael_Kone

A potential "data bug" in the starter notebook and a solution

Data · 6 Sep 2024, 07:55 · 1

We have this cell in the starter notebook:

train_df["mz_array"] = train_df["mz_array"].map(

lambda x: list(map(float, re.findall(r"\d+\.\d*", x))) if isinstance(x, str) else x

train_df["intensity_array"] = train_df["intensity_array"].map(

lambda x: list(map(float, re.findall(r"\d+\.\d*", x))) if isinstance(x, str) else x

test_df["mz_array"] = train_df["mz_array"].map(

lambda x: list(map(float, re.findall(r"\d+\.\d+", x))) if isinstance(x, str) else x

test_df["intensity_array"] = train_df["intensity_array"].map(

lambda x: list(map(float, re.findall(r"\d+\.\d*", x))) if isinstance(x, str) else x

Look closely at the two last assignments, test_df["intensity_array"] = train_df["intensity_array"].map...

This is weird. Perhaps a copy paste error. Do you what that means ? It means we are overwritting this part of the test set with the train set !

But it breaks the code when I change train_df to test_df, in these assignments.

Precisely, this cell:

train_df["intensity_array_normalised"] = train_df.apply(
    lambda x: remove_precursor(x["mz_array"], x["intensity_array"],             x["precursor_mz"]),

 axis=1,

After debugging, I've found that for some rows, the intensity_array column is great than the mz_array a column.

If you run into that the solution is to modify this line in remove_precusor function:

intensity_array[indices_to_zero] = 0

by:

intensity_array[:len(diff)][indices_to_zero] = 0

Good luck to all !

Don't forget the ultimate goal is to learn !

Discussion 1 answer

lightonkalumba

I also faced problems specifically this cell

6 Sep 2024, 09:30

Upvotes 1

Join the largest network for
data scientists and AI builders

About FAQs

Status