We have this cell in the starter notebook:
train_df["mz_array"] = train_df["mz_array"].map(
lambda x: list(map(float, re.findall(r"\d+\.\d*", x))) if isinstance(x, str) else x
)
train_df["intensity_array"] = train_df["intensity_array"].map(
lambda x: list(map(float, re.findall(r"\d+\.\d*", x))) if isinstance(x, str) else x
)
test_df["mz_array"] = train_df["mz_array"].map(
lambda x: list(map(float, re.findall(r"\d+\.\d+", x))) if isinstance(x, str) else x
)
test_df["intensity_array"] = train_df["intensity_array"].map(
lambda x: list(map(float, re.findall(r"\d+\.\d*", x))) if isinstance(x, str) else x
)
Look closely at the two last assignments, test_df["intensity_array"] = train_df["intensity_array"].map...
This is weird. Perhaps a copy paste error. Do you what that means ? It means we are overwritting this part of the test set with the train set !
But it breaks the code when I change train_df to test_df, in these assignments.
Precisely, this cell:
train_df["intensity_array_normalised"] = train_df.apply(
lambda x: remove_precursor(x["mz_array"], x["intensity_array"], x["precursor_mz"]),
axis=1,
)
After debugging, I've found that for some rows, the intensity_array column is great than the mz_array a column.
If you run into that the solution is to modify this line in remove_precusor function:
intensity_array[indices_to_zero] = 0
by:
intensity_array[:len(diff)][indices_to_zero] = 0
Good luck to all !
Don't forget the ultimate goal is to learn !
I also faced problems specifically this cell