Sendy Logistics Challenge
$7,000 USD
Predict the estimated time of arrival (ETA) for motorbike deliveries in Nairobi
23 August–25 November 2019 23:59
974 data scientists enrolled, 357 on the leaderboard
Impute missing values in test data?
published 6 Nov 2019, 14:40

Hello all, want to start a discussion regarding missing values. If I impute missing values in training data, can I also do the same for test data? Is it considered as leakage or not?

It's part of preprocessing. You are allowed to do so.

but the imputation needs to be mapped from the train values, do not use test values to impute missing test values.

Thanks Champ_R and Blenz for the response.

1. So that means whatever processing I do on the training set, it is also valid to do so on the test set? Another example like removing outliers.

2. " but the imputation needs to be mapped from the train values, do not use test values to impute missing test values. "

From the explanation I take that if I impute mean value in the training set, I must use the same mean values from train to impute in test set, and not impute using re-calculated mean value in test set. So it means that the values is mapped, and not the method. Why is it like that?

replying to Fredrick_Neo
edited 2 minutes later

1. In a real life situation , if you have a train and a test set, and you feel the outliers do not make sense and cannot be modeled ( in other words random noise ) , you can remove them from both sets since those rows won't reappear again for your model ( because again it's noise, maybe an input mistake or whatever that's re-occuring ). But here, in the context of a challenge, you're forced to make a prediction for each row in the test set. So you can't remove rows from the test set. You'll have to work on the train data, either by removing or building a robust to outliers model.

2. Your model's input in a real life situation again should be able to process one row as input, if you recalculate at each inference the mean, you'll be using the exact value the test row has which doesn't make sense. Another logic on why you shouldn't impute values using test values, is that you're trying a create a model that is independent from the test set and by using the test data to recalculate a feature, you're making your model depend on the test data size, if you have 1 row, the calculated feature will have a value, if you have 10 rows, the value will change accordingly. I hope it makes sense.

Thanks a lot Blenz for the insights. Yes it makes sense now

Thank you too Blenz your explanation on how the model will be used in real life is on point.