📊 Trending Now: Impute missing values in test ...

Sendy Logistics Challenge

Helping Kenya

$7 000 USD

Completed (over 6 years ago)

Skills you will learn

Prediction

2192 joined

448 active

Info Data Chat Leaderboard

Start

Aug 23, 19

Nov 25, 19

Reveal

Nov 26, 19

freya20

Impute missing values in test data?

Data · 6 Nov 2019, 14:40 · 6

Hello all, want to start a discussion regarding missing values. If I impute missing values in training data, can I also do the same for test data? Is it considered as leakage or not?

Discussion 6 answers

Champ_R

It's part of preprocessing. You are allowed to do so.

6 Nov 2019, 14:53

Upvotes 0

Blenz

but the imputation needs to be mapped from the train values, do not use test values to impute missing test values.

6 Nov 2019, 15:02

Upvotes 0

freya20

Thanks Champ_R and Blenz for the response.

1. So that means whatever processing I do on the training set, it is also valid to do so on the test set? Another example like removing outliers.

2. " but the imputation needs to be mapped from the train values, do not use test values to impute missing test values. "

From the explanation I take that if I impute mean value in the training set, I must use the same mean values from train to impute in test set, and not impute using re-calculated mean value in test set. So it means that the values is mapped, and not the method. Why is it like that?

6 Nov 2019, 15:13

Upvotes 0

Blenz

1. In a real life situation , if you have a train and a test set, and you feel the outliers do not make sense and cannot be modeled ( in other words random noise ) , you can remove them from both sets since those rows won't reappear again for your model ( because again it's noise, maybe an input mistake or whatever that's re-occuring ). But here, in the context of a challenge, you're forced to make a prediction for each row in the test set. So you can't remove rows from the test set. You'll have to work on the train data, either by removing or building a robust to outliers model.

2. Your model's input in a real life situation again should be able to process one row as input, if you recalculate at each inference the mean, you'll be using the exact value the test row has which doesn't make sense. Another logic on why you shouldn't impute values using test values, is that you're trying a create a model that is independent from the test set and by using the test data to recalculate a feature, you're making your model depend on the test data size, if you have 1 row, the calculated feature will have a value, if you have 10 rows, the value will change accordingly. I hope it makes sense.

replied to freya206 Nov 2019, 15:22 (edited 2 minutes later)

Upvotes 0

jennykathambi90

Thank you too Blenz your explanation on how the model will be used in real life is on point.

replied to Blenz7 Nov 2019, 06:36

Upvotes 0

freya20

Thanks a lot Blenz for the insights. Yes it makes sense now

6 Nov 2019, 15:29

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status