☎️ Trending Now: Dealing with NaN

Sasol Customer Retention Recruitment Competition

Helping South Africa

R10 000 ZAR

Challenge completed ~2 years ago

Skills you will learn

Prediction

Job Opportunity

253 joined

56 active

Info Data Chat Leaderboard

Start

Oct 05, 23

Nov 26, 23

Reveal

Nov 26, 23

ThembaMsomi_

Dealing with NaN

Help · 22 Oct 2023, 16:02 · 15

How would you guys recommend dealing with NaN values?

I noticed that dropping them makes the models perform better than when you impute them. However, the test files also contain NaNs and dropping those will remove IDs as well and you don't get scored with missing IDs.

Any advice?

Discussion 15 answers

wuuthraad

XGBoost and LightGBM automatically handle NaN values, so there is no actual need when dealing with impute them with anything. It is done for you(If I am not mistaken both LightGBM and XGBoost treats NaN values as a separate category and includes them as a part of the splits during the tree building process.)

But personally if it's a numeric column I fill it in with the mean and if it's a categorical column I just take the most frequent(mode). Be warned some of the impute methods do change the distribution of the data and in turn this does affect model performance. sometimes for the better, other times you end up like England losing it all at the last minute.

22 Oct 2023, 16:09

Upvotes 2

ThembaMsomi_

Makes sense. I was using hist_grad_boost but I'll play around with XGBoost as well. Thank you :)

replied to wuuthraad22 Oct 2023, 16:31

Upvotes 0

skaak

Ferra Solutions

Just to cover the whole family, *catboost* also handles NaN seamlessly, except they are not allowed in categoricals.

replied to ThembaMsomi_22 Oct 2023, 16:54

Upvotes 0

wuuthraad

Ahhh yes CatBoost , the "middle child" of them all. always left out

replied to skaak22 Oct 2023, 17:07

Upvotes 0

WeOnlyLiveOnce

In the data-prep section, you could try selecting all the columns except the ID column; and in the modelling section, you add it but it's not considered as a predictor

22 Oct 2023, 19:57

Upvotes 0

Satti_Tareq

After a lot of tests, I found that the best way for me is to fill the numerical columns with the mean of the columns grouped by 'region' variable, and the categoricals with the mode grouped on the same 'region' variable, of course you need to fill 'region' before doing this, I filled it with a dummy name 'Nan' , they will be a few missing values after this due to the distribution of the values of features, filling them with 99999 did a great job.

I did the process at my baseline before adding any engineered features and it gave me a nice improvment in both cv and lb. [this method did well even for GBM models in both cv and lb].

23 Oct 2023, 05:43

Upvotes 3

skaak

Ferra Solutions

Wow Satti, that is precious - thanks for sharing. You say this did well for GBM models. Does this mean you are using other models here? I was wondering, wanted to perhaps try, GAM. GBM seems the right tool for this one, especially if you don't fill the NaNs, but if you fill them, anything goes ... fwiw I also did some tests and found *procuct* (1 then 2) to be most important variables (no, that is not a NSF word, it is actually in the data).

replied to Satti_Tareq23 Oct 2023, 06:29

Upvotes 1

wuuthraad

Great! thanks for sharing

replied to Satti_Tareq23 Oct 2023, 07:16

Upvotes 1

Satti_Tareq

You are welcome skaak, Yes I am now using lda, it scored better than lightgbm and xgboost (in both cv and lb) in the baseline by a big difference and it really has a good potential to do better.

replied to skaak25 Oct 2023, 03:31

Upvotes 0

Satti_Tareq

You are welcome wuuthraad

replied to wuuthraad25 Oct 2023, 03:31

Upvotes 0

Ambiqour

University of Johannesburg

I suggest that before you decide to drop all the columns with missing values you look for columns with more missing values , like those with more than 50/60 % and you drop those once and the remaining one impute them with median/mean depending on your choice

23 Oct 2023, 08:04

Upvotes 0

Ambiqour

University of Johannesburg

Once more before imputing the missing values with mean or median. Check for the presence of outliers. The median is not affected by the presence of outliers and would suggest you use median than mean.

23 Oct 2023, 08:07

Upvotes 0

Satti_Tareq

Before filling or dropping Nans another thing that could be usefull is to have a Null counter feature for all features or some of them, in this kind of datasets this feature tends to add some value.

25 Oct 2023, 03:34

Upvotes 1

skaak

Ferra Solutions

Well, I've tried every trick I know to get GBM to give me the best result I can get. Because of all the nans I think something like GBM that can handle them should have a tiny advantage, but if I look at the LB then either my GBM fu is lacking or I have to switch to something else.

@Satti_Tareq, perhaps I'll try some imputing-a-la-Satti a bit and see if that can give the boost I am looking for.

replied to Satti_Tareq28 Oct 2023, 16:40

Upvotes 1

Satti_Tareq

Yes, It is always good to try new things, I have no big experience with ml, but for me GBM was the solution to every question, but now I see that more carfull choosing of an algorithm can give a better result than engineering hundreds of features and fit a GBM on them.

replied to skaak29 Oct 2023, 14:32

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status