Primary competition visual

Sasol Customer Retention Recruitment Competition

Helping South Africa
R10 000 ZAR
Challenge completed ~2 years ago
Prediction
Job Opportunity
253 joined
56 active
Starti
Oct 05, 23
Closei
Nov 26, 23
Reveali
Nov 26, 23
Dealing with NaN
Help · 22 Oct 2023, 16:02 · 15

How would you guys recommend dealing with NaN values?

I noticed that dropping them makes the models perform better than when you impute them. However, the test files also contain NaNs and dropping those will remove IDs as well and you don't get scored with missing IDs.

Any advice?

Discussion 15 answers
User avatar
wuuthraad

XGBoost and LightGBM automatically handle NaN values, so there is no actual need when dealing with impute them with anything. It is done for you(If I am not mistaken both LightGBM and XGBoost treats NaN values as a separate category and includes them as a part of the splits during the tree building process.)

But personally if it's a numeric column I fill it in with the mean and if it's a categorical column I just take the most frequent(mode). Be warned some of the impute methods do change the distribution of the data and in turn this does affect model performance. sometimes for the better, other times you end up like England losing it all at the last minute.

22 Oct 2023, 16:09
Upvotes 2

Makes sense. I was using hist_grad_boost but I'll play around with XGBoost as well. Thank you :)

User avatar
skaak
Ferra Solutions

Just to cover the whole family, *catboost* also handles NaN seamlessly, except they are not allowed in categoricals.

User avatar
wuuthraad

Ahhh yes CatBoost , the "middle child" of them all. always left out

In the data-prep section, you could try selecting all the columns except the ID column; and in the modelling section, you add it but it's not considered as a predictor

22 Oct 2023, 19:57
Upvotes 0
User avatar
Satti_Tareq

After a lot of tests, I found that the best way for me is to fill the numerical columns with the mean of the columns grouped by 'region' variable, and the categoricals with the mode grouped on the same 'region' variable, of course you need to fill 'region' before doing this, I filled it with a dummy name 'Nan' , they will be a few missing values after this due to the distribution of the values of features, filling them with 99999 did a great job.

I did the process at my baseline before adding any engineered features and it gave me a nice improvment in both cv and lb. [this method did well even for GBM models in both cv and lb].

23 Oct 2023, 05:43
Upvotes 3
User avatar
skaak
Ferra Solutions

Wow Satti, that is precious - thanks for sharing. You say this did well for GBM models. Does this mean you are using other models here? I was wondering, wanted to perhaps try, GAM. GBM seems the right tool for this one, especially if you don't fill the NaNs, but if you fill them, anything goes ... fwiw I also did some tests and found *procuct* (1 then 2) to be most important variables (no, that is not a NSF word, it is actually in the data).

User avatar
wuuthraad

Great! thanks for sharing

User avatar
Satti_Tareq

You are welcome skaak, Yes I am now using lda, it scored better than lightgbm and xgboost (in both cv and lb) in the baseline by a big difference and it really has a good potential to do better.

User avatar
Satti_Tareq

You are welcome wuuthraad

User avatar
University of Johannesburg

I suggest that before you decide to drop all the columns with missing values you look for columns with more missing values , like those with more than 50/60 % and you drop those once and the remaining one impute them with median/mean depending on your choice

23 Oct 2023, 08:04
Upvotes 0
User avatar
University of Johannesburg

Once more before imputing the missing values with mean or median. Check for the presence of outliers. The median is not affected by the presence of outliers and would suggest you use median than mean.

23 Oct 2023, 08:07
Upvotes 0
User avatar
Satti_Tareq

Before filling or dropping Nans another thing that could be usefull is to have a Null counter feature for all features or some of them, in this kind of datasets this feature tends to add some value.

25 Oct 2023, 03:34
Upvotes 1
User avatar
skaak
Ferra Solutions

Well, I've tried every trick I know to get GBM to give me the best result I can get. Because of all the nans I think something like GBM that can handle them should have a tiny advantage, but if I look at the LB then either my GBM fu is lacking or I have to switch to something else.

@Satti_Tareq, perhaps I'll try some imputing-a-la-Satti a bit and see if that can give the boost I am looking for.

User avatar
Satti_Tareq

Yes, It is always good to try new things, I have no big experience with ml, but for me GBM was the solution to every question, but now I see that more carfull choosing of an algorithm can give a better result than engineering hundreds of features and fit a GBM on them.