How would you guys recommend dealing with NaN values?
I noticed that dropping them makes the models perform better than when you impute them. However, the test files also contain NaNs and dropping those will remove IDs as well and you don't get scored with missing IDs.
Any advice?
XGBoost and LightGBM automatically handle NaN values, so there is no actual need when dealing with impute them with anything. It is done for you(If I am not mistaken both LightGBM and XGBoost treats NaN values as a separate category and includes them as a part of the splits during the tree building process.)
But personally if it's a numeric column I fill it in with the mean and if it's a categorical column I just take the most frequent(mode). Be warned some of the impute methods do change the distribution of the data and in turn this does affect model performance. sometimes for the better, other times you end up like England losing it all at the last minute.
Makes sense. I was using hist_grad_boost but I'll play around with XGBoost as well. Thank you :)
Just to cover the whole family, *catboost* also handles NaN seamlessly, except they are not allowed in categoricals.
Ahhh yes CatBoost , the "middle child" of them all. always left out
In the data-prep section, you could try selecting all the columns except the ID column; and in the modelling section, you add it but it's not considered as a predictor
After a lot of tests, I found that the best way for me is to fill the numerical columns with the mean of the columns grouped by 'region' variable, and the categoricals with the mode grouped on the same 'region' variable, of course you need to fill 'region' before doing this, I filled it with a dummy name 'Nan' , they will be a few missing values after this due to the distribution of the values of features, filling them with 99999 did a great job.
I did the process at my baseline before adding any engineered features and it gave me a nice improvment in both cv and lb. [this method did well even for GBM models in both cv and lb].
Wow Satti, that is precious - thanks for sharing. You say this did well for GBM models. Does this mean you are using other models here? I was wondering, wanted to perhaps try, GAM. GBM seems the right tool for this one, especially if you don't fill the NaNs, but if you fill them, anything goes ... fwiw I also did some tests and found *procuct* (1 then 2) to be most important variables (no, that is not a NSF word, it is actually in the data).
Great! thanks for sharing
You are welcome skaak, Yes I am now using lda, it scored better than lightgbm and xgboost (in both cv and lb) in the baseline by a big difference and it really has a good potential to do better.
You are welcome wuuthraad
I suggest that before you decide to drop all the columns with missing values you look for columns with more missing values , like those with more than 50/60 % and you drop those once and the remaining one impute them with median/mean depending on your choice
Once more before imputing the missing values with mean or median. Check for the presence of outliers. The median is not affected by the presence of outliers and would suggest you use median than mean.
Before filling or dropping Nans another thing that could be usefull is to have a Null counter feature for all features or some of them, in this kind of datasets this feature tends to add some value.
Well, I've tried every trick I know to get GBM to give me the best result I can get. Because of all the nans I think something like GBM that can handle them should have a tiny advantage, but if I look at the LB then either my GBM fu is lacking or I have to switch to something else.
@Satti_Tareq, perhaps I'll try some imputing-a-la-Satti a bit and see if that can give the boost I am looking for.
Yes, It is always good to try new things, I have no big experience with ml, but for me GBM was the solution to every question, but now I see that more carfull choosing of an algorithm can give a better result than engineering hundreds of features and fit a GBM on them.