Please, I'm in a fix. How do I handle the null values? I have tried filling with mean and arbitrary negative values like - 99.
My score isn't even improving in the slightest.
I have only been using catboost
drop missing values corresponding observation that is less than 70%, using thresh()
Not all columns should be filled with -999 althrough, kindly do some in depth analysis; some missing values demands 0, backfill and mean etc... look carfeully/
Tip3: Avoid using hold-out cross validation
I already used a mix of mean and zero for a few columns, but I could not improve any more than 0.8283
What strategy would you suggest to decide on filling so many NA values, please? I mean, how to decide when to use mean, when to use zero or backfill, or any other method?
If I don't use hold-out cross-validation, what's the alternative? and how can I implement it? Thanks-in-advance
Okay for this... if all missing values are quite important and very much in percentage fill with -1 or -999 and to fill missing number with 0’s has to do with business insights or hypothesis Example if a column of exam score has a missing value; it is so prominent that the probability of such value in real scenario is so likely to be 0’s. And if u observe the min and max number in a contain containing missing number has low dimensionality like 36 as min and 40 as max; filling the missing number with mean will quite be better... And if u observe the trend of the data been repetitive of the lagging values then backfill... Though no perfect way of filling missing value; but filling with probabaility of what likely the real value is meant to be is so important
Thanks too :)