DSN AI Bootcamp Qualification Hackathon by Data Science Nigeria

Knowledge

Predict customers who will default on a loan

888 data scientists enrolled, 520 on the leaderboard

Nigeria

9 September—3 October

Ends in 8 days

Pre-processing

Please, I'm in a fix. How do I handle the null values? I have tried filling with mean and arbitrary negative values like - 99.

My score isn't even improving in the slightest.

I have only been using catboost

Tip1:

drop missing values corresponding observation that is less than 70%, using thresh()

Tip2:

Not all columns should be filled with -999 althrough, kindly do some in depth analysis; some missing values demands 0, backfill and mean etc... look carfeully/

Tip3: Avoid using hold-out cross validation

I already used a mix of mean and zero for a few columns, but I could not improve any more than 0.8283

What strategy would you suggest to decide on filling so many NA values, please? I mean, how to decide when to use mean, when to use zero or backfill, or any other method?

If I don't use hold-out cross-validation, what's the alternative? and how can I implement it? Thanks-in-advance

Okay for this... if all missing values are quite important and very much in percentage fill with -1 or -999 and to fill missing number with 0’s has to do with business insights or hypothesis Example if a column of exam score has a missing value; it is so prominent that the probability of such value in real scenario is so likely to be 0’s. And if u observe the min and max number in a contain containing missing number has low dimensionality like 36 as min and 40 as max; filling the missing number with mean will quite be better... And if u observe the trend of the data been repetitive of the lagging values then backfill... Though no perfect way of filling missing value; but filling with probabaility of what likely the real value is meant to be is so important

Thanks

Thanks too :)