Financial Inclusion in Africa
Can you predict who in Africa is most likely to have a bank account?
Prize
Knowledge
Time
Active
Participants
1153 active · 3717 enrolled
Helping
Africa
Good for beginners
Prediction
Financial Services
Dependant categories in variables, what to do in data cleaning process?
Data · 16 Aug 2019, 15:50 · 14

In the data cleaning process, i'm facing an issue where i'm finding someone married with a household_size==1, should i put both values to NaN, should i drop the row? I'm guessing in assuming the married is right and correcting the household_size is a bad approach since both can be equally wrong (no other variable except maybe the age in some cases but not always viable to support the assumption that one is more correct than the other) What do you think? i want to avoid dropping rows since the test set shares the same flaws at least the public one.

Discussion 14 answers

The dataset has been change to new one.

I

I have the v2 set. They fixed some weird values regarding household_size only i think.

also there are many who are 16 years old and they are parent or spouse

that's correct, there is so much similar observations of the sort, how did you deal with ? did you drop them ? impute the numerical features ? i tried to interpolate the numerical features but that seem to change nothing

seems like you knew what to do in data cleaning in the end , so what did u do with non-sens values ?

it makes sense not to do anything since the test data and the train set are sampled from the same dataset. Cleaning made my score worse. So the answer is nothing, i did no cleaning what so ever. But there might be a way of cleaning that could improve your score which i don't know yet :D

leaving it like that and focusing more on modeling ?

i'm not sure it's the way to go, but since it gave me a better lb score than when i tried to clean, i stopped setting non-sense values to Nan and then imputing. Might be the way i'm doing the cleaning, or might be how the data is or it could be just the public data and private one will be different.

i even droped the non-sense value but still no improvement , anyway thanks

I haven't tried dropping the 'dirty' rows yet, that was something i was going to try , i will do that and see if it improves my score

dropping the non-sens values decreases the score because the classifier despite the 'dirt' is predicting most of those datapoints correctly. They are mostly 0 with some 1's and the classifier is predicting almost all the zero rows correctly , and getting all the ones wrong. if you drop them, you'll remove the ones that got missclassified but you'll be removing the majority of zeros that get classified correctly, hence the decrease in the score.

what about droping features ?

tried it. doesn't work very well, unless you feature engineer something that could replace the information coming from that feature, tried also binning on numerical variables. still no improvement. Only thing that improved my cv a little was feature interactions.