Financial Inclusion in Africa
Who is most likely to have a bank account?
1521 data scientists enrolled, 740 on the leaderboard
29 July 2019
Dependant categories in variables, what to do in data cleaning process?
published 16 Aug 2019, 15:50

In the data cleaning process, i'm facing an issue where i'm finding someone married with a household_size==1, should i put both values to NaN, should i drop the row? I'm guessing in assuming the married is right and correcting the household_size is a bad approach since both can be equally wrong (no other variable except maybe the age in some cases but not always viable to support the assumption that one is more correct than the other) What do you think? i want to avoid dropping rows since the test set shares the same flaws at least the public one.

The dataset has been change to new one.


I have the v2 set. They fixed some weird values regarding household_size only i think.

also there are many who are 16 years old and they are parent or spouse

that's correct, there is so much similar observations of the sort, how did you deal with ? did you drop them ? impute the numerical features ? i tried to interpolate the numerical features but that seem to change nothing

seems like you knew what to do in data cleaning in the end , so what did u do with non-sens values ?

it makes sense not to do anything since the test data and the train set are sampled from the same dataset. Cleaning made my score worse. So the answer is nothing, i did no cleaning what so ever. But there might be a way of cleaning that could improve your score which i don't know yet :D

leaving it like that and focusing more on modeling ?

i'm not sure it's the way to go, but since it gave me a better lb score than when i tried to clean, i stopped setting non-sense values to Nan and then imputing. Might be the way i'm doing the cleaning, or might be how the data is or it could be just the public data and private one will be different.

i even droped the non-sense value but still no improvement , anyway thanks

I haven't tried dropping the 'dirty' rows yet, that was something i was going to try , i will do that and see if it improves my score

dropping the non-sens values decreases the score because the classifier despite the 'dirt' is predicting most of those datapoints correctly. They are mostly 0 with some 1's and the classifier is predicting almost all the zero rows correctly , and getting all the ones wrong. if you drop them, you'll remove the ones that got missclassified but you'll be removing the majority of zeros that get classified correctly, hence the decrease in the score.

tried it. doesn't work very well, unless you feature engineer something that could replace the information coming from that feature, tried also binning on numerical variables. still no improvement. Only thing that improved my cv a little was feature interactions.