DSN AI Bootcamp Qualification Hackathon by Data Science Nigeria
Knowledge
Predict customers who will default on a loan
876 data scientists enrolled, 506 on the leaderboard
Financial ServicesPredictionStructured
Nigeria
9 September—3 October
Ends in 9 days
Preprocessing
published 11 Sep 2020, 14:01

Hi, is there anyway one can handle those missing values without dropping any column?

Fill in them, with probably mean, mode, median or zeros.

You can fill in with zeros (not advisable since some of the features already have zeros), fill with mean, mode, median instead. If you are using tree-based models, imputing missing values is not necessary as they can handle it themselves

Do you know how someone can fill all columns with the mean with one of code or more efficient way than for loop?

I usually use df['column_name'].fillna(df.column_name.mean())

You can use the class provided in sklearn.impute

from sklearn.impute SimpleImputer

imputer = SimpleImputer(strategy='mean') >>> set strategy to mean if you want to fill with mean

imputer.fit(df)

filled_df = imputer.transform(df)

You can also forward fill(ffill) or/and backward fill(bfill) e.g df.fillna(method='ffill). check https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html for more

If I want to fill all the columns with the same filling method, say fill with mean, I sometimes write function and use aggregate to apply it.

def fill_with_mean(col):

return col.fillna(col.mean())

You can then apply it to the whole columns of the dataframe (df) by writing:

df.agg(fill_with_mean)