22 Apr 2021, 14:47

How to approach a machine learning project (Part 2)

This is the second part of a series introducing how to apprach a machine learning project, using the Zimnat Insurance Assurance Challenge. This article will focus on data preprocessing and visualisation.

You can read part 1 here.

In this tutorial, we focus on Data Preparation that entails cleaning data, feature engineering and selecting data that will be used for analysis.

Feature Engineering refers to the process of extracting and transforming variables that will be used for analysis. First things first: we need to check on all our files provided - the training file, the testing file and the submission file. In most cases, the training set and the test set have the same format. It is important to check on the submission file to understand how the final results should be presented.

The training set:

The test set:

The submission file:

Feature engineering

It is true to say that our data has many features, which can be unhealthy for machine learning. Many features may cause overfitting. The submission file has fewer features — this gives a hint on which columns to focus on when it comes to feature engineering. In data preparation, it is a good practice to combine the train set and test set; especially when dealing with categorical variables, it helps maintain consistency between them.

It's worth noting that after concatenation of the dataset will still maintain the same number of columns.

The describe() function gives a summary of all numerical variables in the dataset. Next, we get the age of the customers and visualize it on a graph.

Its true to say that most of the customers are between 30–40 years.

The biggest problem with our data is that it has too many features which can cause the curse of dimensionality. It is hard to visualise and analyse data with too many dimensions. The best way to solve this is by using the Pandas.melt() function. This function is used to format a wide to long dataset.

Add the products to a multidimensional array:

products=full_df[['P5DA','RIBP','8NN1','7POT','66FJ','GYSR','SOP4', 'RVSZ','PYUQ','LJR9','N2MW','AHXO','BSTQ','FM3X','K6QO','QBOL','JWFN','JZ9D','J9JW','GHYX','ECY3']]

Apply the Pandas.melt function:

final_df=full_df.melt(id_vars=full_df.columns[:8],value_vars=products,var_name='PCODE',value_name='Label')
  • id_vars are the columns that are used as identifier variables.
  • value_vars are the columns to unpivot.
  • var_name is the name to use for the ‘value’ column.

Next, we add the column combiner which has a constant value of “X”.

We need to concatenate the two columns in our dataset as per the submission file:

The label columns consists of 2 values — 0 and 1.

Label encoding

Next, our dataset has too many categorical features that need to be label encoded.

Label encoding is the art of converting the labels into numeric form so that it can be easily read by a machine. The columns we should pay attention to are: branch_code, occupation_code, occupation_category_code, PCODE, sex and marital_status.

label_object = {}
categorical_columns = ['branch_code','occupation_code','occupation_category_code','PCODE','sex','marital_status']
for col in categorical_columns:
    labelencoder = LabelEncoder()
    labelencoder.fit(final_df[col])
    final_df[col] = labelencoder.fit_transform(final_df[col])
    label_object[col] = labelencoder

From this we get a fully encoded dataset!

Part 3 of this series we will focus on modelling and deployment.

Important Resources:

Pandas.melt() Function

Zimnat Competition

Recommender Systems

Special thanks to Nairobi Women in Machine Learning & Data Science and AI Kenya.

About the author

Wawira is a software developer and data scientist. Follow her on Twitter, @jlcodes.

Read the original article here.