4 Mar 2021, 11:40

How to approach a machine learning project (Part 1)

A complete guide to approaching typical Machine Learning projects.


This tutorial series covers the following:

  1. Basic Data Science Concepts.
  2. Introduction to the CRISP-DM Methodology (each step explained).
  3. Solving the Machine Learning Problem using the CRISP-DM.

In this guide, we will use the Zimnat insurance competition on Zindi to explain some data science concepts.

The Challenge

'For insurance markets to work well, insurance companies need to be able to pool and spread risk across a broad customer base. This works best where the population to be insured is diverse and large. In Africa, formal insurance against risk has been hampered by lack of private sector companies offering insurance, with no way to diversify and pool risk across populations.
Understanding the varied insurance needs of a population, and matching them to appropriate products offered by insurance companies, makes insurance more effective and makes insurance companies more successful.
At the heart of this, understanding the consumer of insurance products helps insurance companies refine, diversify, and market their product offerings. Increased data collection and improved data science tools offer the chance to greatly improve this understanding.
In this competition, you will leverage data and ML methods to improve market outcomes for insurance provider Zimnat, by matching consumer needs with product offerings in the Zimbabwean insurance market'

Type Of Problem

The dataset focuses on building a predictive model that will meet the individual needs of all customers and ensure their personal preferences are prioritized.

Its true to say that we are building a recommendation engine - a subclass of information filtering that seeks to predict the "Rating" or "Preference" a user would give to an item.

Content-based filtering will be used in this competition. The system finds the similarity between products based on its context or description, and the user's previous history is taken into account to find similar products the user may like, along with personal information.

CRISP-DM Methodology

The CRISP-DM is a structured approach to solving any machine learning project with different tasks namely:

  1. Business Understanding - Part 1
  2. Data Understanding - Part 1
  3. Data Preparation - Part 2
  4. Modelling - Part 2
  5. Deployment - Part 3

More information on this can be found here

Business Understanding

Here the business objective is set, converted into a data mining problem, and designed to achieve its objectives.

The business problem is: How can we make services reliable, increase customer traffic, deal with the supply and demand of different products and reduce company losses?

Getting to know the business problem will give you a clear guideline on what to significantly solve. Understand the Business Problem. This will help in building a model that is significant to the business.

It's worth noting that machine learning can't solve all business problems; defining a business problem will tell whether machine learning is a necessity or a non-necessity.

Convert It To a Data Problem

Building a good recommender system for Zimnat can be one of the ways that can improve the business. Customers are likely to get hooked to sites that provide personalized views and prioritizing items that are likely to be of interest to the user. A good example is Netflix.

This brings us to the analysis question: Can we predict products that a customer is likely to use with both the external and internal factors?

Data Understanding

In this stage, will focus mainly on the data. Here we check on the data's quality and completeness, exploration of variables and their relationship and get a brief description of the data.

The importance of this stage is:

  • Identify data quality problems.
  • Helps you get familiar with data; understanding the variables and their relationship.
  • Helps in transforming the variables into their correct formats.

Data understanding has the following steps:

  1. Set up Your work environment.
  2. Describe and Explore Data.

Set Up Your Work Environment

Import necessary Libraries and import the required datasets.

Next, we read all the CSV files …

The shape function in the Pandas library outputs the number of rows and columns respectively. The training dataset is larger in most cases because the model will learn more and improve on prediction accuracy if the data is randomized well. However, other data scientists prefer when the test set is larger to reduce the problem of overfitting. The sample file simply gives a format on how the final output should be after the model has been deployed.

Exploratory Data Analysis

Next, we check on the different columns in the data….

Assess the different columns and try to understand each variable as per the real world scenario:

  1. ID is unlikely to affect the kind of product a user is likely to use.
  2. join_date  -  Join date can affect a customer's product preference based on the popularity and novelty of the product.
  3. Sex is likely to affect what products an individual is likely to use, and say. Women are likely to buy more beauty products as compared to men.
  4. Marital status  -  This can affect the products a user is likely to use.
  5. Birth year  -  Age can affect the products a user is likely to use.
  6. Branch code is likely not to affect the recommendations of a user.
  7. Occupation code, occupation category code  -  Can also affect which products a customer can afford.
  8. The other columns are just codes for the different products that the insurance company offers.

Coming up with your hypothesis helps in the thinking-through process, and gives you a clearer picture on which columns you should focus on.

Next, use the pandas.describe() function to view statistical details of the numeric variables.

You can also use the pandas.info() function to print a concise information about the data frame.

With this its easy to identify the columns with null values and columns that need to be transformed.

In this step, It is also important to identify the categorical and numerical variables.

If you dealing with categorical data:

  • Identify the different classes
  • import the value_counts() function to get the counts of all unique values

Its true to say that our data is long, which can be a problem when it comes to modelling.

Data understanding is a step that gives a blueprint on what to work on in data preparation.

The next part of this series we will focus on data preparation and modelling.

Important Resources…

CRISP-DM Methodology

Zimnat Competition

Recommender Systems

Data Science Jargon

Special thanks to Nairobi Women in Machine Learning & Data Science and AI Kenya.

About the author

Wawira is a software developer and data scientist. Follow her on Twitter, @jlcodes.

Read the original article here.