This tutorial series covers the following:
In this guide, we will use the Zimnat insurance competition on Zindi to explain some data science concepts.
'For insurance markets to work well, insurance companies need to be able to pool and spread risk across a broad customer base. This works best where the population to be insured is diverse and large. In Africa, formal insurance against risk has been hampered by lack of private sector companies offering insurance, with no way to diversify and pool risk across populations.
Understanding the varied insurance needs of a population, and matching them to appropriate products offered by insurance companies, makes insurance more effective and makes insurance companies more successful.
At the heart of this, understanding the consumer of insurance products helps insurance companies refine, diversify, and market their product offerings. Increased data collection and improved data science tools offer the chance to greatly improve this understanding.
In this competition, you will leverage data and ML methods to improve market outcomes for insurance provider Zimnat, by matching consumer needs with product offerings in the Zimbabwean insurance market'
The dataset focuses on building a predictive model that will meet the individual needs of all customers and ensure their personal preferences are prioritized.
Its true to say that we are building a recommendation engine - a subclass of information filtering that seeks to predict the "Rating" or "Preference" a user would give to an item.
Content-based filtering will be used in this competition. The system finds the similarity between products based on its context or description, and the user's previous history is taken into account to find similar products the user may like, along with personal information.
The CRISP-DM is a structured approach to solving any machine learning project with different tasks namely:
More information on this can be found here
Here the business objective is set, converted into a data mining problem, and designed to achieve its objectives.
The business problem is: How can we make services reliable, increase customer traffic, deal with the supply and demand of different products and reduce company losses?
Getting to know the business problem will give you a clear guideline on what to significantly solve. Understand the Business Problem. This will help in building a model that is significant to the business.
It's worth noting that machine learning can't solve all business problems; defining a business problem will tell whether machine learning is a necessity or a non-necessity.
Building a good recommender system for Zimnat can be one of the ways that can improve the business. Customers are likely to get hooked to sites that provide personalized views and prioritizing items that are likely to be of interest to the user. A good example is Netflix.
This brings us to the analysis question: Can we predict products that a customer is likely to use with both the external and internal factors?
In this stage, will focus mainly on the data. Here we check on the data's quality and completeness, exploration of variables and their relationship and get a brief description of the data.
The importance of this stage is:
Data understanding has the following steps:
Import necessary Libraries and import the required datasets.
Next, we read all the CSV files …
The shape function in the Pandas library outputs the number of rows and columns respectively. The training dataset is larger in most cases because the model will learn more and improve on prediction accuracy if the data is randomized well. However, other data scientists prefer when the test set is larger to reduce the problem of overfitting. The sample file simply gives a format on how the final output should be after the model has been deployed.
Next, we check on the different columns in the data….
Assess the different columns and try to understand each variable as per the real world scenario:
Coming up with your hypothesis helps in the thinking-through process, and gives you a clearer picture on which columns you should focus on.
Next, use the pandas.describe() function to view statistical details of the numeric variables.
You can also use the pandas.info() function to print a concise information about the data frame.
With this its easy to identify the columns with null values and columns that need to be transformed.
In this step, It is also important to identify the categorical and numerical variables.
If you dealing with categorical data:
Its true to say that our data is long, which can be a problem when it comes to modelling.
Data understanding is a step that gives a blueprint on what to work on in data preparation.
The next part of this series we will focus on data preparation and modelling.
Important Resources…
Special thanks to Nairobi Women in Machine Learning & Data Science and AI Kenya.
About the author
Wawira is a software developer and data scientist. Follow her on Twitter, @jlcodes.
Read the original article here.