24 Dec 2020, 06:08

Meet Agnes Mueni, winner of the Zimnat Insurance Prediction Challenge and aspiring data scientist

Join Zimnat Insurance Prediction Challenge winner Agnes Mueni (Pandas) as we talk about her winning machine learning solution and starting her data science career.

Hi Agnes, please introduce yourself to the Zindi community.

My names are Agnes Mueni (Pandas) from Nairobi, Kenya. I am a recent graduate from Moringa School in Nairobi. I am an aspiring data scientist who's passionate about converting data into smart strategies and finding solutions to problems that aim at improving or changing situations for the better. I aspire to be an industry expert that others can approach for help or strategies and help them improve their businesses by solving problems with a positive attitude.

Tell us a bit about your data science journey.

I stumbled upon data science when doing my thesis and after doing further research on it, I learnt that I can be able to perform data analysis and analytics as well as gather impactful insights and this propelled me into taking a Data Science course at Moringa School. I completed the course last year and since then I have done a number of projects revolving around data science as well as participating in competitions and this has helped me improve on my skills a whole lot.

What do you like about competing on Zindi?

Zindi has provided a good platform where aspiring data scientists as well as experienced ones can put their skills into action to provide solutions to existing problems. Personally, I have gained a lot from Zindi in my data science journey as I am gaining confidence in the field to tackle problems with data. I am fascinated by the way data is proving to be a real asset for most organizations, whereby most solutions can be generated from the right data.

Tell us about your solution for the Zimnat Insurance Recommendation Challenge.

I approached the challenge as a multi-classification problem:

Data pre-processing:

- Wrangled and manipulated the training set to be similar to the test set, i.e. for every observation I removed one product and set the removed product as the target.

- Dropping duplicated customers didn’t help

- Dropping missing values didn’t help. Filled the missing values with mean

- Dropping outlier customers, i.e. customers with more than 8 products didn’t help

Feature engineering:

- Sum of total products purchased by each customer

- Age of customer, i.e. 2020 minus the birth year

- Date features – year, month and week of joining

- Product combinations of factor two, product combinations of more than two products didn’t help

- Used one hot encoding for all categorical features

Modelling:

- A blend of Catboost and Lightgbm gave the lowest loss

- Catboost was trained on GPU across 10 seeds, and predictions averaged

- Lightgbm was trained across five random states using five folds of customer groups

What do you think set your approach apart?

For competitions, a robust validation strategy is a necessity. Its challenging, but not relying on the leaderboard and trusting your local validation score is very important.