Hi Agnes, please introduce yourself to the Zindi community.
My names are Agnes Mueni (Pandas) from Nairobi, Kenya. I am a recent graduate from Moringa School in Nairobi. I am an aspiring data scientist who's passionate about converting data into smart strategies and finding solutions to problems that aim at improving or changing situations for the better. I aspire to be an industry expert that others can approach for help or strategies and help them improve their businesses by solving problems with a positive attitude.
Tell us a bit about your data science journey.
I stumbled upon data science when doing my thesis and after doing further research on it, I learnt that I can be able to perform data analysis and analytics as well as gather impactful insights and this propelled me into taking a Data Science course at Moringa School. I completed the course last year and since then I have done a number of projects revolving around data science as well as participating in competitions and this has helped me improve on my skills a whole lot.
What do you like about competing on Zindi?
Zindi has provided a good platform where aspiring data scientists as well as experienced ones can put their skills into action to provide solutions to existing problems. Personally, I have gained a lot from Zindi in my data science journey as I am gaining confidence in the field to tackle problems with data. I am fascinated by the way data is proving to be a real asset for most organizations, whereby most solutions can be generated from the right data.
Tell us about your solution for the Zimnat Insurance Recommendation Challenge.
I approached the challenge as a multi-classification problem:
- Wrangled and manipulated the training set to be similar to the test set, i.e. for every observation I removed one product and set the removed product as the target.
- Dropping duplicated customers didn’t help
- Dropping missing values didn’t help. Filled the missing values with mean
- Dropping outlier customers, i.e. customers with more than 8 products didn’t help
- Sum of total products purchased by each customer
- Age of customer, i.e. 2020 minus the birth year
- Date features – year, month and week of joining
- Product combinations of factor two, product combinations of more than two products didn’t help
- Used one hot encoding for all categorical features
- A blend of Catboost and Lightgbm gave the lowest loss
- Catboost was trained on GPU across 10 seeds, and predictions averaged
- Lightgbm was trained across five random states using five folds of customer groups
What do you think set your approach apart?
For competitions, a robust validation strategy is a necessity. Its challenging, but not relying on the leaderboard and trusting your local validation score is very important.