Tony Mipawa, who won second place in Zindi's UmojaHack Africa 2021 Intermediate Challenge, shares some insights on how he got to his win.
For UmojaHack2021, I represented my university and country on Intermediate Challenge (Sendy Delivery Rider Response Challenge) where the objective was to create a machine learning model that will predict whether a rider will accept, decline or ignore an order sent by a customer. The challenge took two days and at the end I ranked 6th out of 266 participants on the final leaderboard, taking 2nd prize for this challenge overall.
On 27 and 28 of March 2021 Zindi organised UmojaHack Africa 2021, the biggest inter-university machine learning hackathon for African undergraduate and postgraduate students. The hackathon involved 126 universities from 21 countries, and students could choose from three challenges:
From UmojaHack2021, I was able to learn many things from other participants and from the challenge itself. I would love to share some of these things with others in the community. I hope they will be helpful for any machine learning challenges or the next UmojaHack hackathon.
The best way to start any machine learning challenge is by understanding the problem you're going to solve. This can be by its context (domain-based such as finance, health etc.), or by the type of challenge (e.g. supervised such as classification or regression). This plays an important role in drawing some initial assumptions, and during feature engineering this helps in creating new features from the most important features which directly affect the result (target). I recommend this before you start any machine learning challenge, if the topic is complex you should try to do some research online to make sure you understand it before you checkout the datasets.
Now after you are comfortable with the problem context, it’s time to gather data and start working. Create your baseline notebook as soon as you can, and make your first submission. This will give you confidence to improve your model performance and the position on the leaderboard.
You should make sure you understand all features well, and how they relate to or affect the result/target. This will help you to do well on feature engineering. You can achieve a clear understanding of your features by looking at the variable definitions file, followed by Exploratory Data Analysis (EDA). For fast analysis you can use pandas_profiling tool. This would help you to gain a deep understanding of the features and overall dataset quickly.
This will help you to analyse the whole dataset with only a single line of Python code. After you import the pandas_profiling package, it will give you an easier way to provide relationships between features and a summary of the overall dataset. Also you can get some insights and patterns of different features easily. Learn more about pandas_profilling here.
The practice of organising your code in a simple and correct way that follows ML flow will help you to trace your work easily and others to read and understand your work with less effort. This can be achieved by commenting your codes, following the ML flow of work and defining functions with the names of intended tasks to be performed, such as culculate_distance function specifically for distance calculations from one point to another.
This will depend on how you understand the problem context and the analysis you have done so far. This is feature engineering, which offers real potential for machine learning model performance improvement. By using existing feature you can generate new features such as distance from latitudes and longitudes, also days, years, hours, minutes, and seconds from datetime features.
From UmojaHack2021 I discovered ID’s also have potential for ML model improvement. Before encoding categorical features, try to generate some new features such as frequencies and others. You can check out a great article covering this technique here.
Don’t use other model parameters to train with your model, this can lead you to get wrong results. I recommend to start with plain-parameter algorithms (using initial parameters), then you can use hyperparameter tuning like optuna, grid search and random search to find the best parameters for your model such, or you can try to tune hyperparameters manually.
When modelling, try to focus on important parameters such as iterations and use detectors or regulators for overfitting and underfitting to help your model to provide strong predictions. You can use strong tree-based algorithms such as catboost, xgboost, lightgbm and gradientboost which have built-in overfit and underfit detectors also they tend to generalise well with data. You can read more about this technique in this article.
Lastly, I would like to say thank you to Zindi, all sponsors and all participants of UmojaHack2021. In mastering data science and machine learning there is no single silver line but practicing on different challenges such hackathons and competitions can help you to get better experience of working with data and to achieve your career goals.
A big thank you to everyone for participating in UmojaHack2021. This competition was all about quick and structured thinking, coding, experimentation, and finding the one approach that got you up the leaderboard. In short, what machine learning is all about!
Missed out this time? Don’t worry, you can check out all upcoming competitions and hackathons on Zindi platform, and register yourself today!
I’m Anthony Mipawa, fourth year Software Engineering student, Zindi Ambassador and Microsoft Student Learn Ambassador student at the University of Dodoma, Tanzania. I’m passionate about data science and machine learning, mostly in NLP and computer vision. It is my pleasure to facilitate African students and the young generation interested in data science, to increase our productivity and solve real life problems. You can find me on Twitter or LinkedIn.