17 Mar 2020, 07:59

Meet the winners of the Womxn in Big Data South Africa: Female-Headed Households in South Africa Challenge

Zindi is excited to announce the winners of the Womxn in Big Data South Africa: Female-Headed Households in South Africa Challenge. The challenge attracted 452 data scientists (almost 40% women!) from across the continent and around the world, with 201 placing on the leaderboard.

The objective of this challenge was to build a predictive model that accurately estimates what percentage of households per ward are female-headed and living below a particular income threshold, by using data that can be collected without intensive household surveys.

This solution could potentially reduce the cost and improve the accuracy of monitoring key population indicators such as female household headship and income level in between census years.

The winners of this challenge are; Ansem Chaieb from Tunisia in 1st place, Lucille Kaleha from Kenya in 2nd place and Sirine Bouslama from Tunisia in 3rd place.

A big thank you to Microsoft for sponsoring the competition, all the participants, and especially to the winners for their generous feedback. Here are their insights.

Ansem Chaieb (1st place)

Zindi handle: Ansem_chaieb

Where are you from? Tunisia

GitHub Repo

Tell us about the approach you took:

Feature engineering:

I started by dropping features that were the same {'dw_00','dw_02', 'dw_06','dw_12','dw_13','psa_02' ..}. I found that they were not really useful.
I clustered geolocation coordinates to get 5 clusters, then for each cluster I measured the percentage of poverty using the features most correlated with the target.
I tried Binarization and Rounding (it often makes sense to round off these high precision percentages into numeric integers)and some meaningful feature encoding. I also tried to find interactions between features but that did not work.

Modeling:

I trained a single LGBM model with 5-fold cross-validation. In order to improve my results and to reduce overfitting to training data, I used model stacking.

What were the things that made the difference for you that you think others can learn from?

Trying to understand the subject you are working on is the most important part. It will help you design your features well. Not all of your assumptions will be good, so keep trying until you get the best result.

Lucille Kaleha (2nd place)

Zindi handle: kaleha

Where are you from? Kenya

GitHub Repo

Tell us a bit about yourself:

I am a data scientist and a recent actuarial science graduate who likes playing with data and drawing meaningful insights that can improve the way we live.

Tell us about the approach you took:

I focused more on feature engineering and modelling. I realized that features extracted from the latitude and longitude columns significantly improved the model scores. To reduce over-fitting I used an ensemble of eight different models and a Catboost Regressor as the meta learner.

What were the things that made the difference for you that you think others can learn from?

Ensembling, blending and averaging models really does reduce overfitting and produce more accurate predictions.

What are the biggest areas of opportunity you see for AI in Africa over the next few years and what are you looking forward to most about the Zindi community?

Education and healthcare are the biggest areas of opportunity for AI in Africa. I also enjoy the sharing of ideas and learning from one another on Zindi.

Sirine Bouslama (3rd place)

Zindi handle: sbs

Where are you from? Tunisia

GitHub Repo

Tell us a bit about yourself:

I am an Artificial Intelligence Engineer. I've been working on building several predictive models that tackle real-world problems. I like finding solutions to business processes and crack applications where machine learning and deep learning concepts can provide the best fit.

Tell us about the approach you took:

Processing:

The data was clean and didn't need any pre-processing steps. Most of the features were highly skewed. I applied the Box-Cox Transformation so that we have more normal-distribution for most of the features, but this technique didn't improve the results.

Feature Engineering:

I tried to engineer many features but only a few of them helped to get better results:
  • Clusters: I created 10 clusters of the regions using K-means ( ++impact)
  • Concatenate features that show the disposal of luxury items (++impact)
  • Area of the top administrative levels (third and second levels) (+impact)
  • Dimensionality reduction for dwelling and languages features (No impact so I dropped)
  • Average distance of the center of regions to some POIs of different categories such as 'Facilities', 'Education Facility', Public Transport, etc. (tended to overfit on training dataset so I dropped)
  • Target encoding features

Models

I ran different gradient boosting problems but the Catboost model was the best.
For the individual model cross-validation, I used random CV row selections of data. The CV score tended to differ from the public leader-board score and this was due to the difference between the administrative levels between the train and test set.

Hyperparameter Tuning:

Effective hyperparameter tuning proved to be a very large challenge in this project since the CV score and the leader-board was not correlated.

What were the things that made the difference for you that you think others can learn from?

Seed Diversification: Trying different random seeds when training the model because of the non-correlation between the CV score and the leader-board.

What are the biggest areas of opportunity you see for AI in Africa over the next few years?

As Andrew Ng says: AI is the new electricity. So AI can offer an opportunity for different areas. From today's problems, healthcare is the first area that can benefit from AI.

What are you looking forward to most about the Zindi community?

Having great data scientists that have a hunger for learning and sharing knowledge.