🎓 Join the Buzz: 4RTH BEST SOLUTION

DataDrive2030 Early Learning Predictors Challenge

Helping South Africa

$3 000 USD

Completed (~3 years ago)

Skills you will learn

Prediction

1002 joined

336 active

Info Data Chat Leaderboard

Start

Feb 01, 23

Apr 30, 23

Reveal

Apr 30, 23

Koleshjr

Multimedia university of kenya

4RTH BEST SOLUTION

Notebooks · 19 May 2023, 15:59 · 6

Hello guys , Sorry this took so long. I am a pro - share your solution so that others can learn and that way we will build a better community. So my team mate @REACHER and I Worked so hard in this competition but too bad we weren't lucky. Also we had asked for help in creating the sensitivity report but no one responded but its okay. But if someone still wanna help in that now that the competition ended long time ago can help or even share the approach in the platform by starting a new discussion and that would be so helpful. Key take aways: dropping nulls with a thresh of greater than 0.3 except latitude and longitude columns really made a difference and not filling them too. Other key features combination of another columns with ID Enumerator worked magic. Anyways Enjoy the code:

/helping_kids_model.ipynb

Discussion 6 answers

JakubFigura

Amazing job, and great feature engineering ideas!

19 May 2023, 16:04

Upvotes 0

Koleshjr

Multimedia university of kenya

Thanks

replied to JakubFigura19 May 2023, 16:16

Upvotes 0

Juliuss

Freelance

This is one competition that was quite tough and a little frustrating to me atleast.. I must say, your discoveries and solutions were nothing short of genius.Your team did very well and this is no mean feat. Congratulations and thank you for always sharing.

19 May 2023, 17:03

Upvotes 0

jagstang

Thanks for leading the way by sharing, and with a great approach too.

With a mix of data sources merged and very limited rows, I found carefully treating the data types (ensure categorical features like id_facility were actually set to categorical) and null filling to be important (e.g. observe_total had missing values but the values used to calculate the total were available so filling with the sum of attentive, concentrated etc. helped clean it up).

Survey/assessment data often has ordinal nature to it, so encoding those logically (strongly disagree = 1, strongly agree = 5) seemed to help on my end.

Summary stat features by enumerator, facility, ward were definitely helpful as you show here (believe too much of this caused overfitting on my part). Other features for me were things like number of null values in each row, binary feature to indicate if child spoke more than 1 language, difference from capetown longitude, binary feature to indicate if text is present in open-response features, etc.

In the spirit of sharing, here is a simplified version of my approach: https://github.com/fitzpk/zindi_repo/blob/main/early_learning_predictors/catboost_starter.py

Did feature selection make an impactful difference for your model? Also a max_depth=16, limited rows and no regularization parameters to help prevent overfitting? Love it. Definitely a takeaway for me.

Hoping we see some of the other top performers share like you!

19 May 2023, 21:22

Upvotes 1

newnomad

Hey all, thanks for sharing! My solution is quite similiar, though I did not do so much feature engineering. It ranked 6th and the link for the github repo is below. I'd agree with the conclusions brought up by the others, dropping columns with too many nulls, categoricals, ordinal rankings etc. and would add that tuning hat less of an impact than thought, probably by the nature of highly noisy data.

Regarding the sensitivity pipeline: its tricky, I think one could come close with counterfactual explanations (i.e. like DiCE) when one sets the constraints to the 'important' features. Though there is no reason that the global importance features are the best explaining individual importancy of features. And if changes are done to one globally important feature, there is no guarantee one actually reaches a different outcome class as desired by the specification.

Link to repo: https://github.com/danielseussler/zindi-earlylearning-py

20 May 2023, 08:54

Upvotes 1

JAADARIX

Ensi

Great Job Reacher and Koleshjr.

23 May 2023, 18:25

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status