Primary competition visual

Zindi User Behaviour Birthday Challenge

Helping Africa
$3 000 USD
Completed (~4 years ago)
Prediction
871 joined
174 active
Starti
Sep 24, 21
Closei
Jan 23, 22
Reveali
Jan 23, 22
User avatar
flamethrower
3RD PLACE SOLUTION APPROACH
Connect · 25 Jan 2022, 07:37 · edited 6 months later · 25

3rd place solution

Problem Highlight:

We approached the problem as a binary classification problem.

We formulated the problem as developing separate models for each test condition after feature engineering.. Setting up different CV evaluations, feature selections for different test conditions enabled us to have a reliable CV estimation.

We also created a target sum feature which is sum of all individual activity in train data (Disc, Comments, Compart, Submissions) in order to extract this feature from historical months about the amount of user activity(discussions, submissions, comments, Compart).

FEATURE ENGINEERING:

Most of model performance was attributed to feature engineering, we leveraged extracting information from all the datasets, taking into account only historical information:

List of key features is given below:

1. Creation of join date feature, this was critical especially for new users. Users are more likely to be joining Zindi to interact with a competition at time of joining. Using this feature alone could give a boost of 0.2

2. Historical user activity cumulative sum features and ratios.

Cumulative taken over entire user period, last 3 and 6 months. Historical number of months cumulative was extracted from

3. Months elapsed (User duration on Zindi at time of prediction

4. Creation of user consistency features for users with zero standard deviation and weighting factors of overall user behaviour based on user consistency, historical months and total activity ratio over the months. Weighting puts high positive weight on user with consistent activity and enough historical months. Weighting puts high negative weight on users with little activity and enough historical months. Weighting developed for two thresholds of user total activity ratio 0.6 and 0.8 thresholds.

5. Extraction of users lags from last 6 months- Target lags, Target Sum lags, Hist Compart, Sub, Disc and Comments.

6. Creation of overall user behaviour feature for entire period, 3 months and 6 months- Sum of Hist ComPart, Comments, Subs and Disc. This helped augment a user behaviour signal.

7. Competition Participation, Discussions, Submissions and Comments Features Statistics for entire user historical period, previous month, last 3 months, last 6 months. (Counts, Min, Median, Max, No Submissions, discussions, comments, ComPart per month).

8. User No of days since last activity overall user period and 6 months time window

9. User active duration feature statistics (Median, Mean, Min)- Timeline between periods of no activity to the next decline in activity.

10. Features on apparent changes in user behaviour - How often user behaviour remained positive, negative, changed from positive to negative, negative to positive.

11. FeatureX, featureY, Country and points, UserDate Month, Year, DayofWeek.

12. Statistics on already started competitions that are still going to be active at time of prediction. Users are more likely to continue already started competitions especially looking at number of submissions made, points reward, current public rankings.

CROSS VALIDATION STRATEGY:

We performed Stratified KFold after removing time series component (year & month feature). We also validated on last 3 months of Year 3 data at intervals, this choice was due to observing previous months information was most predictive of the next month.

For all test time conditions, we ensured we selected features for training exactly how test time will be.

MODEL TRAINING:

Our final submission was based on the average of 5 folds prediction across the average of 3 LightGBM & 1 Catboost. Single model gave 0.9112 on public LB, averaging only yielded a small boost to 0.9119.

We observed good improvements in sync with the LB score.

We tuned each test condition separately. Improvement in only one condition could yield a boost on LB, so it was worthwhile to setup this way.

UNSUCCESSFUL IDEAS:

1. We attempted to build a model that predicts user interest in a competition, taking competitions not interacted and within user timeline as 0, competitions interacted as 1. This didn't work well due to the high imbalance. Model predicted some correctly but predicts too optimistically compared to just taking user historical information. Due to AUC metric, over optimistic predictions will be highly penalised.

2. User_ID as a feature.

Thank you @Zindi for an exciting competition. Shoutout to @holar that teamed up and helped goal the approach at the last minute.

Congratulations to all that participated.

Discussion 25 answers

Thanks for this, very useful insights that will definitely help for other competitions and just for the lessons captured and congratulations too.

25 Jan 2022, 07:51
Upvotes 0
User avatar
flamethrower

I'm glad some learning points could be picked @psonyango. Thank you

User avatar
21db

Wow nice work, The feature/stats combo! I was also trying to get the testset distribution to be as similar as possible to the train under each condition but I just couldn't acquire those skills in time to make that happen, I hope to learn a great deal further when you share your code.

Congratz!

25 Jan 2022, 09:08
Upvotes 0
User avatar
flamethrower

Thank you @DanielBruintjies. I will definitely share once Zindi confirms everything. Hope it helps

User avatar
21db

@flamethrower Can you share your code please?

User avatar
flamethrower

Apologies for the late response, I will have it on Github and share the link.

User avatar
21db

Awesome!

nicely done. thanks for sharing.

25 Jan 2022, 10:31
Upvotes 0
User avatar
flamethrower

Additionally, I was attempting to get the testset distribution under each circumstance as similar to the train as I could. excellent information that will be helpful for future competitions mini crossword

20 Oct 2023, 10:30
Upvotes 0

The musical rhythm game Friday Night Funkin' (fnf) was made by Ninjamuffin99. Using the power of music, you must win over your lover's Friday Night Funkin ex-rockstar father in this game!

30 Nov 2024, 02:22
Upvotes 0

Great insights on your approach! It's fascinating how feature engineering played such a crucial role in your model's performance. I recently had a similar experience while exploring user engagement through different activities, which reminded me of my time playing geoguessr free. Congratulations on your success, and thanks for sharing your strategies!

16 Apr 2025, 06:53
Upvotes 0

This is a great breakdown of your 3rd place solution! The emphasis on feature engineering is spot on. Focusing on historical user activity, especially for new users as you mentioned, is key. It sounds like you really dug into understanding user behavior patterns. Did you find that new users behave similarly regardless of the type of platform, almost like they're navigating a real-world Slope Game of engagement? Seeing how they "slide" into activity could be a further area to explore.

3 Jun 2025, 04:18
Upvotes 0

I appreciate the detailed explanation of your 3rd place solution approach; it's impressive how much you focused on feature engineering. From my experience, creating and analyzing user behavior features can significantly enhance model performance. I’ve had similar success in my projects by leveraging historical data. If you’re looking for a fun break during problem-solving, I suggest trying out the Slope Game; it sharpens your mind and offers a nice distraction!

19 Jun 2025, 04:01
Upvotes 0

Diving into model training often feels like balancing on a tightrope, especially when averaging predictions from multiple models barely nudges the score. I once faced a similar challenge while working on a recommendation system where tuning each user segment separately made a subtle yet important difference. That experience reminds me of the nuanced results described here. By the way, encountering unpredictable data patterns on platforms like Omegle reminded me how crucial fine-tuning truly is for performance boosts.

14 Jul 2025, 07:10
Upvotes 0

The approach in this project is impressive and highlights the importance of feature engineering in achieving high performance. Using techniques like cumulative user activity sums, join date features, and behavior consistency really enhances model reliability. Drawing parallels, Bitlife also relies on various user actions and event histories to determine outcomes, which makes this method relatable.

15 Sep 2025, 02:50
Upvotes 0

This approach of using advanced feature engineering reminds me of optimizing game strategies in Cookie Clicker , where each action and upgrade decision can greatly affect progress.

17 Nov 2025, 09:50
Upvotes 0

You might consider separating out the types of operations to see which ones have the most impact on the results. It’s like playing Geometry Dash, where each “stage” of data needs to be processed in the right order to achieve the desired result.

19 Nov 2025, 07:20
Upvotes 0

That join date feature is such a smart catch! It makes total sense that new users would be super active right when they sign up for a competition. It reminds me of how random interactions on platforms like Omegle spike when new features drop. Also, formulating separate models for each test condition was a great move to stabilize your CV. Congrats on taking 3rd place, this is a really solid approach

3 Dec 2025, 02:47
Upvotes 0

@dino game Thanks for this

19 Dec 2025, 08:50
Upvotes 0

Really like the way you framed this as separate binary models per test condition—tailoring CV + feature selection to each split usually makes the validation signal much cleaner. The join-date feature insight for cold-start users is especially strong, and the “target sum” activity aggregate sounds like a great proxy for engagement momentum. For tracking and sanity-checking feature experiments, I’ve been using the Drive Mad application to keep runs organized.

31 Dec 2025, 07:46
Upvotes 0

Great breakdown of a strong solution, especially the emphasis on thoughtful feature engineering and careful CV design. The user behavior signals and consistency features are particularly insightful. Congrats on the 3rd place finish and thanks for sharing such a detailed approach! @orbit kick

27 Jan 2026, 07:41
Upvotes 0

Recharge your positive energy by experiencing the thrilling thrill of speed in the Drift Hunters game right now on your computer screen. @Drift Hunters helps you hone your reflexes and quick thinking in real-life situations, keeping you inspired and creative for your future career.

4 Feb 2026, 07:15
Upvotes 0

It's interesting how much effort went into feature selection for each test condition separately. I hadn't thought of that level of granularity. It reminds me of the panic in agario when you're being chased! All that planning and you still get eaten sometimes. The separate models must have paid off if you made it to 3rd place. Congrats on the success!

26 Feb 2026, 06:41
Upvotes 0

Wow, this is a really insightful breakdown of your approach! The emphasis on feature engineering really resonates. It reminds me of playing Doodle Baseball – you think you know the timing, but every pitch is a little different. Sometimes you nail it, sometimes you whiff, but each swing teaches you something about the pitcher's style (the data!). The way you crafted those historical activity features, especially the consistency weighting, sounds like finding that sweet spot to make solid contact. Thanks for sharing!

28 Feb 2026, 03:02
Upvotes 0