Primary competition visual

Zindi User Behaviour Birthday Challenge

Helping Africa
$3 000 USD
Completed (over 4 years ago)
Prediction
874 joined
174 active
Starti
Sep 24, 21
Closei
Jan 23, 22
Reveali
Jan 23, 22
User avatar
flamethrower
3RD PLACE SOLUTION APPROACH
Connect · 25 Jan 2022, 07:37 · edited 6 months later · 9

3rd place solution

Problem Highlight:

We approached the problem as a binary classification problem.

We formulated the problem as developing separate models for each test condition after feature engineering.. Setting up different CV evaluations, feature selections for different test conditions enabled us to have a reliable CV estimation.

We also created a target sum feature which is sum of all individual activity in train data (Disc, Comments, Compart, Submissions) in order to extract this feature from historical months about the amount of user activity(discussions, submissions, comments, Compart).

FEATURE ENGINEERING:

Most of model performance was attributed to feature engineering, we leveraged extracting information from all the datasets, taking into account only historical information:

List of key features is given below:

1. Creation of join date feature, this was critical especially for new users. Users are more likely to be joining Zindi to interact with a competition at time of joining. Using this feature alone could give a boost of 0.2

2. Historical user activity cumulative sum features and ratios.

Cumulative taken over entire user period, last 3 and 6 months. Historical number of months cumulative was extracted from

3. Months elapsed (User duration on Zindi at time of prediction

4. Creation of user consistency features for users with zero standard deviation and weighting factors of overall user behaviour based on user consistency, historical months and total activity ratio over the months. Weighting puts high positive weight on user with consistent activity and enough historical months. Weighting puts high negative weight on users with little activity and enough historical months. Weighting developed for two thresholds of user total activity ratio 0.6 and 0.8 thresholds.

5. Extraction of users lags from last 6 months- Target lags, Target Sum lags, Hist Compart, Sub, Disc and Comments.

6. Creation of overall user behaviour feature for entire period, 3 months and 6 months- Sum of Hist ComPart, Comments, Subs and Disc. This helped augment a user behaviour signal.

7. Competition Participation, Discussions, Submissions and Comments Features Statistics for entire user historical period, previous month, last 3 months, last 6 months. (Counts, Min, Median, Max, No Submissions, discussions, comments, ComPart per month).

8. User No of days since last activity overall user period and 6 months time window

9. User active duration feature statistics (Median, Mean, Min)- Timeline between periods of no activity to the next decline in activity.

10. Features on apparent changes in user behaviour - How often user behaviour remained positive, negative, changed from positive to negative, negative to positive.

11. FeatureX, featureY, Country and points, UserDate Month, Year, DayofWeek.

12. Statistics on already started competitions that are still going to be active at time of prediction. Users are more likely to continue already started competitions especially looking at number of submissions made, points reward, current public rankings.

CROSS VALIDATION STRATEGY:

We performed Stratified KFold after removing time series component (year & month feature). We also validated on last 3 months of Year 3 data at intervals, this choice was due to observing previous months information was most predictive of the next month.

For all test time conditions, we ensured we selected features for training exactly how test time will be.

MODEL TRAINING:

Our final submission was based on the average of 5 folds prediction across the average of 3 LightGBM & 1 Catboost. Single model gave 0.9112 on public LB, averaging only yielded a small boost to 0.9119.

We observed good improvements in sync with the LB score.

We tuned each test condition separately. Improvement in only one condition could yield a boost on LB, so it was worthwhile to setup this way.

UNSUCCESSFUL IDEAS:

1. We attempted to build a model that predicts user interest in a competition, taking competitions not interacted and within user timeline as 0, competitions interacted as 1. This didn't work well due to the high imbalance. Model predicted some correctly but predicts too optimistically compared to just taking user historical information. Due to AUC metric, over optimistic predictions will be highly penalised.

2. User_ID as a feature.

Thank you @Zindi for an exciting competition. Shoutout to @holar that teamed up and helped goal the approach at the last minute.

Congratulations to all that participated.

Discussion 9 answers

Thanks for this, very useful insights that will definitely help for other competitions and just for the lessons captured and congratulations too.

25 Jan 2022, 07:51
Upvotes 0
User avatar
flamethrower

I'm glad some learning points could be picked @psonyango. Thank you

User avatar
21db

Wow nice work, The feature/stats combo! I was also trying to get the testset distribution to be as similar as possible to the train under each condition but I just couldn't acquire those skills in time to make that happen, I hope to learn a great deal further when you share your code.

Congratz!

25 Jan 2022, 09:08
Upvotes 0
User avatar
flamethrower

Thank you @DanielBruintjies. I will definitely share once Zindi confirms everything. Hope it helps

User avatar
21db

@flamethrower Can you share your code please?

User avatar
flamethrower

Apologies for the late response, I will have it on Github and share the link.

User avatar
21db

Awesome!

nicely done. thanks for sharing.

25 Jan 2022, 10:31
Upvotes 0
User avatar
flamethrower