💬 Hot Topic: 2nd place solution

Zindi User Behaviour Birthday Challenge

Helping Africa

$3 000 USD

Completed (over 4 years ago)

Skills you will learn

Prediction

874 joined

174 active

Info Data Chat Leaderboard

Start

Sep 24, 21

Jan 23, 22

Reveal

Jan 23, 22

abdul0807

2nd place solution

Connect · 24 Jan 2022, 19:38 · edited 1 minute later · 14

Problem Formulation:

The problem formulation is not a major aspect of our solution, but we feel it is worth mentioning.

This competition can be formulated in many ways. It was given in the problem statement that an active user is one that enters a competition, makes a submission, or engages through the discussion forums. The target variable is an outcome of four different activities, and this gave us an opportunity to try out different formulation techniques as discussed below:

1. We solved a binary classification problem. The outcome is 1 if user is active on a given month else 0. This formulation might be used by most and it worked as charm for us as well.

2. We solved a forecasting problem. The idea was to predict the number of different activities that a user might perform in the next three months. In order to avoid multiple models at each user level, we formulated the problem as a regression problem and used lag features, time features and other extraneous variables for model training. The performance of this approach was good enough but it didn’t beat the performance of approach #1.

3. We decided to solve the problem as multi-class classification problem where each class is the four different activities performed by the user. This approach was competing well with approach #1. We improved the approach further by combining low frequency classes. We finally created three main classes as described below:

a. A class to determine if a user participated in a competition

b. A class to determine if a user submitted, discussed or commented

c. A class if none of the above activities was performed by the user

This approach gave us “slightly” better performance than approach #1 and we decided to stick to it. Later we created two separate models based on formulation #1 and #3 and blended them together for the final submission.

Please note that using any one of the formulations should have kept us at same position on private LB. In real life problems, if unclear, it is always good to try various formulation techniques and discuss the solution with seniors and domain experts.

Feature Engineering:

Most of the time was spend in feature engineering. To understand the below features, consider an activity by a user to be a competition participation, submission, discussion or comment. Separate features were created for each activity.

A list of some of the key features is given below:

1. The number of activities by a user in the previous month

2. The momentum in the number of activities was captured by taking a difference of the recent month activity from the previous month

3. The cumulative sum of the number of activities by the user was calculated and the number of activities per month was derived for all activities except submissions.

Instead of just using number of submissions, we used number of submissions per competition which gave slightly better score.

4. The number of months since the last activity by the user

5. The number of months of the last activity since joining zindi

6. The number of active competitions of a user in the given month

7. The rank of the user interest in the competitions in the given month based on the user’s competitions history

8. The number of months since joining zindi

9. The number of total users, new users in each month

10. The mean, deviation and max experience of users in each month

11. Count based features for country and user_id

12. User based feature like featureX, featureY, country and points

13. Time based features like year and month

Cross Validation Strategy:

We performed GroupKFold on timestamp column (which is unique combination of year and month). Our final submission was based on the average of 5 folds prediction. Our CV score improvement was consistent with the LB score. At regular intervals, we verified our score on a holdout dataset of last 3 months of training data.

Model Training:

1. Model Used: LightGBM

2. Metric: auc_mu

3. We spend some time tuning the hyper parameters

Few things that didn’t work:

1. We tried different models like XGboost and Catboost but didn’t see a lot of improvement in the final scores. We tried blending different type of boosting algorithms but didn’t see much of improvement

2. User_ID as a feature

3. Count based features of various categorical features

4. Using only latest one year of data for training

Thank you @zindi for the amazing competition.

Last but not the least a special shout out to my team mate @devnikhilmishra

Discussion 14 answers

21db

Nice work! And congratulations to you and your teammate!

24 Jan 2022, 20:40

Upvotes 0

21db

It's awesome you got the multiclass to work. I didn't get to try that.

How did you manage to get it to one target? Sum/mean?

24 Jan 2022, 20:54

Upvotes 0

abdul0807

For binary classification, the existing target in the dataset was used.

For regression problem, average was used to keep the scale between 0 and 1

replied to 21db24 Jan 2022, 22:01 (edited 20 minutes later)

Upvotes 0

21db

ooh okay

replied to abdul080725 Jan 2022, 09:09

Upvotes 0

ziki1414

International University of Applied Science, Bad honnef, Germany

Thanks for sharing the amazing idea that got you 2nd place position. Am all for the learning. If you don't mind you might as well share the complete solution.

24 Jan 2022, 21:41

Upvotes 0

abdul0807

Our code is yet to be evaluated by zindi and then the final LB will be freeze. I think making the code public might not be the correct idea for now.

replied to ziki141424 Jan 2022, 22:05

Upvotes 0

ziki1414

International University of Applied Science, Bad honnef, Germany

OK that's understandable.

replied to abdul080725 Jan 2022, 07:07

Upvotes 0

flamethrower

Remarkable approach, I considered training individual interaction components but never tried it when binary worked well. The multi-class approach is a more efficient way. I will always seek to try multiple formulations moving forward. Please how did you determine the feature of rank of user interest? Thank you for sharing and congratulations.

25 Jan 2022, 07:48 (edited 31 minutes later)

Upvotes 0

abdul0807

The rank of user interest can be determined using these steps:

1. Take the competitions data (the best part is that this data have information about the future competitions as well). We convert this data into a format which looks similar to One Hot Encoding (OHE). So we now have CompID and all of its features say FeatureA_1, FeatureA_2... etc.

2. Get all the competitions happening in each month. Merge this data with data in step #1. We then add the competition features for each month and generate a matrix [timestamp x competition_features]

3. Concatenate train and test data. Merge this data with data in step #1. We add the competition features for each user and generate a matrix [user_id x competition_features]

4. Perform matrix multiplication of [user_id x competition_features] * [competition_features x timestamp]

5. We get a final matrix [user_id * timestamp] with respective user interest values

6. Rank the user interest in descending order across timestamp. The user with the highest interest is rank #1 and so on.

You can always try separate interest rank for each of competition feature. But we stick with a single rank for the 5 competition features (FeatureA, FeatureB, FeatureC, FeatureD, and FeatureE).

Hope this helps!

replied to flamethrower25 Jan 2022, 10:28 (edited 2 minutes later)

Upvotes 0

flamethrower

Wow! This is very interesting, the matrix multiplication approach calculates rank score which indicates a similarity score when compared to historical users competitions features encodings.

I really wanted to take into account user competition score, this again is a more efficient method compared to attempting to build a model for it.

Thank you for the clarification.

replied to abdul080725 Jan 2022, 14:03

Upvotes 0

devnikhilmishra

great description and wonderful work, thanks @eat-sleep-ai-repeat. Was really fun working together

25 Jan 2022, 09:24 (edited 1 minute later)

Upvotes 0

21db

@@devnikhilmishra @eat-sleep-ai-repeat Can you please share your code? Really interested in seeing how you calculated the rank of the user interest. Thanks

17 Feb 2022, 05:06

Upvotes 0

abdul0807

@DanielBruintjies - The code is here

The confirmation email from zindi arrived yesterday hence the delay in my response.

Hope this helps!

replied to 21db3 Mar 2022, 20:23 (edited 1 minute later)

Upvotes 0

21db

Thank you so much! @eat-sleep-ai-repeat

replied to abdul08074 Mar 2022, 19:14

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status