Is user_id as a feature allowed?? If it was hashed, it was hashed for some reason else Zindi could have shared the unhashed version. @Zindi kindly confirm
@bbb...to be very honest I do not think user_id should be allowed in any model. I have been into ML for many years and using user_id seems really irrelevant and gives very bad impression. It may be masked but from the competition if it is shown as user_id it means just identification label. @Zindi I would request please discourage users from using user_id, customer_id as features. This gives really wrong impression to junior Data Scientists and is not at all justified for fellow competitors. You can consult any practically experienced DS, they would also echo the same thoughts.
@bbb Congratulations on your approach especially using "Null Importance feature selection technique". I once used this method before. A great technique i must say.
@ravinder Yeah... You're very right indeed on the use of "user_id" and "costumer_id" as features. I am very sure @Zindi will do the needful on this.
I mean, okay: it's of course not a production-ready feature and this one is more about competition. But hey: we all have very same dataset and can use it in the way we like it until it breaks the rules. Everyone can use exact same feature. If this is not permitted, there should be rules to prohibit it. Otherwise it's not very fair to me.
Also I think that this feature is not as leaky as it seems: you can use dynamics of churn in production with EMA. You can also track principial changes in prices/advertisment/... of the product and change model according to this. My boosting models just done this implicitly by creating splits for user_id_int.
I also bet that the first three places somehow use it (due to very close scores).
Congrat @bbb and thanks a lot for the solution writeup! Great solution.
- could you give examples of the groupby features which improved the score - i did not manage to find any. Thanks for the website link.
- what do you assume were the reasons that unhashed integer 'user_id''s helped to improve the score ?
Update:
- and what was your ensembling technique - all method I used (averaging, median, linear and logistic regressions, lgb, NN, Ridge) gave me marginal improvement (e.g. from 0.9316 to 0.9317 auc)
2) I think that it helps because it works like a time feature, so models can implicitly capture changes in advertisment/prices/... Churn can change at these points a lot.
3) When i started to explore how scores vary from split to split (even when data is stratified) I found out that there is a big std. So I decided to not to use linear models or something else due to possible issues with overfitting. It's often more useful just to create more good diverse models and mix them equally when the scores vary a lot. I have not enough time to do it more precise and right, like in this vido (it is in Russian but there are subtitles): https://www.youtube.com/watch?v=HT3QpRp2ewA
For some reason it still doesn't work
So you can find full text here
Congratulations for such a remarkable work and thanks a lot for sharing your ideas.
Thanks for sharing. Nice work.
And just in case: sorry for the bad english, I'm not a native speaker 😅
спокойно, bbb, пиши по русски - захотят понять - переведут гуглом :)
да уж написал как написал))
Thanks ! Great work !
Is user_id as a feature allowed?? If it was hashed, it was hashed for some reason else Zindi could have shared the unhashed version. @Zindi kindly confirm
It seems like it doesn't break any rules of @Zindi or this competition. Also it is a common practice to check it in a big competitions ¯\_(ツ)_/¯
This even can partially be used in production: it's like a EMA of churn from the last N days (if you use user_id_int for churn smoothing)
@bbb...to be very honest I do not think user_id should be allowed in any model. I have been into ML for many years and using user_id seems really irrelevant and gives very bad impression. It may be masked but from the competition if it is shown as user_id it means just identification label. @Zindi I would request please discourage users from using user_id, customer_id as features. This gives really wrong impression to junior Data Scientists and is not at all justified for fellow competitors. You can consult any practically experienced DS, they would also echo the same thoughts.
@bbb Congratulations on your approach especially using "Null Importance feature selection technique". I once used this method before. A great technique i must say.
@ravinder Yeah... You're very right indeed on the use of "user_id" and "costumer_id" as features. I am very sure @Zindi will do the needful on this.
I mean, okay: it's of course not a production-ready feature and this one is more about competition. But hey: we all have very same dataset and can use it in the way we like it until it breaks the rules. Everyone can use exact same feature. If this is not permitted, there should be rules to prohibit it. Otherwise it's not very fair to me.
Also I think that this feature is not as leaky as it seems: you can use dynamics of churn in production with EMA. You can also track principial changes in prices/advertisment/... of the product and change model according to this. My boosting models just done this implicitly by creating splits for user_id_int.
I also bet that the first three places somehow use it (due to very close scores).
Thank you so much for sharing these great hints.
Thank you and Congratulations!!!
Congrat @bbb and thanks a lot for the solution writeup! Great solution.
- could you give examples of the groupby features which improved the score - i did not manage to find any. Thanks for the website link.
- what do you assume were the reasons that unhashed integer 'user_id''s helped to improve the score ?
Update:
- and what was your ensembling technique - all method I used (averaging, median, linear and logistic regressions, lgb, NN, Ridge) gave me marginal improvement (e.g. from 0.9316 to 0.9317 auc)
Thank you :)
1) Here are some:
2) I think that it helps because it works like a time feature, so models can implicitly capture changes in advertisment/prices/... Churn can change at these points a lot.
3) When i started to explore how scores vary from split to split (even when data is stratified) I found out that there is a big std. So I decided to not to use linear models or something else due to possible issues with overfitting. It's often more useful just to create more good diverse models and mix them equally when the scores vary a lot. I have not enough time to do it more precise and right, like in this vido (it is in Russian but there are subtitles): https://www.youtube.com/watch?v=HT3QpRp2ewA
Thanks a lot. Great feature engineering to learn.
@bbb great work and thanks for sharing libraries, ideas and techniques !
Congratulations