☎️ Hot Topic: 3rd place solution

Is user_id as a feature allowed?? If it was hashed, it was hashed for some reason else Zindi could have shared the unhashed version. @Zindi kindly confirm

29 Nov 2021, 17:23

Upvotes 0

bbb

It seems like it doesn't break any rules of @Zindi or this competition. Also it is a common practice to check it in a big competitions ¯\_(ツ)_/¯

replied to salimshaikh78629 Nov 2021, 17:28

Upvotes 0

bbb

This even can partially be used in production: it's like a EMA of churn from the last N days (if you use user_id_int for churn smoothing)

replied to bbb29 Nov 2021, 17:31 (edited 5 minutes later)

Upvotes 0

ravinder

@bbb...to be very honest I do not think user_id should be allowed in any model. I have been into ML for many years and using user_id seems really irrelevant and gives very bad impression. It may be masked but from the competition if it is shown as user_id it means just identification label. @Zindi I would request please discourage users from using user_id, customer_id as features. This gives really wrong impression to junior Data Scientists and is not at all justified for fellow competitors. You can consult any practically experienced DS, they would also echo the same thoughts.

29 Nov 2021, 18:17

Upvotes 0

MICADEE

LAHASCOM (Freelance)

@bbb Congratulations on your approach especially using "Null Importance feature selection technique". I once used this method before. A great technique i must say.

@ravinder Yeah... You're very right indeed on the use of "user_id" and "costumer_id" as features. I am very sure @Zindi will do the needful on this.

replied to ravinder29 Nov 2021, 20:35

Upvotes 1

bbb

I mean, okay: it's of course not a production-ready feature and this one is more about competition. But hey: we all have very same dataset and can use it in the way we like it until it breaks the rules. Everyone can use exact same feature. If this is not permitted, there should be rules to prohibit it. Otherwise it's not very fair to me.

Also I think that this feature is not as leaky as it seems: you can use dynamics of churn in production with EMA. You can also track principial changes in prices/advertisment/... of the product and change model according to this. My boosting models just done this implicitly by creating splits for user_id_int.

I also bet that the first three places somehow use it (due to very close scores).

replied to MICADEE30 Nov 2021, 01:16

Upvotes 0

Yassine-Student

Thank you so much for sharing these great hints.

29 Nov 2021, 18:50

Upvotes 0

Vahe

Thank you and Congratulations!!!

29 Nov 2021, 19:37

Upvotes 0

isa_k_dsmlkz

Congrat @bbb and thanks a lot for the solution writeup! Great solution.

- could you give examples of the groupby features which improved the score - i did not manage to find any. Thanks for the website link.

- what do you assume were the reasons that unhashed integer 'user_id''s helped to improve the score ?

Update:

- and what was your ensembling technique - all method I used (averaging, median, linear and logistic regressions, lgb, NN, Ridge) gave me marginal improvement (e.g. from 0.9316 to 0.9317 auc)

30 Nov 2021, 05:16 (edited 34 minutes later)

Upvotes 0

bbb

Thank you :)

1) Here are some:

 'GB_DIFF_FEATURE__REGION__DATA_VOLUME',
 'GB_DIFF_FEATURE__REGION__MONTANT',
 'GB_DIFF_FEATURE__REGION__ON_NET',
 'GB_DIFF_FEATURE__REGION__REVENUE',
 'GB_DIFF_FEATURE__REGION__ZONE1',
 'GB_DIFF_FEATURE__REGION__ZONE2',
 'GB_DIFF_FEATURE__TENURE__ARPU_SEGMENT',
 'GB_DIFF_FEATURE__TENURE__DATA_VOLUME',
 'GB_DIFF_FEATURE__TENURE__MONTANT',
 'GB_DIFF_FEATURE__TENURE__ON_NET',
 'GB_DIFF_FEATURE__TENURE__REVENUE',
 'GB_DIFF_FEATURE__TENURE__ZONE1',
 'GB_DIFF_FEATURE__TENURE__ZONE2',
 'GB_FEATURE__REGION__DATA_VOLUME',
 'GB_FEATURE__REGION__REVENUE',
 'GB_FEATURE__REGION__ZONE2',
 'GB_FEATURE__TENURE__ARPU_SEGMENT',
 'GB_FEATURE__TENURE__DATA_VOLUME',
 'GB_FEATURE__TENURE__MONTANT',
 'GB_FEATURE__TENURE__ON_NET',
 'GB_FEATURE__TENURE__ORANGE',
 'GB_FEATURE__TENURE__REVENUE',
 'GB_FEATURE__TENURE__ZONE1',
 'GB_FEATURE__TENURE__ZONE2'

2) I think that it helps because it works like a time feature, so models can implicitly capture changes in advertisment/prices/... Churn can change at these points a lot.

3) When i started to explore how scores vary from split to split (even when data is stratified) I found out that there is a big std. So I decided to not to use linear models or something else due to possible issues with overfitting. It's often more useful just to create more good diverse models and mix them equally when the scores vary a lot. I have not enough time to do it more precise and right, like in this vido (it is in Russian but there are subtitles): https://www.youtube.com/watch?v=HT3QpRp2ewA

replied to isa_k_dsmlkz30 Nov 2021, 08:49

Upvotes 0

isa_k_dsmlkz

Thanks a lot. Great feature engineering to learn.

replied to bbb30 Nov 2021, 11:31

Upvotes 0

goldentom42

@bbb great work and thanks for sharing libraries, ideas and techniques !

Congratulations

1 Dec 2021, 07:06

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status