I want to congratulate all winners and thanks organizers for this competition
my solution is a blend of 2 types of models
For each ID one of products == 1 was randomly replaced to 0. To decrease impact of random factor I repeated substitution 5 times, so my train was 5 time bigger then original.
And 20 binary classifications were trained (for product with only 4 positive values all predicts were 0). For each model separate sets of features were selected.
After all models were tuned I apply softmax to scale predictions within 0s of each ID.
For 2nd stage models instead of products with 0s I used predicrions from 1st stage plus separate feature - probability for current product.
Predictions from 2nd stage were blended with predictions from multiclassification.
Main models were lightgbm, but for stability I repeated all train process with xgboost and catboost and blend predictions.
Validation - 5-fold cross-validation. Put all same IDs to one fold.
original features, sum of all products. and combinations of products by pairs and triples - for example comb for products A and B was 100 * A + 10 * B (for second stage I rounded predictions to 1 decimal point). I guess it may be worse than target encoding but much faster, simple and avoid overfitting.
For each models features were selected one by one.
For predictions .995 and higher where are no false positives. This mean that all .995+ could be round to 1 and other products for these IDs could be round to 0. This group was near one third of dataset so it helped to up by few places
Thank you for sharing the outline of your solution!
According to the rules, it is not allowed to round predictions to 0 and 1. However, my reading of the rules is that it is allowed to replace the predictions with 1e-53 and 1 - 1e-53.
I don't think that is strict rule.
without 0 and 1 much easy to calc logloss, but we already have a lot of 1, so you shoud tweak metric calculation anyway, and it alredy done in sci-kit, so I didn't expect problem.
hey there can share your repo it would be more clear
hi, nice approach. dont you want share your repo?
Very elegant solution! However it is a bit hard for me to understand this phrase "For 2nd stage models instead of products with 0s I used predicrions from 1st stage plus separate feature - probability for current product." If I read it correctly you used probabilities instead of 0s and trained several GBDts with multiclass Y. However, how did your "probability of the current product" feature look like? It seems that information is already included in the probabilities
no - 2nd stage was also 20 binary models, so for each model I use prob for current product.
multyclass was another approach which I blended with 2nd stage predictions