Hello everyone, Firstly, congrats for the winners and thanks to Zindi for offering this competition. I think I want to give a brief description about my solution and I hope others could do so as well. In the private LB I got a very bad score (2.0) becuase I used a specific postprocessing function for my submission. Without using it my score is 1.34 which puts me 6th.
This postprocessing function was depending on the duplicated rows between the training and testing, in addition to merchants who have only one class in the training data. This function was a failure due to the inconsistency between the target values of the training and testing (one sample with the same user Id and merchant name can get different target values)
Aside from this function, my solution was heavily dependent on feature engineering. I used statistics based on aggregations from the data. This aggregations were derived based on User_ids and Merchants_Names.
Another important feature idea is utilizing the Merchant name properly. I thought firstly about just label encode it. But after some thinking, I thought about treating it as an nlp text. So, I added a lot of text features (num_of_words, length...etc). Also I used CountVectorizer to count the frequencies of each word then used an algorithm for topic modeling which clusters the texts based on their topics. These features gave a quite good boost.
Finally, I used a simple weighed average between random forest, catboost and knn. I didn't use xgboost or lgbm because they tend to give bad results when having a small number of samples.
My validation idea was using StratifedKFold with 2 folds (as that one of the classes has only 2 instances in the training data), but with using 5 different seeds. This gave me 10 folds which gave quite consistent results that follows the leaderboard very well. I used these folds to tune my models and test my new features.
My solution was quite good but my only fault is doing some postprocessing for the submission which I didn't expect to get such bad results.
Although I didn't get good rank, the amount of feature engineering and modeling ideas I got from this competition was a treasure. I hope I can made it to the top next time.
I really want to read others solutions as well, so please guys share your solutions. Thanks for reading!
Thanks Bro, had a similar approach as you 👍was a great competition
Well done on the detailed work. Like you, I have experienced a fail case where my approach works really well on public LB but fails on Private LB.
Overtime, I have learned it's important to figure a way to validate any implementation as closely as possible to the deployment scenario, beyond competitions, our decisions and evaluation have a huge impact on business outcomes, so we need to ensure we build robust solutions. I have learned to approach these contests in the same vein to not become used to trying ungeneralisable practices.
In retrospect, do you think there is anyway you could have setup your validation set differently to really see if your function works well across the different instances mentioned.
Nice! I think I should be more cautious about any ungeneralisable practices.
I don't think I can do that with my training set as that the problem was in the testing not the training. I tried two submissions with the same values but changed the label of one row which has the same user id, merchant name and purchase value as in the training set. I gave it the same class as in the training, but the result was worse.
I think the inconsistency of the target is a part of the competition itself, so you cannot be very sure about the value of any sample. Being very sure could lead to very bad results, so spreading the probabilities across the classes is better.
In any case you are unable to validate a technique, then it's probably riskier.
Organisers can be really strategic about public private test split because they understand it's possible to overly fit a tail end of a distribution. Their aim is a solution that minimizes error to any tail of the distribution. We need to find a way achieve this with the training set since its probably more comprehensive than test set possible distributions.
For the sake of a competition, since you have two subs, you can probably select one that might work even though you can't validate it should generalise on private as well and another sub that's validated for any ends of the distribution.
All the best in your future work.
Yeah I should have chosen the second submission to be different. I'll try to be more cautious about that in the future.
Thanks for your comments🙏
Thanks, @Mohamed-Eltayeb for the high-level description of your approach and insights on where you erred. I too used CountVectorizer features and StratifiedKFold but with n_folds=3 as well as other time series features such as day, month and day of the week. However, I found a huge disparity between train and validation performance. Was it the same case for you? How did you avoid overfitting CatBoost?
1- try to sort the data based on PURCHASE_DATE feature
2- try to drop all the time series features, I found them to decrease the performance in this dataset
3- assure that your cv is stratified by the target not another feature. I don't think you can make more than 2 folds as that there is a class which has only two instances.
4- in terms of CountVectorizer, don't add the result features instantly into your model, instead, use something like PCA to make the number of features small, then add them to your sets.