My solution consists of 2 types of models
- the classifier predicted the likelihood of short targets (rider forgot to mark that he took the order and did it just before delivered it to the client) - it was predicted pretty well - the roc auc is above .95
- regression predictions of the main target without short targets
I used data from Arcgis - time-distance for the fastest, shortest and walking routes, and the fastest time at a certain hour. I’m not sure if all of data useful, just downloaded all.
Main feature engeneering - target encoding. I used TE not only with original target, but also with the average speed, calculated as a target divided by distances (original distance + distances from Arcgis). The dataset is small, so I checked many different combinations of features for the target encoding. There was no killer feature, everything added a little.
I used feature selection approach one-by-one, so all models are compact (less than 50 features) and fast.
I used different boosting libraries (lightgbm, xgboost, catboost). Each with different sets of features/parameters. RMSE and Fair loss objective functions were used for regressions, logloss for classification. I also used dart mode, target transformation by square root, different weights. I created more than 10 different models for regression and 2 models for classification.
The predictions of regression models were multiplied by the 1-probability from classification model – it helped to adjust predictions for outliers. After that predictions were stacked by linear regression.
Good job and congrats Evgeny. thank you for sharing your approach! Outlier classification didn't work out so well for me, what features did you use for the classification?
Can you please share some link to github, if it is possible?
Innovative approach with the classification task. Well done and thanks for sharing
Main feature - time to pickup, plus many others including target encoded features
I also used that for my classification, i reached 0.7 with f1score in validation. Only difference, i fed the predictions of the classifier as a binary feature to my regression model, maybe I should've tried your approach of multiplying the reg predictions with the probability of a ride being an outlier. Well done.
Yes - multiplying by prob from classifier helped much more than just a feature, because you need to decrease time for outliers and this is a simple way.
thanks for sharing, what threshold did you use to decide if an observation was a short target for the classification part
as.integer((w_distance*1000/target)>25) w_distance - walking distance (shortest), 25 - just approx border
Congratulations and thank for this clear and concise description.
is w_distance the same as walking distance (shortest) and is 25 signify this : just approx border i mean the formule is just : as.integer((w_distance*1000/target)>25)
First of all thanks for sharing
I have three questions :
1)how did you implement fair loss with catboost, did you use C++ or this method: https://catboost.ai/docs/concepts/python-usages-examples.html#custom-objective-function.
2) How did you choose which feature to use in each of your 8 regression models?
3) Did you use stacking or just mean of your predictions
1.I used fair loss with lightgbm. I used R and catboost didn't support custom loss for R package so far.
2. I added features one by one and choose the best at every iteration. At some points I decided to freeze model for reproducible result and for additional tuning I started new from previous. Many of them were not successful, but few were good and had some difference for stacking.
3. I used both ways - stacked by linear regression and also blend with weights
For catboost in R, how did you implement cross validation? Can you share the piece of code for R catboost cross validation or a link to an online resource detailing how to cross validate with catboost in R
Awesome thanks a lot for these feedbacks
thanks for sharing!
Tried target encoding but the model was badly overfitting? Did you encounter that problem
Thank you for sharing. Truly great insights
Thank you for sharing your solution!