Code for solution ( its a bit messy ) : https://github.com/FADHLOUN-Y/Sendy-Logistics-Challenge
Final model : For this regression task, our final model was an ensemble of 4 types of blended models with weights. We used multiple XGBoost models that were bagged using Sklearn's BaggingRegressor and their results were averaged, A number of bagged LightGBM models using BaggingRegressor, a single CatBoost model , and a sklearn's GradientBoostingRegressor. All these models were making predictions over a 10-fold cross validation scheme. The blending weights were chosen following the best performance on the leaderboard.
Time features :
We started off by computing time features and interactions between them such as substractions, additions and multiplications, we chose to keep substractions as they were giving the best performance. We also extracted the pickup hours, binned them to better capture the signal of congestion and non-congestion. The usual time features were also extracted such as weekend or not , week of the month , end/start of month. We also tried interaction between the day of week and pickup hour to better represent congestion related to specific days and hours of the week.
Rider features :
The rider data was the most interesting part for us in this challenge, because the rider , potentially, contained information on the many outliers in both datasets. We started things off by creating interactions between the rider data, we also used target encoding on the rider Id which gave a slight boost. We also computed for each rider its speed meter/second, by multiplying the distance by 1000 and dividing by the target, using that speed, we set up a logical speed limit that a ride shouldn't surpass which was 18 meter/second, all rides above that speed were given the value of 1 on a new feature called 'error_ride'. We then computed a feature that measured how many errors each rider makes compared to all his rides in the train set which gave us a new feature between 0 and 1 called 'error_rate_rider' that helped with the performance very well. We also removed all the outliers and computed for each rider his average speed to get logical speeds, and imputed the anomalous average speeds with the mean.
Geospatial features :
This was the most annoying part of the challenge. Personally this was my first time working with coordinates I didn't even use them when i started the comp. I didn't know coordinates as they are could be fed to a model, by the end of this challenge much more could be done with them. We performed our own clustering on all the points pickup/dest in the data using KMeans, we also tried Principal Component Analysis( PCA ). I also tried to create an interaction between long/lat to pinpoint a single point in the map and then tried a an interaction 'from_to' feature to highlight recurring trajectories, but it severly overfit as it was a highly dimensional categorical feature with not so many datapoints per level. We then moved on to external data, we got the sublocations from the Uber data, and computed the distance from each point in the data to every sublocation in the Uber data which resulted in over 300 features. We then proceeded to remove by hand features that were useless following lightgbm's feature importance. We also gathered popular points in Nairobi from Google Maps that we knew were responsible for a lot of congestion and delays in the rides, and we computed every point's distance to those points.
Things we tried but did not work out so well :
Classification : Using the binary feature 'error_ride', we approached the problem at start as a binary classification, we tried to predict the outliers ( small time long distance ) by training an XGBoost on the data while focusing on rider data to achieve that and then we used the classifier predictions as a new binary feature for our regression, We also tried to train the regressor only on 'normal' rides, and manually set to 0 rides that were classified to be outliers but it didn't work well. We reached 0.7 f1 score in validation and it gave a small boost to the performance when we used it as a feature but then we felt the model wasn't robust and stable as we were spending too much time tuning the classifier, and we thought that maybe it would perform slightly good on the public lb, and tank on the private which was a huge risk to take. We decided to drop the classification from our pipeline.
Data Removal : We decided to remove all rides above 5k seconds as they were 'pulling' the predictions higher by looking at the residuals, it gave a good boost in the performance. But we're not sure it did well on the private leaderboard.
Nice one blenze....I tried a classification approach as well but it didnt yield anything significant
Thanks, 2nd place solution used outlier classification, but a bit different than my way, and it worked out fine for him.
Thanks a lot for sharing!!!!
"The blending weights were chosen following the best performance on the leaderboard." Do you mean that for every model you choose manually the weight to give to that model depending on its score and then you just do a weighted average on different model predictions?
Not manually, we searched for the combination of model weights that was the best.
yes but how do you choose the weight automatically. Do you train many models to combine the predictions and after you choose the model that score the best on the leaderboard
No we start by assigning random weights to 4 out of fold predictions , and we probe the leaderboard a number of times until we get the weights that perform best.
Check the code : Here we went with 0.9 on the second model, 0,1 with the third.
(oof_test/10)*0.00+(oof_test_2/10)*0.9+(oof_test_3/10)*0.1 + (oof_test_4/10)*0.00
Ok thanks will check the code incha allah
"We also gathered popular points in Nairobi from Google Maps that we knew were responsible for a lot of congestion and delays in the rides, and we computed every point's distance to those points." Do we have the right to to use google map data in this competition ?
Thank you for sharing your process with us. It's great when peeps whom do well share their knowledge. Cheers!
I think so, it's publicly available data. However you need to check with Zindi to make sure.
Yes but they say you can’t use data not listed in the competition data page
Can Zindi weigh in on this? For future challenges, so we know
thierno, my understanding is that if the resource is opensource, and freely available, it's fair game.
Well not that fair, because the solution needs to be used by the competition owner and we don't know what they can use. that's why zindi put all the open data that we can use otherwise they could just say if the data is open you can use it.
thierno, I'm going to assume you're replying to me. I'm unsure I understand your post. I'm saying that if you find publically (and free) resources, it should be allowed. This is because publically available, and free, resources are available to everyone (whom looks for them). Or have I misunderstood your post?
Well competition doesn’t work that way, if you find an open data you have to ask whether you can use it and if it’s ok the data is announced to everyon. That’s ho it is even on kaggle if not stated otherwise on the description of the competition
You might be right, and I'm sure we'll hear from a Zindi official soon enough. I just find the idea of people researching tools and resources available to everyone (making an effort on their own) needing to share the work they've done with other competitors a bit odd.
Where does one draw the line? If you know of a library that might be useful, do you need to tell the other contestants? I imagine the competitive rule-of-thumb to be that if one person can find it, so could the others. The prize goes to those that find the way to get the best results (using anything that the other competitors could have used if they found/thought of it).
Hi Thierno, Blenz and Ivor!
Thanks for your query. For some challenges we do allow you to source your own data sets, but unless it has been stated on the competition page, please check with us first to make sure that they are publicly available and everyone has access to them. If the data sets look okay, we'll post the link on the data page of the competition for everyone's benefit.