Congratulations to all winners!
Big thanks to organizers for a chance to work with real business data and hope that some of our ideas will help Akeed to improve their business.
This is a summary of my solution
Clustering
Initial data was separated into clusters based on customer and restaurants locations
- cluster 0 - most of the locations with correct coordinates were assigned to the main cluster 0.
- cluster 2 - for main cluster if haversine distance between location and restaurants more than 15 km locations were assigned to cluster 2
- cluster 3 - locations around vendor_id = 907 were assigned to cluster 3 for locations in cluster 3 only one vendor predicted by model. Predictions for other restaurants were set to 0
- cluster 5 - locations with only one coordinate (lon=lat)
- clusters 6-8 - locations very far from main cluster and restaurant vendor_id = 231 - predicted as 0
3 types of models were created
- models for cluster 0 (xgb + lgb)
- model for cluster 5 (xgb + lgb)
- model for clusters 0,2,3,5 (lgb) - prediction used for clusters 2 and 5
Features
Main part of features based on coordinates.
- lat/lon, min-max-avg-std for coordinates by customer, haversine distance and diff by coordinates to restaurant
- customer properties (created_at, total num of locations, num of good locations)
- restaurant properties from orders (mean and std of preparation time, net time for delivery, first order date), total tags
- for cluster without longitude, min-max-avg of coordinates by customer played main role
- for one model I used features, created by mean target encoding. coordinates were rounded for 2,3 digits and combined customers and restaurants, mean target was calculated. To prevent overfitting TE calculated for each fold with smooth and double folds split
- models used from 20 to 30 features.
Train process
Used Lightgbm and Xgboost (for blend as second model)
For speed up model development negative downsampling was used - all rows with target==1 and only 40% of rows with target==0. It significantly reduced time of training but models' accuracy were good enough for tuning.
Features were selected with step-wise approach - added/deleted one by one. Results checked on 5-fold CV. Included only features with stable improvement (most of folds improved).
For cross-validation data was splitted to 5 folds by customer_id and then all locations of one customer assigned to one fold (similar to GroupKFold).
I guess that main advantage of my approach is clustering and separate models for main and bad coordinates clusters
Excellent!
This is superb. Absolute gold mine of information. Idea of clustering is very
Great, Thanks a lot. Which library do you recommend for using LightGBM AND Xgboost?
lightgbm much faster and usually has better accuracy. I used xgb only as second model for blend. it adds some stability
Excellent!!
Excellent!
Congrats!!
Can u please share code
It will be a help
Hey, congrats.
Can you please let me know, how have you found out the right co-ordinates for cluster 0. There are a lot of errors in the coordinates. When I am checking the co-ordinates, almost all the co-ordinates are out of Oman.
It was some rough approximation - the main goal was not to find exact co-ordinates themselves but to calculate haversine distance.
In which format, co-ordinates are available? Degree or Radian?
Hey,
- for creating cluster 1, what is metric you choose for identifying correct co-ordinates
- for creating cluster 2, when I am looking at data, on 2.7 percent of CID X LOC_NUM X VENDOR has a distance less than 15km.
Same problem
I was also having the same problem on clustering because very less values have distance less than 15 km