Primary competition visual

Akeed Restaurant Recommendation Challenge

Helping Oman
$3 000 USD
Completed (over 5 years ago)
Prediction
Collaborative Filtering
1420 joined
242 active
Starti
May 18, 20
Closei
Aug 16, 20
Reveali
Aug 16, 20
summary of 1st place solution
Notebooks · 19 Aug 2020, 12:14 · edited 10 minutes later · 15

Congratulations to all winners!

Big thanks to organizers for a chance to work with real business data and hope that some of our ideas will help Akeed to improve their business.

This is a summary of my solution

Clustering

Initial data was separated into clusters based on customer and restaurants locations

  • cluster 0 - most of the locations with correct coordinates were assigned to the main cluster 0.
  • cluster 2 - for main cluster if haversine distance between location and restaurants more than 15 km locations were assigned to cluster 2
  • cluster 3 - locations around vendor_id = 907 were assigned to cluster 3 for locations in cluster 3 only one vendor predicted by model. Predictions for other restaurants were set to 0
  • cluster 5 - locations with only one coordinate (lon=lat)
  • clusters 6-8 - locations very far from main cluster and restaurant vendor_id = 231 - predicted as 0

3 types of models were created

  1. models for cluster 0 (xgb + lgb)
  2. model for cluster 5 (xgb + lgb)
  3. model for clusters 0,2,3,5 (lgb) - prediction used for clusters 2 and 5

Features

Main part of features based on coordinates.

  • lat/lon, min-max-avg-std for coordinates by customer, haversine distance and diff by coordinates to restaurant
  • customer properties (created_at, total num of locations, num of good locations)
  • restaurant properties from orders (mean and std of preparation time, net time for delivery, first order date), total tags
  • for cluster without longitude, min-max-avg of coordinates by customer played main role
  • for one model I used features, created by mean target encoding. coordinates were rounded for 2,3 digits and combined customers and restaurants, mean target was calculated. To prevent overfitting TE calculated for each fold with smooth and double folds split
  • models used from 20 to 30 features.

Train process

Used Lightgbm and Xgboost (for blend as second model)

For speed up model development negative downsampling was used - all rows with target==1 and only 40% of rows with target==0. It significantly reduced time of training but models' accuracy were good enough for tuning.

Features were selected with step-wise approach - added/deleted one by one. Results checked on 5-fold CV. Included only features with stable improvement (most of folds improved).

For cross-validation data was splitted to 5 folds by customer_id and then all locations of one customer assigned to one fold (similar to GroupKFold).

I guess that main advantage of my approach is clustering and separate models for main and bad coordinates clusters

Discussion 15 answers

Excellent!

19 Aug 2020, 12:16
Upvotes 0

This is superb. Absolute gold mine of information. Idea of clustering is very

19 Aug 2020, 12:28
Upvotes 0

Great, Thanks a lot. Which library do you recommend for using LightGBM AND Xgboost?

lightgbm much faster and usually has better accuracy. I used xgb only as second model for blend. it adds some stability

User avatar
Vidya Jyothi Institue of Technology

Excellent!!

19 Aug 2020, 13:45
Upvotes 0
User avatar
Kamenialexnea
Ecole nationale superieure polytechnique yaounde

Excellent!

19 Aug 2020, 18:05
Upvotes 0

Congrats!!

19 Aug 2020, 18:15
Upvotes 0

Can u please share code

20 Aug 2020, 04:47
Upvotes 0

It will be a help

20 Aug 2020, 04:50
Upvotes 0

Hey, congrats.

Can you please let me know, how have you found out the right co-ordinates for cluster 0. There are a lot of errors in the coordinates. When I am checking the co-ordinates, almost all the co-ordinates are out of Oman.

It was some rough approximation - the main goal was not to find exact co-ordinates themselves but to calculate haversine distance.

In which format, co-ordinates are available? Degree or Radian?

Hey,

- for creating cluster 1, what is metric you choose for identifying correct co-ordinates

- for creating cluster 2, when I am looking at data, on 2.7 percent of CID X LOC_NUM X VENDOR has a distance less than 15km.

25 Aug 2020, 15:47
Upvotes 0

I was also having the same problem on clustering because very less values have distance less than 15 km

26 Aug 2020, 17:25
Upvotes 0