🤝 Must-Read: summary of 1st place solution

Akeed Restaurant Recommendation Challenge

Helping Oman

$3 000 USD

Completed (almost 6 years ago)

Skills you will learn

Prediction

Collaborative Filtering

1423 joined

242 active

Info Data Chat Leaderboard

Start

May 18, 20

Aug 16, 20

Reveal

Aug 16, 20

johnpateha

summary of 1st place solution

Notebooks · 19 Aug 2020, 12:14 · edited 10 minutes later · 15

Congratulations to all winners!

Big thanks to organizers for a chance to work with real business data and hope that some of our ideas will help Akeed to improve their business.

This is a summary of my solution

Clustering

Initial data was separated into clusters based on customer and restaurants locations

cluster 0 - most of the locations with correct coordinates were assigned to the main cluster 0.
cluster 2 - for main cluster if haversine distance between location and restaurants more than 15 km locations were assigned to cluster 2
cluster 3 - locations around vendor_id = 907 were assigned to cluster 3 for locations in cluster 3 only one vendor predicted by model. Predictions for other restaurants were set to 0
cluster 5 - locations with only one coordinate (lon=lat)
clusters 6-8 - locations very far from main cluster and restaurant vendor_id = 231 - predicted as 0

3 types of models were created

models for cluster 0 (xgb + lgb)
model for cluster 5 (xgb + lgb)
model for clusters 0,2,3,5 (lgb) - prediction used for clusters 2 and 5

Features

Main part of features based on coordinates.

lat/lon, min-max-avg-std for coordinates by customer, haversine distance and diff by coordinates to restaurant
customer properties (created_at, total num of locations, num of good locations)
restaurant properties from orders (mean and std of preparation time, net time for delivery, first order date), total tags
for cluster without longitude, min-max-avg of coordinates by customer played main role
for one model I used features, created by mean target encoding. coordinates were rounded for 2,3 digits and combined customers and restaurants, mean target was calculated. To prevent overfitting TE calculated for each fold with smooth and double folds split
models used from 20 to 30 features.

Train process

Used Lightgbm and Xgboost (for blend as second model)

For speed up model development negative downsampling was used - all rows with target==1 and only 40% of rows with target==0. It significantly reduced time of training but models' accuracy were good enough for tuning.

Features were selected with step-wise approach - added/deleted one by one. Results checked on 5-fold CV. Included only features with stable improvement (most of folds improved).

For cross-validation data was splitted to 5 folds by customer_id and then all locations of one customer assigned to one fold (similar to GroupKFold).

I guess that main advantage of my approach is clustering and separate models for main and bad coordinates clusters

Discussion 15 answers

Olayinka_Fadahunsi

Excellent!

19 Aug 2020, 12:16

Upvotes 0

aninda_bitm

This is superb. Absolute gold mine of information. Idea of clustering is very

19 Aug 2020, 12:28

Upvotes 0

learner_recsys

Great, Thanks a lot. Which library do you recommend for using LightGBM AND Xgboost?

19 Aug 2020, 12:31 (edited 7 minutes later)

Upvotes 0

johnpateha

lightgbm much faster and usually has better accuracy. I used xgb only as second model for blend. it adds some stability

replied to learner_recsys19 Aug 2020, 13:11

Upvotes 0

saikrithik

Vidya Jyothi Institue of Technology

Excellent!!

19 Aug 2020, 13:45

Upvotes 0

Kamenialexnea

Ecole nationale superieure polytechnique yaounde

Excellent!

Upvotes 0

Congrats!!

Upvotes 0

Maxgen

Can u please share code

20 Aug 2020, 04:47

Upvotes 0

Vikas_rp123

Maxgen

It will be a help

20 Aug 2020, 04:50

Upvotes 0

koustavdey101

Hey, congrats.

Can you please let me know, how have you found out the right co-ordinates for cluster 0. There are a lot of errors in the coordinates. When I am checking the co-ordinates, almost all the co-ordinates are out of Oman.

21 Aug 2020, 18:17 (edited ~9 hours later)

Upvotes 0

johnpateha

It was some rough approximation - the main goal was not to find exact co-ordinates themselves but to calculate haversine distance.

replied to koustavdey10122 Aug 2020, 04:54

Upvotes 0

koustavdey101

In which format, co-ordinates are available? Degree or Radian?

replied to johnpateha24 Aug 2020, 05:46

Upvotes 0

koustavdey101

Hey,