Catch up with Kaggle GM and power data science competitor johnpateha as he shares a few of his secrets to winning the Akeed Restaurant Recommendation Challenge and getting to #15 on the Zindi leaderboard.
Hi Evgeny, please introduce yourself to the Zindi community.
I'm Evgeny Pateha, or johnpateha. I'm an economist living in Moscow, Russia.
Tell us a bit about your data science journey.
I started to study data sceince five years ago through Coursera. Four years ago I started to participate in Kaggle competition, and now I am a Kaggle Grandmaster. Achievements in various ML competitions helped me to find my first job in data science. Now I'm Lead Data Scientist at Ozon, a Russian e-commerce company. I think that data science competitions give us an opportunity to study ML faster, so I still try to participate in different competitions despite limited time.
What do you like about competing on Zindi?
Zindi always has good competition datasets without data leaks.
Tell us about the solution you built for the Akeed Restaurant Recommendation Challenge.
The initial data was separated into clusters, based on customer and restaurants locations:
cluster 0 - most of the locations with correct coordinates were assigned to the main cluster 0
cluster 2 - if haversine distance between location and restaurants was more than 15 km for the main cluster, locations were assigned to cluster 2
cluster 3 - locations around vendor_id = 907 were assigned to cluster 3; for locations in cluster 3, only one vendor was predicted by the model. Predictions for other restaurants were set to 0
cluster 5 - locations with only one coordinate (lon=lat)
clusters 6-8 - locations very far from main cluster and restaurant vendor_id = 231. Predicted as 0
3 types of models were created :
models for cluster 0 - xgb + lgb
model for cluster 5 - xgb + lgb
model for clusters 0, 2, 3, 5 - lgb (prediction used for clusters 2 and 5)
The main part of features is based on coordinates: lat/lon, min-max-avg-std for coordinates by customer, haversine distance and diff by coordinates to restaurant customer properties (created_at, total num of locations, num of good locations).
Restaurant properties from orders (mean and std of preparation time, net time for delivery, first order date), total tags
For cluster without longitude, min-max-avg of coordinates by customer played main role.
For one model I used features created by mean target encoding. Coordinates were rounded for 2/3 digits and combined customers and restaurants, mean target was calculated. To prevent overfitting, TE calculated for each fold with smooth and double folds split.
Models used 20 to 30 features.
3. Model training
Primarily I used Lightgbm and Xgboost (for blend as second model) .
To speed up model development, negative downsampling was used (all rows with target==1 and only 40% of rows with target==0). This significantly reduced time of training, but the models' accuracy were good enough for tuning.
Features were selected using a step-wise approach (added/deleted one by one). Results checked on 5-fold CV. I included only features with stable improvement (most of folds improved).
For cross-validation, data was split to 5 folds by customer_id and then all locations of one customer assigned to one fold (similar to GroupKFold).
Wha do you think set your approach apart?
I guess that the main advantage of my approach is clustering, and separate models for main and bad coordinates clusters.