17 Jan 2020, 08:56

Meet the winners of the Sendy Logisitics Challenge

Zindi is excited to announce the winners of the Sendy Logistics Challenge. The challenge attracted 1127 data scientists from across the continent and around the world, of whom 431 were on the leaderboard.

The objective of the competition was to predict the estimated time of delivery of orders to help Sendy enhance customer communication, improve the reliability of its service, and enable them to reduce the cost of doing business, through improved resource management and planning for order scheduling.

The three winners of this challenge are; Roman from Russia in 1st place, Evgeny Patekha from Russia in 2nd place, and Yury from the Czech Republic in 3rd place.

A special thank you to the winners for their generous feedback. Here are their insights.

Name: Roman (1st place)

Zindi handle: TheRealRoman

Where are you from? Saint Petersburg, Russia

Tell us a bit about yourself.

I'm from Russia. I am a graduate student at St. Petersburg State University. In my free time I solve various data science challenges.

Tell us about the approach you took.

My final solution was stacking with Lasso regression. As the base models, I used lightGBM models with various preprocessing (various outliers filtering). Of the basic models, the Poisson Regression from LightGBM worked without outliers filtering very well and gave the best score as a solo model.
Of the features, I used target encoding, time differences, rounding coordinates and value counts on them.

What were the things that made the difference for you that you think others can learn from?

There were many participants in this challenge so it is difficult for me to say which particular detail allowed me to place first place.

What are the biggest areas of opportunity you see for AI in Africa over the next few years?

I believe that the most important areas for AI in Africa are the same for the entire world. These can be medicine, ecology, business, etc.

Name: Evgeny Patekha (2nd place)

Zindi handle: johnpateha

Where are you from? Moscow, Russia

Tell us a bit about yourself.

I am an economist, however 4 years ago I changed my profession to Data Science. I Studied ML with Coursera and Kaggle. Now I am a data science team lead at QIWI. I am also a Kaggle Grandmaster.

Tell us about the approach you took.

My solution consists of 2 types of models
- Classifier predictions of the likelihood of short targets (the rider forgot to mark that he took the order and did it just before it was delivered to the client)
- Regression predictions of the main target without short targets
I used data from ArcGIS - time-distance for the fastest, shortest and walking routes, and the fastest time at a certain hour. I’m not sure if all of the data is useful. I just downloaded it all.
Main feature engineering - target encoding (TE). I used TE not only with the original target, but also with the average speed, calculated as a target divided by distances (original distance + distances from ArcGIS). The dataset is small, so I checked many different combinations of features for the target encoding. There was no killer feature, everything added a little.
I used a feature selection approach one-by-one, so all models are compact (less than 50 features) and fast.
I used different boosting libraries (LightGBM, XGboost, Catboost), each with different sets of features/parameters. RMSE and Fair Loss objective functions were used for regressions; and log loss for classification. I also used dart mode, target transformation by square root, and different weights. I created more than 10 different models for regression and 2 models for classification.
The predictions of regression models were multiplied by 1-probability from classification model – it helped to adjust predictions for outliers. After that, predictions were stacked by linear regression.

What were the things that made the difference for you that you think others can learn from?

I used two 2 model approaches - classifier and regression, and target encoding for feature engineering.

Name: Yury (3rd place)

Zindi handle: kss

Where are you from? Prague, Czechia

Tell us a bit about yourself.

I'm currently working as a Data Scientist at one of the top banks in Czech Republic.

Tell us about the approach you took.

Tricks I've used:

  1. I used binary classification for outlier detection. After that, I used isotonic regression to calibrate predictions of my classification model (LightGBM). So the final prediction was calculated by the following: the probability of an observation to be outlier * 1 + (1 - the probability of an observation to be outlier) * the prediction of my regression model (LightGBM) which was trained on training dataset without outliers.
  2. Tweedie and Poisson losses in LightGBM (they actually worked better than RMSE loss)
  3. Target encoding of some categorical variables with speed (Distance (KM) / Time from pickup to arrival)
  4. I divided all geographical coordinates into 50 clusters to get 'districts' and after that calculated different aggregated features for those clusters, and I concatenated 'cluster of pickup' with 'cluster of destination' to get 'route' and calculated different aggregated features for such routes.
  5. I used ridge regression for blending
  6. I had a model for speed prediction
  7. I scaled target variables (time from pickup to arrival) with MinMaxScaler and after that, I created LightGBM model with xentropy loss. I used it for blending too.
  8. I've tried to augment training data with SMOTE before, and it helped a little in the final blending.

What were the things that made the difference for you that you think others can learn from?

It was quite important to understand the nature of outliers and to choose the most appropriate way to handle them

This competition was hosted by Sendy (www.sendyit.com) and sponsored by insight2impact (www.i2ifacility.com).

What are your thoughts on our winners' feedback? Engage via the Discussions page or leave a comment on social media.