Sponsored by Uber, insight2impact, and Mobiticket, the Traffic Jam Challenge was Zindi's most popular competition to date. The competition lasted for four months, during which time we had nearly 600 data scientists registered for the challenge and just over 200 on the leaderboard!
The cash prizes up for grabs was a total of $12,000 USD. We challenged data scientists to combine data from Mobiticket with any datasets extracted from the Uber Movement platform to build a model to predict the number of bus tickets that Mobiticket would sell on a given day, time, and route.
A huge congratulations to the top three winners Mohamed Jedidi, Sirine Bchini, and Stephen Oni! Look out for a feature on these Zindians in a future newsletter from Uber.
Name: Mohamed Jedidi, Tunisia
This is Mohamed's 2nd win on Zindi! Read about his previous success with social media data here.
Meet Mohamed here:
How did he win? We will be posting a tutorial from Mohamed shortly on FEATURE ENGINEERING for you to learn more about his winning approach.
Name: Sirine Bchini, Tunisia/France
Sirine is our first female winner on Zindi!
Meet Sirine here:
Tell us a bit about your solution and the approach you took. I partnered with a friend of mine we tried two approaches. First, we took the provided features and we carried different features engineering tricks. For example we used different statistical measures, the mean, the maximum and the minimum per city, day, day of week, year, distance and their combinations. Then we used the tweaked features with common algorithms. Second we used an auto-encoder approach where we took the resulting features of feature engineering, we scaled them by the min max scaler and we passed them through an auto-encoder to find correlation between features, and generate more features. The second approach didn’t give remarkable results and we think it didn't because the data was small. So we used the engineered features as they are. We trained those features with XGboost random forest and neural networks with group K fold strategy. We chose the hyper parameters intuitively to avoid over-fitting.
What were the things that made the difference for you that you think others can learn from? I think feature engineering was key. Taking a gasp of the data, understanding it and making a sense of it was crucial. XGboost random forest and neural networks are to go to algorithms: they always give good results.
Name: Stephen Oni, Nigeria
This is Stephen's 2nd win on Zindi! Read about his previous success with natural language processing here.
Tell us a bit about your solution and the approach you took. My first model was a random forest in which I fed into it the data which is now processed , the travel time was converted to hours and the travel_from,car_type and max_capacity to categorical variable. Which gave me 3.9 To my final model I used about 4 different model Let me start by explaining from the model that was written in R. In the R written model, I used two way interactions between travel_from and some other variable like max capacity, time factors that has been converted, I also obtain the longitude and lattitude of each travel from location to Nairobi and I also obtain their distance to Nairobi from Google map. Withe the longitude and lattitude I calculated the haversine and Manhattan distance, then the Manhattan and haversine distance calculated was used to form 3 way interactions with the previously formed 2 way interactions. I also split the the time_travel to afternoon, morning, evening, which are then encoded to categorical variable. I also frequency encode the travel_from and travel_time and after the frequency encoding I bin them to reduce the categorical space, the travel_time frequency encoding was bin into 4bins and the travel from into 2bins based on some specific number. To understand how the haversine and Manhattan distance calculation work check out New York taxi prediction, intuition was drawn from there. So this features generated where fed into xgboost , cubist and svm model. 2nd my python model: the first I would start with is xgboost, I extract arrival time which is gotten from converting the travel_time to hour and then it is split into late arrival and early arrival in such that if it is btw15and 19 it is 1 and greater than that is 0. I also used the travel time daily average per month of the Uber data,I extract month,year,dayoftheyear quarter and hour,and I generate a set of random features by grouping the data by some features likeonth,weekday,travel_from and find the sum,count,mean of the number of tickets and I also used the distance gotten from Google map so then I remove highly correlated features after then this features are pass through f_regression in sklearn and then pass through the seleckbest model in sklearn also the best features are then inputted into the xgboost model. The above method was also repeated for lightgbm I also do the same thing but slightly modified for catboost, since catboost is good with categorical variable most of it features are categorical variables, I used two way interactions between max capacity and car type, and also btw travel from and car type. I categorize the month to 4 seasons, then I find the min and Max travel time group by travel from I used no encoding here since catboost handle all that And in conclusion I did weighted ensemble off all the prediction from the models
What were the things that made the difference for you that you think others can learn from? I think the thing that made the difference is the 2 and 3 ways interactions between features, The latitude and longitude plus the distance from Google map heaversine and Manhattan distance. Using different features for different model Using select best from sklearn.