Traffic Jam: Predicting People's Movement into Nairobi
\$12,000 USD
Uber and Mobiticket team up to predict demand for public transportation into Nairobi
6 September 2018–13 January 2019 23:59
687 data scientists enrolled, 204 on the leaderboard
Target Value
published 17 Nov 2018, 14:58

But seriously guys I have been looking at the dataset for a while now still very confused. where is the target value(num_of tickets). or are we supposed to generate that? and then use it for training?

Hi. Number of tickets is determined by the number of people in a specific bus, at a specific time. Which you can infer from the given data. If you cant make that bit up........ 🤷🏿‍♂️

I think your question might deserve more credit than the response by @dkaila. If everyone "makes that bit up" - trivial or otherwise, how are they going to declare a winner? I'm assuming they will in some way validate the results on the test set?

Different people are bound to make different adjustments to the data in order to purify it. Without a target your whole data to information pipeline is backward should you validate the answer assuming a target (because your target is in this case a function of the data?).

Now I haven't attempted this challenge myself, but it seems to me if 5 people calculated the response differently they could all backpropogate high accuracies but recieve ambigious scores on your test results - all due to the response itself being "trivial"?

Now I understand this should be trivial. Just remarking on the fact that the answer to this question may not be so simple as to provoke a retort -> ` If you cant make that bit up........ 🤷🏿‍♂️ `.

edited 1 minute later

The Data doesnt explicitly have a target value,you create it. Use the groupby function of the ride_id column and aggegate by count,you will get the number of tickets that were sold per id. or just follow the notebook that was shared by one of us https://github.com/pawelmorawiecki/traffic_jam_Nairobi/blob/master/RandomForest.ipynb

If you interested in R you can use dplyr

Tickets_data <- data %>%

group_by(ride_id) %>%

summarise(Total = n())

Tickets_data <- arrange(Tickets_data, ride_id)

# Merging the two data frames

merged_data <- merge(data,Tickets_data,by="ride_id")

merged_data <- merged_data[ , -c(2,3,4)]

# Finally make it unique so that you dont have duplicate ride_ids

c_merged_data <- unique(merged_data)

# You can save it as csv and use it in python

# saving it as csv so that we can read in python

write.csv(c_merged_data, file = "train_revised2.csv")

Hello Stefan, you wont have different target since we are using count of ids