But seriously guys I have been looking at the dataset for a while now still very confused. where is the target value(num_of tickets). or are we supposed to generate that? and then use it for training?
Hi. Number of tickets is determined by the number of people in a specific bus, at a specific time. Which you can infer from the given data. If you cant make that bit up........ 🤷🏿♂️
I think your question might deserve more credit than the response by @dkaila. If everyone "makes that bit up" - trivial or otherwise, how are they going to declare a winner? I'm assuming they will in some way validate the results on the test set?
Different people are bound to make different adjustments to the data in order to purify it. Without a target your whole data to information pipeline is backward should you validate the answer assuming a target (because your target is in this case a function of the data?).
Now I haven't attempted this challenge myself, but it seems to me if 5 people calculated the response differently they could all backpropogate high accuracies but recieve ambigious scores on your test results - all due to the response itself being "trivial"?
Now I understand this should be trivial. Just remarking on the fact that the answer to this question may not be so simple as to provoke a retort -> ` If you cant make that bit up........ 🤷🏿♂️ `.
The Data doesnt explicitly have a target value,you create it. Use the groupby function of the ride_id column and aggegate by count,you will get the number of tickets that were sold per id. or just follow the notebook that was shared by one of us https://github.com/pawelmorawiecki/traffic_jam_Nairobi/blob/master/RandomForest.ipynb
Hi. Number of tickets is determined by the number of people in a specific bus, at a specific time. Which you can infer from the given data. If you cant make that bit up........ 🤷🏿♂️
I think your question might deserve more credit than the response by @dkaila. If everyone "makes that bit up" - trivial or otherwise, how are they going to declare a winner? I'm assuming they will in some way validate the results on the test set?
Different people are bound to make different adjustments to the data in order to purify it. Without a target your whole data to information pipeline is backward should you validate the answer assuming a target (because your target is in this case a function of the data?).
Now I haven't attempted this challenge myself, but it seems to me if 5 people calculated the response differently they could all backpropogate high accuracies but recieve ambigious scores on your test results - all due to the response itself being "trivial"?
Now I understand this should be trivial. Just remarking on the fact that the answer to this question may not be so simple as to provoke a retort -> ` If you cant make that bit up........ 🤷🏿♂️ `.
Hello Stefan, you wont have different target since we are using count of ids
The Data doesnt explicitly have a target value,you create it. Use the groupby function of the ride_id column and aggegate by count,you will get the number of tickets that were sold per id. or just follow the notebook that was shared by one of us https://github.com/pawelmorawiecki/traffic_jam_Nairobi/blob/master/RandomForest.ipynb
Thanks for this helpfull information
If you interested in R you can use dplyr
Tickets_data <- data %>%
group_by(ride_id) %>%
summarise(Total = n())
Tickets_data <- arrange(Tickets_data, ride_id)
# Merging the two data frames
merged_data <- merge(data,Tickets_data,by="ride_id")
merged_data <- merged_data[ , -c(2,3,4)]
# Finally make it unique so that you dont have duplicate ride_ids
c_merged_data <- unique(merged_data)
# You can save it as csv and use it in python
# saving it as csv so that we can read in python
write.csv(c_merged_data, file = "train_revised2.csv")