In the spirit of open source and exchaning ideas i share my first take on the random forest model: https://github.com/pawelmorawiecki/traffic_jam_Nairobi/blob/master/RandomForest.ipynb
It gives around 3.93 of the metric (mean absolute error), which is now (5th Oct) #2 in the competition. Please feel free to use the code and experiment on it. For example, one may think of adding extra features derived from the date (year, month, holiday, weekend). For now it's just a week of the day.
Thanks! Your solution makes me ponder whether I'm overthinking my features and model.
With this and other approaches I've tried, my biggest worry is overfitting the limited dataset.
Typically, you construct a validation set to avoid overfitting. In case of random forests, you can use oob_score (out of bag score) to measure whether you don't overfit too much.
model = RandomForestRegressor(criterion="mae", oob_score=True)
and then you call model.score(X,y) and the score (R^2 coefficient) is computed on samples which were NOT used in training for a given tree. You compare scores when oob_score=False and oob_score =True, if they differ a lot it's a sign of overfitting.
Thanks! I usually use validation with test sets. I'll take a closer look at the oob_score.
I am using cross validation, but there is still alot of overfitting from unseen data, I have 9 set of features for training the model, the training accuracy is 95% and the validation accuracy is 94%, the mae is 1.19. This shows that there is less overfitting, but when you feed the model with the data given, the mae is around 4.4 on the leaderboard. What would cause this trend on unseen data?
Thanks for sharing Pawel - this was a reminder that simple can sometimes be better! I threw out half of my engineered features and my score went way up :)
@rwambu >the mae is 1.19. This shows that there is less overfitting, but when you feed the model with the data given, the mae is around 4.4 on the leaderboard.
This might not be overfitting (especially if it is actually doing this well on unseen validation data - I’d check that first). If you’re encoding variables, make sure you do it consistently for both the training data and the test set. I used `pd.Categorical(test["travel_from"])` but because not all towns from train are in test, and because they appear in a different order, the wrong towns were being encoded by a given number. This meant my predictions for the test set were off and my submission score was much lower than expected.
@rwambu >> I am using cross validation, but there is still alot of overfitting from unseen data, I have 9 set of features for training the model, >> the training accuracy is 95% and the validation accuracy is 94%,
I would suggest using MAE metric across all your experiments. Also it's not clear what you mean by accuracy in this context. Typically accuracy referers to classification problems, but here we have regression.
The second thing, which might help is that when you construct validation set, please make sure it resembles test set. For example, if in the test set you have very few buses (and a lot of shuttles), then your validation set should preserve such distribution.
Thanks, I was using pandas factorize, I had not noticed not all towns from train are in test.
Thanks. Ah, by accuracy I meant regression.score(X, y) for both the validation and training sets.
Hello Powel,I dont understand why you are converting travel time(I thought travel time is the time of departure?) to minutes.
Right, travel_time is the time of departure. You need to translate it to a number in a meaningful way. One way would be 5:09 translated into 5,09 (a float number), another way is to convert it to a number of minutes from midnight. For random forest they are equivalent. Yet another way would be to make categorical variables (morning, noon, evening, night) and then encode it into 0,1,2 and 3, respectively.
aha,I get it now,So you have converted it to a number of minutes from mid-night.aha! perfect! I get it thanks
I love your solution. However, check the travel_time column. I think this should be departure time. For some towns say Kisii, it would be unreasonable to have a travel time of 7hrs for one ride and 19 hrs for another. Thanks.
You're right. This column should be called "departure time" to avoid any confusion.
Is your target variable categorical or continuos?
Thanks Pawel. Great and simple. Did you submit your predictions has decimals. Since we are predicting number of tickets, shouldnt it be whole numbers?