Currently the evaluation metric for this competition is RMSE. One important characteristic of RMSE is that it is sensitive to outliers, increasing exponentially the further a prediction is off. This can be problematic when working with an imbalanced dataset as a small number of large errors will significantly increase the overall RMSE.
The dataset for this competition is VERY imbalanced. Looking at the training dataset, most of the rows have a Yield somewhere between 100 and 1000, however there are roughly a dozen rows way outside this distribution with extremely high Yield going as far as 16800 (I will assume that this is the case in the test dataset as well). In my experience so far these outliers are difficult to predict, but getting them wrong will massively impact the overall RMSE.
Example: after some very basic cleaning and feature engineering, my model currently achieves a RMSE of 435 in 10-fold cross-validation on the training dataset. When I delete the top 20 rows with the highest Yield from the training dataset, this same model improved to a RMSE of 156(!!) in 10-fold CV. So by only removing about 0.5% of the training dataset, the performance of the same model almost triples when measuring RMSE. If I only drop the top 4 rows with the highest Yield I get a RMSE of 275, still an enormous improvement over my initial RMSE of 435 by removing just 0.1% of the training data.
My point here is that the distribution of the dependent variable of this dataset is massively skewed, causing a very small number of the entries to have a disproportionally large impact on the overall RMSE of a model. A much more suitable evaluation metric for a dataset like this would be something like MAPE, which will still punish your score when you are way off predicting this small number of outliers but it does not increase exponentially with the error.
I completely agree with you, I've encountered the same situation. I've also experimented with adversarial validation to assess the similarity between the training and testing data, and I got around 73% (though it could potentially be lower), as you mentioned, whoever manages to address the outliers effectively will likely succeed in this competition ! (And that's not the best model of course in reality!)
I agree with you Man! I was wondering how you could get a lower RMSE but when submitting the model becomes even worse
I like your points.
My point of view is that I believe this problem do not tolerate under-predict yields. Doing so could lead to insufficient preparations by farmers, resulting in shortages. This is where the sensitivity of RMSE to large errors becomes an advantage.
Furthermore, India may face years with extreme adverse weather (drought or flooding). In these situations, we'd want our error metric to be more sensitive to such unusual data points, as they can represent significant yet rare events.
Maybe the main goal is to avoid really bad yield years. This is important especially if the impact of one bad year is greater than the benefits of several good years.
I find your POV interesting, and I agree with it.
In the end, your choice of metric depends FIRST on your business objective--which in this case might be to ensure the model makes quite good predictions even in cases where the crop yield is quite high. This could be the rationale behind using the MSE.
Does improvement on CV also improves LB?
I tried removing the top Yield outliers, as OP suggested, and my CV did improve from 350 down to like 80, but then my LB score went from around 350 up to 430. So no... :(
With some basic models, it seems like the prediction errors (residuals) are relatively flat until Yield is over about 1400, and then the errors creep up linearly.
So maybe splitting the model horizontally could work. But I haven't found any simple proxy for high Yield.