Primary competition visual

IBM SkillsBuild Hydropower Climate Optimisation Challenge

Helping the World
$3 000 USD
Completed (12 months ago)
Prediction
Forecast
1231 joined
466 active
Starti
Mar 03, 25
Closei
Apr 13, 25
Reveali
Apr 14, 25
User avatar
EL_YOUNES
Congratulations, everyone!
14 Apr 2025, 20:38 · 3

I want to congratulate all of you. We expected some shake-up in the leaderboard, but not to this extent. I'm disappointed with how the test set was split between the public and private leaderboards. We knew only three consumer devices were used in the public leaderboard, but I never imagined that optimizing our models based on those three would lead to even worse results.

Surprisingly, using simpler models like a last-lag value or lagged moving average could actually yield better results on the private leaderboard.

The important question is: if the organizers knew the consumer devices in the public leaderboard were inversely correlated with those in the private leaderboard, why was the dataset split that way? What was the purpose of the public leaderboard then?

Discussion 3 answers
User avatar
rafael_zimmermann

In my opinion, this is one of the biggest debates in the world of machine learning competitions. The key question is: should the private dataset have the same distribution as the public one?

This is a central issue because, in practice, we want models that are robust — not just good at fitting a specific dataset, but capable of generalizing well to external data. In fact, many competitions are already moving in that direction, especially in time series, where it’s natural (and even expected) for the series’ behavior to change over time. That means the test data distribution tends to be different, and the model needs to handle that.

Of course, there is a strong school of thought that supports traditional statistical validation, where public and private sets share the same distribution — mainly to ensure a more “fair” and predictable evaluation. But I personally don’t agree with applying this approach rigidly. If our true goal is to develop intelligent and robust algorithms, then testing on different distributions makes perfect sense. Otherwise, we’re just rewarding whoever best overfits to the public leaderboard, not those who build truly generalizable solutions.

It’s a complex discussion, with no clear right or wrong. These are tough decisions that will inevitably lead to heated debates. But honestly, I don’t see any issue — especially in time series competitions with multiple outliers and where the metric is RMSE — in having a different public and private distribution. In fact, it better reflects real-world challenges and is expected in most modern competitions.

Many competitions today don’t even use a private leaderboard anymore. What they do instead is end the competition and then evaluate submissions on real data collected in the following 3 to 6 months, which implicitly makes it clear that the distribution will be different.

14 Apr 2025, 21:04
Upvotes 4
User avatar
EL_YOUNES

I agree with you, but the issue is that the consumer devices differ significantly from each other. Let me explain:

If we use the exact same model, with the same features and the same cross-validation method, and simply drop weeks 39 to 41 from the training data, you'll get a top score on the public leaderboard.

This suggests that the training data without weeks 39 to 41 aligns well with the consumer devices in the public leaderboard, but performs poorly on the others.

On the flip side, if we train only on data from weeks 39 to 41, we achieve the best results on the private leaderboard and the worst on the public one.

This shows that the issue isn’t about model robustness — it’s about the methodology used to split the dataset. That’s the core of the problem.

User avatar
rafael_zimmermann

I totally understand your frustration — it’s really frustrating when small changes in the data cause big variations on the leaderboard, especially with RMSE, which heavily penalizes outliers.

But I think the main point is that the model became very sensitive to how the data was split. When simply adding or removing a few weeks drastically changes the performance, it suggests some level of overfitting to that specific data distribution. So, it's not just about model robustness — it's also about the robustness of the training data itself.

At the end of the day, everyone hates shake-ups — even the competition platforms themselves usually try to avoid them because they know users dislike them (and honestly, I hate them too). So yeah, I get your frustration, but I also understand Zindi’s position with how they split the data. To be honest, I don’t think it was wrong.