Hey Zindians,
Quick question about the leaderboard 👀
I’ve been noticing quite a few submissions hitting ~0.99, which honestly feels very high given we’re supposed to be working with TerraClimate data only.
Just wanted to sanity check with everyone:
Not accusing anyone at all, just trying to understand how these scores are being achieved because under the current constraints it feels a bit surprising.
If it’s all legit, then clearly there’s something important about this dataset/setup that I’m missing 😅 would be great to learn.
Curious to hear what others think.
They're probably over fitting by using the wrong cross-validation method.
Hello @Koleshjr. I wanted to have a discussion on this. I think it is a data leakage from the data collector website.
for me to prevent this, the challenge should be set as a time series challenge @AJoel: Given potential (we may not have for some places) appearance of frog, and historical terraclimate data, can we identify frog presence at a given place.
Thanks @marching_learning That’s a really solid point, thanks for sharing this 🙏
The idea of potential leakage from the data collection side and the suggestion to frame it as a time series problem both make a lot of sense.
It would be great if the organizers @AJoel @meganomaly could take a closer look at this together with you to confirm whether this is happening and, if so, how best to address it.
Really appreciate you bringing this up 👏
Yes, but the problem is the leakage would still be a leakage. After all if I download that data, I would have the test set's future data.
Not necessarily, since they are posted updates every day. The solution is to go with all public data and rerun the top solutions on new collected data like for forecst challenges (e.g. AgriBora)
I don't think terraclimate has 2026 data yet if I am not mistaken.
You're right. It'll we hard to fully solve this except patience. I think during code review, Zindi will have a lot of work to do to make sure people are not using external available dataset to inflate score. That's being said my true score is around 0.92xx 😄
Nice!
Hello @marching_learning
I spent some more time digging into this and went down a bit of a rabbit hole trying to validate the earlier concerns around potential leakage. Based on the experiments I’ve run, it does appear that there is indeed a leakage issue, your earlier point was spot on.
That said, it may be manageable if the already stricter constraints that are in place of not using the raw lat , lon as features and the use of only the terraclimate features are enforced,
Sharing this in good faith so we can collectively ensure the competition remains fair and aligned with its objectives.
@AJoel @meganomaly
Very high scores are likely the result of using Lat-Lon as features. This is not a valid model approach and does not reflect a generalized model based entirely on climatic and environmental features. The use of Lat-Lon in any models will be disqualified during the post-challenge evaluation period.
I understand your concern about the use of latitude and longitude as features. However, achieving a score as high as 0.99 is extremely unlikely to be explained by lat–lon alone, unless there is some form of data leakage involved.
While lat–lon can indeed be powerful features, they typically do not lead to near-perfect performance in a properly validated setup. I’ve also checked this using cross-validation, and the results support this conclusion.
Yes I have the same conclusions as you @CodeJoe. This 0.99 score is not only due to lat/lon use. the gains of using lat/lon are very marginal.
I agree with both of you. In the end, the Zindi-EY team will review the models and submissions to ensure they meet the terms and conditions of the challenge. So, I suggest participants continue to submit valid model entries as they might end up being winners!
Haha, the only way to achieve .99 is using the lat/lon, and also there a probable high chance of data leakage.
There are some patterns in the data that can also be manipulated to achieve such a score, which also involves using lat/lon