Primary competition visual

EY Biodiversity Challenge

$3 500 USD
25 days left
Classification
Feature Engineering
Geospatial Data
Geospatial Analysis
989 joined
353 active
Starti
Mar 27, 26
Closei
May 24, 26
Reveali
May 24, 26
User avatar
Koleshjr
Multimedia university of kenya
🚨 0.99 Scores… Are We Missing Something?
23 Apr 2026, 11:59 Ā· 15

Hey Zindians,

Quick question about the leaderboard 👀

I’ve been noticing quite a few submissions hitting ~0.99, which honestly feels very high given we’re supposed to be working with TerraClimate data only.

Just wanted to sanity check with everyone:

  • Are people strictly sticking to the provided features? (terra climate features)
  • Is anyone using lat/lon which we have been told not to use?
  • Or could there be some kind of leakage through stations / splits / time?

Not accusing anyone at all, just trying to understand how these scores are being achieved because under the current constraints it feels a bit surprising.

If it’s all legit, then clearly there’s something important about this dataset/setup that I’m missing 😅 would be great to learn.

Curious to hear what others think.

Discussion 15 answers

They're probably over fitting by using the wrong cross-validation method.

23 Apr 2026, 12:10
Upvotes 0
User avatar
marching_learning
Nostalgic Mathematics

Hello @Koleshjr. I wanted to have a discussion on this. I think it is a data leakage from the data collector website.

for me to prevent this, the challenge should be set as a time series challenge @AJoel: Given potential (we may not have for some places) appearance of frog, and historical terraclimate data, can we identify frog presence at a given place.

23 Apr 2026, 12:26
Upvotes 2
User avatar
Koleshjr
Multimedia university of kenya

Thanks @marching_learning That’s a really solid point, thanks for sharing this 🙏

The idea of potential leakage from the data collection side and the suggestion to frame it as a time series problem both make a lot of sense.

It would be great if the organizers @AJoel @meganomaly could take a closer look at this together with you to confirm whether this is happening and, if so, how best to address it.

Really appreciate you bringing this up 👏

User avatar
CodeJoe

Yes, but the problem is the leakage would still be a leakage. After all if I download that data, I would have the test set's future data.

User avatar
marching_learning
Nostalgic Mathematics

Not necessarily, since they are posted updates every day. The solution is to go with all public data and rerun the top solutions on new collected data like for forecst challenges (e.g. AgriBora)

User avatar
CodeJoe

I don't think terraclimate has 2026 data yet if I am not mistaken.

User avatar
marching_learning
Nostalgic Mathematics

You're right. It'll we hard to fully solve this except patience. I think during code review, Zindi will have a lot of work to do to make sure people are not using external available dataset to inflate score. That's being said my true score is around 0.92xx 😄

User avatar
CodeJoe

Nice!

User avatar
Koleshjr
Multimedia university of kenya

Hello @marching_learning

I spent some more time digging into this and went down a bit of a rabbit hole trying to validate the earlier concerns around potential leakage. Based on the experiments I’ve run, it does appear that there is indeed a leakage issue, your earlier point was spot on.

That said, it may be manageable if the already stricter constraints that are in place of not using the raw lat , lon as features and the use of only the terraclimate features are enforced,

Sharing this in good faith so we can collectively ensure the competition remains fair and aligned with its objectives.

@AJoel @meganomaly

User avatar
MICADEE
LAHASCOM
I wonder why people are using lat/lon as features when this is 

not allowed. My Best Ensemble OOF F1 is 0.9446 without lat/lon 

and this actually translated to 0.9475 on LB.

Very high scores are likely the result of using Lat-Lon as features. This is not a valid model approach and does not reflect a generalized model based entirely on climatic and environmental features. The use of Lat-Lon in any models will be disqualified during the post-challenge evaluation period.

23 Apr 2026, 12:59
Upvotes 0
User avatar
CodeJoe

I understand your concern about the use of latitude and longitude as features. However, achieving a score as high as 0.99 is extremely unlikely to be explained by lat–lon alone, unless there is some form of data leakage involved.

While lat–lon can indeed be powerful features, they typically do not lead to near-perfect performance in a properly validated setup. I’ve also checked this using cross-validation, and the results support this conclusion.

User avatar
marching_learning
Nostalgic Mathematics

Yes I have the same conclusions as you @CodeJoe. This 0.99 score is not only due to lat/lon use. the gains of using lat/lon are very marginal.

I agree with both of you. In the end, the Zindi-EY team will review the models and submissions to ensure they meet the terms and conditions of the challenge. So, I suggest participants continue to submit valid model entries as they might end up being winners!

Haha, the only way to achieve .99 is using the lat/lon, and also there a probable high chance of data leakage.

There are some patterns in the data that can also be manipulated to achieve such a score, which also involves using lat/lon

23 Apr 2026, 17:11
Upvotes 0