It's obvious that Zindi changed the private dataset by removing all the outliers without announcing it to the participants. I stopped this competition long ago because of the settings of this competition (bad data and bad metric) so I don't feel very concerned; but for all the other participants : what a waste of time!!!!
That's a very brave statements. Do you have any evidence that would back it up?
the private score (~100) is the score that you will get if you remove all outliers from your training dataset and do a CV
Yes, removing outliers is giving contrasting results on both leaderboards. They should balance the test data. Otherwise, you cannot draw good conclusions.
The villain of the competition is the 'Acre' feature. Without it, we would never have discovered that the outliers were actually data entry errors, and Zindi wouldn't have had to change the private test set. Also, I still strongly believe that this feature has the target leaked into it.
While I agree that the Acre feature is strongly correlated, On the contrary, I think that's how it is in real life. The crop yield is strongly a function of the land size. So the feature by default explains what the target may look like, but isn't a target leak.
I strongly disagree with your statement. If there were a single feature with a 1:1 correlation to Yield, and we had access to this feature before predicting Yield, then there would be no need for any competition. In fact, this single feature could solve many world problems. We wouldn't need to know about the weather or the soil type for planting; we would just need to know about 'Acre'.
No, no no, that's why the correlation wasn't 1:1, it would be high in real life of course maybe even up to 90% correlation. Plus this is a feature we have access to in real life before predicting the yield, that's why I think it isn't a leak.
Hi mchahhou, I understand your perspective, however the public leaderboard is different from the private and they do not intersect. The Zindi team most likely did not change anything. The public LB contained outliers, but the private did not. The public LB deceived us all, CV was the way.
Proofs have been shown in other threads that they indeed changed the private data on purpose. This changes the whole purpose of the competition. Now instead of an outlier detection problem, we have a standard regression problem
I strongly agree with your statement! This challenge was wasted too much time...
this is just like a lucky draw🎉