Hello everyone,
In the discussion titled “Digging Deeper: Investigating Potential Data Entry Errors”, @amyflorida626 stated, “To guide you in your approach to this problem, we will reveal that the private leaderboard will show a distribution that will be useful to the client, where potential outliers are taken into consideration. It is up to you to determine the best way to deal with outliers in the datasets, both in your modeling and your predictions.” I have put considerable effort into detecting and treating outliers. It’s frustrating to learn that outliers have been removed after being told that they are taken into consideration.
I hope that @Zindi can look into this issue and resolve it, please.
Nice confirmation on how Zindi mislead all the competitors
Yes, why saying ‘deal with outliers’ if they will not be considered as outliers in the test set?
Zindi maintains a public and private leaderboard. The ID you mention was in public lb, public lb is not used in calculating your final private lb, the other id's not present in public are the ones used to calculate your private score, and I can confidently say that the private lb had no outliers.
Thank you for this clarification @Koleshjr. However, this doesn’t change the fact that there was a misleading statement.
Well , in as much as I also had not read the discussion by amy indepth, I don't see any misleading statement since she said, "in the private leaderboard not the public leaderboard " so public lb still had the outliers but since the private lb is what matters, they said they are going to take it into consideration which they have . Well wish I had read that discussion in details , too late now
Yes, you could see it that way. However, for me, the statement ‘It is up to you to determine the best way to deal with outliers in the datasets, both in your modeling and your predictions’ was quite misleading.
Btw great work in detecting the outliers, How did you go about it? I would really love to know, because that was impressive
It was misleading because, for some reason, they didn't want to directly tell us that the data contained errors which would render any model useless.
The way I interpreted this statement was: 'We can't remove outliers from the public dataset; it's too late for that. However, they will be removed from the private test.'
To be honest, the proper way to handle the situation would have been to cancel the competition, fix the data, and start over.
Regardless, the best strategy to win this competition was to probe the entire public test and, for the private test, make predictions as if the outliers didn’t exist.
Thank you @Koleshjr. I will organize my work and share it in another thread!
If they did care about the valuable time of all competitors, they could simply state explicitly in the forum that outliers would be removed from private data.
Your interpretation was correct @yanteixeira , but I believe that in such a competition, clarity is essential to avoid any misunderstanding.
Thank you @kamelyamani , will be waiting for it
@yanteixeira yeah sure , I agree with you. But the problem was in misinterpretation. And after@kamelyamani has clarified how he interpreted is , true that statement could have been interpreted in different ways. A more straightforward answer like: "The private test set does not contain outliers" would have helped and atleast with that the best model could have won ( a fair ground for all). I feel bad for all of us who assumed that the private test also had outliers and decided to post process. I am pretty sure some solutions withput post processing the private test would have ranked way higher.
Yes, the statement should be clear.
I think Zindi and other platforms underestimate the time we spend in competitions. We pour our souls into the problem and suddenly find ourselves affected by a miscommunication issue. It's not fair at all.
This is very accurate and honest position😅
Rather unfortunate that all the efforts put into handling outliers yielded nothing substantial. Nevertheless, it has been a wonderful learning experience designing clever ways to detect and deal with outliers. Going forward, it will be really helpful if @Zindi makes it their priority to accurately inform and keep participants updated as competitions progress to give participants fair chance at seeing their best models win.
I agree with you. I believe all serious competitors here now know all the techniques to deal with outliers x)