What a competition!!!
I thought post processing was going to be the winning trick,
Note: turns out the private test set had no outliers at allllll ------- Misleading
Update: There were outliers , only that their IDs were not used in calculating the private score
Proof: Change this id ["ID_ECWVAC40SNWB"] to 0: Notice no change in both private and public, very concerningg!!!!!
with the above concerns, its evident that I post processed an Id which was not an outlier and it hurt my score but others correctly post processed the right ID's they surely deserve better
Anyways my solution without post processing would rank top 10 and the cv vs private lb rhymed almost perfectly, funny huh.
Here is the link to the github repo with the solution, If you find it helpul please star it. I would really appreciate it
https://github.com/koleshjr/Digital-Green-Crop-Yield-Estimate-Challenge/tree/main
My theory is that there were outliers in the private test, but @VIRADUS post and all the comments made Zindi realize that the host would end up with a useless model.
If you look at Amy's response to that post, he said, '...we will reveal that the private leaderboard will show a distribution that will be useful to the client, where potential outliers are taken into consideration...'
This statement made it clear to me that they would remove the outliers from the private test.
@yanteixeira wish I had made that conclusion too, anyways great learning experience and I would also love you to post all the insights you uncovered , you really made some very insightful discussions. Summing it all up would be very greatt
Great one as always.
Yeah, great comp. The outliers in the public LB confused most people. This is my first time seeing outliers intentionally injected into the test set (public). But again, really interesting, and models a real world scenario pretty well.
#keeplearning
Yes, we keep learning!
The boosting method was the best; I tested my CV; bagging was so much better, but when I saw seen private score, my first benchmark catboost was the best (105) with only num_cols. I don't know how to test and train data correlations. useless effort!
Not useless efforts, we were just unlucky we took the outliers in the public set differently by assuming they will be present in private set.... others did too but they were clever about it , ome to think about it , I would have chosen one without post processing and one with post processing , anyways no regrets , we learnt !
For me, I was too focused on how to manage outliers at the pre-processing level that I forgot that the main thing is to have a solid model that manages to generalize across the entire dataset. An error of judgment on my part especially since Zindi tells us "where potential outliers are taken into consideration..."
So disapointed knowing that my first submissions with simple boosting model which generalize well lead you to a private score of 125.
But keep learning from this competition. Great experience.
Yes great learning experience!
Everyone in this post: please find the updated discussion with new findings. Feel free to comment @yanteixeira what do you think about the updated findings?
This is getting more interesting 😅. I don't even know what to say.
I'm actually speechless.
This is disturbing! to be fair for host and participants, they should have changed the datasets yea, but communicate to us
@JuliusFx Exactly.... I agreed to that. Infact i am also speechless like @yanteixeira said earlier. 😵😕