I notice that there is a great gap between CV & LB. I have seen some great discussions about that topic. So this gap can be explained by the fact that less than 20% of the test dataset is actually scored. But Anyway, would you like to share your CV against your LB ?
So I dive first:
CV = 0,2632 & LB=0,0158
CV = 0,3602 & LB=0,0438
Thanks
Please expplain further what you mean by "So this gap can be explained by the fact that less than 20% of the test dataset is actually scored"
Somebosy tried a model submitting new_turlte all the time and it scored 0.01360544217687075 and he deduced that there 2 / 0.01360544217687075 = 147 images scored on the current leaderbord. This represents 20% of test data. Hope it helps.
I don't know if only the 20% thing can fully explain it. Separating around 30% of the images I have seen validations >0.8 which on scoring on LB have got 0.03. Random samples of the images don't get those low scores. I have not seen many ways to get a consistently higher score on LB.
You're right. I also think that the 20% estimate of pulic LB is probably wrong and this is to be lower. For me i validate on roughly 600 images per fold and train & val score are close. That's why i suspect that the percentage of images involved in public LB is even lower maybe less than 5%. But if this not the case, we should seriously worry, because private LB could be lotery.
For sure there is something strange in LB scores (I tried to tell zindi, but they say that is OK). I think the best option is trust CV (make submissions doesnt make sense).
CV: 0.79 LB: 0.028
Don't know what's going on here. Can anybody help me with some suggestions? Thanks.
I think you're doing good. All you have to do, is to trust CV as long as your cross-validation is ok.
One thing that maight help is to keep track of other metrics that might be 'harder' in conjunction with the Apk, such as macro accuracy and the sorts. I've seen the apk go up in validation yet these other metrics fail, and models that have done better also in the other metrics tend to do better on the leaderboard
Hi @astenuz what is apk ?
It's the average precision at k, I think that's the one we're using to score entries right?
Okay thks
Is there any new class in the test data that is not present in the training dataset? I guess new class data are assigned to `new_turtle` but there might be more classes. I hope I'm wrong.
MAPK on validation set (~430 images): 0.5587412587412587
Submission score: 0.03129251700680272
Copied and pasted MAPK code from the tutorial. Looking at the test images next to the predicted turtle's images I can believe an accuracy closer to 50% than 5% and yet that leaderboard score implies very bad performance.
One hypothesis: They take the sum of the APK scores for the public test set (assuming a 30/70 split that is 147, plausible for other reasons noted elsewhere in this thread). But they divide this by the total number of images 2635 instead of the 147 figure that would make sense. 0.5587*147/2635 = 0.0312. This could just be a numerical coincidence though.
Whatever the case, let's hope the issue is cleared up by the team and in the meantime, local validation seems the way to go.