I did oversampling with skf and checked my validation set twice there's no data leaks in it, I used data that weren't oversample to not introduce data leak, however the gap between my cv and lb is big
CV = 45
lb = 64
Is it possible that the way the public lb and the private lb splited resulted in un equal representation of classes, for example maybe there are classes not present in the public lb and present in the private lb, or there's another explanation for my gap between cv and lb?
did you oversample only on the train set? I.e after splitting train and validation sets. if not you have
Yes that exactly what I did, the oversampling was Done just to the train set of the skf, the validation set contained only original data and there's no overlap between the two sets
I even mapped the ids of the submission file to the test csv file to see if my predictions made sense to the eye
Most of my predictions were true, at least the results I've seen won't give me a 0.64
what is your cv / LB without oversampling?
0.62--->0.61
I think curating(by hand checking images in classes and removing the ambiguous ones) the daatset might help, I think that is what .25 guys was saiying.there is a huge overlap in the data, .64 is close to the baseline,I am not sure what everyone is calling "data leak"
Yes that may help, but i don't think that's alone will jump you up the lb to a score around 0.2.
You haven't done it yet I guess?because he said it in the initial discussion
not yet, but i will
Yes. You should.