I've shared my solution for the 6th place on the private leaderboard. Hopefully, it will not change after the code review :)
https://github.com/letfoolsdie/zindi-agricultural
In summary, the final solution is a geometric mean of several imagenet-pretrained models, trained with different parameters on spectrograms/melspectrograms, averaged first by folds and then with each other.
There's also a postprocessing, where I try to find junk test files (containing just noise) using pretrained PANN and replace models' predictions for them with constant prediction based on frequency of each class. It reduced loss a bit (by ~0.01-0.015 points)
Thanks a lot. Did you also experiment self supervised based on spectrograms (like using spectrograms to predict duration) and then using those weights for classification. I tried that approach but could only manage 1.71 on private leaderboard
I haven't thought of that actually, and it seems like a good idea to me :) I wish I've tried that, I guess it should improve predictions at least a little. Except I would try adding audio duration as a separate input to a model instead of training model to predict duration
Thanks
Thanks so much for sharing and congratulations!