Our solution (Team Neural Beans with @100i and me) is conceptually simple - just a single object detection model. We fine-tuned a DINO model with a Swin Transformer-base backbone using the MMDetection library.
There are two main pre-trained versions of this model in mmdet: a version with a ResNet50 backbone and 4 scales of feature maps (DINO-4scale-R-50), and a more performant version with a Swin-Large backbone and 5 scales of feature maps (DINO-5scale-Swin-L). The DINO-5scale-Swin-L model is too big and slow given the resource restrictions of this challenge. We performed experiments with different backbones (including ConvNext (Tiny and Small), Swin (Small, Base and Large), and SwinV2 (Base)), 4 and 5 features scales, and different image sizes. Our best combination uses a Swin-Base backbone, 4 features scales and square 640x640 images.
Our training pipeline includes random horizontal and vertical flips, colour variations (using mmdet's YOLOXHSVRandomAug augmentation), and different image scales. Experiments with mosaic and mixup didn't improve the model. We utilised Exponential Moving Average (EMA) of weights during training which improved validation performance.
The model was trained for 12 epochs with a learning rate of 0.0001, with linear warmup over the first epoch, and cosine annealing beginning after the 6th epoch. The model was trained with mixed precision to reduce GPU memory usage.
Resources:
Every time I come across that library, it's like seeing stars- very difficult to understand. The expertise behind it is undeniable. A huge congratulations to you guys.
Yeah that library definitely needs some maintenance because it's getting harder and harder to manage the dependencies, especially in environments like Kaggle and Colab
Very true, Anyways great work, really learnt from your solution.
How do you manage slow inference speed when using mmdetection
Thanks for the question. We had no problems with inference speed, so it is not something we spent much time thinking about. A few points: