+1. for us, the 9 hour training limit allowed us to train and ensemble two folds. one based on a 20 fold split (cv=0.491) and the other on a 24 split (cv=0.52).
The ensemble was based on WBF and we trained RT-DETR models for around 40 epochs.
thanks @nymfree , that's impressive. I personally avoided ensembles but I'm glad to hear that it could fit in the 9hr limit. Also what iou did you use for the wbf ensemble and the skip thresh as well.
computational cost scales quadratically with increase in image resolution. Very likely that only one model can successfully be trained within 9 hours at 1024 resolution
other than TTA that is built in ultralytics, there is sahi. here, you basically infer on patches of high resolution images. e.g., if the image has 2048x2048 resolution, you don't resize it to 1024, but infer on 4 1024x1024 patches. that might preserve high resolution details. I tried it earlier on when I had weak models and it didn't seem to make things better. In hindsight, I should have revisited it.
actually you could train two models at 1024px given that he was using using the smaller version of yolo11, but then you would have to sacrifice the number of epochs you train for the two models
First of all everyone who could not hack the 40 score is because they were using the default yolo thresh , 0.25. I really struggled with this in the beginning of the comp. So using a really small theshold helped. Then playing with the iou in the yolo predict helped. For example 0.5 worked well for me other than using the default 0.7 and also applying tta.
Okay I felt it wouldn't give me that high score I needed. Actually my model reached 50.1 map50 score yet still took the map50 49.3 because map50-95 at the 49.3 map50 score was around 0.243 and higher than that of the map50 of 0.501.
I was quite surprised to see I got to 3rd on the private LB from 16th (I think) on the public LB. I never tried submitting an ensemble because I just couldn't get two good models trained on a T4 in 9 hours. I used the MMDetection library to fine-tune a single DINO model. To get it to train for a reasonable number of epochs on a T4, I started with a pretrained model with a Resnet50 backbone, but I replaced the backbone with ConvNext-tiny which made a big difference. I also trained on square 800x800 images, fp16, batch size 4 with 2 gradient accumulation steps. After doing CV tests, I trained on all training data for 13 epochs. End to end takes about 7 to 7.5 hours on a single T4. I'll do a more detailed write-up in the next few days.
+1. for us, the 9 hour training limit allowed us to train and ensemble two folds. one based on a 20 fold split (cv=0.491) and the other on a 24 split (cv=0.52).
The ensemble was based on WBF and we trained RT-DETR models for around 40 epochs.
thanks @nymfree , that's impressive. I personally avoided ensembles but I'm glad to hear that it could fit in the 9hr limit. Also what iou did you use for the wbf ensemble and the skip thresh as well.
also what image size did you use and batch size
thank you
we used 640 image size and 0.65 iou threshold. kept default values for skip threshold.
greatt thanks
what did you use? model and image resolution
a single yolo11x , image size 800, trained for 52 epochs
All along I was training on 100 epochs😭😭. I used yolo11s through out with 1024px imgsz on a fold split of 10 (cv=49.3).
@Koleshjr @nymfree Is there a trick to boost your model score on the test set after training?
computational cost scales quadratically with increase in image resolution. Very likely that only one model can successfully be trained within 9 hours at 1024 resolution
Yes it took like 7 hours to train. I thought it gave a better result so why not😌
other than TTA that is built in ultralytics, there is sahi. here, you basically infer on patches of high resolution images. e.g., if the image has 2048x2048 resolution, you don't resize it to 1024, but infer on 4 1024x1024 patches. that might preserve high resolution details. I tried it earlier on when I had weak models and it didn't seem to make things better. In hindsight, I should have revisited it.
actually you could train two models at 1024px given that he was using using the smaller version of yolo11, but then you would have to sacrifice the number of epochs you train for the two models
I also tried sahi, In this competition, it wasn't helpful at all.
First of all everyone who could not hack the 40 score is because they were using the default yolo thresh , 0.25. I really struggled with this in the beginning of the comp. So using a really small theshold helped. Then playing with the iou in the yolo predict helped. For example 0.5 worked well for me other than using the default 0.7 and also applying tta.
I did all these things. My iou was 0.559, confidence was 0.001. But I still couldn't hack the 50 score
Okay I felt it wouldn't give me that high score I needed. Actually my model reached 50.1 map50 score yet still took the map50 49.3 because map50-95 at the 49.3 map50 score was around 0.243 and higher than that of the map50 of 0.501.
I tried sahi as well. It didn't help in any way.
I was quite surprised to see I got to 3rd on the private LB from 16th (I think) on the public LB. I never tried submitting an ensemble because I just couldn't get two good models trained on a T4 in 9 hours. I used the MMDetection library to fine-tune a single DINO model. To get it to train for a reasonable number of epochs on a T4, I started with a pretrained model with a Resnet50 backbone, but I replaced the backbone with ConvNext-tiny which made a big difference. I also trained on square 800x800 images, fp16, batch size 4 with 2 gradient accumulation steps. After doing CV tests, I trained on all training data for 13 epochs. End to end takes about 7 to 7.5 hours on a single T4. I'll do a more detailed write-up in the next few days.
that was a huge jump @stefan027
congrats 👏. I tried codetr at the beginning of the comp but it was too slow so I just gave up on it in the middle and focused on Yolo
we would love to see it.