Team: Brain Blend
By @koleshjr & @Sodiq_Babawale_
The Amini Cocoa Contamination Challenge tasked participants with developing machine learning models capable of identifying multiple plant diseases from images of cocoa leaves. But there was a catch: these models needed to run on low-resource smartphones typically used by subsistence farmers in Africa without sacrificing accuracy.
As Team Brain Blend, we built a lightweight, robust pipeline using YOLO11s and ranked 3rd overall. Here's how we did it 👇
The challenge was a blend of computer vision, model efficiency, and practical deployment something we were genuinely excited to tackle.
We started by organizing the provided dataset using a stratified cross-validation approach:
This ensured each training round had a diverse and balanced dataset.
We chose YOLO11s due to its speed, performance, and compatibility with edge devices. Its tiny size allowed us to meet the deployment constraint without compromising much on accuracy.
We trained on folds 6, 7, and 8 of the dataset. For each fold:
Each fold was trained for ~2 hours and 30 minutes, keeping us well within the 9-hour limit.
We created an ensemble using Weighted Box Fusion (WBF) to combine predictions from all three models, improving robustness and detection confidence.
We performed inference across multiple image sizes to enhance generalization to unseen disease patterns:
[640, 800, 960, 1120, 1280, 1440]💡 Special thanks to @kiminya for inspiring the multi-scale strategy.
Training 8h 33min
Inference 40min
This made our solution fully compliant with the challenge's time constraints (≤9h training, ≤3h inference).
To understand what our models were actually looking at, we implemented EigenCAM:
Welldone @koleshjr. This is detailed. It was nice learning from you.
It was nice collaborating with you@Sodiq_Babawale_
Woah, we did similar things🔥. Thanks for sharing @Koleshjr @Sodiq_Babawale_
Yeah pretty much the same thing 😅, thanks
🔥🔥
Congratulations to Brain Blend team and thank you for sharing your solutions
Congrats guys. I also did the same thing with yolo11s, except that I trained 5 folds at a resolution of 448.
One important detail I noticed in your inference is that you set
max_det=600i.e., double the default of 300. Was this critical to final score? I noticed towards the end that subs with more detections scored a little better, but didn't pursue it further.
Not really , using the default leads to similar or slightly worse results.
In mine, it gives slightly worse results too. I feel the more the merrier works here.
@koleshjr Congratulations, your solution is insightful and thank you for sharing.
Thank you for sharing your detailed solution @koleshjr.
If I may ask, why did you choose folds 6,7,8 instead of other folds?
We were doing subsets of 3s since that is what was fitting in the set time limit, and that combination had the best local val scores and lb was not bad as well
Well done + thanks for sharing