Primary competition visual

Ghana Crop Disease Detection Challenge

Helping Ghana
$8 000 USD
Completed (over 1 year ago)
Computer Vision
Object Detection
2205 joined
344 active
Starti
Oct 04, 24
Closei
Dec 15, 24
Reveali
Dec 15, 24
Train Dataset data augmented ???
Data · 12 Dec 2024, 23:58 · 4

Hi everyone,

During my EDA of the Ghana Crop Disease Detection Challenge dataset, I've noticed something interesting about the bounding boxes that I'd like to discuss with the community.

I've discovered numerous instances of perfectly overlapping bounding boxes with identical coordinates but different crop labels. The most striking example is with Septoria disease, where I found 2,280 bounding boxes that are duplicated between tomato and pepper crops.

What's particularly intriguing is the pattern:

  • FOR EXAMPLE, When there are, say, 5 bounding boxes labeled as tomato_septoria on an actual diseased tomato image
  • You'll find 5 identical "twin" bounding boxes labeled as pepper_septoria overlapping exactly in the same positions

Here's a visual example: https://ibb.co/NSdSBmN

While I've developed a method to remove these duplicates, I'm hesitant to apply it. This seems too systematic to be an error, making me wonder if it's an intentional part of the dataset creation process.

Has anyone else noticed this pattern? What are your thoughts on handling these overlapping annotations? Should we treat this as intentional data augmentation or a data quality issue that needs to be addressed?

Looking forward to your insights!

Discussion 4 answers

I came across this, and several other issues. For example, over 1000 duplicated or near duplicate images with different annotations. It seems the dataset is auto labelled, and developing a model to address these issues results in worse performance on the leaderboard.

13 Dec 2024, 11:06
Upvotes 1
User avatar
CodeJoe

That is true. I guess leaving those annotations give a good score at the LB.

All of the images in the training dataset are repeated, the main difference is the bounding boxes. Even with the different bounding boxes coordinates, the image ids are the same. I think they spotted multiple intances of a particular disease on a single image.

13 Dec 2024, 12:31
Upvotes 0

By duplicated images I meant two or more images with different image_ids being exactly the same or a slightly translated version of each other. The set of bounding boxes in duplicates sometimes intersect. The effect is a lot of false positives that are in fact true positives.