🌾 Let's Talk About: Train Dataset data augmented ?...

Ghana Crop Disease Detection Challenge

Helping Ghana

$8 000 USD

Completed (over 1 year ago)

Skills you will learn

Computer Vision

Object Detection

2216 joined

344 active

Info Data Chat Leaderboard

Start

Oct 04, 24

Dec 15, 24

Reveal

Dec 15, 24

Mlamalerie

Train Dataset data augmented ???

Data · 12 Dec 2024, 23:58 · 4

Hi everyone,

During my EDA of the Ghana Crop Disease Detection Challenge dataset, I've noticed something interesting about the bounding boxes that I'd like to discuss with the community.

I've discovered numerous instances of perfectly overlapping bounding boxes with identical coordinates but different crop labels. The most striking example is with Septoria disease, where I found 2,280 bounding boxes that are duplicated between tomato and pepper crops.

What's particularly intriguing is the pattern:

FOR EXAMPLE, When there are, say, 5 bounding boxes labeled as tomato_septoria on an actual diseased tomato image
You'll find 5 identical "twin" bounding boxes labeled as pepper_septoria overlapping exactly in the same positions

Here's a visual example: https://ibb.co/NSdSBmN

While I've developed a method to remove these duplicates, I'm hesitant to apply it. This seems too systematic to be an error, making me wonder if it's an intentional part of the dataset creation process.

Has anyone else noticed this pattern? What are your thoughts on handling these overlapping annotations? Should we treat this as intentional data augmentation or a data quality issue that needs to be addressed?

Looking forward to your insights!

Discussion 4 answers

da_

I came across this, and several other issues. For example, over 1000 duplicated or near duplicate images with different annotations. It seems the dataset is auto labelled, and developing a model to address these issues results in worse performance on the leaderboard.

13 Dec 2024, 11:06

Upvotes 1

CodeJoe

That is true. I guess leaving those annotations give a good score at the LB.

replied to da_13 Dec 2024, 11:39

Upvotes 0

Mentoni

All of the images in the training dataset are repeated, the main difference is the bounding boxes. Even with the different bounding boxes coordinates, the image ids are the same. I think they spotted multiple intances of a particular disease on a single image.

13 Dec 2024, 12:31

Upvotes 0

da_

By duplicated images I meant two or more images with different image_ids being exactly the same or a slightly translated version of each other. The set of bounding boxes in duplicates sometimes intersect. The effect is a lot of false positives that are in fact true positives.

replied to Mentoni13 Dec 2024, 14:09

Upvotes 2

Join the largest network for
data scientists and AI builders

About FAQs

Status