[EDIT TO ORIGINAL POST] See this explanation by @Muhamed_Tuo below.
---
ORIGINAL POST:
The labels (bounding boxes) for this challenge are found in two places:
The starter notebook uses the .txt files directly, so it is fair to assume those are the ground truth labels. The format of the labels in the .txt files is (xcentre, ycentre, width, height) in relative coordinates. It appears as if these labels were then converted to the (xmin, ymin, xmax, ymax) format (in absolute coordinates) found in Train.csv.
When looking at the bounding boxes in Train.csv, our team noticed a high percentage of boxes that are outside of the image bounds. I then started to compare the bounding boxes from Train.csv to those in the .txt files. I think there is an error (or at least an inconsistency) in how the labels were converted. Specifically, it appears as if the image width and height were 'switched' when converting the relative coordinates to absolute coordinates for some images.
While we can use the .txt files to train our models, the concern is that Test.csv is used to evaluate submissions. If the same conversion error or inconsistency occured, then the ground truths for this challenge is incorrect.
My step-by-step analysis can be found in this notebook: https://www.kaggle.com/code/stefan87/amini-cocoa-contamination-bb-error
Good day
Thank you very much for this
I personally noticed this while analysing the dataset coordinates in the Train.csv file, initially it may be smart to clip the values of ymax where ymax > image height and do the same for cases where xmax > image width, but after this, there's still so much to look out for
Considering swapping the width and height, i looked into this too and noticed that it wasn't so for all the box coordinates. Still been thinking of a way to properly approach it
I have personally not tried to train using the initially prepared dataset in YOLO fornat provided by Zindi in this competition, but as you said, what if there's also an issue with them?
I haven't compared them too
For me the main problem is that we have two sources of labels that don't seem to match. At the very least we should know which source is correct and that the test data is also based on that
Considering this What if they've kind of also been swapped in the evaluation dataset?
Exactly, that's what I'm wondering about too
And here was me not focusing on the data at all.
lol that's usually me!
You need to, my initial suspicion was that the x and y coordinates might have been swapped(i still kind of think so) I'll share an observation of mine later
Not only me then XD.
Take a quick look at this:
Please i am human as well and might have some misconceptions or ideas from my analysis so do well to correct me if any of my calculations or computations are wrong:
Firstly: Modifying the path and computing image resolution
Secondly: Checking some cases of misappropriation for example xmin>image width just does not make sense!
At the same time, we can see that there is not a single case where xmin>xmax, same for ymin and ymax
Thirdly: I have a brief assumption(theory?), nothing serious, i have only just reasoned it :)
* Now, except if i am wrong, mathematically:
* If xmax > image_width and we proceed to make it such that for every scenario where this is a thing, we clip the excess values to the width, say:
* if xmax>image_width:
* xmax == image_width
* else:
* xmax remains
The above is not expected to make it such that after computation, some values of xmin automatically become greater than xmax because we checked initially and for all cases, xmin is not greater than xmax
Lastly: After computing the base stats as i did earlier in the second state, we can see a change. Why? I have no actual facts, just assumptions:
My first assumption would be that the coordinates were swapped i.e xmin for ymin and similarly xmax for ymax
Secondly, similarly to the first assumption, only in a few cases was this an issue(swapping coordinates)nt
I also plotted some of these boxes, carefully comparing the box width and height with respect to the image resolution, it is quite evident that perhaps they are swapped or i am mistaken and these are just wrong case of boxes looking quite decieving!
I'd very much love to hear your opinions on this and if potentially there is anything wrong in my calculations and assumptions, I'd love to hear them too!
Thanks!
Hi, in preprocessing, perhaps change all images to the same size and standardize.
If we resize the images, we also have to resize the bounding boxes proportionally, which means that we will still have the problem where the bounding boxes in Train.csv are different to those in the .txt label files.
Data cleaning and preprocessing are super important when we have bad or noisy data. By reducing the noise, we can train better models. Even if the test set is also noisy, better models should perform better.
However, if there is a 'systematic error' in the data (e.g., hight and width switched during a transformation, as I suspect), and if that systematic error is also present in the test set, then no amount of preprocessing will help to improve test set performance. In fact, data cleaning might actually hurt test set performance since because models that are able to replicate the systematic error will do well.
What I'm suspecting is that there is some sort of systematic labeling error here which is different from just bad/noisy data.
@Stefan027, thank you for sharing.
@AJoel @amy_bray @Zindi
Thanks @Muhamed_Tuo for the workaround. It appears that there are still some mislabels -- just keen to confirm if anyone else has seen the same, and there isn't a bug in my code?
Eg: ID_skBkBf, ID_gTbZrd, ID_FHDhzz, ID_U0JAu1
Hey @Stefan027
Actually, both data sources are correct. The issue is coming from the exif metatata stored in the images, or more specifically the `orientation` metadata. And PIL by default, does not take into account that rotation information, as opposed to OpenCV.
To properly read those images with PIL, we need to:
With OpenCV, this is done by default, as I mentionned earlier.
So, if you want to mimic the same behaviour as with PIL, by ignoring the orientation, you can do:
image = cv2.cvtColor( cv2.imread( fp, cv2.IMREAD_IGNORE_ORIENTATION | cv2.IMREAD_COLOR ), cv2.COLOR_BGR2RGB ) # This will basically disregard the orientation flag and incorrectly load the image@ZINDI I don't think there's anything to worry about the test annotations
Thanks for this information
Really helpful
For those using YOLO, you most likely noticed a fair amount of warning being displayed at the start of the training. What's happening behind the hood is that YOLO is processing these images by loading, applying the correct orientation, and caching them
train: WARNING ⚠️ /kaggle/working/cocoa_diseases/yolo_data/0/train/ID_AJD939.jpg: corrupt JPEG restored and savedHey @Muhamed_Tuo, this is really great! I just learned a lot about processing images! This completely explains the apparent 'switching' of height and width that I talked about in my post, and why it seemed to apply to some images but not others.
For those using the competion's starter notebook for EDA, you can add this function to the notebook (derived from @Muhamed_Tuo above):
from PIL import Image, ExifTags def load_image(filepath): image = Image.open(filepath) for flag in ExifTags.TAGS.keys(): if ExifTags.TAGS[flag]=='Orientation': break orientation = flag exif = image._getexif() orientation_value = exif.get(orientation, None) if orientation_value == 3: image=image.rotate(180, expand=True) elif orientation_value == 6: image=image.rotate(270, expand=True) elif orientation_value == 8: image=image.rotate(90, expand=True) return imageThen in the plot_image_with_boxes function replace this line:
with this:
Thanks for the clarification @Muhamed_Tuo. Very helpful feedback
Great work @Stefan027. Grateful to you for bringing clarity to this issue.
Thanks for this.
Actually
cv2.cvtColor(
cv2.imread(filepath, cv2.IMREAD_IGNORE_ORIENTATION | cv2.IMREAD_COLOR),
cv2.COLOR_BGR2RGB,
)
Is not working, only orientation with PIL can solve this issue
Of course that does not work since you are literally ignoring the orientation . Read the comment in that code block :)
My badddd, sorry missed the comment
Thanks for bringing this to our attention. We are evaluating the data and will respond as soon as we can.
I think the issue was solved in the above thread?
Yes agreed! Have gone through thread and data and I think we're good, no changes at this point 👍