Dear @Amy_Bray and Competition Organizers,
I'm writing to raise a concern regarding the integrity of the provided training dataset, which appears to contain several inconsistencies that will significantly impact the performance of any models developed under the current rules.
The competition rules state that all data manipulation must be done in code, and manual corrections are prohibited. However, the training dataset seems to have been manually labeled with incorrect information. Adhering to the "no manual manipulation" rule would force competitors to train models on flawed data, leading to inaccurate results that don't reflect the true performance of the models.
I've found numerous examples of these discrepancies. Below are a few specific instances for clarity:
Address Field:
Image ID 7707-150: The dataset lists the address as "Development," while the image clearly shows "Lot 225 Union Hall."
Image ID 8612-107: The dataset has the address as "Republic Bank (Barbados) Limited." The correct address from the image is "Lot 3, Ocean City West, Foul Bay."
The above example confirms one of the three Possible Explanations (Human error, Systematic issues or deliberate manipulation) of the train dataset.
Land Surveyor Name:
Image ID 6716-124: The dataset lists the land surveyor as "Kenneth Ward." The image identifies the surveyor as "Sophia D. Ward."
Image ID 7711-051: The dataset lists "Michael H Hutchinson," but the image shows "Lennox J. Reid."
I've also observed the following issues related to name standardization, which further complicate the data quality:
Based on these findings and considering the fact that we are building vision based models for OCR, I believe the integrity of the competition is at risk. Given the rule against manual data correction, training on this dataset would penalize competitors and lead to misleading model performance evaluations.
I would like to propose a discussion on the best way forward:
@Amy_Bray I believe clarifying this issue is essential for a fair and productive competition for all participants.
+1. There are also instances of mislabeled polygons.
+1. There are also instances where the dates and surveyor names are cropped off or masked.
Same here I do find it but i was kind of ingore them
@Joseph_gitau
Would you kindly point me to the Zindi rule that prohibits manual corrections to the training data ? I looked through https://zindi.africa/rules but couldn’t find it. Thank you so much
@3B Under the info page for this competition there is this rule.
My interpretation might be wrong but there is no further clarification on the given rule.
It seems this only applies to the test data, at least that’s my experience. It would be great if Amy_Bray could clarify this for us, but she seems quite busy.
@Amy_Bray @Zindi @AJoel does the manual labeling rule only apply to test data?
There should be an allowance for corrections in the training dataset, as long as it's part of your pipeline to ensure reproducibility.
It's unrealistic to expect a model optimized on incorrect data to magically perform well once deployed, so correcting training labels is in everyone's best interest.
That said we still need guidance on labelling inconsistencies e.g. drop or keep middle initials?
I don't think manual corrections should be allowed to the training set because then how would you scale the solution to a larger training set? And also how can we be sure there aren't inconsistencies in the public/private test set labels as well? Would definetley suggest rather a corrected version of the dataset and possibly some validation please from @Zindi @Amy_Bray that the test set indeed has accurate labels with middle initials, etc. But also not too worried about it bc model accuracy is almost perfect for the text with Vision OCR but am bit worried about the polygons and heuristics used to create labels and train models there, but that's for a different discussion
That's my fear all along, does the test set have accurate labels especially for the polygons. Because in the train set, there can be 5 duplicates of just one site plan with one not even correct😭.
For competitions with relatively high prize money, your explanation seems reasonable, the organizer will probably apply this solution on a larger dataset. I think it's better to not risk manually adjusting the labels, that way it's safer. We should not expect feedback from Zindi anymore
It would be interesting to hear the organisers' response regarding manual processing of training data. The competition will end soon
@Zindi @Amy_Bray
Hello, I would like to ask if Zindi is going to give directions on this. Also are we allowed to manually annotate the train images to finetune segmentation models?
It's disappointing that we haven't heard back from the zindi team yet.
@Amy_Bray @Zindi @AJoel We really in the dark here and would really appreciate your input.
To all fellow competitors, it's clear that a key challenge in this competition is developing a solution robust enough to handle the significant noise in both the train and test datasets.
With just 12 days left, I want to wish everyone the best of luck in these final stages. Keep pushing forward!
In the spirit of collaboration, I plan to open-source my solution soon. I'm still determining the best way to do this before the competition end, but I will post an update here when it's available.
Awesome!
please do not post your solution before the competition ends. rather post it right after the deadline.
I also thought so too but he already posted it. So I just had to edit my post. But I think it is fine.
I guess it is fine. there are no rules to this game.
Both the public and private test sets have been manually reviewed and corrected to align with the cadastral plans. Minor name discrepancies (such as inclusion or omission of initials) remain, which is why WER was selected as the evaluation metric. The training data may still contain inconsistencies, but the test data have been cleaned to ensure reliable evaluation.