📊 AI in Focus: Data Inconsistencies in the Tr...

Barbados Lands and Surveys Plot Automation Challenge

Helping Barbados

$10 000 USD

Completed (7 months ago)

Skills you will learn

Computer Vision

Geospatial Data

Optical Character Recognition

904 joined

179 active

Info Data Chat Leaderboard

Start

Aug 01, 25

Oct 19, 25

Reveal

Oct 20, 25

Joseph_gitau

African center for data science and analytics

Data Inconsistencies in the Training Dataset (Human error, Systematic issues or deliberate manipulation)

Data · 21 Sep 2025, 18:48 · 20

Dear @Amy_Bray and Competition Organizers,

I'm writing to raise a concern regarding the integrity of the provided training dataset, which appears to contain several inconsistencies that will significantly impact the performance of any models developed under the current rules.

The competition rules state that all data manipulation must be done in code, and manual corrections are prohibited. However, the training dataset seems to have been manually labeled with incorrect information. Adhering to the "no manual manipulation" rule would force competitors to train models on flawed data, leading to inaccurate results that don't reflect the true performance of the models.

I've found numerous examples of these discrepancies. Below are a few specific instances for clarity:

Address Field:

Image ID 7707-150: The dataset lists the address as "Development," while the image clearly shows "Lot 225 Union Hall."

Image ID 8612-107: The dataset has the address as "Republic Bank (Barbados) Limited." The correct address from the image is "Lot 3, Ocean City West, Foul Bay."

The above example confirms one of the three Possible Explanations (Human error, Systematic issues or deliberate manipulation) of the train dataset.

Land Surveyor Name:

Image ID 6716-124: The dataset lists the land surveyor as "Kenneth Ward." The image identifies the surveyor as "Sophia D. Ward."

Image ID 7711-051: The dataset lists "Michael H Hutchinson," but the image shows "Lennox J. Reid."

I've also observed the following issues related to name standardization, which further complicate the data quality:

Surveyor Names: I found that 552 records had middle initials removed, 13 had incorrect names.
Surveyed-For Names: In 114 records, their are missing details like middle initials or additional names.

Based on these findings and considering the fact that we are building vision based models for OCR, I believe the integrity of the competition is at risk. Given the rule against manual data correction, training on this dataset would penalize competitors and lead to misleading model performance evaluations.

I would like to propose a discussion on the best way forward:

Is it possible to receive a corrected version of the training dataset? This would allow all participants to build their models on a clean foundation, ensuring a fair competition.
Alternatively, could the rules be updated to allow for manual corrections to the training data? This would give participants the agency to fix the identified errors and proceed with their modeling.

@Amy_Bray I believe clarifying this issue is essential for a fair and productive competition for all participants.

Discussion 20 answers

nymfree

+1. There are also instances of mislabeled polygons.

21 Sep 2025, 19:12

Upvotes 3

Joseph_gitau

African center for data science and analytics

+1. There are also instances where the dates and surveyor names are cropped off or masked.

replied to nymfree21 Sep 2025, 19:27

Upvotes 1

MuhammadQasimShabbeer

Engmatix

Same here I do find it but i was kind of ingore them

21 Sep 2025, 19:29

Upvotes 0

@Joseph_gitau

Would you kindly point me to the Zindi rule that prohibits manual corrections to the training data ? I looked through https://zindi.africa/rules but couldn’t find it. Thank you so much

23 Sep 2025, 07:26

Upvotes 0

Joseph_gitau

African center for data science and analytics

@3B Under the info page for this competition there is this rule.

My interpretation might be wrong but there is no further clarification on the given rule.

replied to 3B23 Sep 2025, 07:45

Upvotes 1

It seems this only applies to the test data, at least that’s my experience. It would be great if Amy_Bray could clarify this for us, but she seems quite busy.

replied to Joseph_gitau23 Sep 2025, 07:56

Upvotes 0

nymfree

@Amy_Bray @Zindi @AJoel does the manual labeling rule only apply to test data?

replied to 3B23 Sep 2025, 09:35

Upvotes 0

kiminya

Strathmore university

There should be an allowance for corrections in the training dataset, as long as it's part of your pipeline to ensure reproducibility.

It's unrealistic to expect a model optimized on incorrect data to magically perform well once deployed, so correcting training labels is in everyone's best interest.

That said we still need guidance on labelling inconsistencies e.g. drop or keep middle initials?

replied to 3B23 Sep 2025, 10:45

Upvotes 6

21db

I don't think manual corrections should be allowed to the training set because then how would you scale the solution to a larger training set? And also how can we be sure there aren't inconsistencies in the public/private test set labels as well? Would definetley suggest rather a corrected version of the dataset and possibly some validation please from @Zindi @Amy_Bray that the test set indeed has accurate labels with middle initials, etc. But also not too worried about it bc model accuracy is almost perfect for the text with Vision OCR but am bit worried about the polygons and heuristics used to create labels and train models there, but that's for a different discussion

replied to kiminya23 Sep 2025, 20:21

Upvotes 1

CodeJoe

That's my fear all along, does the test set have accurate labels especially for the polygons. Because in the train set, there can be 5 duplicates of just one site plan with one not even correct😭.

replied to 21db30 Sep 2025, 01:15

Upvotes 0

For competitions with relatively high prize money, your explanation seems reasonable, the organizer will probably apply this solution on a larger dataset. I think it's better to not risk manually adjusting the labels, that way it's safer. We should not expect feedback from Zindi anymore

replied to 21db7 Oct 2025, 08:00

Upvotes 0

ZaakciiRu

It would be interesting to hear the organisers' response regarding manual processing of training data. The competition will end soon

23 Sep 2025, 21:55

Upvotes 2

Koleshjr

Multimedia university of kenya

@Zindi @Amy_Bray

Hello, I would like to ask if Zindi is going to give directions on this. Also are we allowed to manually annotate the train images to finetune segmentation models?

29 Sep 2025, 06:25

Upvotes 1

Joseph_gitau

African center for data science and analytics

It's disappointing that we haven't heard back from the zindi team yet.

replied to Koleshjr29 Sep 2025, 06:38

Upvotes 0

Joseph_gitau

African center for data science and analytics

@Amy_Bray @Zindi @AJoel We really in the dark here and would really appreciate your input.

3 Oct 2025, 12:51

Upvotes 2

Joseph_gitau

African center for data science and analytics

To all fellow competitors, it's clear that a key challenge in this competition is developing a solution robust enough to handle the significant noise in both the train and test datasets.

With just 12 days left, I want to wish everyone the best of luck in these final stages. Keep pushing forward!

In the spirit of collaboration, I plan to open-source my solution soon. I'm still determining the best way to do this before the competition end, but I will post an update here when it's available.

7 Oct 2025, 22:42

Upvotes 1