Primary competition visual

Barbados Lands and Surveys Plot Automation Challenge

Helping Barbados
$10 000 USD
Completed (5 months ago)
Computer Vision
Geospatial Data
Optical Character Recognition
895 joined
179 active
Starti
Aug 01, 25
Closei
Oct 19, 25
Reveali
Oct 20, 25
User avatar
Joseph_gitau
African center for data science and analytics
OCR Pipeline: WER below 0.13 and Accuracy above 0.99
Notebooks · 7 Oct 2025, 23:07 · 27

Hi all,

if you want to achieve a Word Error Rate (WER) similar to mine, this pipeline should get you there. I used these exact notebooks for my best submission. The main variable you'll need to adjust is the prompt; I recommend experimenting with different phrasings.

Data Preparation Notebook: https://colab.research.google.com/drive/1oxCswRQsCnfGt7hwf-CUCYYYerXBZJfJ?usp=sharing

Base Model Notebook:

https://colab.research.google.com/drive/1oGey1j8ur189rxRzmyyCoTYDLQDzkAo5?usp=sharing

Discussion 27 answers
User avatar
CodeJoe

NO WAY! I am definitely doing this first. You are amazing @Joseph_gitau!!!!

7 Oct 2025, 23:15
Upvotes 1
User avatar
Knowledge_Seeker101
Freelance

Awesome 👏💯

7 Oct 2025, 23:28
Upvotes 1
User avatar
Joseph_gitau
African center for data science and analytics

Another addition I forgot is you have to remove middle name initials to get the low WER.

You can use my function when running inference.

import re

def clean_land_surveyor_names(names_array):
    """
    Clean land surveyor names by removing middle initials while preserving:
    - Names that start with initials (like H.A. King)
    - Names with nicknames in quotes
    - Names with compound elements like St. Clair
    - Professional designations like JP
    """

    def clean_single_name(name):
        if pd.isna(name) or not name or name.strip() == '':
            return name

        name = str(name).strip()

        # Skip names that start with initials (like "H.A King" or "H.A. King")
        if re.match(r'^[A-Z]\.?\s*[A-Z]\.?\s+', name):
          return name

        # Skip names with quotes (nicknames like D.C "Vallan" Franklin JP)
        if '"' in name:
            return name

        # Skip single names (like "Simba")
        if len(name.split()) <= 1:
            return name

        # Protect "St." in compound names like "Michelle E. St. Clair"
        name_protected = name.replace(' St. ', ' PROTECTED_ST ')

        # Remove middle initials patterns:
        # 1. Single initials: "Lennox J Reid" → "Lennox Reid"
        name_cleaned = re.sub(r'\s+[A-Z]\.?\s+', ' ', name_protected)

        # 2. Multiple initials: "Jamal K.L. Gaskin" → "Jamal Gaskin"
        name_cleaned = re.sub(r'\s+[A-Z]\.[A-Z]\.?\s+', ' ', name_cleaned)

        # 3. Space-separated initials: "Lee B S Brathwaite" → "Lee Brathwaite"
        name_cleaned = re.sub(r'\s+[A-Z]\s+[A-Z]\s+', ' ', name_cleaned)

        # 4. Complex patterns like "Lee B.S Brathwaite" or "Sekani H.C Franklin"
        name_cleaned = re.sub(r'\s+[A-Z]\.[A-Z]\s+', ' ', name_cleaned)

        # 5. Handle remaining single initials that might be left
        name_cleaned = re.sub(r'\s+[A-Z]\.?\s+', ' ', name_cleaned)

        # Restore protected "St."
        name_cleaned = name_cleaned.replace(' PROTECTED_ST ', ' St. ')

        # Clean up multiple spaces and trim
        name_cleaned = re.sub(r'\s+', ' ', name_cleaned).strip()

        return name_cleaned

    # Apply cleaning to each name in the array
    if isinstance(names_array, np.ndarray):
        return np.array([clean_single_name(name) for name in names_array])
    elif isinstance(names_array, (list, pd.Series)):
        return [clean_single_name(name) for name in names_array]
    else:
        return clean_single_name(names_array)

7 Oct 2025, 23:31
Upvotes 0
User avatar
Joseph_gitau
African center for data science and analytics

But I guess telling the model middle name initials are not needed would be a better option. I don't know

User avatar
CodeJoe

That's fine. Thank you for sharing! I think you posted too early though. Now the board is going to be funny😂😂

User avatar
Joseph_gitau
African center for data science and analytics

😂😂 That's the goal. Let's see who can maybe be more creative than the other. Been a learning experience for me. That's a plus for me.

User avatar
CodeJoe

😂😂😭😭

User avatar
nymfree

😂😂 go all the way and publish the segmentation part to see real chaos

User avatar
crossentropy
Federal university of Technology, Akure

You can help us with that Nymfree🤲😂

User avatar
Knowledge_Seeker101
Freelance

🤣🤣🤣😂 wild

User avatar
Muhamed_Tuo
Inveniam

@Joseph_gitau

I guess I'll be the one to say this. You've just ruined the efforts of everyone who worked hard to attain that score.

Sharing a high scoring code within 2 weeks of the end of the competition is not acceptable. Many of us, myself included, invested a great amount of time and resources to reach these scores and build that advantage, and now a large part of that has gone to waste. Now, anyone can join the contest and become competitive within a few days. If that was your goal, you could have said earlier, and I would have sat and waited for you do so.

I'm really struggling to grasp your thought process behind this. It would have cost you nothing to wait until the end of the competion to share your solution, and everyone would have thanked you for it.

To all those who are ecstatic about this, yes, you might now be competitive. But this is not how you learn and grow. Growth comes from struggling to find a solution or achieving a breakthrough on your own.

8 Oct 2025, 09:51
Upvotes 7
User avatar
CodeJoe

Honestly, we were also quite surprised and didn't really understand the motive to why he did that. But what is done is done. I am extremely sorry for how you feel now but it is not a wasted effort, @Muhamed_Tuo. Wishing you the best of Luck!

User avatar
nymfree

Well said @Muhamed_Tuo. Also found it strange as he had been sitting on this solution for around a month.

User avatar
Moujoudix

Totally get the frustration @Muhamed_Tuo.

Quick reality check on Zindi rules: "Code must not be shared privately… Any code that is shared must be made available to all participants on the discussion boards." This notebook was shared publicly, so it's within the rules.

On timing/fairness: early sharing can sting, but it raises the floor, not the ceiling. Copying a score ≠ winning-CV choices, feature tweaks, training stability, ensembling, and error analysis still separate leaders.

Upside: faster learning, more creativity, and tougher competition. A strong public baseline helps everyone optimize further-better splits/regularization/augmentation/post-processing, etc.

If we want an embargo on sharing code in the last X weeks, let's propose it for future comps. Until then, this share is rule-compliant, let's build on it, not shut it down.

User avatar
Joseph_gitau
African center for data science and analytics

@Muhamed_Tuo, massive respect for the time and effort you put in—those scores are proof of your skill, and no shared notebook can erase the learning from that journey.

For me, this is what community is all about. A shared high-performing notebook doesn't devalue the work done; it accelerates our collective growth. It shifts the challenge from baseline discovery to true innovation.

I also worked hard from the start, and I believe this is how we push the boundaries. We learn best when we build upon shared ideas.

I believe what he shared gives a baseline score which is similar to the starter notebook Zindi gives sometimes. I don't see anything wrong with it.

User avatar
Koleshjr
Multimedia university of kenya

@Joseph_gitau The issue isn’t with sharing itself, it’s the timing. Sharing resources is great, but not this close to the end of the competition. If it had been shared a month earlier, that would’ve been fair. But right now, all the effort and resources the top competitors invested to reach those scores have basically gone to waste.

@Bone Let’s be honest the starter notebook’s score and his notebook’s score aren’t even close. Anyway, what’s done is done… let the chaos begin

May the best win!

User avatar
data_style_bender

Thanks a lot, man!

This is exactly what a real competition should be about — collaboration and learning. Platforms like Kaggle have shown that sharing ideas and code helps everyone grow and push boundaries.

I honestly don’t understand why some people think it ruins the top scorers’ efforts. If they had shared even a bit of their insights, this whole situation probably wouldn’t have happened.

14 Oct 2025, 18:03
Upvotes 3
User avatar
Joseph_gitau
African center for data science and analytics

I have got a second OCR pipeline which i will share later on. Should be better that what I have shared already but needs further checks to confirm the structure and functionality.

User avatar
CodeJoe

@Joseph_gitau Have you realized there's a bug in unsloth?

User avatar
CodeJoe

When you try any inferencing. Check on your side if I am wrong

User avatar
Joseph_gitau
African center for data science and analytics

Yes, and why I have got a second pipeline. I knew this would happen. I guess this affects many in the leaderboard. with 5 days to the close there need to be a solution for this.

User avatar
CodeJoe

Alright that's fine. I think everyone should be aware

User avatar
Joseph_gitau
African center for data science and analytics

How can we make them aware? Doing a post on the same or this thread should be ok?

User avatar
CodeJoe

Let me post on a new thread. Or?

User avatar
Joseph_gitau
African center for data science and analytics

Should be ok.

User avatar
CodeJoe

Done!