📊 Trending Now: OCR Pipeline: WER below 0.13 a...

Barbados Lands and Surveys Plot Automation Challenge

Helping Barbados

$10 000 USD

Completed (7 months ago)

Skills you will learn

Computer Vision

Geospatial Data

Optical Character Recognition

904 joined

179 active

Info Data Chat Leaderboard

Start

Aug 01, 25

Oct 19, 25

Reveal

Oct 20, 25

Joseph_gitau

African center for data science and analytics

OCR Pipeline: WER below 0.13 and Accuracy above 0.99

Notebooks · 7 Oct 2025, 23:07 · 27

Hi all,

if you want to achieve a Word Error Rate (WER) similar to mine, this pipeline should get you there. I used these exact notebooks for my best submission. The main variable you'll need to adjust is the prompt; I recommend experimenting with different phrasings.

Data Preparation Notebook: https://colab.research.google.com/drive/1oxCswRQsCnfGt7hwf-CUCYYYerXBZJfJ?usp=sharing

Base Model Notebook:

https://colab.research.google.com/drive/1oGey1j8ur189rxRzmyyCoTYDLQDzkAo5?usp=sharing

Discussion 27 answers

CodeJoe

NO WAY! I am definitely doing this first. You are amazing @Joseph_gitau!!!!

7 Oct 2025, 23:15

Upvotes 1

Knowledge_Seeker101

Freelance

Awesome 👏💯

7 Oct 2025, 23:28

Upvotes 1

Joseph_gitau

African center for data science and analytics

Another addition I forgot is you have to remove middle name initials to get the low WER.

You can use my function when running inference.

import re

def clean_land_surveyor_names(names_array):

"""

    Clean land surveyor names by removing middle initials while preserving:

    - Names that start with initials (like H.A. King)

    - Names with nicknames in quotes

    - Names with compound elements like St. Clair

    - Professional designations like JP

"""

    def clean_single_name(name):

        if pd.isna(name) or not name or name.strip() == '':

            return name

        name = str(name).strip()

        # Skip names that start with initials (like "H.A King" or "H.A. King")

        if re.match(r'^[A-Z]\.?\s*[A-Z]\.?\s+', name):

          return name

        # Skip names with quotes (nicknames like D.C "Vallan" Franklin JP)

        if '"' in name:

            return name

        # Skip single names (like "Simba")

        if len(name.split()) <= 1:

            return name

        # Protect "St." in compound names like "Michelle E. St. Clair"

        name_protected = name.replace(' St. ', ' PROTECTED_ST ')

        # Remove middle initials patterns:

        # 1. Single initials: "Lennox J Reid" → "Lennox Reid"

        name_cleaned = re.sub(r'\s+[A-Z]\.?\s+', ' ', name_protected)

        # 2. Multiple initials: "Jamal K.L. Gaskin" → "Jamal Gaskin"

        name_cleaned = re.sub(r'\s+[A-Z]\.[A-Z]\.?\s+', ' ', name_cleaned)

        # 3. Space-separated initials: "Lee B S Brathwaite" → "Lee Brathwaite"

        name_cleaned = re.sub(r'\s+[A-Z]\s+[A-Z]\s+', ' ', name_cleaned)

        # 4. Complex patterns like "Lee B.S Brathwaite" or "Sekani H.C Franklin"

        name_cleaned = re.sub(r'\s+[A-Z]\.[A-Z]\s+', ' ', name_cleaned)

        # 5. Handle remaining single initials that might be left

        name_cleaned = re.sub(r'\s+[A-Z]\.?\s+', ' ', name_cleaned)

        # Restore protected "St."

        name_cleaned = name_cleaned.replace(' PROTECTED_ST ', ' St. ')

        # Clean up multiple spaces and trim

        name_cleaned = re.sub(r'\s+', ' ', name_cleaned).strip()

        return name_cleaned

    # Apply cleaning to each name in the array

    if isinstance(names_array, np.ndarray):

        return np.array([clean_single_name(name) for name in names_array])

    elif isinstance(names_array, (list, pd.Series)):

        return [clean_single_name(name) for name in names_array]

    else:

        return clean_single_name(names_array)

7 Oct 2025, 23:31

Upvotes 0

Joseph_gitau

African center for data science and analytics

But I guess telling the model middle name initials are not needed would be a better option. I don't know

replied to Joseph_gitau7 Oct 2025, 23:31

Upvotes 0

CodeJoe

That's fine. Thank you for sharing! I think you posted too early though. Now the board is going to be funny😂😂

replied to Joseph_gitau7 Oct 2025, 23:45

Upvotes 2

Joseph_gitau

African center for data science and analytics

😂😂 That's the goal. Let's see who can maybe be more creative than the other. Been a learning experience for me. That's a plus for me.

replied to CodeJoe7 Oct 2025, 23:46

Upvotes 0

CodeJoe

😂😂😭😭

replied to Joseph_gitau7 Oct 2025, 23:47

Upvotes 0

nymfree

😂😂 go all the way and publish the segmentation part to see real chaos

replied to Joseph_gitau8 Oct 2025, 05:03

Upvotes 0

crossentropy

Federal university of Technology, Akure

You can help us with that Nymfree🤲😂

replied to nymfree8 Oct 2025, 05:51

Upvotes 1

Knowledge_Seeker101

Freelance

🤣🤣🤣😂 wild

replied to nymfree8 Oct 2025, 08:58

Upvotes 0

Muhamed_Tuo

Inveniam

@Joseph_gitau

I guess I'll be the one to say this. You've just ruined the efforts of everyone who worked hard to attain that score.

Sharing a high scoring code within 2 weeks of the end of the competition is not acceptable. Many of us, myself included, invested a great amount of time and resources to reach these scores and build that advantage, and now a large part of that has gone to waste. Now, anyone can join the contest and become competitive within a few days. If that was your goal, you could have said earlier, and I would have sat and waited for you do so.

I'm really struggling to grasp your thought process behind this. It would have cost you nothing to wait until the end of the competion to share your solution, and everyone would have thanked you for it.

To all those who are ecstatic about this, yes, you might now be competitive. But this is not how you learn and grow. Growth comes from struggling to find a solution or achieving a breakthrough on your own.

8 Oct 2025, 09:51

Upvotes 7

CodeJoe

Honestly, we were also quite surprised and didn't really understand the motive to why he did that. But what is done is done. I am extremely sorry for how you feel now but it is not a wasted effort, @Muhamed_Tuo. Wishing you the best of Luck!

replied to Muhamed_Tuo8 Oct 2025, 09:57

Upvotes 0

nymfree

Well said @Muhamed_Tuo. Also found it strange as he had been sitting on this solution for around a month.

replied to Muhamed_Tuo8 Oct 2025, 10:03

Upvotes 0

Moujoudix

Totally get the frustration @Muhamed_Tuo.

Quick reality check on Zindi rules: "Code must not be shared privately… Any code that is shared must be made available to all participants on the discussion boards." This notebook was shared publicly, so it's within the rules.

On timing/fairness: early sharing can sting, but it raises the floor, not the ceiling. Copying a score ≠ winning-CV choices, feature tweaks, training stability, ensembling, and error analysis still separate leaders.

Upside: faster learning, more creativity, and tougher competition. A strong public baseline helps everyone optimize further-better splits/regularization/augmentation/post-processing, etc.

If we want an embargo on sharing code in the last X weeks, let's propose it for future comps. Until then, this share is rule-compliant, let's build on it, not shut it down.

replied to Muhamed_Tuo8 Oct 2025, 11:16

Upvotes 1

Joseph_gitau

African center for data science and analytics

@Muhamed_Tuo, massive respect for the time and effort you put in—those scores are proof of your skill, and no shared notebook can erase the learning from that journey.

For me, this is what community is all about. A shared high-performing notebook doesn't devalue the work done; it accelerates our collective growth. It shifts the challenge from baseline discovery to true innovation.

I also worked hard from the start, and I believe this is how we push the boundaries. We learn best when we build upon shared ideas.

replied to Muhamed_Tuo8 Oct 2025, 11:35

Upvotes 1

Bone

I believe what he shared gives a baseline score which is similar to the starter notebook Zindi gives sometimes. I don't see anything wrong with it.

replied to Muhamed_Tuo8 Oct 2025, 13:28

Upvotes 0

Koleshjr

Multimedia university of kenya

@Joseph_gitau The issue isn’t with sharing itself, it’s the timing. Sharing resources is great, but not this close to the end of the competition. If it had been shared a month earlier, that would’ve been fair. But right now, all the effort and resources the top competitors invested to reach those scores have basically gone to waste.

@Bone Let’s be honest the starter notebook’s score and his notebook’s score aren’t even close. Anyway, what’s done is done… let the chaos begin

May the best win!

replied to Joseph_gitau8 Oct 2025, 14:15

Upvotes 2

data_style_bender

Thanks a lot, man!

This is exactly what a real competition should be about — collaboration and learning. Platforms like Kaggle have shown that sharing ideas and code helps everyone grow and push boundaries.

I honestly don’t understand why some people think it ruins the top scorers’ efforts. If they had shared even a bit of their insights, this whole situation probably wouldn’t have happened.

14 Oct 2025, 18:03

Upvotes 3

Joseph_gitau

African center for data science and analytics

I have got a second OCR pipeline which i will share later on. Should be better that what I have shared already but needs further checks to confirm the structure and functionality.

replied to data_style_bender14 Oct 2025, 21:31

Upvotes 0

CodeJoe

@Joseph_gitau Have you realized there's a bug in unsloth?

replied to Joseph_gitau14 Oct 2025, 21:34

Upvotes 0

CodeJoe

When you try any inferencing. Check on your side if I am wrong

replied to CodeJoe14 Oct 2025, 21:36

Upvotes 0

Joseph_gitau

African center for data science and analytics

Yes, and why I have got a second pipeline. I knew this would happen. I guess this affects many in the leaderboard. with 5 days to the close there need to be a solution for this.

replied to CodeJoe14 Oct 2025, 21:36

Upvotes 1