Hi all,
if you want to achieve a Word Error Rate (WER) similar to mine, this pipeline should get you there. I used these exact notebooks for my best submission. The main variable you'll need to adjust is the prompt; I recommend experimenting with different phrasings.
Data Preparation Notebook: https://colab.research.google.com/drive/1oxCswRQsCnfGt7hwf-CUCYYYerXBZJfJ?usp=sharing
Base Model Notebook:
https://colab.research.google.com/drive/1oGey1j8ur189rxRzmyyCoTYDLQDzkAo5?usp=sharing
NO WAY! I am definitely doing this first. You are amazing @Joseph_gitau!!!!
Awesome 👏💯
Another addition I forgot is you have to remove middle name initials to get the low WER.
You can use my function when running inference.
name_protected = name.replace(' St. ', ' PROTECTED_ST ')name_cleaned = name_cleaned.replace(' PROTECTED_ST ', ' St. ')But I guess telling the model middle name initials are not needed would be a better option. I don't know
That's fine. Thank you for sharing! I think you posted too early though. Now the board is going to be funny😂😂
😂😂 That's the goal. Let's see who can maybe be more creative than the other. Been a learning experience for me. That's a plus for me.
😂😂😭😭
😂😂 go all the way and publish the segmentation part to see real chaos
You can help us with that Nymfree🤲😂
🤣🤣🤣😂 wild
@Joseph_gitau
I guess I'll be the one to say this. You've just ruined the efforts of everyone who worked hard to attain that score.
Sharing a high scoring code within 2 weeks of the end of the competition is not acceptable. Many of us, myself included, invested a great amount of time and resources to reach these scores and build that advantage, and now a large part of that has gone to waste. Now, anyone can join the contest and become competitive within a few days. If that was your goal, you could have said earlier, and I would have sat and waited for you do so.
I'm really struggling to grasp your thought process behind this. It would have cost you nothing to wait until the end of the competion to share your solution, and everyone would have thanked you for it.
To all those who are ecstatic about this, yes, you might now be competitive. But this is not how you learn and grow. Growth comes from struggling to find a solution or achieving a breakthrough on your own.
Honestly, we were also quite surprised and didn't really understand the motive to why he did that. But what is done is done. I am extremely sorry for how you feel now but it is not a wasted effort, @Muhamed_Tuo. Wishing you the best of Luck!
Well said @Muhamed_Tuo. Also found it strange as he had been sitting on this solution for around a month.
Totally get the frustration @Muhamed_Tuo.
Quick reality check on Zindi rules: "Code must not be shared privately… Any code that is shared must be made available to all participants on the discussion boards." This notebook was shared publicly, so it's within the rules.
On timing/fairness: early sharing can sting, but it raises the floor, not the ceiling. Copying a score ≠ winning-CV choices, feature tweaks, training stability, ensembling, and error analysis still separate leaders.
Upside: faster learning, more creativity, and tougher competition. A strong public baseline helps everyone optimize further-better splits/regularization/augmentation/post-processing, etc.
If we want an embargo on sharing code in the last X weeks, let's propose it for future comps. Until then, this share is rule-compliant, let's build on it, not shut it down.
@Muhamed_Tuo, massive respect for the time and effort you put in—those scores are proof of your skill, and no shared notebook can erase the learning from that journey.
For me, this is what community is all about. A shared high-performing notebook doesn't devalue the work done; it accelerates our collective growth. It shifts the challenge from baseline discovery to true innovation.
I also worked hard from the start, and I believe this is how we push the boundaries. We learn best when we build upon shared ideas.
I believe what he shared gives a baseline score which is similar to the starter notebook Zindi gives sometimes. I don't see anything wrong with it.
@Joseph_gitau The issue isn’t with sharing itself, it’s the timing. Sharing resources is great, but not this close to the end of the competition. If it had been shared a month earlier, that would’ve been fair. But right now, all the effort and resources the top competitors invested to reach those scores have basically gone to waste.
@Bone Let’s be honest the starter notebook’s score and his notebook’s score aren’t even close. Anyway, what’s done is done… let the chaos begin
May the best win!
Thanks a lot, man!
This is exactly what a real competition should be about — collaboration and learning. Platforms like Kaggle have shown that sharing ideas and code helps everyone grow and push boundaries.
I honestly don’t understand why some people think it ruins the top scorers’ efforts. If they had shared even a bit of their insights, this whole situation probably wouldn’t have happened.
I have got a second OCR pipeline which i will share later on. Should be better that what I have shared already but needs further checks to confirm the structure and functionality.
@Joseph_gitau Have you realized there's a bug in unsloth?
When you try any inferencing. Check on your side if I am wrong
Yes, and why I have got a second pipeline. I knew this would happen. I guess this affects many in the leaderboard. with 5 days to the close there need to be a solution for this.
Alright that's fine. I think everyone should be aware
How can we make them aware? Doing a post on the same or this thread should be ok?
Let me post on a new thread. Or?
Should be ok.
Done!