Have you ever wondered how a machine learns to understand a language — not from text typed neatly on a screen, but from real human speech spoken in buses, markets, busy offices, or quiet homes? Speech full of interruptions, background noise, accents, laughter, and hesitation?
That was exactly the challenge in the Your Voice, Your Device, Your Language Challenge — and the journey that led me to build a lightweight, noise‑robust, edge‑deployable Swahili ASR model that ultimately achieved one of the top performances in the competition.
My name is Abdourahamane Ide Salifou, and this is how I approached the problem, the technical decisions I made, and the lessons I learned while building a practical speech recognition system for one of Africa’s most widely spoken languages.
Swahili is spoken by tens of millions across East Africa, yet high‑quality open ASR tools for the language remain limited. The challenge invited participants to build a solution capable of:
A major twist? No training dataset was provided. Participants had to source, and build their own training pipeline creatively.
This constraint shaped my entire approach.
One of the first lessons I embraced was that speech recognition models trained only on clean, studio‑recorded audio rarely succeed in real life. Human conversations are messy — full of overlaps, pauses, filler sounds, and background noise.
So instead of using traditional clean datasets, I chose a corpus that reflects natural Swahili speech.
Dataset Used: Sunbird/salt
A multi‑speaker, conversational Swahili dataset containing:
This dataset aligned perfectly with the challenge requirement: "Build a model that works in real-world conditions."
Training on this type of data allowed the model to generalise better, improving robustness in situations like phone calls, student interviews, and outdoor conversations.
Even with conversational speech, real deployments involve unpredictable noise. Such as motorbikes, marketplace chatter, wind and children playing in the background.
To prepare the model, I used a targeted noise augmentation strategy.
Noise pool: Sunbird/urban-noise-uganda-61k
This dataset includes:
Instead of preprocessing the entire dataset with noise, I built a custom DataCollator that:
This dynamic augmentation ensures that every training batch is unique, preventing the model from overfitting to any specific noise pattern.
Before any model training, I conducted an analysis of the inference dataset (Sartify_ITU_Zindi_Testdataset).
Key findings:
This simplified the pipeline — no resampling, no channel conversion. Understanding these characteristics early helped avoid mismatches and inference bugs later.
With the right data strategy in place, the next question was: Which architecture can deliver accuracy without sacrificing efficiency?
I selected Abdoul27/whisper-turbo-v3-model, a distilled variant of Whisper.
Why this model?
1. Strong Swahili foundation
It was already trained on the Common Voice 12 Swahili dataset, giving the model:
2. Distilled = smaller, faster
This “turbo” version preserves Whisper’s strengths while drastically reducing size, making it ideal for:
This challenge had strict resource constraints. Training Whisper on a regular GPU is expensive, but training it on a single NVIDIA T4 (≤16 GB VRAM)? That required creativity.
I combined two powerful techniques:
A. Parameter-Efficient Fine-Tuning (PEFT) with LoRA
Instead of updating all model weights, LoRA:
This approach drastically reduces memory usage while preserving model knowledge.
B. 8-bit Quantization
Loading the model in 8-bit format:
These two methods made training Whisper Turbo feasible under tight constraints.
To push the model further within hardware limits, I employed:
Learning schedule:
To avoid overfitting and hallucination, I monitored Word Error Rate (WER) closely. Only the checkpoint with the lowest validation WER was kept.
For inference, Whisper supports both greedy and beam search decoding.
I used:
Beam search explores multiple hypotheses before deciding on the transcription, avoiding early mistakes that greedy decoding often makes.
This improved accuracy at only a small cost to speed.
The final model demonstrated strong performance:
Given the challenge constraints and the real-world nature of the data, these results validated the approach thoroughly.
Building this system taught me that developing ASR for African languages requires:
This lightweight Swahili ASR model shows that high-performance speech technology for low-resource languages is absolutely achievable — even on constrained hardware.
About Abdourahamane
I am Abdourahamane Ide Salifou, a graduate student in Engineering Artificial Intelligence passionate about the intersection of language, accessibility, and machine learning. I enjoy building practical AI systems — especially those designed for low-resource environments where efficiency and inclusivity matter most.
This challenge was an opportunity to apply that philosophy, and I am excited for what comes next.
@Abdourahamane_ congratulations and very amazing approach!
Congratulations, Abdourahamane Ide Salifou!