🌍 Winning Solution to the Swahili ASR Challenge

Winning Solution to the Swahili ASR Challenge

Meet the winners · 20 Nov 2025, 03:31 · 5 mins read ·

Have you ever wondered how a machine learns to understand a language — not from text typed neatly on a screen, but from real human speech spoken in buses, markets, busy offices, or quiet homes? Speech full of interruptions, background noise, accents, laughter, and hesitation?

That was exactly the challenge in the Your Voice, Your Device, Your Language Challenge — and the journey that led me to build a lightweight, noise‑robust, edge‑deployable Swahili ASR model that ultimately achieved one of the top performances in the competition.

My name is Abdourahamane Ide Salifou, and this is how I approached the problem, the technical decisions I made, and the lessons I learned while building a practical speech recognition system for one of Africa’s most widely spoken languages.

Follow along in my GitHub.

Swahili is spoken by tens of millions across East Africa, yet high‑quality open ASR tools for the language remain limited. The challenge invited participants to build a solution capable of:

Handling real conversational speech
Running efficiently on low‑resource hardware
Preserving privacy by enabling on‑device inference
Maintaining strong accuracy despite noise and speaker variability

A major twist? No training dataset was provided. Participants had to source, and build their own training pipeline creatively.

This constraint shaped my entire approach.

📚 1. Building the Dataset: Going Beyond "Read Speech"

One of the first lessons I embraced was that speech recognition models trained only on clean, studio‑recorded audio rarely succeed in real life. Human conversations are messy — full of overlaps, pauses, filler sounds, and background noise.

So instead of using traditional clean datasets, I chose a corpus that reflects natural Swahili speech.

Dataset Used: Sunbird/salt

A multi‑speaker, conversational Swahili dataset containing:

Spontaneous dialogue
Speaker variability
Realistic pacing and prosody
Environmental variability

This dataset aligned perfectly with the challenge requirement: "Build a model that works in real-world conditions."

Training on this type of data allowed the model to generalise better, improving robustness in situations like phone calls, student interviews, and outdoor conversations.

🔊 2. Noise Augmentation: Training the Model to Survive Reality

Even with conversational speech, real deployments involve unpredictable noise. Such as motorbikes, marketplace chatter, wind and children playing in the background.

To prepare the model, I used a targeted noise augmentation strategy.

Noise pool: Sunbird/urban-noise-uganda-61k

This dataset includes:

Market sounds
Traffic
People speaking in the background
Varying acoustic profiles from East African cities

Instead of preprocessing the entire dataset with noise, I built a custom DataCollator that:

Randomly selects a noise clip
Overlays it directly on each speech waveform
Randomizes the amplitude to simulate different SNR conditions

This dynamic augmentation ensures that every training batch is unique, preventing the model from overfitting to any specific noise pattern.

3. Audio Profiling: Understanding the Test Data

Before any model training, I conducted an analysis of the inference dataset (Sartify_ITU_Zindi_Testdataset).

Key findings:

All files were 16 kHz
All were mono-channel
Average duration: 6.08 seconds

This simplified the pipeline — no resampling, no channel conversion. Understanding these characteristics early helped avoid mismatches and inference bugs later.

4. Model Selection: Whisper, But Turbo

With the right data strategy in place, the next question was: Which architecture can deliver accuracy without sacrificing efficiency?

I selected Abdoul27/whisper-turbo-v3-model, a distilled variant of Whisper.

Why this model?

1. Strong Swahili foundation

It was already trained on the Common Voice 12 Swahili dataset, giving the model:

Phonetic understanding
Vocabulary grounding
Accent robustness

2. Distilled = smaller, faster

This “turbo” version preserves Whisper’s strengths while drastically reducing size, making it ideal for:

Real-time inference
Low-memory GPU environments
Edge deployment scenarios

5. Fine-Tuning Strategy: Making a Big Model Fit on a Small GPU

This challenge had strict resource constraints. Training Whisper on a regular GPU is expensive, but training it on a single NVIDIA T4 (≤16 GB VRAM)? That required creativity.

I combined two powerful techniques:

A. Parameter-Efficient Fine-Tuning (PEFT) with LoRA

Instead of updating all model weights, LoRA:

Freezes core Whisper parameters
Injects small trainable matrices into attention layers
Trains only these lightweight adapters

This approach drastically reduces memory usage while preserving model knowledge.

B. 8-bit Quantization

Loading the model in 8-bit format:

Cut memory footprint significantly
Allowed larger batch sizes
Accelerated attention computations

These two methods made training Whisper Turbo feasible under tight constraints.

6. Memory Optimization & Training Settings

To push the model further within hardware limits, I employed:

Gradient checkpointing — reduce activation memory
Mixed precision (fp16) — faster math
Batch size = 4 (max that fits)
Gradient accumulation = 2 (effective batch = 8)

Learning schedule:

Learning rate: 1e-5
Warmup: 500 steps
Epochs: 1 (enough due to strong initialization + data size)

To avoid overfitting and hallucination, I monitored Word Error Rate (WER) closely. Only the checkpoint with the lowest validation WER was kept.

7. Inference: Smarter Decoding With Beam Search

For inference, Whisper supports both greedy and beam search decoding.

I used:

num_beams = 3
repetition_penalty = 1.2

Beam search explores multiple hypotheses before deciding on the transcription, avoiding early mistakes that greedy decoding often makes.

This improved accuracy at only a small cost to speed.

📊 8. Results

The final model demonstrated strong performance:

Public leaderboard WER: 18.22
Private leaderboard WER: 17.81
Inference time: ~1.24 seconds per audio file
Full test set (4,089 files): 1 hour, 24 minutes, 29 seconds

Given the challenge constraints and the real-world nature of the data, these results validated the approach thoroughly.

🧭 Final Thoughts

Building this system taught me that developing ASR for African languages requires:

thoughtful dataset choices,
robustness-focused augmentation,
hardware-aware optimization,
and careful fine-tuning of powerful architectures.

This lightweight Swahili ASR model shows that high-performance speech technology for low-resource languages is absolutely achievable — even on constrained hardware.

About Abdourahamane

I am Abdourahamane Ide Salifou, a graduate student in Engineering Artificial Intelligence passionate about the intersection of language, accessibility, and machine learning. I enjoy building practical AI systems — especially those designed for low-resource environments where efficiency and inclusivity matter most.

This challenge was an opportunity to apply that philosophy, and I am excited for what comes next.

If you enjoyed this content upvote this article to show your support