Winning Solution to the Swahili ASR Challenge
Meet the winners ยท 20 Nov 2025, 03:31 ยท 5 mins read ยท
10

Have you ever wondered how a machine learns to understand a language — not from text typed neatly on a screen, but from real human speech spoken in buses, markets, busy offices, or quiet homes? Speech full of interruptions, background noise, accents, laughter, and hesitation?

That was exactly the challenge in the Your Voice, Your Device, Your Language Challenge — and the journey that led me to build a lightweight, noise‑robust, edge‑deployable Swahili ASR model that ultimately achieved one of the top performances in the competition.

My name is Abdourahamane Ide Salifou, and this is how I approached the problem, the technical decisions I made, and the lessons I learned while building a practical speech recognition system for one of Africa’s most widely spoken languages.

Follow along in my GitHub.

Swahili is spoken by tens of millions across East Africa, yet high‑quality open ASR tools for the language remain limited. The challenge invited participants to build a solution capable of:

  • Handling real conversational speech
  • Running efficiently on low‑resource hardware
  • Preserving privacy by enabling on‑device inference
  • Maintaining strong accuracy despite noise and speaker variability

A major twist? No training dataset was provided. Participants had to source, and build their own training pipeline creatively.

This constraint shaped my entire approach.

📚 1. Building the Dataset: Going Beyond "Read Speech"

One of the first lessons I embraced was that speech recognition models trained only on clean, studio‑recorded audio rarely succeed in real life. Human conversations are messy — full of overlaps, pauses, filler sounds, and background noise.

So instead of using traditional clean datasets, I chose a corpus that reflects natural Swahili speech.

Dataset Used: Sunbird/salt

A multi‑speaker, conversational Swahili dataset containing:

  • Spontaneous dialogue
  • Speaker variability
  • Realistic pacing and prosody
  • Environmental variability

This dataset aligned perfectly with the challenge requirement: "Build a model that works in real-world conditions."

Training on this type of data allowed the model to generalise better, improving robustness in situations like phone calls, student interviews, and outdoor conversations.

🔊 2. Noise Augmentation: Training the Model to Survive Reality

Even with conversational speech, real deployments involve unpredictable noise. Such as motorbikes, marketplace chatter, wind and children playing in the background.

To prepare the model, I used a targeted noise augmentation strategy.

Noise pool: Sunbird/urban-noise-uganda-61k

This dataset includes:

  • Market sounds
  • Traffic
  • People speaking in the background
  • Varying acoustic profiles from East African cities

Instead of preprocessing the entire dataset with noise, I built a custom DataCollator that:

  • Randomly selects a noise clip
  • Overlays it directly on each speech waveform
  • Randomizes the amplitude to simulate different SNR conditions

This dynamic augmentation ensures that every training batch is unique, preventing the model from overfitting to any specific noise pattern.

3. Audio Profiling: Understanding the Test Data

Before any model training, I conducted an analysis of the inference dataset (Sartify_ITU_Zindi_Testdataset).

Key findings:

  • All files were 16 kHz
  • All were mono-channel
  • Average duration: 6.08 seconds

This simplified the pipeline — no resampling, no channel conversion. Understanding these characteristics early helped avoid mismatches and inference bugs later.

4. Model Selection: Whisper, But Turbo

With the right data strategy in place, the next question was: Which architecture can deliver accuracy without sacrificing efficiency?

I selected Abdoul27/whisper-turbo-v3-model, a distilled variant of Whisper.

Why this model?

1. Strong Swahili foundation

It was already trained on the Common Voice 12 Swahili dataset, giving the model:

  • Phonetic understanding
  • Vocabulary grounding
  • Accent robustness

2. Distilled = smaller, faster

This “turbo” version preserves Whisper’s strengths while drastically reducing size, making it ideal for:

  • Real-time inference
  • Low-memory GPU environments
  • Edge deployment scenarios

5. Fine-Tuning Strategy: Making a Big Model Fit on a Small GPU

This challenge had strict resource constraints. Training Whisper on a regular GPU is expensive, but training it on a single NVIDIA T4 (≤16 GB VRAM)? That required creativity.

I combined two powerful techniques:

A. Parameter-Efficient Fine-Tuning (PEFT) with LoRA

Instead of updating all model weights, LoRA:

  • Freezes core Whisper parameters
  • Injects small trainable matrices into attention layers
  • Trains only these lightweight adapters

This approach drastically reduces memory usage while preserving model knowledge.

B. 8-bit Quantization

Loading the model in 8-bit format:

  • Cut memory footprint significantly
  • Allowed larger batch sizes
  • Accelerated attention computations

These two methods made training Whisper Turbo feasible under tight constraints.

6. Memory Optimization & Training Settings

To push the model further within hardware limits, I employed:

  • Gradient checkpointing — reduce activation memory
  • Mixed precision (fp16) — faster math
  • Batch size = 4 (max that fits)
  • Gradient accumulation = 2 (effective batch = 8)

Learning schedule:

  • Learning rate: 1e-5
  • Warmup: 500 steps
  • Epochs: 1 (enough due to strong initialization + data size)

To avoid overfitting and hallucination, I monitored Word Error Rate (WER) closely. Only the checkpoint with the lowest validation WER was kept.

7. Inference: Smarter Decoding With Beam Search

For inference, Whisper supports both greedy and beam search decoding.

I used:

  • num_beams = 3
  • repetition_penalty = 1.2

Beam search explores multiple hypotheses before deciding on the transcription, avoiding early mistakes that greedy decoding often makes.

This improved accuracy at only a small cost to speed.

📊 8. Results

The final model demonstrated strong performance:

  • Public leaderboard WER: 18.22
  • Private leaderboard WER: 17.81
  • Inference time: ~1.24 seconds per audio file
  • Full test set (4,089 files): 1 hour, 24 minutes, 29 seconds

Given the challenge constraints and the real-world nature of the data, these results validated the approach thoroughly.

🧭 Final Thoughts

Building this system taught me that developing ASR for African languages requires:

  • thoughtful dataset choices,
  • robustness-focused augmentation,
  • hardware-aware optimization,
  • and careful fine-tuning of powerful architectures.

This lightweight Swahili ASR model shows that high-performance speech technology for low-resource languages is absolutely achievable — even on constrained hardware.

About Abdourahamane

I am Abdourahamane Ide Salifou, a graduate student in Engineering Artificial Intelligence passionate about the intersection of language, accessibility, and machine learning. I enjoy building practical AI systems — especially those designed for low-resource environments where efficiency and inclusivity matter most.

This challenge was an opportunity to apply that philosophy, and I am excited for what comes next.

Back to top
If you enjoyed this content upvote this article to show your support
Discussion 2 answers
User avatar
21db

@Abdourahamane_ congratulations and very amazing approach!

20 Nov 2025, 11:27
Upvotes 0

Congratulations, Abdourahamane Ide Salifou!

20 Nov 2025, 11:55
Upvotes 0