A curated set of 7 hours of audio files has been collected for the test set. You will train your model on open-source data and apply it to the test set.
Find the test set here.
To help you get started, we’ve pulled together a rich set of open-source tools and datasets that you’re free to use in this challenge.
The most important resource for your model is the Mozilla Common Voice Swahili dataset. It features over 100+ hours of labelled speech recordings contributed by native speakers. This will be your primary dataset for training and evaluation.
To strengthen your language understanding or enrich your pipeline, consider using pawa-min-alpha—a massive 2-billion parameter Swahili language model. It’s ready to plug in for language modeling, rescoring, or as a downstream component.
When it comes to tools and frameworks, you have a lot of flexibility. You can start with Whisper, a powerful open-source STT model by OpenAI. Another solid option is Vosk, which is lightweight and great for low-resource devices.
Looking to improve your model’s efficiency to run on edge device? Take a look at this practical model pruning guide. It’s especially useful if you're aiming for fast inference on limited hardware like the NVIDIA T4 GPU.
All data and tools used must be open-source and publicly licensed (CC-BY, MIT, Apache 2.0 or less restrictive). Proprietary or closed data is not allowed.