For this model training challenge, the ASR dataset will be available as soon as the hackathon starts.
Links referenced in workshop presentation: will be linked here after the workshop
For this model training challenge, you will be using the Hausa dataset provided by Mozilla Common Voice. To access the data, visit Mozilla Common Voice Datasets and select “Common Voice Corpus 7.0” and “Language: Hausa”. Important note: the data is pre-split into train and test sets (as seen in the downloads), please only train your model using the “train.tsv” data and do not use “test.tsv” data to train your model.
Please use the Test.csv and SampleSubmission.csv on Zindi to test your model. This is the same test set on Mozilla Common Voice but the IDs have been edited to work with the Zindi system. You will need to go onto Mozilla to download Train.tsv.
About Mozilla Common Voice (commonvoice.mozilla.org/en/faq)
“Voice recognition technology is revolutionizing the way we interact with machines, but the currently available systems are expensive and proprietary. Mozilla Common Voice is an initiative to make voice recognition technologies better and more accessible for everyone. Common Voice is a massive global database of donated voices that lets anyone quickly and easily train voice-enabled apps in potentially every language. We're not only collecting voice samples in widely spoken languages but also in those with a smaller population of speakers. Publishing a diverse dataset of voices will empower developers, entrepreneurs, and communities to address this gap themselves.”
Files in Mozilla’s Common Voice Hausa you can download:
Every .tsv file contains the transcripts, the corresponding audio file names, and (if available), the metadata about the speakers.
Low Resource ASR
Mozilla Common Voice
Elpis quick guide
Preparing the corpus