A reflection on the Google NLP Hack Series
Technical · 30 Jun 2022, 07:41

In collaboration with Google, Zindi hosted the Google NLP Hack Series, Intro to ASR Africa Challenge. Zindians were to submit datasets in their own languages for use in Automatic Speech Recognition (ASR). The aim of this challenge was to collect written and spoken language data to create a speech corpus that can be used to train an ASR model in a beginner-friendly way. There were many submissions of different languages and we are glad that this work also helps to preserve the local African languages.

The data and model collection challenge was held after the Introduction to ASR Workshop and a week-long hackathon where Zindians practised their newly learned skills. Data scientists were encouraged to collect datasets from their community in their own languages and it was also an invitation to data scientists to join the ASR community.

Second place winner Jamiil Ali from Benin collected audio files in Dendi, a language of his elders.

“Being a native of the language, I still don’t know how to write it. But I can read a text written in Dendi. During this data collection, I got the chance to learn the Dendi alphabet and the pronunciation of some letters. Best of all, I discovered how to count in Dendi!”

Jamiil gathered information that might be forgotten one day and his repository serves as an oral history of his country and the Dendi language.

First-place winner, Dunstan Matekenya from Malawi collected audio files in Chichewa and learned that speech data collection is harder than it seems, but he persevered and won the challenge.

“The most difficult aspect is the transcription. When you have multiple people transcribing the audio, it is difficult to determine if they are doing the same thing. I developed a one-page manual to ensure everyone was following the same procedure. Also, in Chichewa, people combine English a lot and this introduces challenges during transcription in terms of how to transcribe the English words.”

You can find the datasets collected by the winners here: Sudan: Sudanese dialect Benin: Fongbe, Senegal: Wolof, Kenya: Swahili, Ivory Coast: Baule, Ivory Coast: Dendi, Malawi: Nyanja / Chichewa.

Speaking to future NLP practitioners, ​​Jamiil Ali said, “Amadou Hampaté Bâ once said: ‘an old man who dies is a library that burns.’ This is a call to encourage all native language speakers who are not yet online, to contribute to the development of their language by generating data. Nowadays, we have resources to help us build digital libraries that will never disappear and continue to serve future generations. Most importantly, it helps our children to not forget their roots.”

Missed this challenge, but still interested in trying it out? Learn more about ASR and how to build your model here: Africa ASR Workshop video (cover intro to NLP & ASR). You can also check out the Elpis tutorial on how to train your model using Elpis.