If Whisper proves too opaque for you, here are 2 step-by-step tutorials from huggingface using Wav2vec2 to fine-tune an English ASR model similar to the AfriSpeech task.
This approach gives you more control over the vocabulary of the model which may be quite important in getting a higher WER. They also cover the feature extractor, processor, data collator, and other important components in more detail. You will find a lot of helpful tips in there!
One more caveat. As with most real-world projects, watch out for gotchas in the data such as missing or duplicate audios, empty or short transcripts, and long audios or long transcripts that can lead to CUDA OOM errors. It is helpful to have good error handling in your preprocessing scripts to help avoid painful crashes down the road.
Happy Hacking! Enjoy!
Hi @intron,
Thanks for the ressources.
It is both refreshing and nice to have the host involved in the competition.
Thank you so mutch intron,
There are a few duplicates, no missing or empty audios/transcripts, but there are samples that are too short (containing only 6 characters) or too long, like this exp that the transcript contains 795 charater and the audio lasts ~131 sd:
So, let's take care of this and let me see you on the LB @Muhamed_Tuo.
I don't know about you, but I decided not to remove the punctuation/special characters, because there are some audios where they read them (not in all audios)...
@Siwar_NASRI, deal accepted. See you on the LB.
About the punctuation I think you should remove. The reason is that you can't capture these by just listening to an audio file. Even for humans, it's not that obvious, unless there's a pattern. You can do an heuristic later to add them. But I would just clean them not to put to much noise into the training of the model
I have a feeling that these punctuations are going to be messing with us since the metric is a bit strict on words orders
@Muhamed_Tuo let's spend more time with the data, we're still at the starting point. I'm even thinking not to lower the characters, medical language is not as free as spoken language.
It's kind of tricky with the punctuations because I have encountered audio samples where the speakers say the punctuations out loud (like comma and full stop) while reading the sentences
Definitely. There might be some patterns the model could easily capture.
Now you've scared me :)