Intron AfriSpeech-200 Automatic Speech Recognition Challenge
Can you create an automatic speech recognition (ASR) model for African accents, for use by doctors?
Prize
$5 000 USD
Time
2 months to go
Participants
11 active · 193 enrolled
Advanced
Automatic Speech Recognition
Health
Media
Finetune ASR models using Wav2Vec2
Notebooks · 27 Feb 2023, 07:25 · 8

If Whisper proves too opaque for you, here are 2 step-by-step tutorials from huggingface using Wav2vec2 to fine-tune an English ASR model similar to the AfriSpeech task.

This approach gives you more control over the vocabulary of the model which may be quite important in getting a higher WER. They also cover the feature extractor, processor, data collator, and other important components in more detail. You will find a lot of helpful tips in there!

One more caveat. As with most real-world projects, watch out for gotchas in the data such as missing or duplicate audios, empty or short transcripts, and long audios or long transcripts that can lead to CUDA OOM errors. It is helpful to have good error handling in your preprocessing scripts to help avoid painful crashes down the road.

Happy Hacking! Enjoy!

https://huggingface.co/blog/fine-tune-wav2vec2-english

https://huggingface.co/docs/transformers/tasks/asr

Discussion 8 answers

Hi @intron,

Thanks for the ressources.

It is both refreshing and nice to have the host involved in the competition.

27 Feb 2023, 08:15
Upvotes 1

Thank you so mutch intron,

There are a few duplicates, no missing or empty audios/transcripts, but there are samples that are too short (containing only 6 characters) or too long, like this exp that the transcript contains 795 charater and the audio lasts ~131 sd:

{'speaker_id': 'c43d9f633fa45740796d9f385ec98138',
 'path': '1f43a334-e3ab-4338-99fa-08e42a2cdd3e/6ee94e7c0b77c8062591d3f6122dc2c3.wav',
 'audio_id': '1f43a334-e3ab-4338-99fa-08e42a2cdd3e/6ee94e7c0b77c8062591d3f6122dc2c3',
 'audio': {'path': '1f43a334-e3ab-4338-99fa-08e42a2cdd3e/6ee94e7c0b77c8062591d3f6122dc2c3.wav',
  'array': array([-0.00653076, -0.0065918 , -0.00662231, ...,  0.00698853,
          0.00695801,  0.00686646]),
  'sampling_rate': 44100},
 'transcript': 'given 0.25 mg dilaudid IV prn Allergies: Morphine Confusion/Delir Last dose of Antibiotics: Piperacillin/Tazobactam Zosyn - Vancomycin - Cefipime - Metronidazole - Infusions: Other ICU medications: Pantoprazole Protonix - Hydromorphone Dilaudid - Other medications: Changes to medical and family history: Review of systems is unchanged from admission except as noted below Review of systems: Flowsheet Data as of  Vital signs Hemodynamic monitoring Fluid balance                 24 hours                Since 12 AM Tmax: 36.7C 98 Tcurrent: 36.6C 97.8 HR: 79 56 - 79 bpm BP: 120/6174 97/4155 - 132/6484 mmHg RR: 24 17 - 28 insp/min SpO2: 92% Heart rhythm: SR Sinus Rhythm    Total In:                 2 963 mL                 912 mL PO:    TF: IVF:                 2 963 mL                 912 mL',
 'age_group': '41-55',
 'gender': 'Male',
 'accent': 'igbo',
 'domain': 'clinical',
 'country': 'NG',
 'duration': 131.32798185941044}

So, let's take care of this and let me see you on the LB @Muhamed_Tuo.

I don't know about you, but I decided not to remove the punctuation/special characters, because there are some audios where they read them (not in all audios)...

@Siwar_NASRI, deal accepted. See you on the LB.

About the punctuation I think you should remove. The reason is that you can't capture these by just listening to an audio file. Even for humans, it's not that obvious, unless there's a pattern. You can do an heuristic later to add them. But I would just clean them not to put to much noise into the training of the model

I have a feeling that these punctuations are going to be messing with us since the metric is a bit strict on words orders

@Muhamed_Tuo let's spend more time with the data, we're still at the starting point. I'm even thinking not to lower the characters, medical language is not as free as spoken language.

It's kind of tricky with the punctuations because I have encountered audio samples where the speakers say the punctuations out loud (like comma and full stop) while reading the sentences

Definitely. There might be some patterns the model could easily capture.

Now you've scared me :)