Hello everyone, this is the summary of the 12th solution:
Fine-tune facebook wav2vec2 without any data augmentation or parameter tuning, got 12% WER.
Using two effects (reduce speed + reverberation) to transform data with p=0.5, reduce both the attention_dropout and the hidden_dropout to 0.05, got 0.099 WER.
Other effects such as adding noise, gain, and pitch shift did not improve the results.
Code: https://github.com/anashas/Automatic-Speech-Recognition-in-WOLOF
Great work Sir