A short description of the solution process:
1. Firstly, I finetuned wav2vec2-xlsr model on a random train/validation split - this gave WER of 0.07 on the validation set and 0.15 on the (public) test set. Validation tracked train very closely, which led me to realise there are only about 700 unique train transcriptions.
2. I finetuned wav2vec2-xlsr on a train/validation split without overlapping transcriptions - this gave WER of 0.24 on validation. The best model was actually "jonatasgrosman/wav2vec2-large-xlsr-53-french" from Huggingface hub - XLSR finetuned on French Common Voice, which outperformed both French-only and multilingual models. This model scored 0.14 on the test set. For postprocessing, I applied a French spellchecker, which reduced the WER to 0.12
3. Since the test score was lower than the validation score, I suspected the test set also (partly) consisted of 700 train labels. I matched test predictions to the closest preprocessed train transcripts calculating Levenstein distance and about 85% were within 2 edits from the closest preprocessed train transcript. Submitting the preprocessed closest train transcript for all observations below 7 edits scored 0.04 on the test set. Submitting the original train transcripts (i.e. including punctuation and noise) scored 0.021
4. I retrained the model adding well-matched test samples and the validation set to train, repeated the step (3) and decoded test examples that didn't have a close match using a language model trained on the train text. This model scored 0.0202 on the test set, and was my final submission
Your solution made me cry. I think you did so much more than me. If I was an organizer I would have giving you MVP award.
I just feel the same.
Congrats.
Please, can (I/we) have the code for this solution!
Well done Sir
Well done! Thank you for sharing.
Now with code: https://github.com/adilism/zindi-ai4d-wolof