Now that the competition is over, it would be very interesting to learn from the solution of others.
On our side (NLP Zurich, rank 6) we followed this procedure:
- pretrained XLSR model. Fine-tuning on data set
- ensemble of 3 models
- beam search with 3-gram word language model (with 20 beams for each of the 3 models)
- nearest neighbour search of the prediction in the vocabulary extracted from the training set
Thanks a lot. Can you share code once review is over. On my side, I fine tuned a XLSR model and got a LB score of 1.60
Single model pretrained XLSR model but preprocessed data. For convention from mp3 to wav and then removed the noises and removed the silence. used 40 epochs and dived data sets in two different datasets to process it faster. I have used almost all of the data for training. I got 10th but my original rank is 8th as my other selection was 0.078 were but somehow the algorithm didn't count it as. My future strategy was to enhance the speech using the deep learning pretrained model.
Can you share the code, please!
sure. it will take some time to clean up the Github repo but then I'm happy to share it
Same here...
I guess moving on to further competitions we're all going to have to trust local cv more than public LB due to the massive shakeup that just happened..
My solution was based on a nvidia pretrained model .Dataset was processed into 4 variants ;
sythethic data with noise and sythetic data without noise,
non sythetic data with noise and non sythetic data without noise,
best model was achieved with sythetic data with noise..
preprocessor for encoder was a melspectogram and a greedy decoder to decode output..
cutuout was used for augementation as spec_augment didnt improve local cv..
training was in 24 epoch on 4 notebooks as i was constrained with gpu usage on kaggle under 9 hours..
final position 11th with loss of 0.09723334.... and best from submissions was 0.093545188003...
congrats to the winners and everyone who participated... I hope to see you all in another challenge soon..