Many thanks to all hosts and Zindi team for such an interesting challenge. Congrats and thanks to all participants!
Despite the simple formulation the task is really difficult because some classes have close semantics and small number of training examples. My final solution is an ensemble of 6 MT5 (L, XL) models trained on different sequence lengths from 64 to 256 tokens. Each of 6 models is 5-fold self-ensemble. I wasn’t aware of any other models pretrained on Chichewa language so from the beginning concentrated on MT5. My cross-validation setup is based on 5-fold stratified split. CV score of my best ensemble was 0.7005 (pretty close to the private LB 0.7097) but its public score was only 0.6419 so choosing final submissions is this competition was a bit tricky.
One interesting conclusion from my experiments is that good model can be trained on relatively short sequence (even though almost all texts are quite long). In particular one of my best single models was trained on 64 tokens. Models trained on 384 and 512 tokens were not better. Also I trained some models on different ranges of tokens like [0:256), [256:512), etc. All ranges except the 1st one gave much lower CV score.
Awesome !
thanks for sharing with the community 😉
great summary
Hi,
Many thanks fort sharing with us, and congratulations!
Can you also please also share your configuration, GPU, etc. @vecxoz
Thanks!
In general I prefer to run experiments on the free TPUs at Kaggle. If I need more resources I rent GPUs at Google Cloud. In particular in the end of this competition I rented couple hours of A100-40GB to train MT5-XL because this model does not fit in 16 GB GPU/TPU.
Thanks fort the information
Wow..... Great. Thanks for sharing. 👍
Hi!
Thank you for sharing your experiments on that challenge. I really appreciate it. 👏👏👏
👏👏
Congratulations and thanks for sharing!
Contratulations!
Our result is actually is also the blend of MT5 models and the linear model with the standard TF-IDF stuff.
We used first 700 tokens.
> In particular one of my best single models was trained on 64 tokens.
Did you use the first 64 tokens in that case?
> task is really difficult because some classes have close semantics
Yeah, and also I think there are some mislabellings. I've found ~20 of them and tried to fix them.
Thanks!
Yes, 64 first tokens.
I have also tried mT5 small model with 60 epochs and 0.001 learning rate but unfortunately it couldn't make any prediction.
@Sir-G @vecxoz can you share your hyerparameters configuration please, maybe my resources weren't
I had problems when all layers were trainable. Maybe try to freeze all layers except 2 or 3 last blocks.
Oh great idea!
Thanks a lot for sharing. I tried MT5 on Colab but ran into GPU issues since I didn't have the needed GPU at the time. Able to attain 0.58 with MT5 small but I had to use tiny batch sizes which isn't very ideal cuz it causes unstable learning shifts.
Truly, the beauty in NLP is in understanding how to leverage the knowledge of pre-trained models.