Hi,
The maskhne paper at https://arxiv.org/pdf/2305.13989.pdf discusses various mthodologies like FT Eval, MAD-X and LT-SFT. MAD-X and LT-SFT requires dataset in source and target langauges. The data source in target language (luo and tsn) can be labelled so that we can train the model for thelanguage part. The notebook does not contain unlabelled news dataset for lup and tsn languages. Can we download unlabelled news dataset for luo and tsn language and use it to train langauge part of MAD-X /LT-SFT.
the rule said that we can unfortunately use only the dataset they provided.
"You may use only the datasets provided for this competition."
Hi, that's a great question. Yes, you can use unlabelled data for the target languages. We have updated the readme of MasakhaPOS Github with the link to some monolingual data used for annotation. You are not restricted to this, please, feel free to use other monolingual data that you may find online.
@zindi, @amy can someone from the Zindi team confirm whether we are allowed to use data other than the 18 language folders?
Hello "Zindi Team"
According to what @Krishna_Priya said, We are all wating for your confirmation.