Pre-trained language models in recent times have led to significant improvement in various Natural Language Processing (NLP) tasks and transfer learning is rapidly changing the field. Transfer Learning is the process of training a model on a large-scale dataset and then using that pre-trained model to conduct learning for another downstream task (i.e., target task).
Among leading architectures for pre-training models for transfer learning in NLP, pre-trained models, particularly for African languages, are barely represented mainly due to a lack of data. While these architectures are freely available for use, most are data-hungry. The GPT-2 model, for instance, used millions, possibly billions of text to train. (ref)
This gap exists due to a lack of availability of data for African languages on the Internet. The languages selected for BERT pre-training “were chosen because they are the top languages with the largest Wikipedias”. (ref) Similarly, the 157 pre-trained language models made available by fastText were trained on Wikipedia and Common Crawl. (ref)
The objective of this challenge is the creation, curation and collation of good quality African language datasets for a specific NLP task. This task-specific NLP dataset will serve as the downstream task we can evaluate future language models on.
This challenge is being hosted by the Artificial Intelligence for Development Africa(AI4D-Africa) Network.
About AI4D-Africa; Artificial Intelligence for Development-Africa Network: (ai4d.ai)
AI4D-Africa is a network of excellence in AI in sub-Saharan Africa. It is aimed at strengthening and developing community, scientific and technological excellence in a range of AI-related areas. It is composed of African Artificial Intelligence researchers, practitioners and policy makers.