Note that this competition has been updated on 1 June 2020 with a new round of prizes specifically for languages indigenous to Uganda, Ghana, and South Africa.
In recent times, pre-trained language models have led to significant improvement in various Natural Language Processing (NLP) tasks and transfer learning is rapidly changing the field. Transfer Learning is the process of training a model on a large-scale dataset and then using that pre-trained model to conduct learning for another downstream task (i.e. a target task like name entity recognition).
Among leading architectures for pre-training models for transfer learning in NLP, pre-trained models in African languages are barely represented mainly due to a lack of data. (However, there are some examples, for example this multilingual BERT that includes likes like Swahili and Yoruba.) While these architectures are freely available for use, most are data-hungry. The GPT-2 model, for instance, used millions, possibly billions of text to train. (ref)
This gap exists due to a lack of availability of data for African languages on the Internet. The languages selected for BERT pre-training “were chosen because they are the top languages with the largest Wikipedias”. (ref) Similarly, the 157 pre-trained language models made available by fastText were trained on Wikipedia and Common Crawl. (ref)
Therefore, this challenge's objective is the creation, curation and collation of good quality African language datasets for a specific NLP task. This task-specific NLP dataset will serve as the downstream task we can evaluate future language models on.
This challenge hosted in partnership with GIZ and the FAIR Forward initiative and the Artificial Intelligence for Development Africa(AI4D-Africa) Network.
About FAIR Forward and GIZ (toolkit-digitalisierung.de/en/fair-forward)
The “FAIR Forward – Artificial Intelligence for all” initiative promotes a more open, inclusive and sustainable approach to AI on an international level. It is implemented by the Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) on behalf of the German Federal Ministry for Economic Cooperation and Development (BMZ). FAIR Forward seeks to improve the foundations for AI innovation and policy in five partner countries: Rwanda, Uganda, Ghana, South Africa and India. Together with our partners, we focus on three areas of action: (1) strengthen local technical know-how on AI, (2) increase access to open AI training data, (3) develop policy frameworks ready for AI.
About AI4D-Africa; Artificial Intelligence for Development-Africa Network (ai4d.ai)
AI4D-Africa is a network of excellence in AI in sub-Saharan Africa. It is aimed at strengthening and developing community, scientific and technological excellence in a range of AI-related areas. It is composed of African Artificial Intelligence researchers, practitioners and policy makers.
To be eligible for this competition you must register for this competition on Zindi. You must upload your submission to the competition leaderboard. Note that there will be no scores on this leaderboard. The languages covered in the dataset must be indigenous to Uganda, South Africa, or Ghana. The participants in this competition can be from any country in the world.
The judging panel will score all submissions according to the evaluation criteria. The panel’s determination is final. Winners will be announced within 30 working days of the end of the competition.
Our intention is that the datasets are kept free and open for public use under a Creative Commons license 4.0 or similar. Data already licensed under more restrictive terms will not be eligible.
If your dataset wins, by accepting the prize, you thereby agree to making the dataset publicly available under a Creative Commons license 4.0 or similar and allow Zindi to use the dataset for a future challenge. All other datasets that did not win will similarly be encouraged to share their datasets as a public good.
If two data sets are identical, the tie breaker will be the date and time in which the submission was made (the earlier solution will win).
As an individual or team, you are able to make up to THREE unique submissions. If you make more than three submissions, we will evaluate your three most recent submissions.
You acknowledge and agree that Zindi may, without any obligation to do so, remove or disqualify an individual, team, or account if Zindi believes that such individual, team, or account is in violation of Zindi’s Rules.
We reserve the right to modify these rules at any time as necessary.
Update as of June 2020: The datasets MUST be for languages indigenous to Uganda, Ghana, or South Africa. Any other languages will not be evaluated.
Our intention is that the datasets are kept free and open for public use under a Creative Commons license 4.0 or similar. Data already licensed under more restrictive terms will not be eligible.
The evaluation of datasets will be done by an expert committee and will take into consideration the following criteria:
A corpus should be representative and balanced with respect to particular factors; for example, by genre—newspaper articles, literary fiction, spoken speech, blogs and diaries, and legal documents. A corpus is said to be “representative of a language variety” if the content of the corpus can be generalized to that variety (Leech 1991). Basically, if the content of the corpus, defined by specifications of linguistic phenomena examined or studied, reflects that of the larger population from which it is taken, then we can say that it "represents that language variety.”
The notion of a corpus being balanced is an idea that has been around since the 1980s, but it is still a rather fuzzy notion and difficult to define strictly. Atkins and Ostler (1992) propose a formulation of attributes that can be used to define the types of text, and thereby contribute to creating a balanced corpus.
the researcher credibly demonstrates that given a research grant of $1,500 in addition to a $500 upfront prize (for a total of $2,000), the resulting dataset will be delivered in a reasonable timeframe and perform well against the same criteria (representative and balanced, annotated for a specific downstream task, number of tokens, and underrepresentation of the language). This score will be based on the plan for building the dataset in the future as articulated in the documentation as well as other indications of the researcher’s commitment and understanding of the project from the submitted dataset and documentation.
The dataset should be designed to enable certain downstream tasks. Any downstream task will be considered and no preference will be given to one over the other. The following are examples of downstream tasks that we would be interested in seeing:
Updated June 2020: This competition contains only one round of data submissions and awards.
At the end of the competition, a $500 USD will be awarded to the top individuals or teams with the top language datasets from each of the following countries:
In addition to the $500 prize, an additional $1,500 research grant may be awarded to each of the three winners to help fund further development of their dataset over the following two to three months.
The specific terms and timeline will be agreed upon between Zindi and the winners. Winners who receive the research grant will receive technical guidance from NLP experts from AI4D. Zindi reserves the right to award the research grant to other individuals or teams if an agreement for further development of the datasets is not reached with the top winners. Regardless of the research grant, the top winners will still receive the $500 prize.
Competition closes on 2 August 2020.
Final submissions must be received by 11:59 PM GMT.
We reserve the right to update the contest timeline if necessary.
Is speech recognition data accepted in this competition?
Only if this speech data is accompanied by a transcription text of the audio, in this case yes, the data is eligible for this challenge.
Is textual data in image format applicable for this conversation?
No. Taking pictures of written or printed text and then performing the task of optical character recognition to get the digital version of this text is a creative way to go about creating data for the challenge. We will however not accept the image data as is.
Can one submit a dataset found on the net (i.e the person did not create it themselves for submission) even though the data respect the criteria defined in the competition?
Yes you can, it is in the public domain. Scraping data from sites for example, already on the internet is one of the things we expect people to do.
Congratulations to the winners of the 1st Round of AI4D-African Language Dataset Challenge!
Read more here.
Read more here.
Here is the link to the recording of the first installment to this webinar series along with other resources!
Masakhane, University Saarland, Ghana NLP, University of Cape Town, Instadeep and DFKI, discuss Natural Language Processing in low-resourced languages and collecting language data in African languages.
Join the largest network for
data scientists and AI builders