GIZ Kinyarwanda Text Cleaning and Augmentation Competition by GIZ
Can you collect and curate English-Kinyarwanda text data for machine translation tasks?
Prize
$3 500
Time
Ended 4 months ago
Participants
127 enrolled
Helping
Rwanda
Good for beginners
Collection
Research
Description

Machine translation has been around since the 1950s, however, it was only in 2016 that Google assigned a research team to look into the neural networks for machine translation instead of the traditional statistical database method. The research team found that neural networks work across multiple languages and were faster than previous methods. This led to the birth of modern machine translation.

The cornerstone of good machine translation models is quality training data. Parallel and/or monolingual corpora used in building translation models often suffer from a wide range of issues, including syntax and semantic errors, wrong translations, noise such as control characters, incomplete sentences, etc. These issues result in inaccurate translations and discourage users from utilizing the deployed services.

In this challenge, you are tasked with uncovering and remediating issues in the provided parallel corpus consisting of English-Kinyarwanda sentence pairs, you are also tasked with finding additional data to clean through means such as data augmentation or web scraping. The cleaned dataset will later be used to build high-quality machine translation models.

We encourage participants from different domains, such as translation, interpretation, (computational) linguistics or natural language processing.

About the GIZ Digital Transformation Center Rwanda

The Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) GmbH is a federally owned international cooperation enterprise for sustainable development with worldwide operations. GIZ has worked in Rwanda for over 30 years. The primary objectives between the Government of Rwanda and the Federal Republic of Germany are poverty reduction and promotion of sustainable development. To achieve these objectives, GIZ Rwanda is active in the sectors of Decentralization and Good Governance, Economic Development and Employment Promotion, Climate & Energy, as well as Digitalization & ICT (Information and Communications Technology).

The program "Digital Solutions for Sustainable Development" (DSSD) aims to promote the development of digital solutions, digital inclusion and professional ICT skills and capacities. In 2019, DSSD opened the Digital Transformation Center Rwanda (DTC) as a hub for innovation and collaboration among public and private sector, academia, and civil society.

One of the focus areas of the Digital Transformation Center is Artificial Intelligence. Against this background, the AI Hub Rwanda has been founded, bundling all AI initiatives implemented by GIZ in Rwanda. It comprises two projects, the global FAIR Forward program as well as the component for Machine Translation of the DSSD program. The vision of the AI Hub Rwanda is to co-create a vibrant and inclusive ecosystem in Rwanda harnessing the benefits of open and ethical AI for sustainable development. Our mission is empowering our partners through providing open AI training data, capacity building, and development of ethical policy frameworks towards building community-driven AI solutions.

Evaluation

Note that there are no leaderboard scores for this competition, as evaluations will be completed by a panel of judges.

In this challenge, you are tasked with uncovering and remediating issues in the provided parallel corpus consisting of English-Kinyarwanda sentence pairs, you are also tasked with finding additional data to clean.

You will need to submit your cleaning script or notebook, the cleaned data and documentation. The documentation can be word, PDF or slide show document that gives a high-level overview of the steps you used to clean the dataset. You will need to report the number of sentences cleaned and comment on each of the below points.

Your solution can be as creative and innovative as you like. Submissions will be evaluated within 7 working days of the close of the competition by a panel of judges on the following criteria:

  • Number of sentences created and cleaned
  • Standardizing (case normalization) and Spell Check
  • Removing & Finding Email ID
  • Removing Unicode Characters, Punctuation, script and HTML tags, etc.,
  • Removing & Finding URL
  • Removing English and other foreign languages from Kinyarwanda
  • Not hard coded, there should be functions / Reproducibility
  • Efficiency - how long did your code take to run and memory usage
  • Complexity and Overall package: Solution is logical and ideas are in the form of functions that can be applied to whole corpus.

Please label your files:

username_submission_XXX

Where XXX is a unique ID to identify when your submission was made.

Our intention is that the datasets are kept free and open for public use under a Creative Commons license 4.0 or similar. If your dataset wins, by accepting the prize, you thereby agree to making the dataset publicly available under a Creative Commons license 4.0 or similar. All other datasets that did not win will similarly be encouraged to share their datasets as a public good.

If two data sets are identical, the tie-breaker will be the date and time in which the submission was made (the earlier solution will win).

Prizes

1st place: $1 750

2nd place:$1 050

3rd place: $700

Timeline

The competition closes on 7 August 2022.

Final submissions must be received by 11:59 PM GMT.

We reserve the right to update the contest timeline if necessary.

Rules

Teams and collaboration

You may participate in competitions as an individual or in a team of up to four people. When creating a team, the team must have a total submission count less than or equal to the maximum allowable submissions as of the formation date. A team will be allowed the maximum number of submissions for the competition, minus the total number of submissions among team members at team formation. Prizes are transferred only to the individual players or to the team leader.

Multiple accounts per user are not permitted, and neither is collaboration or membership across multiple teams. Individuals and their submissions originating from multiple accounts will be immediately disqualified from the platform.

Code must not be shared privately outside of a team. Any code that is shared, must be made available to all competition participants through the platform. (i.e. on the discussion boards).

The Zindi user who sets up a team is the default Team Leader. The Team Leader can invite other data scientists to their team. Invited data scientists can accept or reject invitations. Until a second data scientist accepts an invitation to join a team, the data scientist who initiated a team remains an individual on the leaderboard. No additional members may be added to teams within the final 5 days of the competition or the last hour of a hackathon, unless otherwise stated in the competition rules

A team can be disbanded if it has not yet made a submission. Once a submission is made individual members cannot leave the team.

Submissions and winning

You may make a maximum of 1 submissions per day.

You may make a maximum of 5 submissions for this competition.

There is no public/private leaderboard.

If your solution places 1st, 2nd, or 3rd on the final leaderboard, you will be required to submit your winning solution code to us for verification, and you thereby agree to assign all worldwide rights of copyright in and to such winning solution to Zindi.

If two solutions earn identical score, the tiebreaker will be the date and time in which the submission was made (the earlier solution will win).

The winners will be paid via bank transfer, PayPal, or other international money transfer platform. International transfer fees will be deducted from the total prize amount, unless the prize money is under $500, in which case the international transfer fees will be covered by Zindi. In all cases, the winners are responsible for any other fees applied by their own bank or other institution for receiving the prize money. All taxes imposed on prizes are the sole responsibility of the winners. The top 3 winners or team leaders will be required to present Zindi with proof of identification, proof of residence and a letter from your bank confirming your banking details. Winners will be paid in USD or the currency of the competition. If your account cannot receive US Dollars or the currency of the competition then your bank will need to provide proof of this and Zindi will try to accommodate this.

Payment will be made after code review.

You acknowledge and agree that Zindi may, without any obligation to do so, remove or disqualify an individual, team, or account if Zindi believes that such individual, team, or account is in violation of these rules. Entry into this competition constitutes your acceptance of these official competition rules.

Zindi is committed to providing solutions of value to our clients and partners. To this end, we reserve the right to disqualify your submission on the grounds of usability or value. This includes but is not limited to the use of data leaks or any other practices that we deem to compromise the inherent value of your solution.

Zindi also reserves the right to disqualify you and/or your submissions from any competition if we believe that you violated the rules or violated the spirit of the competition or the platform in any other way. The disqualifications are irrespective of your position on the leaderboard and completely at the discretion of Zindi.

Please refer to the FAQs and Terms of Use for additional rules that may apply to this competition. We reserve the right to update these rules at any time.

Consequences of breaking any rules of the competition or submission guidelines:

  • First offence: No prizes for 6 months and 2000 points will be removed from your profile (probation period). If you are caught cheating, all individuals involved in cheating will be disqualified from the challenge(s) you were caught in and you will be disqualified from winning any competitions for the next six months and 2000 points will be removed from your profile. If you have less than 2000 points to your profile your points will be set to 0.
  • Second offence: Banned from the platform. If you are caught for a second time your Zindi account will be disabled and you will be disqualified from winning any competitions or Zindi points using any other account.
  • Teams with individuals who are caught cheating will not be eligible to win prizes or points in the competition in which the cheating occurred, regardless of the individuals’ knowledge of or participation in the offence.
  • Teams with individuals who have previously committed an offence will not be eligible for any prizes for any competitions during the 6-month probation period.