Approximately 55 parallel Kinyarwanda-English sentences will be provided for data cleaning along. This data is to help you get started on this process.
You are tasked with finding additional data sources of parallel Kinyarwanda-English sentences. You need to clearly document where and how you downloaded the data, however it is preferable that you input the data straight into your script using an API.
GIZ is particularly interested in domain-specific text data from fields such as health, agriculture, tourism, etc,. We recommend that you focus on one field, you can even specialize in a specific subfield if you’d like. As quantity and quality of data will result in a strong model.
Please do not use the JW300 datasets as this may skew the distribution of data across fields.
The objective of this challenge is to create a script that will clean Kinyarwanda-English parallel sentences.
You are encouraged to use a rules-based approach along with machine learning if you think it is applicable. Remember to consider your script’s efficiency and memory usage during execution.
You are welcome to create a machine translation but it will not add to your final score.