🧩 Simple Machine Translation on low resource languages: French to Fongbe & Ewe

Simple Machine Translation on low resource languages: French to Fongbe & Ewe

Technical · 27 Jan 2022, 11:55 · 6 mins read ·

Using a simple transformer to translate French into Ewe and Fongbe on the Deepnote platform.

Introduction

Ewe and Fongbe are Niger–Congo languages, both part of a cluster of related languages commonly called Gbe. Fongbe is the major Gbe language of Benin (with approximately 4.1 million speakers), while Ewe is spoken in Togo and southeastern Ghana by approximately 4.5 million people as a first language, and by a million others as a second language. They are closely related tonal languages, and both contain diacritics (accents on letters) that can make them difficult to study, understand, and translate. For more information go to Zindi.

Objectives

The objective of this challenge is to create a machine translation system capable of converting text from French into Fongbe or Ewe. I will be using the same model to train and translate both datasets to ease the processing power and memory issues.

Simple Transformers

This library is based on the Transformers library by HuggingFace. Simple Transformers let you quickly train and evaluate Transformer models. Only 3 lines of code are needed to initialize a model, train the model, and evaluate a model. For more information visit GitHub Repo.

Supports:

Sequence Classification
Token Classification (NER)
Question Answering
Language Model Fine-Tuning
Language Model Training
Language Generation
T5 Model
Seq2Seq Tasks
Multi-Modal Classification
Conversational AI.
Text Representation Generation.

Installing and loading libraries

!pip install simpletransformers

!pip install fsspec==2021.5.0

import logging

import pandas as pd

from sklearn.model_selection import train_test_split

from simpletransformers.seq2seq import Seq2SeqModel,Seq2SeqArgs

logging.basicConfig(level=logging.INFO)

transformers_logger = logging.getLogger("transformers")

transformers_logger.setLevel(logging.WARNING)

Data

I used only 35k samples so that my GPU doesn't run out of memory, I used original data without preprocessing.

df = pd.read_csv("Train.csv")[0:35000]

test = pd.read_csv(Test.csv")

Clean = False  # clean text

The cleaning function was useful in English translation but in this case, we get poor results if we clean the data. So, we are going to turn it off.

if Clean:

    # converting every letter to lower case

    df["Target"] = df["Target"].apply(lambda x: str(x).lower())

    df["French"] = df["French"].apply(lambda x: str(x).lower())

   
     # removing apostrophe from the sentences

    df["Target"] = df["Target"].apply(lambda x: re.sub("'", "", x))

    df["French"] = df["French"].apply(lambda x: re.sub("'", "", x))

    exclude = set(string.punctuation)

   
     # removing all the punctuations

    df["Target"] = df["Target"].apply(lambda x: "".join(ch for ch in x if ch not in exclude))

    df["French"] = df["French"].apply(lambda x: "".join(ch for ch in x if ch not in exclude))

   
     # removing digits from the sentences

    digit = str.maketrans("", "", string.digits)

    df["Target"] = df["Target"].apply(lambda x: x.translate(digit))

    df["French"] = df["French"].apply(lambda x: x.translate(digit))

Divide the train and test data into two data frames based on languages.

Fon = df[df.Target_Language=="Fon"]

Ewe = df[df.Target_Language=="Ewe"]

Fon_test = test[test.Target_Language=="Fon"]

Ewe_test = test[test.Target_Language=="Ewe"]

Training Fongbe Model

Using simple transformer seq2seq I have downloaded Helsinki-NLP/opus-mt-en-mul which work best in our case and using specific Seq2SeqArgs to set arguments of model.

Arguments:

num_train_epochs = 30
batch_size = 32
max_length = 120
src_lang = ”fr”
tgt_lang = ”fon”
overwrite_output_dir = True

Train / Eval split

train_data = Fon[["French","Target"]]

train_data = train_data.rename(columns={"French":"input_text","Target":"target_text"})

train_df, eval_df = train_test_split(train_data, test_size=0.2, random_state=42)

Arguments:

I experimented with multiple model arguments to produce the best possible to give us a better result.

model_args = Seq2SeqArgs()

model_args.num_train_epochs = 30

model_args.no_save = True

model_args.evaluate_generated_text = False

model_args.evaluate_during_training = False

model_args.evaluate_during_training_verbose = True

model_args.rag_embed_batch_size = 32

model_args.max_length = 120

model_args.src_lang = "fr"

model_args.tgt_lang = "fon"

model_args.overwrite_output_dir = True

Initializing Model

model_fon = Seq2SeqModel(

    encoder_decoder_type="marian",

    encoder_decoder_name="Helsinki-NLP/opus-mt-en-mul",

    args=model_args,

    use_cuda=True)

Evaluation Metric

def count_matches(labels, preds):

    print(labels)

    print(preds)

    return sum(

            1 if label == pred else 0

            for label, pred in zip(labels, preds)

Training Model

model_fon.train_model(

    train_df, eval_data=eval_df, matches=count_matches)

Single Prediction

The model is performing accurately.

# Use the model for prediction

print(

    model_fon.predict(

        Fon_test["French"].values[25]

Generating outputs:   0%|          | 0/24 [00:00<?, ?it/s]

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py:3260: FutureWarning: `prepare_seq2seq_batch` is deprecated and will be removed in version 5 of 🤗 Transformers. Use the regular `__call__` method to prepare your inputs and the tokenizer under the `with_target_tokenizer` context manager to prepare your targets. See the documentation of your specific tokenizer for more details

  FutureWarning,

Predicting Fongbe test data and saving

Fon_test["Target"] = model_fon.predict(list(Fon_test["French"].values))

Fon_test[["ID","Target"]].to_csv("Fon.csv",index=False)

Saving the model Fon model

import torch

torch.save(model_fon , 'model_fon.pkl')

Training Ewe Model

Using simple transformer seq2seq I have downloaded Helsinki-NLP/opus-mt-en-mul which work best in our case and using specific Seq2SeqArgs to set arguments of model.

Arguments:

num_train_epochs = 30
batch_size = 32
max_length = 120
src_lang = ”fr”
tgt_lang = ”ewe”
overwrite_output_dir = True

train_data = Ewe[["French","Target"]]

train_data = train_data.rename(columns={"French":"input_text","Target":"target_text"})

train_df, eval_df = train_test_split(train_data, test_size=0.20, random_state=42)

Model Arguments

I experimented with multiple model arguments to produce the best possible to give us a better result.

model_args = Seq2SeqArgs()

model_args.num_train_epochs = 30

model_args.no_save = True

model_args.evaluate_generated_text = False

model_args.evaluate_during_training = False

model_args.evaluate_during_training_verbose = True

model_args.rag_embed_batch_size = 32

model_args.max_length = 120

model_args.src_lang = "fr"

model_args.tgt_lang = "ewe"

model_args.overwrite_output_dir = True

Initializing Ewe model

model_ewe = Seq2SeqModel(

    encoder_decoder_type="marian",

    encoder_decoder_name="Helsinki-NLP/opus-mt-en-mul",

    args=model_args,

    use_cuda=True)

Training model

model_ewe.train_model(

    train_df, eval_data=eval_df, matches=count_matches)

Predicting and saving CVS

Ewe_test["Target"]=model_ewe.predict(list(Ewe_test["French"].values))

Ewe_test[["ID","Target"]].to_csv("Ewe.csv",index=False)

Saving model

torch.save(model_ewe , 'model_ewe.pkl')

Joining both MT predicted translation

ewe = pd.read_csv('Ewe.csv')

fon = pd.read_csv('Fon.csv')

fr_to_targ_lang_sub = pd.concat([ewe, fon])

fr_to_targ_lang_sub.head()

Creating Submission file

fr_to_targ_lang_sub.to_csv(

    "submission.csv", index=False)

Leaderboard

The error metric for this competition is Rouge Score, ROUGE-N (N-gram) scoring (Rouge1), reporting the F-measure.

Final thoughts

Machine translation is underrated in the world of NLP due to Google translation and other giants making translation perfect, but they don’t offer all languages. Some of the low-resource languages don’t even make it. This was an interesting journey as I started with the seq2seq model with attention to transformers, and then stumbled upon Helsinki NLP language models. I had issues with memory and GPU, and even then, I was not improving my score as there were many limitations while processing huge data. After spending a month, I stumbled upon simple transformers which were created to make fine-tuning simple and easy for everyone. In the end, I was happy with the result while using effective and fast ways to train and predict low-resource languages from Africa.

Code available on GitHub and Deepnote.

About the author

Abid Ali Awan is a certified data scientist professional, who loves building machine learning models and blogging about the latest AI technologies. He is currently testing AI products at PEC-PITC, and recently, he has participated in 60+ competitions, ranging from data analytics to machine learning, to which his creativity in dealing with challenges can be credited to. You can reach him on LinkedIn and Polywork.

Read the original article here.

If you enjoyed this content upvote this article to show your support

Discussion 0 answers

Join the largest network for
data scientists and AI builders

About FAQs

Status