Next generation NLP projects like BERT and GPT3 show that the sky's the limit in using text data for machine learning. But working with text data provides its own set of challenges, not least of which is the language barrier(s). In this article, explore tools to get around text data in different languages.
“In the face of adversity, we have a choice. We can be bitter, or we can be better. Those words are my North Star.”- Caryn Sullivan
Imagine: you, a data scientist, assigned to work on a NLP project to analyse what people post on social media (e.g Twitter) about COVID-19. One of your first tasks is to find different hashtags for COVID-19 (e.g #covid19 ) and then start collecting all tweets related to COVID-19.
When you start to analyze the collected data related to COVID-19, you find out that the data is generated from different languages around the world such as English, Swahili, Spanish, Chinese, Hindi etc. In this case, you will have two problems to solve before you start analysing the dataset: the first is to identify the language of the particular data and the second is to translate the data to the language of your choice (e.g all data should be in the English language).
So how can we solve these two problems?
First Problem: Language Detection
The first problem is to know how you can detect language for particular data. In this case, you can use a simple python package called langdetect.
langdetect is a simple python package developed by Michal Danilák that supports detection of 55 different languages out of the box (ISO 639-1 codes):
af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he,
hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl,
pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw
Install langdetect
To install langdetect run the following command in your terminal.
pip install langdetect
To detect the language of the text: e.g “Tanzania ni nchi inayoongoza kwa utalii barani afrika”. First, you import the detect method from langdetect and then pass the text to the method.
from langdetect import detect
sentence = "Tanzania ni nchi inayoongoza kwa utalii barani afrika"
print(detect(sentence))
Output: “sw”
The method detects the text provided is in the Swahili language (‘sw’).
You can also find out the probabilities for the top languages by using detect_langs method.
from langdetect import detect_langs
sentence = "Tanzania ni nchi inayoongoza kwa utalii barani afrika"
print(detect_langs(sentence))
Output: [sw:0.9999971710531397]
NOTE: You also need to know that the language detection algorithm is non-deterministic, if you run it on a text which is either too short or too ambiguous, you might get different results every time you run it.
Call the following code before language detection in order to enforce consistent results.
from langdetect import DetectorFactory
DetectorFactory.seed = 0
Now you can detect any language in your data by using the langdetect python package.
Second Problem: Language Translation
The second problem you need to solve is to translate a text from one language to the language of your choice. In this case, you will use another useful python package called google_trans_new.
google_trans_new is a free and unlimited python package that implemented Google Translate API. It also performs auto language detection.
Install google_trans_new
To install google_trans_new run the following command in your terminal.
pip install google_trans_new
Basic example
To translate a text from one language to another, you have to import the google_translator class from google_trans_new module. Then you have to create an object of the google_translator class and finally pass the text as a parameter to the translate method and specify the target language by using lang_tgt parameter e.g lang_tgt=”en”.
from google_trans_new import google_translator
translator = google_translator()
sentence = "Tanzania ni nchi inayoongoza kwa utalii barani afrika"
translate_text = translator.translate(sentence,lang_tgt='en')
print(translate_text)
In the example above we translate a Swahili sentence into the English language. Here is the output after translation.
Tanzania is the leading tourism country in Africa
By default, the translate() method can detect the language of the text provided and returns the English translation to it. If you want to specify the source language of the text, you can use the lang_scr parameter.
Here are all the languages names along with their shorthand notation.
{'af': 'afrikaans', 'sq': 'albanian', 'am': 'amharic', 'ar': 'arabic', 'hy': 'armenian', 'az': 'azerbaijani', 'eu': 'basque', 'be': 'belarusian', 'bn': 'bengali', 'bs': 'bosnian', 'bg': 'bulgarian', 'ca': 'catalan', 'ceb': 'cebuano', 'ny': 'chichewa', 'zh-cn': 'chinese (simplified)', 'zh-tw': 'chinese (traditional)', 'co': 'corsican', 'hr': 'croatian', 'cs': 'czech', 'da': 'danish', 'nl': 'dutch', 'en': 'english', 'eo': 'esperanto', 'et': 'estonian', 'tl': 'filipino', 'fi': 'finnish', 'fr': 'french', 'fy': 'frisian', 'gl': 'galician', 'ka': 'georgian', 'de': 'german', 'el': 'greek', 'gu': 'gujarati', 'ht': 'haitian creole', 'ha': 'hausa', 'haw': 'hawaiian', 'iw': 'hebrew', 'hi': 'hindi', 'hmn': 'hmong', 'hu': 'hungarian', 'is': 'icelandic', 'ig': 'igbo', 'id': 'indonesian', 'ga': 'irish', 'it': 'italian', 'ja': 'japanese', 'jw': 'javanese', 'kn': 'kannada', 'kk': 'kazakh', 'km': 'khmer', 'ko': 'korean', 'ku': 'kurdish (kurmanji)', 'ky': 'kyrgyz', 'lo': 'lao', 'la': 'latin', 'lv': 'latvian', 'lt': 'lithuanian', 'lb': 'luxembourgish', 'mk': 'macedonian', 'mg': 'malagasy', 'ms': 'malay', 'ml': 'malayalam', 'mt': 'maltese', 'mi': 'maori', 'mr': 'marathi', 'mn': 'mongolian', 'my': 'myanmar (burmese)', 'ne': 'nepali', 'no': 'norwegian', 'ps': 'pashto', 'fa': 'persian', 'pl': 'polish', 'pt': 'portuguese', 'pa': 'punjabi', 'ro': 'romanian', 'ru': 'russian', 'sm': 'samoan', 'gd': 'scots gaelic', 'sr': 'serbian', 'st': 'sesotho', 'sn': 'shona', 'sd': 'sindhi', 'si': 'sinhala', 'sk': 'slovak', 'sl': 'slovenian', 'so': 'somali', 'es': 'spanish', 'su': 'sundanese', 'sw': 'swahili', 'sv': 'swedish', 'tg': 'tajik', 'ta': 'tamil', 'te': 'telugu', 'th': 'thai', 'tr': 'turkish', 'uk': 'ukrainian', 'ur': 'urdu', 'uz': 'uzbek', 'vi': 'vietnamese', 'cy': 'welsh', 'xh': 'xhosa', 'yi': 'yiddish', 'yo': 'yoruba', 'zu': 'zulu', 'fil': 'Filipino', 'he': 'Hebrew'}I have created a simple python function that you can do both detect and translate the text into the language of your choice.
from langdetect import detect
from google_trans_new import google_translator
#simple function to detect and translate text
def detect_and_translate(text,target_lang):
result_lang = detect(text)
if result_lang == target_lang:
return text
else:
translator = google_translator()
translate_text = translator.translate(text,lang_src=result_lang,lang_tgt=target_lang)
return translate_text
The python function receives a text and target language as parameters. Then it detects the language of the text provided and if the language of the text is the same as the target language it returns the same text, but it is not the same it translates the text provided to the target language.
Example:
sentence = "I hope that, when I've built up my savings, I'll be able to travel to Mexico"
print(detect_and_translate(sentence,target_lang='sw'))
In the above source code, we translate the sentence into the Swahili language. Here is the output:-
Natumai kwamba, nitakapojiwekea akiba, nitaweza kusafiri kwenda Mexico
In this article, you have learned how to solve two language challenges when you have text data with different languages and you want to translate the data into the single language of your choice. Congratulations 👏, you have made it to the end of this article!
You can download the notebook used in this article here
If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post! I can also be reached on Twitter @Davis_McDavid.
About the author
Davis David is Zindi Ambassador for Tanzania and a data scientist at ParrotAI. He is passionate about artificial intelligence, machine learning, deep learning and big data. He is a co-organizer and facilitator of the AI movement in Tanzania; conducting AI meetups, workshops and events with a passion to build a community of data scientists to solve local problems.