On the use of linguistic similarities to improve Neural Machine Translation for African Languages
Pascal Jr. Tikeng Notsawo, Brice Nanda, James Assiene, 5th Black in AI Workshop @ NeurIPS, 2021.
In recent years, there has been a resurgence in research on empirical methods for machine translation. Most of this research has been focused on high-resource, European languages. Despite the fact that around 30% of all languages spoken worldwide are African, the latter have been heavily under investigated and this, partly due to the lack of public parallel corpora online. Furthermore, despite their large number (more than 2,000) and the similarities between them, there is currently no publicly available study on how to use this multilingualism (and associated similarities) to improve machine translation systems performance on African languages. So as to address these issues: We propose a new dataset for African languages that provides parallel data for vernaculars not present in commonly used dataset like JW300. To exploit multilingualism, we first use a historical approach based on historical origins of these languages, their morphologies, their geographical and cultural distributions as well as migrations of population to identify similar vernaculars. We also propose a new metric to automatically evaluate similarities between languages. This new metric does not require word level parallelism like traditional methods but only paragraph level parallelism. We then show that performing Masked Language Modelling and Translation Language Modeling in addition to multi-task learning on a cluster of similar languages leads to a strong boost of performance in translating individual pairs inside this cluster. In particular, we record an improvement of 29 BLEU on the pair Bafia-Ewondo using our approaches compared to previous work methods that did not exploit multilingualism in any way.