Word embeddings

6 minute read

Published:

See the tutorial. Below is the genesis of this tutorial.

In 2019, I was just a programmer. As a reminder, in the engineering school where I was [1], we first do two intense years of preparatory classes (with lots of mathematics and physical sciences, algorithms, etc.). Then everyone chooses their department (Computer Engineering, Electrical Engineering, Mechanical Engineering, Civil Engineering, Telecommunications Engineering, Industrial Engineering). I chose computer science engineering, and was already in the 2nd year of my training as a CS engineer (i.e. my 4th year at university).

When James Assiene Moudie [2] launched the first edition of MLPC (Machine Learning Project Competition) [3], I had the idea of creating a system to automatically translate local Cameroonian languages. Unfortunately, at that time, the closest we had come to Artificial Intelligence at school was :

  • September (Fall) 2016 to July (Summer) 2018

    In general, all our mathematics training was useful for understanding machine learning theory: Real Analysis, Linear Algebra, Euclidean Affine Geometry, Probability and Statistics, Series and Generalized Integrals, Multilinear Algebra-Curves and Surface, Analysis in finite-dimensional vector spaces, Numerical Analyses.

    Note : NDONG NGUEMA Eugène Patrice, the teacher who taught us Series and Generalized Integrals (Fall 2017), and Numerical Analysis (Winter & Summer 2017), is the best teacher I’ve ever known in my life. Beyond that, he’s a genius. Unfortunately, he was born and teaches in Cameroon.

  • September (Fall) 2018 to July (Summer) 2019
    • Formal Systems and Foundations of Artificial Intelligence
    • Mathematical Tools for Computer Science …
    • Science of information: (Shannon) Entropy, (huffman …) encoding…
    • Basic mathematics: measure theory, Laplace transform …
  • September (Fall) 2019 to July (Summer) 2020:

    • Data Analysis, Theory and Practice (with Wilson Toussile, Fall 2019):
      • Statistical Learning : Parametric Estimation (Maximum likelihood estimator …), Confidence interval …
      • Supervised, unsupervised and semi-supervised learning formalism
      • Bayesian classification, linear and quadratic regression, bias-variance risk decomposition, cross-validation, k-nearest neighbors, K-means classification,
      • Hierarchical classification (hierarchical ascending classification…), Hard classification, Fuzzy classification, Similarity and dissimilarity measures, Clustering by mixture models (Gaussian case), EM algorithm.
    • Artificial Intelligence and Applications : Multi-Agent Systems (really old school) …
    • Grammars and Languages: Chomsky hierarchy of grammars, Canonical automata, etc…
  • September-December (Fall) 2020 :
    • Advanced Machine Learning: there was nothing advanced, the professor just took Ian Goodfellow, Yoshua Bengio and Aaron Courville’s book and came to explain worse than what was in the book.
    • Image processing, GIS and WebMapping
    • Data mining: unfortunately, Professor Henri Gwet, who taught us this class (a good teacher), passed away a few months after the end of the session.

Note : Unlike in Montreal (UdeM), where I’m supposed to take 2 to 3 courses per session for the Fall and Winter semesters (in Summer, teachers don’t usually give courses), that’s 4 to 6 courses per year (I generally take 4 courses/sessions to learn fast), in Cameroon we had about 20 courses per school year, and no elective courses like here in Montreal, we did all the courses.

Note : I’ve listed the courses that were directly related to my learning of machine learning. Here is the complete list of courses I took at NASEY [1].

So, in 2019, I already knew how to make computers understand vectors on the field of reals (linear regression, …), but I didn’t know how to make them understand texts. That’s where I got my first taste of NLP. First with GloVe, Word2Vec, Bag of words and TF-IDF; and later with BERT and its variants (BERT was released not long ago, and was creating a lot of buzz).

If for the first approaches (GloVe, etc), understanding was quick, learning how Transformer works by myself until I could implement it was not an easy task for me.

Note : I didn’t like following tutorials because I found most of them ineffective. I preferred to read the papers: difficult to understand for a beginner, but once you understand by reading the paper, you’ve really understood (whereas you can follow 10 tutorials on a notion and never understand).

Faced with the difficulty of understanding how Transformer works :

  • I came back to read how the (vanilla) attention mechanism works in this paper: NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE, Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio, ICLR 2015.
  • I read Minh-Thang Luong’s PhD thesis: NEURAL MACHINE TRANSLATION, STANFORD UNIVERSITY, December 2016
  • The Deep Learning book, Ian Goodfellow, Yoshua Bengio, Aaron Courville
  • Coursera’s Machine Learning and NLP specialization
  • Here’s the document that lists all the papers I’ve read this period (briefly, from 2019 to 2022), and that helped me get started first in NLP, then in machine learning in general.

During this learning period, I wrote this tutorial on word embedding.

Back to 2020. I’ve completed my project. We’ve published our paper: On the use of linguistic similarities to improve Neural Machine Translation for African Languages. “… We propose a new dataset for African languages that provides parallel data for vernaculars not present in commonly used dataset like JW300. To exploit multilingualism, we first use a historical approach based on historical origins of these languages, their morphologies, their geographical and cultural distributions as well as migrations of population to identify similar vernaculars. We also propose a new metric to automatically evaluate similarities between languages. This new metric does not require word level parallelism like traditional methods but only paragraph level parallelism. We then show that performing Masked Language Modelling and Translation Language Modeling in addition to multi-task learning on a cluster of similar languages leads to a strong boost of performance in translating individual pairs inside this cluster. In particular, we record an improvement of 29 BLEU on the pair Bafia-Ewondo using our approaches compared to previous work methods that did not exploit multilingualism in any way…”

I also did an internship at WL Research from July to December 2020, with Mohamed Hassan Kane [4]. We developed and deployed a ML solution that reviews end-user license agreements (EULA) for terms and conditions that are unacceptable to the government. I also work on supervised ML with Derivatives (Sobolev Training, Differential ML, SIREN …)

I made my first contact with Mila in February 2021. Many thanks to Dianbo Liu [5], who initiated me to research.

References

[1] National Advanced School of Engineering Yaounde (NASEY), Cameroon

[2] James Assiene Moudie is a Research Engineer at DeepMind today (2023)

[3] MLPC, Machine Learning Project Competition : NASEY students must use machine learning to solve a local problem in Africa.

[4] Mohamed Hassan Kane : see this video Independent AI Research in Africa: What role for the diaspora?

[5] Dianbo Liu was a postdoctoral researcher with Prof. Yoshua Bengio and led the Humanitarian AI team at the Mila-Quebec AI Institute.