← Tutorials

Word embeddings

I wrote this in 2020, in my early days in machine learning. I prefer to leave the words and the tutorial as I wrote them back then. You can read here what led me to machine learning research: and my story.

Code : https://github.com/Tikquuss/word_embeddings

GloVe

GloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. The GloVe model learns word vectors by examining word co-occurrences within a text corpus.

Let the matrix of word-word co-occurrence counts be denoted by XX, whose entries XijX_{ij} tabulate the number of times word jj occurs in the context of word ii. Let Xi=kXikX_i = \sum_k X_{ik} be the number of times any word appears in the context of word ii. Finally, let Pij=Xij/XiP_{ij} = X_{ij}/X_i be the probability that word jj appear in the context of word ii.

The relationship between two words ii and jj can be examined by studying the ratio of their co-occurrence probabilities with various probe words, kk. For words kk related to ii but not jj, we expect the ratio Pik/PjkP_{ik}/P_{jk} will be large. Similarly, for words kk related to jj but not ii, the ratio should be small. For words kk that are either related to both ii and jj, or to neither, the ratio should be close to one.

An example relating to the concepts of thermodinamics is given in the original paper with i=icei = ice, j=steamj = steam and k {solid,gas,water,fashion }k \in \ \{solid, gas, water, fashion\ \}

The above argument suggests that the appropriate starting point for word vector learning should be with ratios of co-occurrence probabilities rather than the probabilities themselves. Noting that the ratio Pik/PjkP_{ik}/P_{jk} depends on three words ii, jj, and kk, the most general model takes the form F(wi,wj,wk~)=Pik/PjkF(w_i, w_j,\tilde{w_{k}}) = P_{ik}/P_{jk} where wRdw \in \mathbb{R}^d are word vectors and w~Rd\tilde{w} \in \mathbb{R}^d are separate context word vectors.

To enforce FF to encode the information present the ratio Pik/PjkP_{ik}/P_{jk} in the word vector space, the authors restrict FF to depend only on the difference of the two target words ii and jj, since vector spaces are inherently linear structures. To also avoid FF to obfuscate the linear structure we are trying to capture as it transforms vectors into scalars, the authors passed the dot product (wiwj)Tw~k(w_i - w_j)^T \tilde{w}_k as an FF parameter instead of wiwjw_i - w_j and w~k\tilde{w}_k themselves.

F((wiwj)Tw~k)=Pik/PjkF((w_i - w_j)^T \tilde{w}_k) = P_{ik}/P_{jk}  then \text{ then } F(wiTw~k)=Pik=Xik/Xi (A)F(w_i^T \tilde{w}_k) = P_{ik} = X_{ik}/X_i \text{ (A)}

The authors require that FF be a homomorphism between the groups (R,+)(\mathbb{R},+) and (R>0,×)(\mathbb{R}_{>0}, ×), i.e.,

F((wiwj)Tw~k)=F(wiTw~kwjTw~k)=F(wiTw~k)F(wjTw~k)F((w_i - w_j)^T \tilde{w}_k) = F(w_i^T \tilde{w}_k - w_j^T \tilde{w}_k) = \frac{F(w_i^T \tilde{w}_k)}{F(w_j^T \tilde{w}_k)}  then \text{ then } F=exp (B)F = exp \text{ (B)}

(A) and (B)wiTw~k=log(Pik)=log(Xik)log(Xi)\text{(A) and (B)} \Rightarrow w_i^T \tilde{w}_k = log(P_{ik}) = log(X_{ik}) - log(X_i)

We will then produce vectors with a soft constraint that for each word pair of word ii and word jj

wiTw~j+bi+b~j=logXijw_i^T \tilde{w}_j + b_i + \tilde{b}_j = \log X_{ij}

where bib_i and b~j\tilde{b}_j are scalar bias terms associated with words ii and jj, respectively.

We’ll do this by minimizing an objective function JJ, which evaluates the sum of all squared errors based on the above equation, weighted with a function ff:

J=i=1Vj=1Vf(Xij)(wiTw~j+bi+b~jlogXij)2J=\sum_{i=1}^{V} \sum_{j=1}^{V} f(X_{ij}) (w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2

We choose an ff that helps prevents common word pairs (i.e., those with large XijX_{ij} values) from skewing our objective too much:

f(Xij)={(Xijxmax)αif Xij<xmax1otherwisef(X_{ij}) = \left\{ \begin{array}{ll} \bigg(\frac{X_{ij}}{x_{max}}\bigg)^{\alpha} & \text{if } X_{ij} \lt x_{max} \\ 1 & \text{otherwise} \end{array} \right.

When we encounter extremely common word pairs (where Xij>xmaxX_{ij} \gt x_{max}) this function will cut off its normal output and simply return 11. For all other word pairs, we return some weight in the range (0,1)(0,1), where the distribution of weights in this range is decided by α\alpha.

The authors use xmax=100 and α=3/4x_{max} = 100 \text{ and } \alpha = 3/4

Word2Vec

Word2vec is a group of related models that are used to produce word embeddings. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space.

  • Continuous Bag of Words (CBOW) : In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption).

  • Continuous Skipgram (CS) : In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. This other architecture tries to guess neighboring words using the current word.

According to the authors, CBOW is faster while skip-gram is slower but does a better job for infrequent words.

For a sentence x=x1...xTx = x_1 ... x_T, we want to maximize the given likelihood :

L(θ)=t=1Tmjmfθ(wt,wt+j)\mathcal{L} (\theta) = \prod_{t=1}^{T} \prod_{-m \le j \le m } f_{\theta}(w_t, w_{t+j})

where :

fθ(wi,wj)={Pθ(wiwj)for CBOWPθ(wjwi)for CSf_{\theta}(w_i, w_j) = \left\{ \begin{array}{ll} P_{\theta} (w_i | w_j) & \text{for CBOW} \\ P_{\theta} (w_j | w_i) & \text{for CS} \end{array} \right.

Hence, our objective function can be the average negative log likelihood : J(θ)=1Tlog(L(θ))=1Tt=1Tmjmfθ(wt,wt+j)\mathcal{J} (\theta) = - \frac{1}{T} log(\mathcal{L} (\theta)) = - \frac{1}{T} \sum_{t=1}^{T} \sum_{-m \le j \le m } f_{\theta}(w_t, w_{t+j})

For each word ww, we define two different vector representation vwv_w and uwu_w :

  • vwv_w is used when ww is a center word
  • uwu_w is used when ww is a context word

Hence, for a center word cc and a context word oo :

Pθ(oc)=euoTvcwVeuwTvcP_{\theta} (o \| c) = \frac{e^{u_o^Tv_c}}{\sum_{w \in V} e^{u_w^Tv_c}}

Bag of words

  1. Find N most popular words in train corpus and numerate them. Now we have a dictionary of the most popular words.
  2. For each title in the corpora create a zero vector with the dimension equals to N.
  3. For each text in the corpora iterate over words which are in the dictionary and increase by 1 the corresponding coordinate.

Drawbacks :

  • vocabulary size
  • contain many 0s (thereby resulting in a sparse matrix)
  • We are retaining no information on the grammar of the sentences nor on the ordering of the words in the text.

Let’s try to do it for a toy example. Imagine that we have N = 4 and the list of the most popular words is

['hi', 'you', 'me', 'are']

Then we need to numerate them, for example, like this:

{'hi': 0, 'you': 1, 'me': 2, 'are': 3}

And we have the text, which we want to transform to the vector:

'hi how are you'

For this text we create a corresponding zero vector

[0, 0, 0, 0]

And iterate over all words, and if the word is in the dictionary, we increase the value of the corresponding position in the vector:

'hi':  [1, 0, 0, 0]
'how': [1, 0, 0, 0] # word 'how' is not in our dictionary
'are': [1, 0, 0, 1]
'you': [1, 1, 0, 1]

The resulting vector will be

[1, 1, 0, 1]

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

  • Term Frequency (TF) : It is a measure of how frequently a term, tt, appears in a document, dd: tf(t,d)=number of times the term “t” appears in the document “d”number of terms in the document "d"tf (t, d) = \frac{\text{number of times the term “t” appears in the document “d”}}{\text{number of terms in the document "d"}}

  • Inverse Document Frequency (IDF) : IDF is a measure of how important a term is. We need the IDF value because computing just the TF alone is not sufficient to understand the importance of words.

idf(t)=log(numbers of documentnumber of document with the term "t")idf (t) = log \bigg( \frac{\text{numbers of document}}{\text{number of document with the term "t"}} \bigg)

  • We can now compute the TF-IDF score for each word in the corpus. Words with a higher score are more important, and those with a lower score are less important.

tf_idf(t,d)=tf(t,d)idf(t)tf\_idf(t, d) = tf (t, d) * idf (t)

TF-IDF takes into account total frequencies of words in the corpora. It helps to penalize too frequent words and provide better features space.