Concise Grok

AI generated image
The main function

Mixture of Experts from scratch

Photo by Alice Pasqual

Intro to Word Embeddings

Natural Language Processing models — very often — have words as inputs. This article focuses on a common method to encode words.

Representing words

How should you represent a word in a computer? Let’s see some immediate answers.

Each word as a unique number. At first one could think to assign a number to each word, for example “the” = 0, “dog” = 1, “barks” = 2, so one encodes the sentence “the dog barks” as [0, 1, 2]. This is fine in some cases but it is an arbitrary representation that does not capture any relationship between words, it is hard to spot similarities between words.

Each charachter in a word as a number. A second attempt would be to consider the ascii character representation of a word. But this is a rigid representation that only tells you what the word is as a mere bunch of characters, it doesn’t say much about its meaning (its semantics).

One-hot encoding. At first, we might encode each word in our vocabulary with a vector whose size is the vocabulary size… yes, if your vocabulary consists of 100k words, you should consider 100k vectors each having 100k components. These vectors are of the form [0 ,0,…, 0, 1, 0,…, 0]. The index of the entry 1 is – if the vocabulary is lexicographically ordered – the ordinal number corresponding to the word’s position in the vocabulary. For example, for the three words vocabulary {apple, banana, cherry} we have apple = [1, 0, 0], banana = [0, 1, 0] and cherry = [0, 0, 1]. To create a vector that encodes a sentence we simply concatenate the one-hot vectors for each word composing the sentence. However, this approach is inefficient because it entails a lot of sparse vectors (i.e. most components are zero) and each one of these vectors has no particular relation with the others (the vectors have all the same length and are orthogonal one another). Note that there is an obvious correspondence between one-hot vectors and natural numbers, that is the mapping one-hot vector -> index of the component equal to 1.

These are the most immediate ways for encoding words, however each one of these attempts brings very limited semantical power. It would be better to transform words into points of a vector space such that similar words are represented by similar vectors.

Word embeddings. One-hot vectors have all components except one equal to zero. We can use these vector components in a more efficient way: we can transform each one-hot vector into a low-dimensional dense vector exploiting all entries (meaning that we usually use nonzero components). We do not have to specify the encoding by hand because the new vector entries are trainable parameters! An embedding is a dense vector of floating point values. The transformed vector’s size (equal to the embedding dimension) is specified by the user: for large datasets an embedding dimension between 256 and 1024 is fine, for small datasets even an eight-dimensional space may be enough.

Semantic similarity

If we consider the angle between vectors, we can obtain a way to encode semantic similarity in words.

The more two words are similar, the more their similarity will tend to 1. Extremely dissimilar words should have similarity about 0.

One drawback of word embeddings representation is that the components that should denote features are generally not interpretable. For example, it may happen that two embedded vectors have a large third component but it is not clear what it means.

PyTorch implementation

Let v be the size of vocabulary V and let d be the embeddings dimension. Our word embeddings can be thought as a lookup table of dimensions v × d. The word whose index is i has its embedding represented by the i-th row of this matrix. The words to indices mapping is a dictionary named word_to_idx.

PyTorch uses nn.Embedding to perform word embeddings. nn.Embedding holds a Tensor of dimension (v, d). When the embedding layer is created, nn.Embedding Tensor is initialized randomly and it is only when you train it that similarity between words appears. Here’s a quick code:

import torch
import torch.nn as nn

torch.manual_seed(0)

word_to_idx = {"the": 0, "dog": 1, "barks": 2}
# 3 words in vocab, 2 dimensional embeddings
embeds = nn.Embedding(3, 2)  
lookup_tensor = torch.tensor([word_to_idx["dog"]], dtype=torch.long)
dog_embed = embeds(lookup_tensor)
print(dog_embed)
tensor([[-2.1788,  0.5684]], grad_fn=<EmbeddingBackward>)

So, the word dog corresponds to the vector [-2.1788, 0.5684].

R-NET

R-NET is an end-to-end neural networks model for reading comprehension style question answering which aims to answer questions from a given passage. We refer to a paper by the Natural Language Computing Group, Microsoft Research Asia, “R-NET: Machine Reading Comprehension With Self-matching Networks” (2017). The network has been tested on large datasets like the Stanford Question Answering Dataset (SQuAD) and Microsoft MAchine Reading COmprehension (MS-MARCO).

Description

For a reading comprehension style question answering, a passage P and a question Q are given. The task is to predict an answer A to the question Q based on information found in P. Note that the SQuAD dataset constrains answers to be a continuous sub-span of passage P. Answer A often includes non-entities  and can be a much longer phrase. Here is an example from the SQuAD dataset:

Passage: Tesla later approached Morgan to ask for more funds to build a more powerful transmitter. When asked where all the money had gone, Tesla responded by saying that he was affected by the Panic of 1901, which he (Morgan) had caused. Morgan was shocked by the reminder of his part in the stock market crash and by Tesla’s breach of contract by asking for more funds. Tesla wrote another plea to Morgan, but it was also fruitless. Morgan still owed Tesla money on the original agreement, and Tesla had been facing foreclosure even before construction of the tower began.
Question: On what did Tesla blame for the loss of the initial money?
Answer: Panic of 1901.

The architecture matches the question and passage, then proceeds with gated attention-based recurrent networks to obtain a question-aware passage representation. Then aFFFFFF self-matching attention mechanism refines the representation by matching the passage against itself, which effectively encodes information from the whole passage. Finally, some pointer networks are used to locate the position of answers from the passages. 

R-NET Structure

The following image shows the architecture. First, the question and passage are processed by a bi-directional recurrent network (BiRNN) separately. Then, question and passage are matched with gated attention-based recurrent networks, obtaining question-aware representation for the passage. On top of that, self-matching attention is applied to aggregate evidence from the whole passage and refine the passage representation, which is then fed into the output layer to predict the interval which contains the answer for the question.

Question and Passage encoder

Consider a question Q = \{ w_1^Q, \dots, w_m^Q\}  and a passage P = \{ w_1^P, \dots, w_n^P\} . We begin with preprocessing and text encoding. The preprocessing is done in a separate process and is not part of the neural network. First we preprocess by splitting data into parts, and then we convert all the words to corresponding vectors, world-level embeddings \{e_1^Q, \dots, e_m^Q\}   and \{ e_1^P, \dots, e_n^P\}   . Each word is represented by a concatenation of two vectors: its GloVe vector and another vector that holds character level embeddings \{c_1^Q, \dots, c_m^Q\}   and  \{c_1^P, \dots, c_n^P\} . The latter embeddings are generated by taking the final hidden states of a BiRNN applied to embeddings of characters in the token.

r-netEnc

We are ready to use a BiRNN to obtain new representations \{u_1^Q, \dots, u_m^Q\}   and \{u_1^P, \dots, u_n^P\}   of all the words in the question and passage:

u^Q = \mathrm{BiRNN}Q(u_{t{-}1}^Q, [e_t^Q, c_t^Q])

u^P = \mathrm{BiRNN}P(u_{t-1}^P, [e_t^P, c_t^P]).

The notation [a, b] represents the concatenation of vectors a  and b   . As implementation, it is fine to use GRU networks as BiRNN cells and setting word vector to all zeros when the word is missing from GloVe.

Question aware representation for the passage

The module after encoding computes another representation for the passage by taking into account the words inside the question sentence: a gated attention-based recurrent network is used to incorporate question information into passage representation. The difference with basic attention-based recurrent network lies in using an additional gate to determine the importance of information in the passage with respect to a particular question.

Given the question representation \{ u_1^Q,\dots,u_m^Q\}  and the passage representation \{ u_1^P,\dots,u_n^P\}   , the aim is to generate question-aware representation of the passage \{ v_1^P,\dots,v_n^P \}   feeding the previous state v_{t-1}^P  and an attention-pooling vector c_t  into the RNN:

v_{t}^P = \mathrm{RNN}(v_{t-1}^P,c_t)  .

The attention-pooling vector is computed as a weighted sum

\displaystyle c_t = \sum_{i=1}^m a_i^t \,u_i^Q

where the weights (attention scores) result from softmax

\displaystyle a_{i}^t =\frac{\exp(s_{i}^t)}{\sum_{j=1}^m \exp(s_j^t)}

and

s_j^t = \mathrm{v}^T\tanh(W_{u}^Qu_{j}^Q + W_{u}^Pu_{t}^P + W_{v}^Pv_{t-1}^P)  .

So we compute the products of matrices W_{\square}^\square   by vectors u_\square^\square, v_\square^\square   , take hyperbolic tangent activation tanh, multiply by a weight vector \mathrm{v}   (a learned vector), then take softmax to express the “importance” of each word in the question (this is the similarity in additive attention). So we compute the product of u^Q (question representation) by the attention vector to obtain a weighted — by attention scores –average of question word vectors. c_{t}  is the representation of the part of the question relevant to the current word from the passage.

This is a first setup. Later S. Wang and J. Jang (2016) chose to use concatenation [u_{t}^P, c_{t}]  instead of using just c_t   (so we are considering information from the “original” input u_t^P   ):

v_t^P = \mathrm{RNN}(v_{t-1}^P, [u_t^P,c_t])  .

To determine the importance of passage parts and attend the ones relevant to the question, another gate is added to the RNN input [u_{t}^P,c_{t}]   . The gate is obtained taking the product of some new weight matrix W_g    and the concatenated vector [u_{t}^P,c_t]   , then using a sigmoid activation function.

g_t = \mathrm{sigmoid}(W_g, [u_{t}^P,c_t]) .

The gate output g_{t} is a vector of non-negative entries, which is then (element-wise) multiplied by the original concatenated vector [u_t^P,c_t]

[u_{t}^P, c_t]^\ast = g_t \odot[u_t^P,c_t] \,.

The result of this multiplication is finally passed to the RNN cell as input.

In the next section we’ll discuss the self-matching attention layer and the output layer.

Applying self-matching attention on the passage

The authors suggest to add an attention mechanism on the passage itself. This is done because the question aware passage representation \{ v_1^P, v_2^P, \dots, v_n^P\}   has a very limited knowledge of context. Here the vector v^P   is passed as the input for the self-matching attention module. So, the question-aware passage representation is directly matched against itself. It dynamically collects evidence from the whole passage for words in passage and encodes the evidence relevant to the current passage word and its matching question information into the passage representation h_t^P   .

The equation for the passage representation h_{t}^P  is

h_t^P = \mathrm{BiRNN}(h_{t-1}^P, [v_t^P, c_t ] )

where c_t   is the attention-pooling vector of the whole passage v^P   (depending on the whole passage v^P   and v_t^P   at time t   ):

s_j^t = \mathrm{v}^T \tanh(W_{v}^Pv_{j}^P + \tilde{W}_{v}^Pv_{t}^P)

\displaystyle a_i^t = \frac{\exp(s_i^t)}{\sum_{j=1}^n\exp(s_j^t)}

c_t = \displaystyle \sum_{i=1}^n a_{i}^tv_{i}^P\,.

Here \mathrm{v}  is a weight vector, the weights matrices W_v^P   and \tilde{W}_v^P   are different. An additional gate as in gated attention-based recurrent networks is applied to [v_{t}^P, c_t]  to adaptively control the RNN input.

Output Layer: predicting the interval containing the answer to the question

Pointer networks were introduced in a paper by Vinyals et al. to address some issues from combinatorial problems where the size of the output dictionary depends on the length of the input sequence. For these kind of problems, sequence models like neural machine translation are not effective because these methods require the size of the output dictionary to be fixed a priori. In these networks the softmax probabilities arising from the attention mechanism are considered as “pointers” to input elements. For example, pointer networks works well for solving combinatorial problems like convex hull and Delaunay Triangulation.

ConvexHull

In our context, pointer networks are used to predict the beginning and end of the answer. Given the passage representation \{ h_{1}^P, h_{2}^P, \dots, h_{t}^P\}   , the attention mechanism is used as a pointer to select the start position (p^1)    and the end position (p^2) from the passage:

s_j^t = \mathrm{v}^T \tanh(W_{h}^P h_{j}^P + W_{h}^a h_{t-1}^a)

\displaystyle a_{i}^t = \frac{\exp(s_{i}^t)}{\sum_{j=1}^n\exp(s_{j}^t)}

p^t = \mathsf{argmax}(a_{1}^t,\dots,a_{n}^t)\,.

Here h_{t-1}^a   represents the last hidden state of the answer RNN (pointer network). The initital hidden state is the output of the question pooling. The a_i  at time t   represent probabilities over the words in the passage, the argmax of the a^1  vector is the predicted starting point. The input of the answer RNN is the attention-pooling vector based on current predicted probability a^t   :

\displaystyle c_t = \sum_{i=1}^n a_i^t h_i^P

h_{t}^a = \mathrm{RNN}(h_{t-1}^a,c_t)\,.

As initial state of the answer RNN we use the question vector r^Q   . The vector r^Q=\mathrm{att}(u^Q, V_{r}^Q)  is an attention–pooling vector of the question based on the parameter V_r^Q (as usual, the equations are those of an attention model):

s_{j} = \mathrm{v}^T \tanh(W_{u}^Q u_{j}^Q + W_{V}^QV_{r}^Q)

\displaystyle \frac{\exp(s_{i})}{\sum_{j=1}^m \exp(s_{j})}

r^Q = \displaystyle \sum_{i=1}^m a_iu_i^Q \,.

r-netAnswer

There are many open source R-Net implementations, probably one of the most simple using Keras (respecting the original paper naming conventions) can be found here.