JetMoE

Also available on Substack

Support this blog

Concise Grok

AI generated image
The main function

Mixture of Experts from scratch

Photo by Alice Pasqual

Orca 2 on Colab

Photo by Masaru Suzuki

Support this blog.

Retentive Network – notes

Useful links

Retentive Network: A Successor to Transformer for Large Language Models
Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, F. Wei
arXiv:2307.08621 [cs.CL] (2023)

Official implementation on GitHub (link)

PyTorch implementation of RetNet by Jamie Stirling (link)

xPos paper (link)

Reading code

Photo by Nicole Wolf
Click on the wedge-shaped element to shrink the dataclass definition

Useful links

llama2.c – A. Karpathy’s GitHub page (link)

Introducing Llama 2
– Meta AI (link)

What Is a Transformer Model? – NVIDIA blog (link)

Rotary Embeddings: A Relative Revolution – EleutherAI (link)

Forward-Forward algorithm

Photo by Tolga Ulkan

2022 has gone away with Hinton’s last effort — The Forward-Forward Algorithm: Some Preliminary Investigations. It is not my intention to stir up a controversy about Hinton, but to this day it still escapes me what his real contribution to neural networks is. Last time I covered an article by Hinton was for Capsule Networks (what happened to them?) a few years ago.

There are some issues with backpropagation: first, even if neural networks are somewhat modeled on real neuronal functioning, backpropagation does not exist biologically; second, everything one puts into a neural network (as a black box) has to be modeled as a differentiable module to work well with backpropagation.

Main idea

Hinton’s last paper introduces the Forward-Forward (FF) learning method with the following key features:

(a) FF replaces forward and backward passes of backpropagation by two forward passes; one operates on positive values and the other operates on negative values;

(b) each layer has its own objective function, that is, a measure of goodness for positive and negative data;

(c) FF computes the gradients locally using a local objective function, so there is no need to backpropagate the errors.

Looking at a piece of the implementation code for the layer train method, the input is literally split into positive and negative values to operate on.

Learning with a simple layer-wise goodness function

The sum of the squared activities in a layer can be used as the “goodness” but there are many other possibilities, including minus the sum of the squared activities. Specifically, we look to correctly classify input vectors as positive data or negative data when the probability that an input vector is positive is given by the following (θ is a threshold term and σ denotes the logistic function):

\displaystyle p(\mathsf{positive}) =  \sigma\left( \sum_j y_j^2 - \theta \right)\,.

A single hidden layer can be learned using the following criterion: the sum squared activities of the hidden units has to be high for positive data (over a certain threshold value θ for sure) and low for negative data.

A necessary observation: since it is trivial to distinguish positive from negative data by simply using the length of activity vector in the first hidden layer as an input to the second hidden layer (no need to learn new features), FF normalizes the length of the hidden vector before using it as input to the next layer. Briefly, the activity vector in the first hidden layer has a length and an orientation: the length is used to define the goodness for that layer the orientation (only) is passed to the next layer.

A supervised example

To implement supervised learning with FF, one way is to include the class labels in the input (see figure below).

An image with the correct label constitutes the positive data and an image with incorrect label constitutes the negative data. The only difference between positive and negative data is the label, so FF should ignore all image features that do not correlate with the label.

After training on MNIST dataset using FF, it is possible to classify a test digit running the net with a particular label as part of the input and accumulate the goodnesses of all but the first hidden layer. This has to be done for each label separately. After that, the label with the highest accumulated goodness is chosen. The paper reports that, during training, in order to pick hard negative labels, a forward pass from a neutral label was used.

With MNIST, after training all the layers, to make a prediction for a test image x, we find the pair (x, y) for all labels y (where y in {0, 1,…, 9}) that maximizes the network’s overall activation.

Performance

Hinton’s paper reports a brief comparison between FF and backpropagation on CIFAR-10. The test performance of FF is slightly worse than backpropagation. There is also an interesting page about the analysis of performance versus backpropagation.

Code implementations

I’d like to mention two GitHub repositories, one form Nebuly-ai and the other from Mohammad Pezeshki. Both are PyTorch implementations.

Useful Links

The Forward-Forward Algorithm: Some Preliminary Investigations
G. Hinton
arXiv:2212.13345 [cs.LG], 2022.

Code from Nebuly-ai.

Code from M. Pezeshki.

Detailed Backpropagation Algorithm (link).

Interesting performance analysis page.

Notes on the GUIE competition

Brief post on the 1st place solution of the Google Universal Image Embedding competition on Kaggle

[source]

The Google Universal Image Embedding competition (GUIE) is, as reported in the competition description page, the first competition in image representations that should work across many object types. Image representations are a key element for computer vision models. In past times, generic embedding learning techniques were applied to different domains separately, rather than developing generic embedding models which could be applied to all domains combined.

Representations are very useful. As a simple example, it is well known that autoencoders find representations for images. These representations are usually cheaper/smaller than the images from which they originate. One can easily work on these representations (for example, comparing them) without resorting to the original images.

Some of the types of images to be evaluated in this competition: apparel & accessories, packaged goods, landmarks, furniture & home decor, storefronts, dishes, artwork, toys, memes, illustrations, and cars. This Competition requires contestants to develop a model with the capacity to generate a 64-dimensional embedding for each image. Then the back-end server will retrieve the image of the same instance via a search based on kNN (= 5).

The competition ended last October, 2022. In this post we will examine the 1st place solution by Qinghua Cui and Shihao Shao. We report some Shao’s comments on the strategies and development that led to the winning model.

First attempts

The competition is a bit atypical as no dataset is provided. From the discussion emerges that larger datasets result in better score, since weights trained on ImageNet-22K are better than 1K ones. So the first strategy was searching for pre-trained weights on super large datasets. A good choice to begin with was CLIP, whose code can be found here.

Cui and Shao adapted weights of VIT-H pre-trained on Laion-2B, a subset of Laion-5B as their baseline model. They added a linear projection layer to squeeze the embedding into 64d. Then, an ArcFace head was adapted together with this modification. A Dropout layer was inserted between the last and the second-to-last linear layer with a drop rate of 0.2. SGD with momentum was chosen as optimizer with L2 weights decay rate of 1.5e-4.

Dimensionality reduction algorithms like random projections, PCA, t-SNE did not work.

Training

One of the first issues participants became concerned about was the strict competition rule that banned datasets without a commercial license. Later, the rules were updated: licensing for the winning model was no longer required, only the source code used to generate it had to be licensed. All publicly available datasets were fine for model training as long as they were publicly disclosed on the forum.

Some attempts followed the scheme “choose datasets -> decide model, training-related things”. So, the winners decided to test various datasets like Products-10k, Shopee, MET Artwork Dataset, Alibaba goods, H&M Personalized Fashion, GPR1200, GLDv2-Full, DeepFashion -Consumer-to-shop Clothes Retrieval Benchmark part. Datasets were added to the training list iteratively instead of training on each datasets from very beginning. This led the score above 0.610.

The winners decided not to follow usual training methods as LP-FT (linear probing, then full fine-tuning). In the end they trained on the last 2 fully-connected layers to completely converge for 6 epochs and then, froze the linear layer and train only on the backbone part, for 3 epochs. We will present some of the reasons for that decision below.

They noted that the weights of the last layer changed rapidly when training on all the layers. Furthermore, the central embedding of each class changed rapidly and so the Euclidean distance between classes. Hence, they decided to

(a) freeze the final FC layer while training on the rest part (backbone);

(b) adding dropout to the full connection layer, already a well-known trick to avoid over-fitting — not always working, but fine in this case.

Products-10k gave the most improvement, so it was used for fine-tuning respecting the order of “first fc, then backbone”, reaching a score of 0.671.

Ensemble strategy

Another odd fact was that model ensemble (by averaging outputs) did not give a superior result, as also noted by other participants.

Having trained two models on the same datasets, due to the noise of randomly mini-batch selection and possibilities in augmentation, the outputs can differ greatly. This, presumably, could be determined by the oscillating results of the final FC layers. So, ensemble does not work as it is in the most cases.

However, the ensemble of models should work when getting “similar” results from final FC layers in different models. Therefore, the winners tried to apply ensembling keeping the final two FC layers frozen.

Finally, Shiao wrote:

“I need to redo EVERY THING MENTIONED ABOVE to the new laion-2b VIT-H model thanks to this weight:(, except several changes: 1) drop model ensemble, VIT-H is really a huge guy 2) train on all the datasets at the same time, drop products-10k, leave products-10k as the final fine-tuning datasets.”

Overlapping patches can help image segmentation for Vision Transformer models: the last trick was to set 4 pixels overlapping using 290 x 290 resolution.

The final performance results were 0.732 on public leaderboard and 0.728 on private leaderboard.

Useful links

GUIE competition overview page on Kaggle.

1st Place Solution in Google Universal Images Embedding paper.

1st place solution comments by S. Shiao.

1st place solution Github repository.

OpenCLIP repo.

ArcFace paper.

Laion-5B dataset page.

Active Dendrites

Avoiding catastrophic forgetting

Photo by Henry Be

The following content is mainly about the article Avoiding Catastrophe: Active Dendrites Enable Multi-Tasking Learning in Dynamic Environments by A. Iyer et al. (December 2021). It is a pleasant paper mixing biology, neuroscience and mathematical modeling, I hope you find it interesting.

Catastrophic forgetting

Standard Artificial Neural Networks (ANNs), based on the (inaccurate) point neuron model [Lapique, 1907] and backpropagation algorithm, often fail dramatically in multiple task learning. Differently from single-task machine learning, learning multiple distinct tasks introduces new complications. When using gradient-based methods (such as backpropagation), a noteworthy issue is that error gradients and accumulated knowledge from different tasks can interfere with one another. Effective weight tweaking to reduce the error for one task may lead to suboptimal or ruinous performance for another task. This is a common problem known as catastrophic forgetting.

The same is true for continual learning, that concerns the ability to acquire new knowledge over time while retaining relevant information from the past. A typical scenario involves training a network on a set of distinct tasks presented in a strict sequence of training phases. As a basic example, consider two different learning tasks: (1) classify dogs type and (2) identify Aramaic alphabet letters.

In essence, learning is starting from an initial weights configuration and then moving throughout weight space to a place where the error is small on the task being learned.

The figure above provides an intuitive support for what happens. Consider the sequential learning of the two aforementioned tasks 1 and 2. From the initial weights configuration (yellow dot), after learning to classify dogs we reach a certain minimum region (a). Then we learn to indentify letters and the weight configuration is modified to reach a minimum region (b). So network has completely ignored which weight configuration is appropriate for the first task.

Biological neurons and Active Dendrites

The point neuron model postulates that all of neural synapses have a linear impact on the cell. This simple assumption laid the foundations of Rosenblatt’s original Perceptron [Rosenblatt, 1958] and continues to form the basis for current deep learning networks.

This artificial neuron has relatively few synapses and no dendrites. Learning occurs by changing the strength or “weight” of the synapses which are represented by a scalar value that can take positive or negative values. A weighted sum of point neuron inputs is calculated and then a non-linear function f determines the output value of the neuron. It is now well known that the point neuron assumption is an oversimplified model of biological computations.

Pyramidal neurons (see figure below) are the most common type of neurons in the neocortex. Biological neurons have thousands of synapses arranged along dendrites. Biological synapses are partly stochastic, and therefore are low precision. Learning in a biological neuron mostly involves the formation of new synapses and the removal of unused synapses.

In real neurons, proximal synapses (those close to the cell body) have a linear impact on the neuron, but the most of synapses occur on distal dendritic segments (away from the cell body). These distal segments are known as active dendrites and process synapses in a non-linear fashion. When input to an active dendritic segment reaches a threshold, the segment initiates a dendritic spike that travels to the cell body and can determine a depolarization of the neuron for an extended period of time, even for half a second. During this period, the neuron is closer to its firing threshold and any new input is more likely to make the neuron fire. Hence, these dendrites — differently from proximal segments — have a modulatory and long-lasting impact on the neuron’s activity. Any active dendritic segment receives input signal from cells in different layers or in the form of top-down feedback.

Sparse Representations

Neural circuits in the neocortex are highly sparse. Studies reveal that relatively few neurons spike in response to a sensory stimulus. Neural connectivity is also sparse: pyramidal neurons are sparsely connected to each other and receive relatively few signals from neighboring neurons.

This is not the case in neural network modeling, where connections are mostly dense. Sparse neural representations are introduced using vectors where most of the entries are zero. Studies show that sparse representations are more resistant to noise than the dense ones. Furthermore, pattern recognition is less prone to negative effects due to slight perturbations in the input.

Active Dendrites Neuron

The authors propose a new neuron model. Mimicking what happens in pyramidal neurons, the active dendrites neuron receives two sources of input, in analogy with the proximal and distal inputs. Feedforward input is treated exactly like a point neuron. At the same time, multiple dendritic segments process a context vector and their output modulates the feedforward activation. In other words, the magnitude of the response to a given stimulus is highly context-dependent. The image below shows five dendrites processing context (weights involved are represented by small discs) and the feedforward input.

Given input x, weights w and bias b, the feedforward signal is, as usual, computed as

\hat{t} = \mathbf{w}^\top \mathbf{x} + b \,.

Note that weights here do not represent a 2d matrix but a vector including just the values involved with the particular neuron (needles to say, we are referring to a single artificial neuron whose functioning we are defining). On the other hand, each dendrite j computes

\mathbf{u}_j^\top \mathbf{c}

where uj are weights relative to j -th dendrite and c is a context vector (for example, the context vector may encode task ID info). We will not delve too deeply into the question of calculating such a context vector but, in short, the context vector:

1) is computed using prototype representations for different classes;

2) if the system receives task information during training, then the prototype vector for a certain task is computed by taking the element-wise mean over all the training samples across all features;

3) if the system receives no task information during training, then a statistical clustering approach is used: if the new batch of samples is similar to earlier training samples, they are assigned to an existing prototype; if not, the new batch of samples is assumed to correspond to a new task, and a novel prototype is instantiated.

The figure above illustrates the prototype method. Yellow points represent samples for task A, beige for task B.

Returning to our neuron model, the segment with the strongest response to the context is selected:

\displaystyle d = \max_j \mathbf{u}_j^\top \mathbf{c}\, .

The active dendrites contextual contribution modulates the feedforward activation in the following manner:

\displaystyle y= f(\hat{t},d) = \hat{t} \cdot \sigma(d)\,.

In the expression above, y is the resulting activation, σ is the sigmoid function which takes a real number and maps it into the range [0, 1]. It is clear that weak responses (near zero) to the context vector will significantly reduce the resulting activation.

Modeling sparsity

To add sparsity in active dendrites neuron architectures, authors apply the kWTA (k-Winner-Take-All) function, that mimics biological inhibitory networks, defined as follows:

{k(y_i) = \begin{cases} y_i & \textsf{if}\; y_i\; \textsf{is one of the top}\, k \, \textsf{activations over all} \, i\\ 0 & \textsf{otherwise}\end{cases}}

where i indexes neurons in the same layer. Sparsity is ensured by selecting the top k activations and setting all others to zero.

Active Dendrites Network Architecture

The figure below shows an active dendrite neurons network. All neurons in each hidden layer are active dendrites neurons. The network is trained by backpropagation.

The neurons selected by the kWTA function are the only having nonzero activations (hence nonzero gradients) and these latter neurons will be the only ones to be updated during the backpropagation algorithm backward pass.

A very small sparse subset of the full network is actually updated for each input. This is because for each of those “winner” neurons, only the dendritic segment that was chosen by the max operator is updated (the other segments are not modified).

What do we expect from this model? Different dendritic inputs are expected to activate different subnetworks. If this happened, the backpropagation algorithm would only modify the connections of the neurons in each subnetwork, leaving the rest of the connections in the whole network untouched (see figure below).

From tests carried out on the permuted MNIST dataset, empirical evidence shows that the network does indeed invoke separate subsets of neurons to learn different tasks. As for the results, authors claim that — in the multi-task RL setting — a 3-layer active dendrites network can achieve an average accuracy of about 88% when learning 10 Meta-World environment tasks together, while — in the continual learning setting — an almost identical network can achieve greater than 90% accuracy when learning 100 permuted MNIST tasks in sequence.

Useful links

A. Iyer, K. Grewal, A. Velu, L. O. Souza, J. Forest, S. Ahmad
Avoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments
arXiv:2201.00042v1 [cs.NE], 2021.

L. Lapique’s 1907 paper (translated, 2007).

Why Neural Networks Forget, and Lessons from the Brain [link].

J. Snell, K. Swersky, R. S. Zemel
Prototypical networks for few-shot learning
arXiv:1703.05175v2 [cs.LG], 2017.

Permuted MNIST [link].

T. Yu, D. Quillen, Z. He, R. Julian, A. Narayan, H. Shively, A. Bellathur, K. Hausman, C. Finn, S. Levine
Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
arXiv:1910.10897v2 [cs.LG], 2019 (v2 revised 2021).

CoAtNets

A class of state-of-the-art computer vision models

Photo by Monisha Selvakumar

This post refers mainly to the paper CoAtNet: Marrying Convolution and Attention for All Data Sizes by Z. Dai et al. (2021).

CoAtNet models (pronounced “coat” net) for computer vision emerge as a combination of the Convolutional and Transformer (a Self-Attention based model) architectures. Experiments show that CoAtNets achieve state-of-the-art performance across various datasets like ImageNet and JFT-3B.

Convolution and Self-Attention

Convolutional neural networks (CNNs) use the convolution operation as follows (check here for a simple intro to convolution operation). Let x be a given input — think of an image or, more generally, a feature representation —  whose dimensions are r × c × d, where r and c are image (or representation) rows and columns, d is the number of channels. Let \mathcal{L}(i) be a local image patch around pixel xᵢ, where i denotes coordinates (α, β). Then the convolution output yᵢ is

\displaystyle y_i = \sum_{\mathcal{L}(i)} w_{i-j} \odot x_j

where ij = (α, β) – (m, n) and wᵢ is a weight (a convolution kernel entry). The index j = (m, n) varies over the patch \mathcal{L}(i) . Note that xᵢ can be also considered as a 1 × 1 × d, so the product might involve multiple channels. Below, an example with \mathcal{L}(i) = \mathcal{L} (3, 3), a local patch for representation x.

CNNs employ weight sharing: kernel matrix is reused for generating the output for all pixel positions (a, b). Weight sharing enforces translation equivariance

convolve(translate(x)) = translate(convolve(x))

and this is a fine property because you if your CNN detects a particular element in an image, it will find that element again when shifting the image.

For self-attention, consider a 1× 1 × d “pixel” xᵢ and consider a region \mathcal{G} whose center is, for simplicity, xᵢ. This is similar to a local image patch, but the letter \mathcal{G} tells us that this region can even be considered as global. Single-headed attention yᵢ is 

\displaystyle y_i = \sum_{j \in \mathcal{G}} \textsf{softmax}_\mathcal{G} \left( q_i^\top k_j \right) v_j

where the queries qᵢ = Qxᵢ, keys kⱼ = Kxⱼ and values vⱼ = Vxⱼ quantities are described here. The matrices Q, K and V are learned matrices. Softmax is applied in relation to the quantities computed from pixels in the neighborhood \mathcal{G} of xᵢ. The notation j \mathcal{G} indicates that the sum is over all indices j corresponding to elements (pixels) in \mathcal{G} . This computation is repeated for every pixel x₍ ₎ to obtain outputs y₍ ₎. In practice, multiple attention heads are used to learn distinct representations of the input. Below, an image showing what has just been described.

The dashed lines represent learned transformations, the rest are matrix operations. 

In the current setting, no positional information is encoded in attention. This poses a limit on the expressiveness of vision models. Information about position can be achieved through the well-known positional embeddings, using sinusoidal functions. However, many experiments suggest to use relative positional embeddings for better results. Relative attention is defined as follows. Consider the relative distance of pixel of coordinates i = (α, β) to each position j = (m, n) in \mathcal{G} so that each position determines two distances: a row offset mα and column offset nβ (see figure below). The relative distances are computed with respect to — for example —  pixel (0,0) and their format is row offset (yellow), columns offset (gray).

The row and column offsets are associated with an embedding r(mα) and r(nβ) respectively, each with dimension that is the half of the output dimension d(out). Concatenating these vectors to form a unique vector, the expression for this relative attention is

\displaystyle y_i = \sum_{j \in \mathcal{G}} \textsf{softmax}_\mathcal{G} \left( q_i^\top k_j + q_i^\top r_{j-i} \right)v_j\,.

So we have two components as argument of softmax: the logit expressing similarity between the query and an element from \mathcal{G} and the relative distance of the element from the query. Note that adding relative position information, self-attention also enjoys — similarly to convolutions — translation equivariance.

Merging desirable properties

It is worthwhile to compare the relative strengths and weaknesses of both convolution and self-attention, before questioning about how to best combine them.

Translation Equivariance. We saw earlier that this is a property satisfied by convolution.

Input-adaptive Weighting. In convolution, kernel entries are static and do not depend on the particular input. Instead, the attention weights (all the softmax parts) dynamically depend on the representation of the input.

Global Receptive Field. One of the most crucial differences between self-attention and convolution concerns the size of the receptive field. A larger receptive field, despite the high computational cost involved, provides more contextual information which could lead to higher model capacity.

An ideal model would combine the three previous properties. Taking these properties into account, the authors use the following attention mechanism for their model

\displaystyle y_i = \sum_{j \in \mathcal{G}} \frac{\exp(x_i^\top x_j + w_{i-j})}{\sum_{k \in \mathcal{G}} \exp(x_i^\top x_k + w_{i-k})}\;x_j

which is a kind of relative attention

\displaystyle\sum_{j \in \mathcal{G}} \textsf{softmax}_\mathcal{G} \left( x_i^\top x_j + w_{i-j} \right)x_j

where weights take the place of relative distances. Here \mathcal{G} indicates the global spatial space and, for each j, the weight w_{i-j} is a scalar (there are as many as the order of \mathcal{G} ).

CoAtNet model

In the case of global attention, the complexity is quadratic w.r.t. spatial size. So it is not always feasible to use self-attention in vision tasks. Applying the previously defined attention directly to raw images would result in an excessively slow computation due to the (usually) large number of pixels involved. Hence, the authors state three main options:

(A) perform some down-sampling to reduce the spatial size and employ the global relative attention after the feature map reaches manageable level;

(B) enforce local attention, which restricts the global receptive field \mathcal{G} in attention to a local field \mathcal{L} just like in convolution;

(C) replace the quadratic softmax attention with certain linear attention variant which only has a linear complexity w.r.t. the spatial size.

Some experiments suggest excluding options (B) and (C) and to focus on (A). There are many ways to reduce the image size, leading to different architectures. The model we show uses, as a first stage S0, a simple 2-layer convolutional Stem. This is followed by stage S1, employing MBConv blocks with squeeze-excitation (SE), as the spatial size is too large for global attention. From S2 through S4 it is possible to consider either the MBConv or the Transformer block, provided that convolution stages must appear before Transformer stages. This leads to 4 different settings: CCCC, CCCT, CCTT and CTTT, where C denotes Convolution and T denotes Transformer. Some experiments reveal that the proper configuration is CCTT. 

For both the MBConv (yellow) and the Transformer (white) blocks, transformations are of the kind

xx + Module(Norm(x))

where Module is MBConv, Self-Attention or FFN (FeedForward Network) and Norm corresponds to BatchNorm for MBConv and LayerNorm for Self-Attention and FFN. As activation function, Gaussian Error Linear Units (GELUs) is used in both MBConv and Transformer blocks.

Within each stage from S1 to S4, down-sampling is performed independently for both the residual branch and the identity branch.

In the Transformer block, the standard max pooling of stride 2 is directly applied to the input states of both branches of the self-attention module. A channel projection (for example, a 1 × 1 convolution) is applied to the identity branch to enlarge the hidden size. Hence, the module down-sampling can be represented as

xx + Proj(Pool(x)) + Attention(Pool(Norm(x))).

For the MBConv block, differently from standard MBConv block, the residual branch down-sampling is obtained by using a stride=2 convolution to the normalized inputs. The standard MBConv uses stride=2 for the Depth-wise convolution part. We can express the module as follows:

xProj(Pool(x)) + Conv(DepthConv(Conv(Norm(x), stride=2))).

In depth-wise convolution, convolution is applied to a single channel at a time, that is, each channel of the input data convolves with a dedicated kernel. So, the filters/kernels will be of size k × k × 1 (see figure below).

Results

The original paper on CoAtNets reports several good results. However, it is worth noting that, as of March 2022, the state-of-the-art in image classification on ImageNet is represented by a CoAtNet model (CoAtNet-7, Top Accuracy: 90.88%, 2440M parameters, here for more).

Useful links

CoAtNet: Marrying Convolution and Attention for All Data Sizes
Z. Dai, H. Liu, Q. V. Le, M. Tan
arXiv:2106.04803v2 [cs.CV] (2021).

Stand-Alone Self-Attention in Vision Models
P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, J. Shlens
arXiv:1906.05909v1 [cs.CV] (2019).

2D Convolution (link).

Multi-Head Attention (link).

Code implementations (PyTorch and TensorFlow) (link).