Forward-Forward algorithm

Impressions on a new(?) learning procedure for neural networks

2022 has gone away with Hinton’s last effort — The Forward-Forward Algorithm: Some Preliminary Investigations. It is not my intention to stir up a controversy about Hinton, but to this day it still escapes me what his real contribution to neural networks is. Last time I covered an article by Hinton was for Capsule Networks (what happened to them?) a few years ago.

There are some issues with backpropagation: first, even if neural networks are somewhat modeled on real neuronal functioning, backpropagation does not exist biologically; second, everything one puts into a neural network (as a black box) has to be modeled as a differentiable module to work well with backpropagation.

Main idea

Hinton’s last paper introduces the Forward-Forward (FF) learning method with the following key features:

(a) FF replaces forward and backward passes of backpropagation by two forward passes; one operates on positive values and the other operates on negative values;

(b) each layer has its own objective function, that is, a measure of goodness for positive and negative data;

(c) FF computes the gradients locally using a local objective function, so there is no need to backpropagate the errors.

Looking at a piece of the implementation code for the layer train method, the input is literally split into positive and negative values to operate on.

Learning with a simple layer-wise goodness function

The sum of the squared activities in a layer can be used as the “goodness” but there are many other possibilities, including minus the sum of the squared activities. Specifically, we look to correctly classify input vectors as positive data or negative data when the probability that an input vector is positive is given by the following (θ is a threshold term and σ denotes the logistic function):

$\displaystyle p(\mathsf{positive}) = \sigma\left( \sum_j y_j^2 - \theta \right)\,.$

A single hidden layer can be learned using the following criterion: the sum squared activities of the hidden units has to be high for positive data (over a certain threshold value θ for sure) and low for negative data.

A necessary observation: since it is trivial to distinguish positive from negative data by simply using the length of activity vector in the first hidden layer as an input to the second hidden layer (no need to learn new features), FF normalizes the length of the hidden vector before using it as input to the next layer. Briefly, the activity vector in the first hidden layer has a length and an orientation: the length is used to define the goodness for that layer the orientation (only) is passed to the next layer.

A supervised example

To implement supervised learning with FF, one way is to include the class labels in the input (see figure below).

An image with the correct label constitutes the positive data and an image with incorrect label constitutes the negative data. The only difference between positive and negative data is the label, so FF should ignore all image features that do not correlate with the label.

After training on MNIST dataset using FF, it is possible to classify a test digit running the net with a particular label as part of the input and accumulate the goodnesses of all but the first hidden layer. This has to be done for each label separately. After that, the label with the highest accumulated goodness is chosen. The paper reports that, during training, in order to pick hard negative labels, a forward pass from a neutral label was used.

With MNIST, after training all the layers, to make a prediction for a test image x, we find the pair (x, y) for all labels y (where y in {0, 1,…, 9}) that maximizes the network’s overall activation.

Performance

Hinton’s paper reports a brief comparison between FF and backpropagation on CIFAR-10. The test performance of FF is slightly worse than backpropagation. There is also an interesting page about the analysis of performance versus backpropagation.

Code implementations

I’d like to mention two GitHub repositories, one form Nebuly-ai and the other from Mohammad Pezeshki. Both are PyTorch implementations.

Useful Links

The Forward-Forward Algorithm: Some Preliminary Investigations
G. Hinton
arXiv:2212.13345 [cs.LG], 2022.

Code from Nebuly-ai.

Code from M. Pezeshki.

Detailed Backpropagation Algorithm (link).

Interesting performance analysis page.