PyTorch backward function

This post examines some backward() function examples about the autograd (Automatic Differentiation) engine of PyTorch. As you already know, if you want to compute all the derivatives of a tensor, you can call backward() on it. The torch.tensor.backward function relies on the autograd function torch.autograd.backward that computes the sum of gradients (without returning it) of given tensors with respect to the graph leaves.

A first example

In a tutorial fashion, import torch library

import torch

and consider the matrix

\displaystyle x := \begin{bmatrix} x_1 & x_2 \\ x_3 & x_4\end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 1\end{bmatrix}

coded as

x = torch.ones(2, 2, requires_grad=True)

and y defined as

y := x + 2 = \begin{bmatrix} x_1+2 & x_2 + 2\\ x_3+2 & x_4+4\end{bmatrix} = \begin{bmatrix} y_1 & y_2 \\ y_3 & y_4\end{bmatrix} = \begin{bmatrix} 3 & 3 \\ 3 & 3\end{bmatrix} \,.

Note that, throughout the whole post, the asterisk symbol \ast stands for entry-wise multiplication, not the usual matrix multiplication. Then we define z in terms of y:

z = y*y*3 = 3* \begin{bmatrix} y_1 & y_2 \\ y_3 & y_4 \end{bmatrix} * \begin{bmatrix} y_1 & y_2 \\ y_3 & y_4 \end{bmatrix} = \begin{bmatrix} 3 y_1^2 & 3y_2^2 \\ 3y_3^2 & 3y_4^2\end{bmatrix} = \begin{bmatrix} 27 & 27 \\ 27 & 27 \end{bmatrix} \,.

Define out as the mean of the entries of z:

\displaystyle \texttt{out}=\frac{1}{4}\left(3y_{1}^2 + 3y_{2}^2 + 3y_{3}^2 + 3y_{4}^2\right)\,.

Important: out contains a single real value. This value is the result of a scalar function (in this case, the mean function).

y = x + 2
z = y * y * 3
out = z.mean()

Now, how do we compute the derivative of out with respect to x? First we type

out.backward() 

to calculate the gradient of current tensor and then, to return \displaystyle \partial \texttt{out}/\partial x , we use

x.grad
tensor([[4.5000, 4.5000],         
         [4.5000, 4.5000]]) 

These values are obtained because, for example, taking the derivative w.r.t. x_{2} one gets

\displaystyle \begin{aligned} \frac{\partial \texttt{out}}{\partial x_2} &= \frac{\partial }{\partial x_2} \left( \frac{1}{4} ( 3 y_1^2 + 3y_2^2 + 3y_3^2 + 3y_4^2) \right) \\ & = 0 + \frac{3}{4} \frac{\partial}{\partial x_2} y_2^2 + 0 + 0 \\ &= \frac{3}{4} \frac{\partial}{\partial x_2} (x_2 + 2)^2 \\ & = \frac{3}{2} \cdot (x_2 + 2) \stackrel{x_2=1} {\longrightarrow} \frac{3}{2} \cdot 3 = 4.5 \,. \end{aligned}

The zeros in the second row are due to the fact that y_1, y_3 and y_4 do not depend on x_2 , hence their derivative is zero. The grad attribute is None by default and becomes a tensor the first time a call to backward() computes gradients for self. The attribute will then contain the gradients computed and future calls to backward() will accumulate (add) gradients into it. Alternatively, use just

torch.autograd.grad(outputs=out, inputs=x)

instead of x.grad, without calling backward(). What happens if you call, for example, z.grad or y.grad? Neither z nor y are graph leaves, so you will get no result and a warning (check next section).

A neural networks example

Neural networks use the backpropagation algorithm: neural network parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter. PyTorch has torch.autograd as built-in engine to compute those gradients. The engine supports automatic computation of gradients for any computational graph. Consider the simplest one-layer neural network, with input x, parameters W and b, and some loss function.

Fig. 1. Simple Neural Network
x = torch.ones(8)  # input tensor
y = torch.zeros(10)  # expected output
W = torch.randn(8, 10, requires_grad=True) # weights
b = torch.randn(10, requires_grad=True) # bias vector
z = torch.matmul(x, W) + b # output
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

We can represent the code with the following computational graph.

Fig. 2. Computational Graph

W and b are parameters, x is the input. We can only obtain the grad properties for the leaf nodes of the computational graph which have requires_grad property set to True. Calling grad on non-leaf nodes will elicit a warning message:

Try it yourself typing:

loss.backward()
print(W.grad)
print(b.grad)
print(x.grad)
print(y.grad)
print(z.grad) # WARNING
print(loss.grad) # WARNING

Note that loss is a scalar output. Applying backward() directly on loss (with no arguments) is not a problem because loss represents a unique output and it is unambiguous to take its derivatives with respect to each variable in x. The situation changes when trying to call backward() on a non-scalar output. For example, consider the 10-entries tensor z. When calling backward() on it, what do you expect x.grad to be? We will address this problem in the following sections.

Vector-Jacobian product

In general, torch.autograd is an engine for computing vector-Jacobian products, that is, the product J ᐪ ᐧ v where v is any vector and

J^\top = \begin{bmatrix} \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} \end{bmatrix} \,.

If v happens to be be the gradient of a scalar function l=g(y​), that is

\displaystyle v = \begin{bmatrix} \frac{\partial{l}}{\partial{y_1}} \\ \vdots \\ \frac{\partial{l}}{\partial{y_m}} \end{bmatrix} \,,

then the vector-Jacobian product returns the gradient of l with respect to x:

J^\top \cdot v = \begin{bmatrix} \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}} \end{bmatrix} \begin{bmatrix} \frac{\partial l}{\partial y_{1}} \\ \vdots \\ \frac{\partial l}{\partial y_{m}} \end{bmatrix} = \begin{bmatrix} \frac{\partial l}{\partial x_{1}} \\ \vdots \\ \frac{\partial l}{\partial x_{n}} \end{bmatrix}\,.

Let’s focus on our first example to understand what actually happens. Note that out from the first example is a scalar function just like the function l = g(y) previously cited: you can think of mean as g and out as l: out = mean(y₁, y₂, y₃, y₄). In our case, the Jacobian matrix is the following

\displaystyle J^\top = \begin{bmatrix} \frac{\partial y_{1}}{\partial x_{1}} & \frac{\partial y_{2}}{\partial x_{1}} & \frac{\partial y_{3}}{\partial x_{1}} & \frac{\partial y_{4}}{\partial x_{1}} \\ \frac{\partial y_{1}}{\partial x_{2}} & \frac{\partial y_{2}}{\partial x_{2}} & \frac{\partial y_{3}}{\partial x_{2}} & \frac{\partial y_{4}}{\partial x_{2}} \\ \frac{\partial y_{1}}{\partial x_{3}} & \frac{\partial y_{2}}{\partial x_{3}} & \frac{\partial y_{3}}{\partial x_{3}} & \frac{\partial y_{4}}{\partial x_{3}} \\ \frac{\partial y_{1}}{\partial x_{4}} & \frac{\partial y_{2}}{\partial x_{4}} & \frac{\partial y_{3}}{\partial x_{4}} & \frac{\partial y_{4}}{\partial x_{4}} \end{bmatrix} = \begin{bmatrix}1&0&0&0\\0&1&0&0\\0&0&1&0\\0&0&0&1\end{bmatrix}\,,

where yᵢ = xᵢ + 2 for i=1, 2, 3, 4. Since v is the vector of derivatives of out with respect to the yᵢ, the product Jᐪ ᐧ v is

\displaystyle \begin{bmatrix}1&0&0&0\\0&1&0&0\\0&0&1&0\\0&0&0&1\end{bmatrix} \begin{bmatrix} (3/2)y_1\\(3/2)y_2\\(3/2)y_3\\(3/2)y_4\end{bmatrix} \stackrel{x_i=1}{=} \begin{bmatrix} 4.5\\4.5\\4.5\\4.5\end{bmatrix} \,.

Therefore, the vector-Jacobian product returns x.grad , as expected.

The function torch.autograd.grad computes and returns the sum of gradients of outputs w.r.t. the inputs. If the output is not a scalar quantity, then one has to specify v, the “vector” in the Jacobian-vector product. Note that torch.autograd.grad is a method, torch.tensor.grad is a tensor attribute.

Example 1

Another example of vector-Jacobian product is the following. Here we suppose that v is the gradient of an unspecified scalar function l = g(y). The tensor v is defined by torch.rand(3).

import torch
x = torch.rand(3, requires_grad=True)
y = x + 2

# y.backward() <---
# RuntimeError: grad can be implicitly 
# created only for scalar outputs
# try ---> y.backward(v) where v is any tensor of length 3

v = torch.rand(3)

y.backward(v)
print(x.grad)

Alternatively, just use

torch.autograd.grad(outputs=y, inputs=x, grad_outputs=v)

instead of x.grad, without backward. Tensor v has to be specified in grad_outputs.

Example 2

Let x = [x₁, x₂] and define y as

[y_1,y_2,y_3] := [x_1^2,\, x_1^2 + 5x_2^2\,, 3x_2 ]\,.

In this case the transposed Jacobian J ᐪ is

\displaystyle \begin{bmatrix} \frac{\partial y_{1}}{\partial x_{1}} & \frac{\partial y_{2}}{\partial x_{1}} & \frac{\partial y_{3}}{\partial x_{1}} \\ \frac{\partial y_{1}}{\partial x_{2}} & \frac{\partial y_{2}}{\partial x_{2}} & \frac{\partial y_{3}}{\partial x_{2}} \end{bmatrix} = \begin{bmatrix} 2x_1 & 2x_1 & 0 \\ 0 & 10x_2 & 3 \end{bmatrix}\,.

Now, assign numeric values to x₁ and x setting x = [1, 2] and — since y is not a single scalar output — choose the vector v to be, for simplicity, [1, 1, 1]. The vector-Jacobian product is

\displaystyle \begin{bmatrix} 2 & 2 & 0 \\ 0 & 20 & 3 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 4 \\ 23 \end{bmatrix} \,.

This example is easily represented by the following code.

import torch
x = torch.tensor([1., 2], requires_grad=True)
print("x: ", x)

y = torch.empty(3)
y[0] = x[0]**2
y[1] = x[0]**2 + 5*x[1]**2
y[2] = 3*x[1]
print('y:', y)

v = torch.tensor([1., 1, 1,])
y.backward(v) 
print('x.grad:', x.grad)

The general case

We have seen — and it is also shown on the official autograd page — that if you have a function y = f(x) and a vector v happens to be the gradient of a scalar function l = g(y) , then the vector-Jacobian product would be the gradient of l with respect to x . But what happens when v is not a simple vector? Consider the following code.

x = torch.tensor([[1.,2,3],[4,5,6]], requires_grad=True)
y = torch.log(x)
# y is a 2x2 tensor obtained by taking logarithm entry-wise

v = torch.tensor([[3.,2,0],[4,0,1]], requires_grad=True)
# v is not a 1D tensor!

y.backward(v)
x.grad # returns dl/dx, as evaluated by "matrix-Jacobian" product v * dy/dx

# therefore we can interpret v as a matrix dl/dy
# for which the chain rule expression dl/dx = dl/dy * dy/dx holds.
tensor([[3.0000, 1.0000, 0.0000],
        [1.0000, 0.0000, 0.1667]]))

Here y is a function of the 2×3 matrix x . The function y is obtained applying entry-wise the natural logarithm to the elements of x . As you can see, in this case v  is not a simple vector, it’s a 2×3 matrix. So how to interpret v  ? Here v  can be interpreted as the tensor

\displaystyle \begin{bmatrix} \frac{\partial l_{1}}{\partial y_{1}} & \frac{\partial l_{1}}{\partial y_{2}} & \frac{\partial l_{1}}{\partial y_{3}} \\ \frac{\partial l_{2}}{\partial y_{1}} & \frac{\partial l_{2}}{\partial y_{2}} & \frac{\partial l_{2}}{\partial y_{3}} \end{bmatrix}

such that when performing entry-wise matrix-Jacobian one gets

\displaystyle \frac{\partial l_{j}}{\partial y_{k}} \frac{\partial y_{j}}{\partial x_{k}} = \frac{\partial l_{j}}{\partial x_{k}}

in a chain-rule fashion. Hence, when calling x.grad, we obtain the result of the entry-wise product of v  by \mathrm{d}y / \mathrm{d}x :

\displaystyle \begin{aligned} v * \frac{\mathrm{d} y}{\mathrm{d}x} & = \frac{\mathrm{d} l}{\mathrm{d}y} * \frac{\mathrm{d} y}{\mathrm{d}x} \\ &= \begin{bmatrix} 3 & 2 & 0 \\ 4 & 0 & 1 \end{bmatrix} * \begin{bmatrix} 1 & 1/2 & 1/3 \\ 1/4 & 1/5 & 1/6 \end{bmatrix} \\&= \begin{bmatrix} 3 & 1 & 0 \\ 1 & 0 & 0.1667 \end{bmatrix}\,. \end{aligned}

For the “Jacobian” \mathrm{d}y / \mathrm{d}x — it is not the actual Jacobian, that would be a 6×6 matrix including all the ∂yᵢ/∂xⱼ, it is a matrix composed of the Jacobian diagonal entries— remember that the derivative of log(z) is 1/z. Note that x.grad is equivalent to the (entry-wise) product v*(1/v) where the matrix 1/v contains the reciprocals of entries in v.

The math in PyTorch autograd’s tutorial page about vector-Jacobian product is fine but may be misleading in cases like the latter example: what PyTorch actually evaluates is an entry-wise product between v (interpreted as the matrix containing derivatives of function l with respect to y) and the matrix J (the matrix containing the derivatives of y with respect to x ).

Feel free to email me for comments, questions, suggestions or if you just want to leave a message.