This post is the follow-up of my post on forward mode AD. There I motivated motivated the use of automatic differentiation by noting that it is helpful for gradient-baseed optimization. In this post I will discuss how reverse mode allows us to take the derivative of a function output with respect to basically an arbitrary number of parameters in one sweep. To do this comfortably, one often evaluates local derivatives, whose use is facilitated by introducing the bar notation.

To see this, let’s go the other way around. Given our computational graph we start at the output yy and calculate derivatives with respect to intermediate values. For convenience we define the notation

ˉu=yu

that is, the symbol under the bar is the variable we wish to take the derivative with respect to. Now let’s dive right in and take derivatives from the example. We start out with

ˉv6=yv6=1ˉv5=yv5=yv6v6v5=ˉv6v6(v4,v5)v5=ˉv6v4

The first expression is often called the seed gradient and is trivial to evaluate. In order to evaluate ˉv5 we had to use the chain-rule.

Continuing, we now have to evaluate ˉv4. Let’s do it first using the chain rule:

yv4=yv6(v6v5v5v4+v6v4)=v4(1)+v5.

where we have used the chain rule and that v6=v6(v5(v4),v4). Looking at the computational graph, we see that we had to split the product using a plus-sign precisely at a position where there are multiple paths from the origin y to the target node v4, one via v6 and one via v5. Now as trivial as this example is, real-world programs are often much more complicated. And to calculate the partial derivatives, the formulas can become ever more complex.

Now the bar notation is here to make our life easier. Essentially, it gives us an easy way to replace chain-rule evaluation with local values that are stored in all child-nodes of the target node vi in ˉvi. Put in another way, they are used as cache-values when traverseing the graph from right-to-left, as we do in reverse-mode AD. This is illustrated in the sketch below:

Coputational graph for reverse-mode autodiff

We see that there are two gradients incoming to the v4 node: ˉv5v5/v4 and ˉv6v6/v4. Assuming that ˉv6 and ˉv5 are already evaluated, we just need the partial derivatives v5/v4 and v6/v4 to find ˉv4:

ˉv4=ˉv5v5v4+ˉv6v6v4=ˉv5(1)+ˉv6v5.

And indeed, we have recovered the same rule as when applying the chain rule. When we continue to calculate derivatives ˉvi, we see that this notation becomes very handy. For example, using the chain rule we have

ˉv3=yv3=ˉv6v6(v5(v4(v1,v3),v4(v3,v1)))v3

where we again need to use the product rule. Using the bar-notation however, the derivative becomes trivial

ˉv3=ˉv4v4v3=ˉv4v1,

given that both ˉv4 and v1 have been evaluated. But in typical reverse-mode use this is the case, as we have completed one forward pass through the graph and we have traversed the barred values from right to left.

For completeness sake, let’s calculate the Reverse-mode AD evaluation trace as we did for the forward mode.

Reverse-mode AD evaluation trace
v1=x1=1.0
v0=x2=1.1
v1=v0v1=(1.1)(1.0)=1.1
v2=exp(v1)=3.004
v3=cos(v0)=0.4536
v4=v1v3=0.4990
v5=v2v4=(3.004)(0.4990)=2.5052
v6=v4v5=(0.4990)(2.5052)=1.2500
ˉv6=yv6=1
ˉv5=yv5=ˉv6v4=(1)(0.4990)=0.4990
ˉv4=yv4=ˉv6v6v4+ˉv5v5v4=ˉv6v5+ˉv5(1)=(1)(2.5052)+(0.4990)(1)=2.006
ˉv3=yv3=ˉv4v4v3=ˉv4v1=(2.006)(1.1)=2.2069
ˉv2=ˉv5v5v2=ˉv5(1)=0.4990
ˉv1=ˉv2v2v1+ˉv4v4v1=ˉv2exp(v1)+ˉv4v3=(0.4990)(3.004)+(2.006)(0.4536)=2.4090
ˉv0=ˉv3v3v0+ˉv1v1v0=ˉv3(sinv0)+ˉv1v1=(2.2069)(0.8912)+(2.4090)(1)=4.3758
ˉv1=ˉv1v1v1=ˉv1v0=(2.4090)(1.1)=2.6500

As a first step, we are evaluating all the un-barred quantities vi before we begin the backward pass, where we evaluate all the ˉvi. We also see that we need the values of the child nodes to calculate the barred values. For example, to calculate ˉv3 we need the value of v1, which is a child-node of v4, when evaluating the partial derivative v4/v1.

We also see that this procedure would allow us to calculate the derivatives y/vi for an arbitrary number of vis in a single reverse pass. This is the exactly the behaviour that makes reverse-mode automatic differentiation so powerful in context of gradient-based optimization. In that situation we want to take the derivative of a scalar loss function with respect to a possible large amount of parameters.

Reverse-mode AD in higher dimensions: Jacobian-Vector products

We are often working in higher dimensional spaces, hidden layers in multilayer perceptrons for example can have hundreds or even thousands of features. It is therefore instructive to take a look how reverse-mode AD works in this setting. For this we are looking at a new example:

xRnf:RnRmg:RmRluRmyRlu=f(x)y=g(u)=g(f(x))

The computational graph for the program y=g(f(x)) is just

Computational graph for simple example with gradient backprop

Now the formulas for the higher-dimensional case arise naturally from the ones derived in the previous section. Since we are discussing reverse mode, we start at the right. The bar quantities for each element of the first hidden layer is given by:

ˉuj=iˉyiyiuj

Using vector notation, this equation can be written as

[ˉy1ˉyl][y1u1y1umylu1ylum]=[ˉu1ˉum]

Now ˉf is the vector that we get as an output from that calculation and will be propagated left-wards. We obtain it by calculating a vector-jacobian product. In practice, one usually does not calculate the jacobian matrix, but it is more efficient to calculate the vjp directly.

Moving leftwards to the graph, we use the same procedure to calculate ˉx:

[ˉu1ˉun][u1x1u1xnumx1umxn]=[ˉx1ˉxn]

Combining these results we find

ˉx=ˉy[{yiuj}i,j][{uixj}r,s].

This equation is evaluated left-to-right, so that it requires to calculate only vector-matrix products.

As a final note, these vector-jacobian products are often called pullbacks in the context of automatic differentiation software. A notation used for them is

ˉu=iˉyiyiu=Buf(ˉy).

Let’s unpack the expression Buf(ˉy). The upper index denotes simply that the pullback is a function of u, since it needs the values from the forward-pass at the node to be evaluated. The lower index f denotes that it depends on the mapping f, since this is the mapping from x to u:f(x)=u. Finally, the argument ˉy denotes that the pullback needs the incoming gradient from the right.

In practice, automatic differentiation software uses the chain rule to split the computational graph of a program into finer units until it can identify a pullback for a segment of the computational graph. In some cases this may be the desired way to work and in some cases, a user-defined backpropagation rule may be desireable.

Summary

We have introduced reverse-mode automatic differentiation. By starting with a gradient at the end of the computational graph, this mode allows to quickly calculate the sensitivity of an output with respect to an arbitrary large amount of intermediate values. To aid the necessary computations for this, the bar-notation has been introduced. Finally, we motivated that reverse-mode AD only needs to calculate vector-jacobian products and introduced the notation of pullback as the primitive object of reverse-mode AD. And finally, please check out the references below which I used for this blog-post.

References

[1] C. Rackauckas 18.337J/6.338J: Parallel Computing and Scientific Machine Learning

[2] S. Radcliffe Autodiff in python

[3] R. Grosse Lecture notes CSC321

[4] A. Griewank, A. Walther - Evaluating Derivatives - SIAM(2008)