The goal of this post is to show the math of backpropagating a derivative for a fully-connected (FC) neural network layer consisting of matrix multiplication and bias addition. I have briefly mentioned this in an earlier post dedicated to Softmax, but here I want to give some more attention to FC layers specifically.

Here is a fully-connected layer for input vectors with *N* elements, producing
output vectors with *T* elements:

As a formula, we can write:

Presumably, this layer is part of a network that ends up computing some loss
*L*. We'll assume we already have the derivative of the loss w.r.t. the output
of the layer .

We'll be interested in two other derivatives: and .

## Jacobians and the chain rule

As a reminder from The Chain Rule of Calculus,
we're dealing with functions that map from *n* dimensions to *m* dimensions:
. We'll consider the outputs of *f*
to be numbered from 1 to *m* as . For each such
we can compute its partial derivative by any of the *n* inputs as:

Where *j* goes from 1 to *n* and *a* is a vector with *n* components. If *f*
is differentiable at *a* then the derivative of *f* at *a* is the *Jacobian
matrix*:

The multivariate chain rule states: given
and and a point ,
if *g* is differentiable at *a* and *f* is differentiable at then
the composition is differentiable at *a* and its derivative
is:

Which is the matrix multiplication of and .

## Back to the fully-connected layer

Circling back to our fully-connected layer, we have the loss - a
scalar function . We also have the
function . If we're interested in the derivative w.r.t the
weights, what are the dimensions of this function? Our "variable part" is then
*W*, which has *NT* elements overall, and the output has *T* elements, so
[1].

The chain rule tells us how to compute the derivative of *L* w.r.t. *W*:

Since we're backpropagating, we already know ; because of the
dimensionality of the *L* function, the dimensions of are
[1,T] (one row, *T* columns). has *NT* inputs and *T* outputs,
so the dimensions of are [T,NT]. Overall, the dimensions of
are then [1,NT]. This makes sense if you think about it,
because as a function of *W*, the loss has *NT* inputs and a single scalar
output.

What remains is to compute , the Jacobian of *y* w.r.t. *W*. As
mentioned above, it has *T* rows - one for each output element of *y*, and *NT*
columns - one for each element in the weight matrix *W*. Computing such a large
Jacobian may seem daunting, but we'll soon see that it's very easy to generalize
from a simple example. Let's start with :

What's the derivative of this result element w.r.t. each element in *W*? When
the element is in row 1, the derivative is (*j* being the column
of *W*); when the element is in any other row, the derivative is 0.

Similarly for , we'll have non-zero derivatives only for the second
row of *W* (with the same result of being the derivative for the
*j*-th column), and so on.

Generalizing from the example, if we split the index of *W* to *i* and *j*, we
get:

This goes into row *t*, column in the Jacobian matrix. Overall,
we get the following Jacobian matrix with shape [T,NT]:

Now we're ready to finally multiply the Jacobians together to complete the chain rule:

The left-hand side is this row vector:

And we're multiplying it by the matrix shown above. Each item in the
result vector will be a dot product between and the corresponding
column in the matrix . Since has a single non-zero element
in each column, the result is fairly trivial. The first *N* entries are:

The next *N* entries are:

And so on, until the last (*T*-th) set of *N* entries is all *x*-es multiplied
by .

To better see how to apply each derivative to a corresponding element in *W*, we
can "re-roll" this result back into a matrix of shape [T,N]:

## Computational cost and shortcut

While the derivation shown above is complete and mathematically correct, it can
also be computationally intensive; in realistic scenarios, the full Jacobian
matrix can be *really* large. For example, let's say our input is a (modestly
sized) 128x128 image, so *N=16,384*. Let's also say that *T=100*. The weight
matrix then has *NT=1,638,400* elements; respectably big, but nothing out of the
ordinary.

Now consider the size of the full Jacobian matrix: it's *T* by *NT*, or over
160 million elements. At 4 bytes per element that's more than half a GiB!

Moreover, to compute every backpropagation we'd be forced to multiply this full Jacobian matrix by a 100-dimensional vector, performing 160 million multiply-and-add operations for the dot products. That's a lot of compute.

But the final result is the size of *W* - 1.6 million
elements. Do we really need 160 million computations to get to it? No, because
the Jacobian is very *sparse* - most of it is zeros. And in fact, when we look
at the found above - it's fairly straightforward to
compute using a single multiplication per element.

Moreover, if we stare at the matrix a
bit, we'll notice it has a familiar pattern: this is just the outer product between the vectors
and *x*:

If we have to compute this backpropagation in Python/Numpy, we'll likely write code similar to:

```
# Assuming dy (gradient of loss w.r.t. y) and x are column vectors, by
# performing a dot product between dy (column) and x.T (row) we get the
# outer product.
dW = np.dot(dy, x.T)
```

## Bias gradient

We've just seen how to compute weight gradients for a fully-connected layer. Computing the gradients for the bias vector is very similar, and a bit simpler.

This is the chain rule equation applied to the bias vector:

The shapes involved here are: is still [1,T], because the
number of elements in *y* remains *T*. has *T* inputs (bias
elements) and *T* outputs (*y* elements), so its shape is [T,T]. Therefore, the
shape of the gradient is [1,T].

To see how we'd fill the Jacobian matrix , let's go back to the
formula for *y*:

When derived by anything other than , this would be 0; when derived
by the result is 1. The same applies to every other element of *y*:

In matrix form, this is just an identity matrix with dimensions [T,T]. Therefore:

For a given element of *b*, its gradient is just the corresponding element in
.

## Fully-connected layer for a batch of inputs

The derivation shown above applies to a FC layer with a single input vector *x*
and a single output vector *y*. When we train models, we almost always try to
do so in *batches* (or *mini-batches*) to better leverage the parallelism of
modern hardware. So a more typical layer computation would be:

Where the shape of *X* is [N,B]; *B* is the batch size, typically a
not-too-large power of 2, like 32. *W* and *b* still have the same shapes, so
the shape of *Y* is [T,B]. Each column in *X* is a new input vector (for a
total of *B* vectors in a batch); a corresponding column in *Y* is the output.

As before, given , our goal is to find and . While the end results are fairly simple and pretty much what you'd expect, I still want to go through the full Jacobian computation to show how to find the gradiends in a rigorous way.

Starting with the weigths, the chain rule is:

The dimensions are:

- : [1,TB] because
*Y*has*T*outputs for each input vector in the batch. - : [TB,TN] since has
*TB*outputs and*TN*inputs overall. - : [1,TN] same as in the batch-1 case, because the same weight matrix is used for all inputs in the batch.

Also, we'll use the notation to talk about the *i*-th
element in the *b*-th input vector *x* (out of a total of *B* such input
vectors).

With this in hand, let's see how the Jacobians look; starting with , it's the same as before except that we have to take the batch into account. Each batch element is independent of the others in loss computations, so we'll have:

As the Jacobian element; how do we arrange them in a 1-dimensional vector with
shape [1,TB]? We'll just have to agree on a linearization here - same as we did
with *W* before. We'll go for row-major again, so in 1-D the array *Y* would be:

And so on for *T* elements. Therefore, the Jacobian of *L* w.r.t *Y* is:

To find , let's first see how to compute *Y*. The *i*-th element
of *Y* for batch *b* is:

Recall that the Jacobian now has shape [TB,TN]. Previously we had
to unroll the [T,N] of the weight matrix into the rows. Now we'll also have to
unrill the [T,B] of the output into the columns. As before, first all *b*-s for
*t=1*, then all *b*-s for *t=2*, etc. If we carefully compute the derivative,
we'll see that the Jacobian matrix has similar structure to the single-batch
case, just with each line repeated *B* times for each of the batch elements:

Multiplying the two Jacobians together we get the full gradient of *L* w.r.t.
each element in the weight matrix. Where previously (in the non-batch case) we
had:

Now, instead, we'll have:

Which makes total sense, since it's simply taking the loss gradient computed from each batch separately and adds them up. This aligns with our intuition of how gradient for a whole batch is computed - compute the gradient for each batch element separately and add up all the gradients [2].

As before, there's a clever way to express the final gradient using matrix operations. Note the sum across all batch elements when computing . We can express this as the matrix multiplication:

This is a good place to recall the computation cost again. Previously we've seen that for a single-input case, the Jacobian can be extremely large ([T,NT] having about 160 million elements). In the batch case, the Jacobian would be even larger since its shape is [TB,NT]; with a reasonable batch of 32, it's something like 5-billion elements strong. It's good that we don't actually have to hold the full Jacobian in memory and have a shortcut way of computing the gradient.

## Bias gradient for a batch

For the bias, we have:

here has the shape [1,TB]; has the shape [TB,T]. Therefore, the shape of is [1,T], as before.

From the formula for computing *Y*:

We get, for any batch *b*:

So, whereas was an identity matrix in the no-batch case, here it looks like this:

With *B* identical rows at a time, for a total of *TB* rows. Since
is the same as before, their matrix
multiplication result has this in column *j*:

Which just means adding up the gradient effects from every batch element independently.

## Addendum - gradient w.r.t. x

This post started by explaining that the parameters of a fully-connected layer
we're usually looking to optimize are the weight matrix and bias. In most cases
this is true; however, in some other cases we're actually interested in
propagating a gradient through *x* - often when there are more layers before
the fully-connected layer in question.

Let's find the derivative . The chain rule here is:

Dimensions: is [1, T] as before; has T outputs
(elements of *y*) and N inputs (elements of *x*), so its dimensions are [T, N].
Therefore, the dimensions of are [1, N].

From:

We know that . Generalizing
this, we get ; in other words,
is just the weight matrix *W*. So
is the dot product of
with the *i*-th column of *W*.

Computationally, we can express this as follows:

```
dx = np.dot(dy.T, W).T
```

Again, recall that our vectors are *column* vectors. Therefore, to multiply *dy*
from the left by *W* we have to transpose it to a row vector first. The result
of this matrix multiplication is a [1, N] row-vector, so we transpose it again
to get a column.

An alternative method to compute this would transpose *W* rather than *dy* and
then swap the order:

```
dx = np.dot(W.T, dy)
```

These two methods produce exactly the same *dx*; it's important to be familiar
with these tricks, because otherwise it may be confusing to see a transposed *W*
when we expect the actual *W* from gradient computations.

[1] | As explained in the
softmax post,
we This is a fully general approach as we can linearize any-dimensional
arrays. To work with Jacobians, we're interested in |

[2] | In some cases you may hear about averaging the gradients across the
batch. Averaging just means dividing the sum by B; it's a constant
factor that can be consolidated into the learning rate. |