Backpropagation made right for a certain value of right.
Header by ZEISS Microscopy, CC BY 2.0, via Flickr
I have just taught the SGD algorithm with backpropagation to my NLP masters student and, as every year, I found that no explanation of it on the web looks satisfying to me. As the saying goes “on est jamais mieux servi⋅e que par soi-même”, so I figured I’d better get one. In the worst case, it’ll be useful for a future version of me (hi mate!).
Backpropagation is a very frustrating topic. At its core, it’s nothing very complicated and you never need more than relatively basic Bachelor-level maths (a bit of linear algebra, a bit of multivariate calculus). On the other hand it does involve the manipulation of many objects simultaneously, making it hard to keep everything in mind simultaneously, and even to write down the process, due to the sheer number of quantities involved. I am doing the best I can, but I realise that it can be easy to get bogged down. I can only offer a few words of advice:
- We will split the problem in several smaller problems. Only focus on one of them at a time.
- Trust the process.
- When you have fully understood all the small parts, only then put them back together.
I hate doing things this way and I am not good at it at all, but in my experience, it’s the best way to get around this little frustating piece of knowledge.
Problem formulation
We want to optimise a neural network, i.e. to find the weights that minimize the loss on a train dataset. Formally, consider the error function:
\[ E(θ) = \sum_{(X, y)∈\mathcal{D}}\operatorname{loss}(\operatorname{net}_θ(X), y) \]
where \(\operatorname{net}_θ\) is the function implemented by the neural network when its weights are \(θ\), \(\operatorname{loss}\) is the chosen per-sample loss and \(\mathcal{D}=\{(X_1, y_1), …, (X_n, y_n)\}\) is the training dataset.
\(E\) measures the overall error made by the network on the training dataset for a given parameter vector \(θ\). In this context “training” the network means finding a value for \(θ\) such that \(E(θ)\) is minimal.
There are many algorithms for that, but for neural networks, most of the time, an efficient (possibly approximate) solution to this problem is to use a gradient-based algorithm, most commonly a variant of the Stochastic Gradient Descent algorithm (SGD). These algorithms have (obviously) one thing in common: they require to compute the gradient of the per-sample losses with respect to \(θ\). In other words, for all \((X, y)∈\mathcal{D}\), we need to be able to compute the gradient of the per-sample error function
\[ ∇_{\!θ}~e_{(X, y)}(θ) = ∇_θ~\operatorname{loss}(\operatorname{net}_θ(X), y) \]
Notations
We are embarking on a journey with a lot of variables. Absurdly many variables. I’m trying my best to use as few as possible and to keep the notations as clear as I can but I’m still not very satisfied about it. If you can think of improvements, please let me know !
Let’s assume that we have a neural network with \(N\) fully connected layers and using a non-linearity \(φ\). Given an input \(X\), that network will compute an output \(\hat{y}\) as:
\[ \hat{y} = f_N(f_{N-1}(… f_1(X)…)) \]
Or equivalently
\[ \hat{y} = (f_N∘f_{N-1}∘…∘f_1)(X) \]
Where \(f_ℓ\) is a fully connected neural layer (hence \(ℓ\)): a function of the form
\[ \left\lvert \begin{array}{rrl} f_ℓ:~& ℝ^{d_{ℓ-1}} \longrightarrow & ℝ^{d_ℓ}\\ & U \longmapsto & f_ℓ(U) = φ(Z) \stackrel{\mathrm{def}}{=} \begin{pmatrix}φ(z_1)\\⋮\\φ(z_r)\end{pmatrix} \end{array} \right. \]
Where
\[ Z = W^ℓ×U \]
\(W^ℓ∈\mathcal{Mat}(ℝ^{d_ℓ}, ℝ^{d_{ℓ-1}})\) being the weight matrix of \(ℓ\)-th layer. We will also use the following notation for the weights of \(W^ℓ\)
\[ W^ℓ = \begin{pmatrix} w^ℓ_{1,1} & … & w^ℓ_{1, d_{ℓ-1}}\\ ⋮ & & ⋮\\ w^ℓ_{d_ℓ,1} & … & w^ℓ_{d_ℓ, d_{ℓ-1}}\\ \end{pmatrix} \]
In theory, of course, \(W^ℓ\) could also be the \(ℓ\)-th power of \(W\), so the notation I use here is ambiguous. On the other hand, we do need to put that indice somewhere and it’s a convenient place. In any case, we won’t use any power of anything in this section, so let’s agree that it’s ok.
Note that in general, neural layers are biased : of the form \(f(U) = φ(W×U + b)\). We will see later why this is makes no difference. For now, let’s assume that our layers have no bias, it will makes our computations easier to follow.
So to sum it up, our neural network is defined by
- The sequence \((d_0, d_1, …, d_N)\) of the dimensions of its layers. \(d_0\) is the dimension of the input, and for all \(ℓ⩾0\), \(d_ℓ\) is the dimension of the \(ℓ\)-th layer.
- The weight matrices \((W^1, …, W^N)\), where \(W^ℓ\) is of dimension \(d_ℓ\) rows by \(d_{ℓ-1}\) columns.
- The non-linearity \(φ\). We will assume that it’s used for every layer except the last, i.e. \(f_N(U) = W^N×U\). It will simplify things a bit and it’s consistent with the general practice of using a specific non-linearity tied to the loss for the last layer (usually a softmax, that is mean to be used with the negative log-likelihood loss).
And we have:
\[ \hat{y} = W^N × \underbrace{φ( \underbrace{W^{N-1} × φ( … × \underbrace{φ(\underbrace{W^1×X}_{Z^1})}_{O^1}… ) }_{Z^{N-1}} )}_{O^{N-1}} \]
I have added yet a few more notations here:
- \(Z^ℓ=\begin{pmatrix}z^ℓ_1\\⋮\\z^ℓ_{d_ℓ}\end{pmatrix}\) is the intermediary output of the \(ℓ\)-th layer, just before applying the non-linearity.
- \(O^ℓ=\begin{pmatrix}o^ℓ_1\\⋮\\o^ℓ_{d_ℓ}\end{pmatrix}\) is the output of the \(ℓ\)-th layer (including the non-linearity).
In other words, using a recursive definition:
\[ \left\lbrace \begin{aligned} O^0 &= X\\ Z^ℓ &= W^{ℓ}×O^{ℓ-1} & \text{for $1 ⩽ ℓ ⩽ N-1$}\\ O^ℓ &= φ(Z^ℓ) & \text{for $1 ⩽ ℓ ⩽ N-1$}\\ \hat{y} &= Z^N\\ \end{aligned} \right. \tag{1}\]
And that’s it! Now we have all we need.
Doing it with numbers
Remember: our problem here is to compte the gradient of the per-sample error function with respect to the parameters of the network.
\[ ∇_{\!θ}~e_{(X, y)}(θ) = ∇_θ~\operatorname{loss}(\operatorname{net}_θ(X), y) \]
With our notations, the parameters of the networks are the elements of the weight matrices. Therefore:
\[ θ = (w^1_{1,1}, w^1_{1, 2}, …, w^ℓ_{i, j}, …, w^N_{d_N, d_{N-1}}) \]
So, using the definition of the gradient, we have for all \((X, y)\),
\[ ∇_{\!θ}~e_{(X, y)}(θ) = \begin{pmatrix} \frac{∂e_{(X, y)}}{∂w^1_{1,1}}\\ ⋮\\ \frac{∂e_{(X, y)}}{∂w^ℓ_{i,j}}\\ ⋮\\ \frac{∂e_{(X, y)}}{∂w^N_{d_N, d_{N-1}}}\\ \end{pmatrix} \]
In order to make it easier on the eyes, we’ll simplify the notation a bit and write that
\[ ∇\,e(θ) = \left(\frac{∂e}{∂w^ℓ_{i,j}}\right)_{ℓ, i, j} \]
since none of these symbols should be ambiguous.
Great news! That means that in order to arrive to our ends (train a neural network, remember), all we need is to be able to compute the value of \(\frac{∂e}{∂w^ℓ_{i,j}}\) for any appropriate \(ℓ\), \(i\) and \(j\).
In general this is a problem that is not very hard, but very expensive, both in terms of human-powered symbolic derivation and of computer-powered numeric computations. The main reason is that while the influence of one weight on the output of its layer is fairly simple, its influence on the output of the following layers gets gradually more complicated.
Fortunately, we have one great tool — the chain rule — and one angle of attack — there is no backward connexion in neural networks.
More concretely, from the notations defined in Equation 1:
\[ \left\lbrace \begin{aligned} e &= f_N(f_{N-1}(… f_{ℓ+1}(O^ℓ)…))\\ Z^ℓ &= W^ℓ×O^{ℓ-1}\\ O^ℓ &= φ(Z^ℓ)\\ \end{aligned} \right. \tag{2}\]
Therefore, using the chain rule1:
\[ \begin{align} \frac{∂e}{∂w^ℓ_{i,j}} &= \left\langle ∇_{\!O^ℓ}\,e \middle| \frac{∂O^ℓ}{∂w^ℓ_{i,j}} \right\rangle\\ &\stackrel{\mathrm{def}}{=} \left\langle ∇_{\!O^ℓ}\,e \middle| \begin{pmatrix} \frac{∂o^ℓ_1}{∂w^ℓ_{i,j}}\\ ⋮\\ \frac{∂o^ℓ_{d_ℓ}}{∂w^ℓ_{i,j}}\\ \end{pmatrix} \right\rangle\\ &= \sum_k \frac{∂e}{∂o^ℓ_k}\frac{∂o^ℓ_k}{∂w^ℓ_{i,j}}\\ \end{align} \tag{3}\]
where \(⟨⋅|⋅⟩\) is the scalar product.
But we know from Equation 2 that for all \(k\), \(o^ℓ_k=φ(z^ℓ_k)\), so, using the chain rule,
\[ \begin{align} \frac{∂o^ℓ_k}{∂w^ℓ_{i,j}} &= \frac{∂φ(z^ℓ_k)}{∂w^ℓ_{i,j}}\\ &= \frac{∂φ(z^ℓ_k)}{∂z^ℓ_k}\frac{∂z^ℓ_k}{∂w^ℓ_{i,j}}\\ &= φ'(z^ℓ_k)\frac{∂z^ℓ_k}{∂w^ℓ_{i,j}} \end{align} \tag{4}\]
Moreover, using Equation 2 and the definition of the matrix-vector product,
\[ z^ℓ_k = \sum_m w^ℓ_{k,m}o^{ℓ-1}_m \]
And therefore we have
\[ \begin{aligned} \frac{∂z^ℓ_k}{∂w^ℓ_{i,j}} &= \frac{∂}{∂w^ℓ_{i,j}} \left(\sum_m w^ℓ_{k,m}o^{ℓ-1}_m\right)\\ &= \sum_m \frac{∂}{∂w^ℓ_{i,j}}(w^ℓ_{k,m}o^{ℓ-1}_m) \end{aligned} \tag{5}\]
But in our neural network, there is no back-connection: the weights of a layer don’t have any influence on the output of the previous layer. Therefore \(o^{ℓ-1}_m\) is does not depend on any \(w^ℓ_{k,m}\) and
\[ \begin{align} \frac{∂}{∂w^ℓ_{i,j}}(w^ℓ_{k,m}o^{ℓ-1}_m) &= \begin{dcases*} o^{ℓ-1}_j & if $(k, m) = (i, j)$\\ 0 & otherwise \end{dcases*}\\ &= 𝟙_{(k, m) = (i, j)}⋅o^{ℓ-1}_j\\ \end{align} \tag{6}\] (where \(𝟙_{(k, m) = (i, j)}\) is \(1\) if \((k, m) = (i, j)\) and \(0\) otherwise)
We plug Equation 6 into Equation 5 to get:
\[ \begin{align} \frac{∂z^ℓ_k}{∂w^ℓ_{i,j}} &= \sum_m \frac{∂}{∂w^ℓ_{i,j}}(w^ℓ_{k,m}o^{ℓ-1}_m)\\ &= \sum_m 𝟙_{(k, m) = (i, j)}⋅o^{ℓ-1}_m\\ &= 𝟙_{k=i}⋅o^{ℓ-1}_j\\ \end{align} \tag{7}\] since all the terms in the sum are \(0\), except for \(m=j\).
Similarly, if we plug Equation 4, then Equation 7 into Equation 3, we get:
\[ \begin{align} \frac{∂e}{∂w^ℓ_{i,j}} &= \sum_k \frac{∂e}{∂o^ℓ_k}\frac{∂o^ℓ_k}{∂w^ℓ_{i,j}}\\ &= \sum_k \frac{∂e}{∂o^ℓ_k}\,φ'(z^ℓ_k)\,\frac{∂z^ℓ_k}{∂w^ℓ_{i,j}}\\ &= \sum_k \frac{∂e}{∂o^ℓ_k}\,φ'(z^ℓ_k)\,𝟙_{k=i}⋅o^{ℓ-1}_j\\ &= \frac{∂e}{∂o^ℓ_i}\,φ'(z^ℓ_i)\,o^{ℓ-1}_j\\ \end{align} \]
Let’s box this last one:
\[ \boxed{ \frac{∂e}{∂w^ℓ_{i,j}}=\frac{∂e}{∂o^ℓ_i}\,φ'(z^ℓ_i)\,o^{ℓ-1}_j } \tag{8}\]
Equation 8 is important because at this point, we have no explicit \(w^ℓ_{i,j}\) left. Instead what we have is:
- \(o^{ℓ-1}_j\), which has already been computed during the forward pass.
- \(φ'(z^ℓ_i)\), which is easy to compute if you know the derivative of \(φ\), which we do, and \(z^ℓ_i\), which we know because it has also been computed during the forward pass.
All that’s left to compute is \(\frac{∂e}{∂o^ℓ_i}\), which is interesting because Let’s take a small step back and consider what that quantity is:
- \(o^ℓ_i\) is the \(i\)-th coordinate of \(O^ℓ\): the output of the \(ℓ\)-th layer.
- \(e\) is the error.
- \(\frac{∂e}{∂o^ℓ_i}\) is the \(i\)-th coordinate of \(∇_{\!O^ℓ}\,e\): the gradient of the error with respect to the output of the \(ℓ\)-th layer.
There is a special case here for \(ℓ=N\): \(O^N\) is \(\hat{y}\), the output of our network and we have \(e=\operatorname{loss}(\hat{y}, y)\), so:
\[ \begin{aligned} \frac{∂e}{∂o^N_i} &= \frac{∂}{∂o^N_i}\operatorname{loss}(\hat{y}, y)\\ &= \frac{∂}{∂\hat{y}_i}\operatorname{loss}(\hat{y}, y) \end{aligned} \tag{9}\]
And that’s usually easy enough to compute2, since the derivatives of the loss function are usually well-known.
Now what about the other values of \(ℓ\)? By definition of the network, we have
\[ e = \operatorname{loss}\mathopen{}\Big(f_N\big(…f_{ℓ+2}(O^{ℓ+1})…\big), y\Big) \]
From the chain rule, we have
\[ \begin{align} \frac{∂e}{∂o^ℓ_i} &= \left\langle ∇_{\!O^{ℓ+1}}\,e \middle| \frac{∂O^{ℓ+1}}{∂o^ℓ_i} \right\rangle\\ &= \sum_k \frac{∂e}{∂o^{ℓ+1}_k}\frac{∂o^{ℓ+1}_k}{∂o^ℓ_i}\\ \end{align} \tag{10}\]
And one more application of the chain rule gives us
\[ \begin{align} \frac{∂o^{ℓ+1}_k}{∂o^ℓ_i} &= \frac{∂φ(z^{ℓ+1}_k)}{∂o^ℓ_i}\\ &= \frac{∂φ(z^{ℓ+1}_k)}{∂z^{ℓ+1}_k}\frac{∂z^{ℓ+1}_k}{∂o^ℓ_i}\\ &= φ'(z^{ℓ+1}_k)\frac{∂}{∂o^ℓ_i}\left(\sum_m w^{ℓ+1}_{k,m}o^ℓ_m\right)\\ &= φ'(z^{ℓ+1}_k)\sum_m w^{ℓ+1}_{k,m}\frac{∂}{∂o^ℓ_i}o^ℓ_m\\ &= φ'(z^{ℓ+1}_k)\sum_m w^{ℓ+1}_{k,m}⋅𝟙_{m=i}\\ &= φ'(z^{ℓ+1}_k)\,w^{ℓ+1}_{k,i}\\ \end{align} \tag{11}\]
And if we plug Equation 11 into Equation 10:
\[ \begin{align} \frac{∂e}{∂o^ℓ_i} &= \sum_k \frac{∂e}{∂o^{ℓ+1}_k}\frac{∂o^{ℓ+1}_k}{∂o^ℓ_i}\\ &= \sum_k \frac{∂e}{∂o^{ℓ+1}_k}\,φ'(z^{ℓ+1}_k)\,w^{ℓ+1}_{k,i}\\ \end{align} \tag{12}\]
And we’re done!
Why? Because we know all of the terms in that last sum:
- \(w^{ℓ+1}_{k,i}\) is just a weight, no need to compute anything.
- We know \(z^{ℓ+1}_k\) from the forward pass and we know the derivative of \(φ\), so \(φ'(z^{ℓ+1}_k)\) is not a problem.
But do we know \(\frac{∂e}{∂o^{ℓ+1}_k}\)? We do for \(ℓ=N-1\): from Equation 9, it’s \(\frac{∂}{∂\hat{y}_i}\operatorname{loss}(\hat{y}, y)\).
Therefore, we can compute \(\frac{∂e}{∂o^{N-1}_i}\), frow which we can then compute \(\frac{∂e}{∂o^{N-2}_i}\) and so on: we can recursively compute all the \(\frac{∂e}{∂o^ℓ_i}\), backward from the last layer of the network.
And that is why we call this backpropagation: for everything we have done up to this point, computing \(\frac{∂e}{∂w^ℓ_{i,j}}\) could be done in parallel for any value of \(ℓ\), \(i\) and \(j\), there was no dependency between the formulas. For Equation 12, in contrast, we need \(\frac{∂e}{∂o^{ℓ+1}_k}\) to compute \(\frac{∂e}{∂o^ℓ_k}\). Therefore, we will need far less operations to compute \(\frac{∂e}{∂w^ℓ_{i,j}}\) if we start from \(ℓ=N\), the last layer of the network and work our way backward, as long as we keep track of the \(\frac{∂e}{∂o^ℓ_k}\) that we have already computed.
Just for fun, let’s put everything together and write a summary formula: for all \(ℓ\), \(i\) and \(j\),
\[ \left\lbrace \begin{aligned} \frac{∂e}{∂w^ℓ_{i,j}} &= \frac{∂e}{∂o^ℓ_i}\,φ'(z^ℓ_i)\,o^{ℓ-1}_j\\ \frac{∂e}{∂o^ℓ_i} &= \sum_k \frac{∂e}{∂o^{ℓ+1}_k}\,φ'(z^{ℓ+1}_k)\,w^{ℓ+1}_{k,i} & \text{if $1 ⩽ ℓ ⩽ N-1$}\\ \frac{∂e}{∂o^N_i} &= \frac{∂}{∂\hat{y}_i}\operatorname{loss}(\hat{y}, y)\\ \end{aligned} \right. \]
Footnotes
To be precise: the version of the chaine rule for \(h=g∘f\), where \(f: ℝ\longrightarrow ℝ^n\) and \(g: ℝ^n\longrightarrow ℝ\). In that case: \[ h'(t) = \left\langle ∇_{\!f(t)}\,g(f(t))~ \middle|~ f'(t) \right\rangle \] Which is itself simply a special case of the chain rule for vector-valued functions with vector inputs: \[ J_v(g∘f) = J_{f(v)}(g) × J_v(f) \] where \(J\) is the Jacobian operator.↩︎
For instance if the loss is the negative \(\log\)-likelihood applied on a softmax and \(y=c\), with \(1 ⩽ c ⩽ d_N\): \[ \begin{align} \frac{∂}{∂\hat{y}_i}\operatorname{loss}(\hat{y}, y) & = \frac{∂}{∂\hat{y}_i}\Big(-\hat{y}_c + \operatorname{log sum exp}(\hat{y})\Big)\\ &= \begin{dcases*} -1 + \frac{\operatorname{e}^{\hat{y}_c}}{\sum_j \operatorname{e}^{\hat{y}_j}} & if $i=c$\\ \frac{\operatorname{e}^{\hat{y}_i}}{\sum_j \operatorname{e}^{\hat{y}_j}} & otherwise \end{dcases*}\\ &= \operatorname{softmax}(\hat{y})_i - 𝟙_{i=c} \end{align} \] For numerical stability reasons, this is usually evaluated by first computing the \(\operatorname{log softmax}\) of \(y\), from which it is easy to compute both the \(\log\)-likelihood (i.e. the loss) and the \(\operatorname{softmax}\) that appears in the last line of this derivation of its gradient.↩︎