Lucas V. Schuermann

The Math Behind Backpropagation

Introduction

Previously we wrote a short introduction to neural networks, which discusses backpropagation as the training method of choice, described as: "simply a way of minimizing the loss function, or error, of the network by propagating errors backward through the network and adjusting weights accordingly."

This brief writeup is meant to shed light on the mathematics behind backpropagation, deriving (with substantial justification) the weight changing algorithm for a feedforward neural network by means of a standard gradient descent.

The feedforward algorithm

The activation of a neural network is iteratively defined by

yn=f(xn)xn+1=wnyn\begin{aligned} y_n &= f(x_n) \\ x_{n+1} &= w_n y_n \end{aligned}

where yny_n is the output vector at layer nn, ff is the activation function, xnx_n is the input vector at layer nn, and wnw_n is the weight matrix between layers nn and n+1n+1. The first layer is n=1n=1 and the last layer is n=Nn=N.

The backpropagation algorithm

The error of the network is defined by

c=12(yNt)2\begin{aligned} c &= \frac{1}{2}(y_N - t)^2 \end{aligned}

The error gradient of the input vector at a layer nn is defined as

δn=cxn\begin{aligned} \delta_n = \frac{\partial c}{\partial x_n} \end{aligned}

The error gradient of the input vector at the last layer NN is

δN=cxN=xN12(yNt)2=(yN12(yNt)2)yNxN=(yNt)f(xN)xN=(yNt)f(xN)\begin{aligned} \delta_N &= \frac{\partial c}{\partial x_N} \\ &= \frac{\partial}{\partial x_N} \frac{1}{2}(y_N - t)^2 \\ &= \left( \frac{\partial}{\partial y_N} \frac{1}{2}(y_N - t)^2 \right) \frac{\partial y_N}{\partial x_N} \\ &= (y_N - t) \frac{\partial f(x_N)}{\partial x_N} \\ &= (y_N - t) f'(x_N) \end{aligned}

The error gradient of the input vector at an inner layer nn is

δn=cxn=cxn+1xn+1xn=δn+1xn+1xn=δn+1wnynxn=δn+1wnynynynxn=δn+1wnynynf(xn)xn=δn+1wnf(xn)\begin{aligned} \delta_n &= \frac{\partial c}{\partial x_n} \\ &= \frac{\partial c}{\partial x_{n+1}} \frac{\partial x_{n+1}}{\partial x_n} \\ &= \delta_{n+1} \frac{\partial x_{n+1}}{\partial x_n} \\ &= \delta_{n+1} \frac{\partial w_n y_n}{\partial x_n} \\ &= \delta_{n+1} \frac{\partial w_n y_n}{\partial y_n} \frac{\partial y_n}{\partial x_n} \\ &= \delta_{n+1} \frac{\partial w_n y_n}{\partial y_n} \frac{\partial f(x_n)}{\partial x_n} \\ &= \delta_{n+1} w_n f'(x_n) \end{aligned}

Therefore, the error gradient of the input vector at a layer nn is

δn=f(xn){(yNt)if n=Nδn+1wnif n<N\begin{aligned} \delta_n &= f'(x_n) \begin{cases} (y_N - t) & \text{if } n = N \\ \delta_{n+1} w_n & \text{if } n < N \end{cases} \end{aligned}

Hence, the error gradient of the weight matrix wnw_n is

cwn=cxn+1xn+1wn=δn+1wnynwn=δn+1yn\begin{aligned} \frac{\partial c}{\partial w_n} &= \frac{\partial c}{\partial x_{n+1}} \frac{\partial x_{n+1}}{\partial w_n} \\ &= \delta_{n+1} \frac{\partial w_n y_n}{w_n} \\ &= \delta_{n+1} y_n \end{aligned}

Therefore, the change in weight should be

Δwn=αcwn=αδn+1yn\begin{aligned} \Delta w_n &= -\alpha \frac{\partial c}{\partial w_n} \\ &= -\alpha \delta_{n+1} y_n \end{aligned}

where α\alpha is the learning rate (or rate of gradient descent). Thus, we have shown the necessary weight change, from which the implementation of a training algorithm follows trivially.

— Lucas Schuermann, Carlos Martin



Last updated May 16, 2020