← Blog

Epoch-wise bias-variance decomposition

Let’s suppose we’re training a model parameterized by θ\theta, and let’s denote by θt\theta_t the parameter θ\theta at step tt given by the optimization algorithm of our choice. In machine learning, it is often helpful to be able to decompose the error E(θ)E(\theta) as B2(θ)+V(θ)+N(θ)B^2(\theta)+V(\theta)+N(\theta), where BB represents the bias, VV the variance, and NN the noise (irreducible error). In most cases, the decomposition is performed on an optimal solution θ\theta^* (for instance, limtθt\lim_{t \rightarrow \infty} \theta_t, or its early stopping version), for example, in order to understand how the bias and variance change with the complexity of the function implementing θ\theta, the size of this function, etc. This has helped explain phenomena such as model-wise double descent. On the other hand, it can also be interesting to visualize how B(θt)B(\theta_t) and V(θt)V(\theta_t) evolve with tt (which can help explain phenomena like epoch-wise double descent): that’s what we’ll be doing in this blog post.

Hackmd version : https://hackmd.io/@6LQ4mvRtS4Sc3LHkNEvDXQ/HJl86oh__2

Notations

  • X\mathcal{X} : domain set (input space)
  • Y\mathcal{Y} : label set (output space)
  • H\mathcal{H} : hypothesis class (class of possible models we can learn)

Definititions and Preliminaries

Definitition 1 (Loss Function) The loss function (t,y)\ell(t, y) is defined as a function that takes two labels and produces a value between 00 and some constant M[0,]M \in [0, \infty], and measures the cost of predicting yy when the true value is tt.

 ⁣:Y×Y[0,M](t,y)(t,y)\begin{align*} \ell \colon \mathcal{Y} \times \mathcal{Y} &\to [0, M] \\ (t, y) &\mapsto \ell(t, y) \end{align*}

Examples include square loss (t,y)=(ty)2\ell(t, y) = (t−y)^2, absolute loss (t,y)=ty\ell(t, y) = \|t−y\|, and zero-one loss (t,y)=I[ty]\ell(t, y) = \mathbb{I}[t\ne y]

Definitition 2 (Training set) Set SS of S\|S\| values zi=(xi,ti)X×Yz_i = (x_i, t_i) \in \mathcal{X} \times \mathcal{Y}, where xiXx_i \in \mathcal{X} represents a feature vector and ti=ti(x)Yt_i = t_i(x) \in \mathcal{Y} the label of the ithi^{th} sample. ziz_i are assumed to be i.i.d. and sampled from an unknown data distribution D\mathcal{D}.

S={z1,,zn}S = \{z_1, \cdots , z_n\}

Since the training set size is an important parameter of a learning problem, we will assume in the following that all dataset have the same size nn.

Definitition 3 (Optimal prediction)

In our setup we don’t have a joint distribution, but just xx being random and t=t(x)t = t(x) being a deterministic or random function of xx, so that p(x,t)=p(x)p(tx)p(x, t) = p(x)p(t\|x). The optimal prediction for an example xx is given by y(x)=argminyEtp(tx)[(t,y)]y^* (x) = \arg\min_y \mathbb{E}_{t \sim p(t\|x)} [\ell(t, y)]

In deterministic case, there exist tYt \in \mathcal{Y} such that p(tx)=1p(t\|x)=1 : y(x)=argminy(t,y)=ty^* (x) = \arg\min_y \ell(t, y) = t.

In the nondeterministic case :

  • Using square loss, we have :
y(x)=argminyEtp(tx)[(ty)2]=argminyEtp(tx)[t22yt+y2]=argminy y22Etp(tx)[t]y+Etp(tx)[t2]=Etp(tx)[t]\begin{split} y^{*}(x) &= \arg\min_{y} \mathbb{E}_{t \sim p(t|x)}[(t - y)^2] \\ &= \arg\min_{y} \mathbb{E}_{t \sim p(t|x)}[t^2 - 2yt + y^2] \\ &= \arg\min_{y} \ y^2 - 2 \mathbb{E}_{t \sim p(t|x)} [t] y + \mathbb{E}_{t \sim p(t|x)} [t^2] \\ &= \mathbb{E}_{t \sim p(t|x)} [t] \end{split}

That is the mean of p(tx)p(t\|x). For example, if for some xx, we have the example (x,1)(x, 1) with probability px[0,1]p_x \in [0, 1] and the example (x,0)(x, 0) with probability 1px1-p_x, that is :

\begin{array}{ll} 1 & \mbox{with probability } p_x \\ 0 & \mbox{with probability } 1-p_x \end{array} \right.$$ Then, using square loss, we have $y^*(x) = p_x$. * Using the absolute loss, we have

\begin{split} y^*(x) &= \arg\min_{y} \mathbb{E}{t \sim p(t|x)}[|y - t|] \ &= \arg\min{y} \ 2 \int_{-\infty}^{y} F_{t|x}(t)dt - y + \mathbb{E}{t \sim p(t|x)} [t] + 2\lim{t \rightarrow - \infty} t F_{t|x}(t) \ &= F_{t|x}^{-1}(1/2) \end{split}

with $F_{t\|x}$ the cumulative distribution function of $p(t\|x)$. So $y^*(x)$ is the median of $p(t\|x)$. The last line follows from the fact that $y \mapsto 2 F_{t\|x}(y) - 1$ is the derivative of the function $$y \mapsto 2 \int_{-\infty}^{y} F_{t|x}(t)dt - y + \mathbb{E}_{t \sim p(t|x)} [t] +2\lim_{t \rightarrow - \infty} t F_{t|x}(t)$$, which thus reaches its minimum when $F_{t\|x}(y) = 1/2$, that is when $y = F_{t\|x}^{-1}(1/2)$. The second line follows from

\begin{equation} \begin{split} \mathbb{E}{t \sim p(t|x)}[|y - t|] &= \mathbb{E}{t \sim p(t|x)}\Big[(y - t) \mathbb{I}[y\ge t] - (y - t) \mathbb{I}[y<t] \Big] \ &= \mathbb{E}{t \sim p(t|x)}\Big[(y - t) \mathbb{I}[y\ge t] - (y - t) (1-\mathbb{I}[y\ge t]) \Big] \ &= \mathbb{E}{t \sim p(t|x)}\Big[(y - t) (2\mathbb{I}[y\ge t] - 1) \Big] \ &= \mathbb{E}{t \sim p(t|x)}\Big[2y \mathbb{I}[y\ge t] - y - 2 t \mathbb{I}[y\ge t] + t \Big] \ &= 2y\mathbb{E}{t \sim p(t|x)}\big[\mathbb{I}[y\ge t] \big] - y\mathbb{E}{t \sim p(t|x)}[1] - 2\mathbb{E}{t \sim p(t|x)}\big[t\mathbb{I}[y\ge t] \big] + \mathbb{E}{t \sim p(t|x)}[t] \ &= 2yF{t|x}(y) - y - 2 \int_{[-\infty, y]} t \ dF_{t|x}(t) + \mathbb{E}{t \sim p(t|x)}[t] \ &= 2yF{t|x}(y) - y - 2 \Big( [t F_{t|x}(t)]{-\infty}^{y} - \int{[-\infty, y]} F_{t|x} (t)dt \Big) + \mathbb{E}{t \sim p(t|x)}[t] \ &= 2yF{t|x}(y) - y - 2 \Big( y F_{t|x}(y) - \lim_{t \rightarrow - \infty} t F_{t|x}(t) - \int_{-\infty}^y F_{t|x}(t)dt \Big) + \mathbb{E}{t \sim p(t|x)}[t] \ &= 2yF{t|x}(y) - y - 2yF_{t|x}(y) + 2\lim_{t \rightarrow - \infty} t F_{t|x}(t) + 2\int_{-\infty}^y F_{t|x}(t)dt + \mathbb{E}{t \sim p(t|x)}[t] \ &= 2\int{-\infty}^y F_{t|x}(t)dt - y + \mathbb{E}{t \sim p(t|x)}[t] + 2\lim{t \rightarrow - \infty} t F_{t|x}(t) \end{split} \end{equation}

With the illustrative example use above of the square loss, we have $F_{t\|x}(t) = (1-p_x)\mathbb{I}[0\le t<1] + \mathbb{I}[1\le t]$, so $y^*(x) = \mathbb{I}[1-p_x < 1/2] + \lambda \mathbb{I}[1-p_x = 1/2] = \mathbb{I}[p_x > 1/2] + \lambda \mathbb{I}[p_x = 1/2]$ $\forall \lambda \in [0, 1]$. * Using zero-one loss, we have :

\begin{split} y^*(x) &= \arg\min_{y} \mathbb{E}{t \sim p(t|x)}[\mathbb{I}[t\ne y]] \ &= \arg\min{y} \mathbb{E}{t \sim p(t|x)}[1-\mathbb{I}[t=y]] \ &= \arg\min{y} 1 - \mathbb{E}{t \sim p(t|x)}[\mathbb{I}[t=y]] \ &= \arg\max{y} \mathbb{E}{t \sim p(t|x)}[\mathbb{I}[t=y]] \ &= \arg\max{t} p(t|x) = \arg\max_{t} p(t, x) \end{split}$$

That is the most frequent prediction. With the illustrative example use above for square loss, we have

y(x)=argminy pxI[y1]+(1px)I[y0]=I[px1px]=I[px1/2]\begin{equation*} \begin{split} y^*(x) &= \arg\min_y \ p_x\mathbb{I}[y\ne 1] + (1-p_x)\mathbb{I}[y\ne 0] \\ &= \mathbb{I}[p_x\ge 1-p_x] \\ &= \mathbb{I}[p_x\ge 1/2] \end{split} \end{equation*}

Definitition 4 (Algorithm) A learning algorithm A\mathcal{A} is defined as the following mapping A:(X×Y)nH\mathcal{A} : ( \mathcal{X} \times \mathcal{Y})^n \mapsto \mathcal{H}. The learning algorithm takes a dataset S(X×Y)nS \in ( \mathcal{X} \times \mathcal{Y})^n of nn samples and return a model h=A(S)Hh = \mathcal{A}(S) \in \mathcal{H}.

The optimal model is the model for which f(x)=y(x)f(x) = y^*(x) for every xx. For example, in the case of zero-one loss, the optimal model is the Bayes classifier, and its loss is called the Bayes rate. In the example use above for zero-one loss, the Bayes classifier is f(x)=I[P[t=1x]1/2]f(x)=\mathbb{I}[ \mathbb{P}[t=1\|x] \ge 1/2].

Definitition 5 (True Risk) : Given hHh \in \mathcal{H}

R[h]=E(x,t)D[(t,h(x))]R[h] = \mathbb{E}_{(x,t) \sim \mathcal{D}}[ \ell(t, h(x))]

Definitition 6 (Empirical Risk) : For hHh \in \mathcal{H} and a dataset S={(x1,t1),,(xn,tn)}S = \{(x_1, t_1), \cdots , (x_n, t_n)\}

R^S[h]=1ni=1n(ti,h(xi))\hat{R}_S[h] =\frac{1}{n} \sum_{i=1}^n \ell(t_i, h(x_i))

The essential task of supervised learning is to maximize the performance on all of the possible data via the adjustment of hh on a particularly drawn sample set SS. We won’t be using these risk expressions primarily in this monograph, as is customary in statistical learning theory, but we have recalled them anyway to highlight the dependence of h=A(S)h = \mathcal{A}(S) on SS.

Definitition 7 (Expected loss for each input feature vector)

Since the the same learner A\mathcal{A} produces in general different models hh for different training sets SS, (t,h(x))\ell(t, h(x)) is a function of the training set SS through h=A(S)h = \mathcal{A}(S). This dependency can be removed by averaging over training sets.

Let DnD_n be a set of training sets of size nn, y^n(x)\hat{y}_n(x) the predictions produced for example xx by applying the learner to each training set in DnD_n, and Yn(x)={A(S)(x)  SDn}Y_n(x) = \{ \mathcal{A}(S)(x) \ \| \ S \in D_n\} be the multiset of these predictions (a specific prediction y^n(x)\hat{y}_n(x) will appear more than once in Yn(x)Y_n(x) if it is produced by more than one training set in DnD_n).

The the quantity of interest is the expected loss

= \mathbb{E}_{D_n, \ t \sim p(t|x)}[\ell(t, \hat{y}_n(x))] = \mathbb{E}_{y \sim Y_n(x), \ t \sim p(t|x)}[\ell(t, y)]$$ Our objetive is to bias-variance decompose the expected loss $E_n(x)$ into three terms: **bias**, **variance** and **noise (irreducible error)**. A standard such decomposition exists for squared loss, and a number of different ones have been proposed for zero-one loss. **Definitition 8 (Main prediction)** : The main prediction for a loss function $\ell$ and a set of training sets $D_n$ is $$y^{\ell, D_n} (x) = \arg\min_{y'} \mathbb{E}_{D_n}[\ell(\hat{y}_n(x), y')] = \arg\min_{y'} \mathbb{E}_{y \sim Y_n(x)}[\ell(y, y')]$$ In words, the main prediction is the value whose average loss relative to all the predictions in $Y_n(x)$ is minimum (i.e., it is the prediction that “differs least” from all the predictions in $Y_n(x)$ according to $\ell$). It is a measure of the “central tendency” of a learner. The main prediction is not necessarily a member of $Y_n(x)$. **Theorem 1** Under squared loss the main prediction is the mean of the predictions in $Y_n(x)$, under absolute loss it is the median of $Y_n(x)$ since, and under zero-one loss it is the mode of $Y_n(x)$ (i.e. the most frequent prediction). *Proof* * Under squared loss the main prediction is the mean of the predictions in $Y_n(x)$ since

\begin{split} y^{\ell, D_n}(x) &= \arg\min_{y’} \mathbb{E}{y \sim Y_n(x)}[(y - y’)^2] \ &= \arg\min{y’} \ {y’}^2 - 2 \mathbb{E}{y \sim Y_n(x)} [y] y’ + \mathbb{E}{y \sim Y_n(x)} [y^2] \ &= \mathbb{E}_{y \sim Y_n(x)}[y] \end{split}

* Under absolute loss it is the median of $Y_n(x)$ since

\begin{split} y^{\ell, D_n}(x) &= \arg\min_{y’} \mathbb{E}{y \sim Y_n(x)}[|y’ - y|] \ &= \arg\min{y’} \ 2 \int_{-\infty}^{y’} F_{Y_n(x)}(y)dy - y’ + \mathbb{E}{y \sim Y_n(x)} [y] + 2\lim{y \rightarrow - \infty} y F_{Y_n(x)}(y) \ &= F_{Y_n(x)}^{-1}(1/2) \end{split}

with $F_{Y_n(x)}$ the cumulative distribution function of $y_n(x) \sim Y_n(x)$. The derivation to get this result is similar to what we did in the **Definitition 3** above for the absolute error. * Under zero-one loss it is the mode of $Y_n(x)$ (i.e., the most frequent prediction) since

\begin{split} y^{\ell, D_n}(x) &= \arg\min_{y’} \mathbb{E}{y \sim Y_n(x)}[\mathbb{I}[y\ne y’]] \ &= \arg\min{y’} \mathbb{E}{y \sim Y_n(x)}[1-\mathbb{I}[y=y’]] \ &= \arg\min{y’} 1 - \mathbb{E}{y \sim Y_n(x)}[\mathbb{I}[y=y’]] \ &= \arg\max{y’} \mathbb{E}{y \sim Y_n(x)}[\mathbb{I}[y=y’]] \ &= \arg\max{y’} f_{Y_n(x)} (y’) \end{split}

with $f_{Y_n(x)}$ the mass function ($Y_n(x)$ finite) or the density function (is $Y_n(x)$ not finite) of $y_n(x) \sim Y_n(x)$. **Definition 9 (bias, variance, noise)** The bias, variance and noise (irreducible error) of a learner on an example $x$ are respectively $$B^2(x) = \ell( y^*(x), y^{\ell, D_n}(x)) \\ V(x) = \mathbb{E}_{D_n}[\ell( y^{\ell, D_n}(x), y_n(x))] = \mathbb{E}_{y \sim Y_n(x)}[\ell( y^{\ell, D_n}(x), y )] \\ N(x) = \mathbb{E}_{t \sim p(t|x)}[\ell(t, y^*(x))]$$ In words, the square bias is the loss incurred by the main prediction relative to the optimal prediction, the variance is the average loss incurred by predictions relative to the main prediction, and the noise is the unavoidable component of the loss, that is incurred independently of the learning algorithm. In the deterministic case, $N(x)= \ell(t(x), t(x))$ for all $x$. Bias and variance may be averaged over all examples, in which case we will refer to them as average square bias $$\mathbb{E}_{x \sim p(x)}[B^2(x)]$$ and average variance $$\mathbb{E}_{x \sim p(x)}[V(x)]$$. The average noise is $$\mathbb{E}_{x \sim p(x)}[N(x)] = \mathbb{E}_{(x,t) \sim p(x,t)}[\ell(t, y^*(x))]$$ **Theorem 2** For square loss $\ell(t, y) = (t−y)^2$ $$V(x) = \mathbb{E}_{y \sim Y_n(x)}[y^2] - ( y^{\ell, D_n}(x))^2 \text{ and } N(x) = \mathbb{E}_{t \sim p(t|x)}[t^2] - (y^*(x))^2$$ *Proof*

\begin{split} V(x) &= \mathbb{E}{y \sim Y_n(x)}[( y^{\ell, D_n}(x) - y)^2] \&= \mathbb{E}{y \sim Y_n(x)}[( y^{\ell, D_n}(x))^2] - 2 y^{\ell, D_n}(x) \mathbb{E}{y \sim Y_n(x)}[y] + \mathbb{E}{y \sim Y_n(x)}[y^2] \&= ( y^{\ell, D_n}(x))^2 - 2 ( y^{\ell, D_n}(x))^2 + \mathbb{E}{y \sim Y_n(x)}[y^2] \&= \mathbb{E}{y \sim Y_n(x)}[y^2] - ( y^{\ell, D_n}(x))^2 \end{split}

\begin{split} N(x) &= \mathbb{E}{t \sim p(t|x)}[(t - y^*(x))^2] \&= \mathbb{E}{t \sim p(t|x)}[t^2] - 2 y^(x) \mathbb{E}{t \sim p(t|x)}[t] + \mathbb{E}{t \sim p(t|x)}[(y^(x))^2] \&= \mathbb{E}{t \sim p(t|x)}[t^2] - 2 (y^(x))^2 + (y^(x))^2 \&= \mathbb{E}{t \sim p(t|x)}[t^2] - (y^*(x))^2 \end{split}

# Bias-variance decomposition For a given loss functions $\ell$, we are looking for two constants $c_1(x, \ell)$ and $c_2(x, \ell)$ such that $$E_n(x) = \ B^2(x) + c_1(x, \ell) \ V(x) + c_2(x, \ell) \ N(x)$$ **Theorem 2 [1]** For square loss $\ell(t, y) = (t−y)^2$, $c_1(x, \ell)=c_2(x, \ell)=1$ *Proof*

\begin{split} E_n(x) & = \mathbb{E}{y \sim Y_n(x), \ t \sim p(t|x)}[(t−y)^2] \ & = \mathbb{E}{y \sim Y_n(x), \ t \sim p(t|x)}[(t−y^(x) + y^(x) - y)^2] \ & = \mathbb{E}{t \sim p(t|x)}[(t−y^*(x))^2] + 2(\mathbb{E}{t \sim p(t|x)}[t]−y^(x))(y^(x) - \mathbb{E}{y \sim Y_n(x)}[y]) + \mathbb{E}{y \sim Y_n(x)}[(y^(x) - y)^2] \ & = N(x) + 2\times0\times(y^(x) - \mathbb{E}{y \sim Y_n(x)}[y]) + \mathbb{E}{y \sim Y_n(x)}[(y^(x) - y^{\ell, D_n}(x) + y^{\ell, D_n}(x) - y)^2] \ & = N(x) + (y^(x) - y^{\ell, D_n}(x))^2 + 2(y^(x) - y^{\ell, D_n}(x))(y^{\ell, D_n}(x) - \mathbb{E}{y \sim Y_n(x)}[y]) + \mathbb{E}{y \sim Y_n(x)}[(y^{\ell, D_n}(x) - y)^2] \ & = N(x) + B^2(x) + 2(y^(x) - y^{\ell, D_n}(x))\times 0 + V(x) \ & = B^2(x)+V(x)+N(x) \end{split}

Let $\mathbb{P}_{D_n}(x) = \mathbb{P}[y^*(x) \in Y_n(x)]$ be the probability over training sets in $D_n$ that the learner predicts the optimal class for $x$. **Theorem 3 [1]** For zero-one loss $\ell(t, y) = \mathbb{I}[t\ne y]$ in two class problems, $c_1 (x, \ell) = 2\mathbb{P}_{D_n}(x) − 1$ and $c_2(x, \ell) = 2 \mathbb{I}[y^{\ell, D_n}(x) = y^*(x)] - 1$ *Proof* > See [1] See [1] for zero-one loss $\ell(t, y) = \mathbb{I}[t\ne y]$ in multiclass problems, and for the absolute loss $\ell(t, y) = \|t−y\|$. # Application : teacher-student setup See the [hackmd version](https://hackmd.io/@6LQ4mvRtS4Sc3LHkNEvDXQ/HJl86oh__2). I'll copy it, format it in markdow and put it here as soon as I can (and check that the derivation I've made there are correct, by the way. I did them in a rush). # References [1] Pedro Domingos. A unified bias-variance decomposition for zero-one and squared loss. In Proc. of the 17th National Conf. on Artificial Intelligence, pages 564–569, Austin, TX, July 2000. [2] IFT 6085 - Lecture 9, Stability, Generalization and the Applications of Stability