Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

This website is still under development, please refer to my CV for more information about me

Posts

Predicting Grokking Long Before it Happens: A look into the loss landscape of models which grok

14 minute read

Published:

A common practice in deep learning is to stop training a model as soon as a sign of overfitting is observed, or when the model’s generalization capabilities have not improved over a long training period (early stopping). The limits of this practice are now well-known today, since (i) a model’s performance can improve, deteriorate and then improve again during training (epoch-wise double descent) (ii) a model can generalize several steps after severe overfitting (grokking). Epoch-wise double descent and grokking open the way to new studies concerning the structure of the minimum found by Stochastic Gradient Descent (SGD), and how networks behave in the neighbourhood of SGD training convergence. These phenomena also lead us to rethink our knowledge about the relationship between the model size, data size, initialization, hyperparameters and generalization of neural networks. Beyond just rethinking this relationship, there appears to be a need to be able to identify measures that are easy and cheaper to obtain and that are strongly correlated with generalization since phenomena such as multiple descents can occur at model sizes that are difficult to experiment with, just as grokking often requires models to be trained for a very large number of epochs, making it difficult to construct a phase diagram of generalization covering all the hyperparameters. With these issues in mind, we’ve been looking at grokking recently [1]. This blog post summarizes some of the observations we’ve made.

Epoch-wise bias-variance decomposition

14 minute read

Published:

Let’s suppose we’re training a model parameterized by $\theta$, and let’s denote by $\theta_t$ the parameter $\theta$ at step $t$ given by the optimization algorithm of our choice. In machine learning, it is often helpful to be able to decompose the error $E(\theta)$ as $B^2(\theta)+V(\theta)+N(\theta)$, where $B$ represents the bias, $V$ the variance, and $N$ the noise (irreducible error). In most cases, the decomposition is performed on an optimal solution $\theta^*$ (for instance, $\lim_{t \rightarrow \infty} \theta_t$, or its early stopping version), for example, in order to understand how the bias and variance change with the complexity of the function implementing $\theta$, the size of this function, etc. This has helped explain phenomena such as model-wise double descent. On the other hand, it can also be interesting to visualize how $B(\theta_t)$ and $V(\theta_t)$ evolve with $t$ (which can help explain phenomena like epoch-wise double descent): that’s what we’ll be doing in this blog post.

portfolio

publications

talks

teaching

Group and home rehearsal courses

Yaounde, Cameroon, 2016, 2017

During my engineering training, I gave tutoring in mathematics, physics and chemistry to college students, at home (private) and in group.

Preparatory classes

Yaounde, Cameroon, 2017, 2018

During my training as an engineer, I prepared many students in mathematics and physical sciences (in short MSP, French system) for the entrance exams of the Grandes Ecoles in Cameroon.