Generating Random Variables and Stochastic Processes, Generative Flow Networks (GFlowNets)
2 minute read
Published:
Note: It's better to read all the updates below before clicking on any link.
Practical tutorial
Here here is the practical tutorial (theory & code) I wrote in Winter 2022 about GflowNets [1], MCMC, Metropolis-Hasting, Gibbs sampling, Metropolis-adjusted Langevin, Inverse Transform Sampling, Acceptance-Rejection Method and Important Sampling. I received a lot of positive feedback on this tutorial, which has been the starting point for many in their learning of GflowNets.
More resources
To go in depth with GflowNets : GflowNets foundations paper [2] or Trajectory Balance paper [3] (very pedagogical paper).
For Variational Bayes, I recomment the paper A practical tutorial on Variational Bayes [4]
See also MCMC and Bayesian Modeling, 2017, Martin Haugh, Columbia University
Update : I met Pierre L’Ecuyer
In Fall 2022, wanting to update my level in probability and statistics, I took "IFT6561 : Stochastic Simulation", taught at the Université de Montréal by the eminent Pierre L'Écuyer. This course is clearly a masterclass. It's very theoretical and very practical at the same time. Pierre L'Écuyer is the 2nd best teacher I've known in my life so far. I was very close to switching to another field, since he was planning to take me on as a student; but unfortunately I was already being supervised. His book, "Stochastic Simulation and Monte Carlo Methods", a masterclass, is not yet public. But if you ask for access he will send it to you. Here are the book's headlines, captured from my reading plan (Click on each image to zoom in - I've noticed that it only works locally, so just open the image in the new tab).
×
Note: I mention this section because I'm supposed to add a section on Gibbs sampling, Metropolis-adjusted Langevin and Important Sampling to my tutorial by now, from the book of Pierre. I'll find the time to do it so that the tutorial can be complete.
A common practice in deep learning is to stop training a model as soon as a sign of overfitting is observed, or when the model’s generalization capabilities have not improved over a long training period (early stopping). The limits of this practice are now well-known today, since (i) a model’s performance can improve, deteriorate and then improve again during training (epoch-wise double descent) (ii) a model can generalize several steps after severe overfitting (grokking). Epoch-wise double descent and grokking open the way to new studies concerning the structure of the minimum found by Stochastic Gradient Descent (SGD), and how networks behave in the neighbourhood of SGD training convergence. These phenomena also lead us to rethink our knowledge about the relationship between the model size, data size, initialization, hyperparameters and generalization of neural networks. Beyond just rethinking this relationship, there appears to be a need to be able to identify measures that are easy and cheaper to obtain and that are strongly correlated with generalization since phenomena such as multiple descents can occur at model sizes that are difficult to experiment with, just as grokking often requires models to be trained for a very large number of epochs, making it difficult to construct a phase diagram of generalization covering all the hyperparameters. With these issues in mind, we’ve been looking at grokking recently [1]. This blog post summarizes some of the observations we’ve made.
Let’s suppose we’re training a model parameterized by $\theta$, and let’s denote by $\theta_t$ the parameter $\theta$ at step $t$ given by the optimization algorithm of our choice. In machine learning, it is often helpful to be able to decompose the error $E(\theta)$ as $B^2(\theta)+V(\theta)+N(\theta)$, where $B$ represents the bias, $V$ the variance, and $N$ the noise (irreducible error). In most cases, the decomposition is performed on an optimal solution $\theta^*$ (for instance, $\lim_{t \rightarrow \infty} \theta_t$, or its early stopping version), for example, in order to understand how the bias and variance change with the complexity of the function implementing $\theta$, the size of this function, etc. This has helped explain phenomena such as model-wise double descent. On the other hand, it can also be interesting to visualize how $B(\theta_t)$ and $V(\theta_t)$ evolve with $t$ (which can help explain phenomena like epoch-wise double descent): that’s what we’ll be doing in this blog post.