Posts by Tags

Generating Random Variables and Stochastic Processes, Generative Flow Networks (GFlowNets)

2 minute read

Published: April 14, 2022

Word embeddings

6 minute read

Published: August 07, 2020

Grokking Beyond the Euclidean Norm of Model Parameters

less than 1 minute read

Published: July 06, 2025

Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. We show that the dynamic of grokking goes beyond the $\ell_2$ norm, that is: If there exists a model with a property $P$ (e.g., sparse or low-rank weights) that fits the data, then GD with a small (explicit or implicit) regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) will also result in grokking, provided the number of training samples is large enough. Moreover, the $\ell_2$ norm of the parameters is no longer guaranteed to decrease with generalization when it is not the property sought.

Grokking Beyond the Euclidean Norm of Model Parameters

less than 1 minute read

Published: July 06, 2025

Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. We show that the dynamic of grokking goes beyond the $\ell_2$ norm, that is: If there exists a model with a property $P$ (e.g., sparse or low-rank weights) that fits the data, then GD with a small (explicit or implicit) regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) will also result in grokking, provided the number of training samples is large enough. Moreover, the $\ell_2$ norm of the parameters is no longer guaranteed to decrease with generalization when it is not the property sought.

Generating Random Variables and Stochastic Processes, Generative Flow Networks (GFlowNets)

2 minute read

Published: April 14, 2022

Generating Random Variables and Stochastic Processes, Generative Flow Networks (GFlowNets)

2 minute read

Published: April 14, 2022

Word embeddings

6 minute read

Published: August 07, 2020

Grokking Beyond the Euclidean Norm of Model Parameters

less than 1 minute read

Published: July 06, 2025

Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. We show that the dynamic of grokking goes beyond the $\ell_2$ norm, that is: If there exists a model with a property $P$ (e.g., sparse or low-rank weights) that fits the data, then GD with a small (explicit or implicit) regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) will also result in grokking, provided the number of training samples is large enough. Moreover, the $\ell_2$ norm of the parameters is no longer guaranteed to decrease with generalization when it is not the property sought.

Grokking Beyond the Euclidean Norm of Model Parameters

less than 1 minute read

Published: July 06, 2025

Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. We show that the dynamic of grokking goes beyond the $\ell_2$ norm, that is: If there exists a model with a property $P$ (e.g., sparse or low-rank weights) that fits the data, then GD with a small (explicit or implicit) regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) will also result in grokking, provided the number of training samples is large enough. Moreover, the $\ell_2$ norm of the parameters is no longer guaranteed to decrease with generalization when it is not the property sought.

Grokking Beyond the Euclidean Norm of Model Parameters

less than 1 minute read

Published: July 06, 2025

Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. We show that the dynamic of grokking goes beyond the $\ell_2$ norm, that is: If there exists a model with a property $P$ (e.g., sparse or low-rank weights) that fits the data, then GD with a small (explicit or implicit) regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) will also result in grokking, provided the number of training samples is large enough. Moreover, the $\ell_2$ norm of the parameters is no longer guaranteed to decrease with generalization when it is not the property sought.

Generating Random Variables and Stochastic Processes, Generative Flow Networks (GFlowNets)

2 minute read

Published: April 14, 2022

Generating Random Variables and Stochastic Processes, Generative Flow Networks (GFlowNets)

2 minute read

Published: April 14, 2022

Grokking Beyond the Euclidean Norm of Model Parameters

less than 1 minute read

Published: July 06, 2025

Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. We show that the dynamic of grokking goes beyond the $\ell_2$ norm, that is: If there exists a model with a property $P$ (e.g., sparse or low-rank weights) that fits the data, then GD with a small (explicit or implicit) regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) will also result in grokking, provided the number of training samples is large enough. Moreover, the $\ell_2$ norm of the parameters is no longer guaranteed to decrease with generalization when it is not the property sought.

Generating Random Variables and Stochastic Processes, Generative Flow Networks (GFlowNets)

2 minute read

Published: April 14, 2022

Generating Random Variables and Stochastic Processes, Generative Flow Networks (GFlowNets)

2 minute read

Published: April 14, 2022

Generating Random Variables and Stochastic Processes, Generative Flow Networks (GFlowNets)

2 minute read

Published: April 14, 2022

Word embeddings

6 minute read

Published: August 07, 2020

Grokking Beyond the Euclidean Norm of Model Parameters

less than 1 minute read

Published: July 06, 2025

Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. We show that the dynamic of grokking goes beyond the $\ell_2$ norm, that is: If there exists a model with a property $P$ (e.g., sparse or low-rank weights) that fits the data, then GD with a small (explicit or implicit) regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) will also result in grokking, provided the number of training samples is large enough. Moreover, the $\ell_2$ norm of the parameters is no longer guaranteed to decrease with generalization when it is not the property sought.

Grokking Beyond the Euclidean Norm of Model Parameters

less than 1 minute read

Published: July 06, 2025

Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. We show that the dynamic of grokking goes beyond the $\ell_2$ norm, that is: If there exists a model with a property $P$ (e.g., sparse or low-rank weights) that fits the data, then GD with a small (explicit or implicit) regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) will also result in grokking, provided the number of training samples is large enough. Moreover, the $\ell_2$ norm of the parameters is no longer guaranteed to decrease with generalization when it is not the property sought.

Grokking Beyond the Euclidean Norm of Model Parameters

less than 1 minute read

Published: July 06, 2025

Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. We show that the dynamic of grokking goes beyond the $\ell_2$ norm, that is: If there exists a model with a property $P$ (e.g., sparse or low-rank weights) that fits the data, then GD with a small (explicit or implicit) regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) will also result in grokking, provided the number of training samples is large enough. Moreover, the $\ell_2$ norm of the parameters is no longer guaranteed to decrease with generalization when it is not the property sought.

Word embeddings

6 minute read

Published: August 07, 2020

Word embeddings

6 minute read

Published: August 07, 2020

Epoch-wise bias-variance decomposition

14 minute read

Published: May 01, 2023

Let’s suppose we’re training a model parameterized by $\theta$, and let’s denote by $\theta_t$ the parameter $\theta$ at step $t$ given by the optimization algorithm of our choice. In machine learning, it is often helpful to be able to decompose the error $E(\theta)$ as $B^2(\theta)+V(\theta)+N(\theta)$, where $B$ represents the bias, $V$ the variance, and $N$ the noise (irreducible error). In most cases, the decomposition is performed on an optimal solution $\theta^*$ (for instance, $\lim_{t \rightarrow \infty} \theta_t$, or its early stopping version), for example, in order to understand how the bias and variance change with the complexity of the function implementing $\theta$, the size of this function, etc. This has helped explain phenomena such as model-wise double descent. On the other hand, it can also be interesting to visualize how $B(\theta_t)$ and $V(\theta_t)$ evolve with $t$ (which can help explain phenomena like epoch-wise double descent): that’s what we’ll be doing in this blog post.

Epoch-wise bias-variance decomposition

14 minute read

Published: May 01, 2023

Let’s suppose we’re training a model parameterized by $\theta$, and let’s denote by $\theta_t$ the parameter $\theta$ at step $t$ given by the optimization algorithm of our choice. In machine learning, it is often helpful to be able to decompose the error $E(\theta)$ as $B^2(\theta)+V(\theta)+N(\theta)$, where $B$ represents the bias, $V$ the variance, and $N$ the noise (irreducible error). In most cases, the decomposition is performed on an optimal solution $\theta^*$ (for instance, $\lim_{t \rightarrow \infty} \theta_t$, or its early stopping version), for example, in order to understand how the bias and variance change with the complexity of the function implementing $\theta$, the size of this function, etc. This has helped explain phenomena such as model-wise double descent. On the other hand, it can also be interesting to visualize how $B(\theta_t)$ and $V(\theta_t)$ evolve with $t$ (which can help explain phenomena like epoch-wise double descent): that’s what we’ll be doing in this blog post.

Visualization of the loss landscape and optimization path of a neural network

less than 1 minute read

Published: May 01, 2022

Visualization of the loss landscape and optimization path of a neural network

less than 1 minute read

Published: May 01, 2022

Epoch-wise bias-variance decomposition

14 minute read

Published: May 01, 2023

Let’s suppose we’re training a model parameterized by $\theta$, and let’s denote by $\theta_t$ the parameter $\theta$ at step $t$ given by the optimization algorithm of our choice. In machine learning, it is often helpful to be able to decompose the error $E(\theta)$ as $B^2(\theta)+V(\theta)+N(\theta)$, where $B$ represents the bias, $V$ the variance, and $N$ the noise (irreducible error). In most cases, the decomposition is performed on an optimal solution $\theta^*$ (for instance, $\lim_{t \rightarrow \infty} \theta_t$, or its early stopping version), for example, in order to understand how the bias and variance change with the complexity of the function implementing $\theta$, the size of this function, etc. This has helped explain phenomena such as model-wise double descent. On the other hand, it can also be interesting to visualize how $B(\theta_t)$ and $V(\theta_t)$ evolve with $t$ (which can help explain phenomena like epoch-wise double descent): that’s what we’ll be doing in this blog post.

Pascal Jr. Tikeng Notsawo

Posts by Tags

Acceptance-Rejection Method

Bag of words

Deep Learning

Delayed Generalization

GflowNets

Gibbs sampling

Glove

Gradient Descent

Grokking

Implicit Regularization

Important Sampling

Inverse Transform Sampling

Low-Rank

MCMC

Metropolis-Hasting

Metropolis-adjusted Langevin

NLP

Overparameterization

Regularization

Sparsity

TF-IDF

Word2Vec

bias-variance tradeoff

deep learning

loss landscape

statistical learning