Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Page Not Found

Page not found. Your pixels are in another canvas.

This website is still under development, please refer to my CV for more information about me

Jupyter notebook markdown generator

Posts

Grokking Beyond the Euclidean Norm of Model Parameters

less than 1 minute read

Published: July 06, 2025

Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. We show that the dynamic of grokking goes beyond the $\ell_2$ norm, that is: If there exists a model with a property $P$ (e.g., sparse or low-rank weights) that fits the data, then GD with a small (explicit or implicit) regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) will also result in grokking, provided the number of training samples is large enough. Moreover, the $\ell_2$ norm of the parameters is no longer guaranteed to decrease with generalization when it is not the property sought.

Epoch-wise bias-variance decomposition

14 minute read

Published: May 01, 2023

Let’s suppose we’re training a model parameterized by $\theta$, and let’s denote by $\theta_t$ the parameter $\theta$ at step $t$ given by the optimization algorithm of our choice. In machine learning, it is often helpful to be able to decompose the error $E(\theta)$ as $B^2(\theta)+V(\theta)+N(\theta)$, where $B$ represents the bias, $V$ the variance, and $N$ the noise (irreducible error). In most cases, the decomposition is performed on an optimal solution $\theta^*$ (for instance, $\lim_{t \rightarrow \infty} \theta_t$, or its early stopping version), for example, in order to understand how the bias and variance change with the complexity of the function implementing $\theta$, the size of this function, etc. This has helped explain phenomena such as model-wise double descent. On the other hand, it can also be interesting to visualize how $B(\theta_t)$ and $V(\theta_t)$ evolve with $t$ (which can help explain phenomena like epoch-wise double descent): that’s what we’ll be doing in this blog post.

Visualization of the loss landscape and optimization path of a neural network

less than 1 minute read

Published: May 01, 2022

Generating Random Variables and Stochastic Processes, Generative Flow Networks (GFlowNets)

2 minute read

Published: April 14, 2022

Word embeddings

6 minute read

Published: August 07, 2020

portfolio

Word embeddings

Published: August 07, 2020

Pre-train and Fine-tune a Language Model with 🤗 Transformers

Published: March 01, 2022

Intrinsic Dimension Estimation

Published: April 29, 2022

publications

On the use of linguistic similarities to improve Neural Machine Translation for African Languages

Pascal Jr. Tikeng Notsawo, Brice Nanda, James Assiene, 5th Black in AI Workshop @ NeurIPS, 2021.

Stochastic Average Gradient : A Simple Empirical Investigation

Pascal Junior Tikeng Notsawo, IFT6512, Stochastic programming, Université de Montréal, 2023.

Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization for Heterogeneous Representational Coarseness

Dianbo Liu, Alex Lamb, Xu Ji, Pascal Jr. Tikeng Notsawo, Mike Mozer, Yoshua Bengio, Kenji Kawaguchi, In Thirthy-Seventh AAAI Conference on Artificial Intelligence, 2023.

Predicting Grokking Long Before it Happens: A look into the loss landscape of models which grok

Pascal Jr. Tikeng Notsawo, Hattie Zhou, Mohammad Pezeshki, Irina Rish, Guillaume Dumas, ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.

Lost in Translation: The Algorithmic Gap Between LMs and the Brain

Tommaso Tosato, Pascal Jr. Tikeng Notsawo, Saskia Helbling, Irina Rish, Guillaume Dumas, Workshop on Large Language Models and Cognition, ICML, 2024.

Grokking Beyond the Euclidean Norm of Model Parameters

Pascal Jr. Tikeng Notsawo, Guillaume Dumas, Guillaume Rabusseau, Forty-Second International Conference on Machine Learning (ICML), 2025.

talks

On the use of linguistic similarities to improve Neural Machine Translation for African Languages

Published: December 15, 2022

Predicting Grokking Long Before it Happens: A look into the loss landscape of models which grok

Published: July 28, 2023

Rethinking Generalization in Deep Learning: Double descent and Grokking phenomena

Published: September 14, 2023

teaching

Group and home rehearsal courses

Yaounde, Cameroon, 2016, 2017

During my engineering training, I gave tutoring in mathematics, physics and chemistry to college students, at home (private) and in group.

Preparatory classes

Yaounde, Cameroon, 2017, 2018

During my training as an engineer, I prepared many students in mathematics and physical sciences (in short MSP, French system) for the entrance exams of the Grandes Ecoles in Cameroon.

Pascal Jr. Tikeng Notsawo