Grokking: Generalisation Long After Memorisation
Neural networks sometimes memorise their training set perfectly and then, thousands of steps later, suddenly learn to generalise. Understanding why requires not just tracking accuracy, but peering inside the geometry of learning itself.
In 2022, Alethea Power and colleagues at OpenAI published a short paper with a striking observation: when small transformers were trained on simple modular-arithmetic tasks, they would overfit almost immediately — achieving near-zero training loss while generalisation remained no better than chance. So far, unremarkable. But if training continued far beyond the point of apparent convergence, the models would abruptly transition to near-perfect generalisation. They called the phenomenon grokking, after the Robert Heinlein word for deep, intuitive understanding.
The finding reframed a question that practitioners mostly thought was settled: once a model has memorised its training data, is there anything left to learn? The answer, apparently, is sometimes yes — and the transition can happen long after any conventional early-stopping criterion would have halted training.
This post surveys the experimental setup, the mechanistic explanations from Nanda et al., the role of regularisation — and then turns to my own ongoing research, which frames the grokking transition through the lens of the empirical Neural Tangent Kernel.
The phenomenon
Modular arithmetic setup
Power et al. chose a deliberately simple family of tasks: binary operations over finite groups. The canonical example is modular addition — given two integers , predict . With , the full dataset has examples. The model is a small one-layer transformer, trained on a random 30% split with weight decay and prolonged optimisation.
The task is genuinely finite and closed — there is no distribution shift, no noise, no ambiguity. Once the training set is memorised, perfect validation accuracy is achievable in principle; the question is only whether the model learns the underlying rule or the surface patterns.
Power et al. found that the same architecture exhibits qualitatively different learning dynamics depending on how long training ran. A model stopped at step ~1 000 shows 100% training accuracy and chance-level validation accuracy — pure memorisation. The same model continued to step ~60 000 shows 100% training accuracy and ~100% validation accuracy. The transition is sharp: validation accuracy jumps from chance to near-perfect in a span that is small relative to the total training budget.
Chart 1 — Training vs. validation accuracy (p = 113, modular addition)
Delayed generalisation
The surprise is not that the model eventually generalises — it is the delay. Under standard intuitions, once the training loss has converged, gradient updates carry no new information. Yet something continues to happen in weight space that makes generalisation possible long after the training signal appears exhausted.
This separates grokking from ordinary slow learning. It is not that generalisation and memorisation proceed at similar speeds — it is that they proceed at radically different speeds, with memorisation winning by orders of magnitude. The network does not inch toward generalisation while memorising; it memorises completely, then grokks.
The phenomenon also highlights a practical danger. If we evaluate a model only when its training loss has converged, we may mistake memorisation for the end of learning. Early stopping on training loss — or even on validation loss plateaux — could terminate training before the generalising solution has been found.
Mechanistic explanations
Several groups have proposed accounts of why grokking happens. The most influential draw on mechanistic interpretability — the project of reverse-engineering what computations a neural network has actually implemented.
The Fourier circuit
Neel Nanda and colleagues performed a detailed mechanistic analysis of transformers trained on modular addition (arXiv:2301.05217). Their central finding: grokked models implement a specific algorithm using Fourier features. The network learns to represent each input token as a combination of sinusoidal functions at a sparse set of "key frequencies" , uses the attention mechanism to compute and , and reads off the answer from those values via a linear map. Formally, the learned logit for class is approximately
for some learned coefficients . This is a genuinely correct implementation of modular addition — not a heuristic that happens to work on the training set. The circuit does not emerge gradually; the transition corresponds to a qualitative shift from a memorisation solution (relying on a different, less structured set of features) to the Fourier solution.
The chart below illustrates this shift. Before grokking, all Fourier frequencies in the input embedding have similar, small amplitudes. After grokking, a sparse set of key frequencies dominates.
Chart 2 — Fourier frequency spectrum of input embeddings before and after grokking (p = 113)
Representation learning view
A complementary perspective focuses on the evolution of internal representations. Grokking can be understood as a process in which the network first learns to distinguish training examples by rote — effectively building a lookup table in weight space — and then reorganises those representations into a structured, compositional form that supports extrapolation. One signature is a decrease in the effective dimensionality of the representation. Memorisation produces high-dimensional, irregular embeddings; the generalising Fourier circuit lives in a much lower-dimensional subspace.
The role of weight decay
One of the most reproducible findings in the grokking literature is that weight decay (L2 regularisation) is nearly always required to observe the phenomenon. Without it, models memorise and stay memorised. With it, grokking becomes reliable, and the strength of weight decay controls the timing: stronger regularisation produces faster grokking.
This suggests a norm-based account. The memorisation solution and the generalisation solution both achieve zero training loss, but they differ in the norm of their weights. The memorisation solution — essentially a large lookup table — requires many large weights to distinguish training examples individually. The generalisation solution — the Fourier circuit — implements a compact algorithm expressible with smaller weights. With total loss
once reaches its minimum, gradient updates are driven entirely by the weight norm term. The network continues to move through weight space in directions that reduce while staying near the zero-training-loss manifold. If the generalisation solution lies in a lower-norm region of that manifold, the optimiser will eventually find it.
Mechanistic progress measures
A practical challenge raised by grokking is that the usual indicators of learning give no warning that generalisation is imminent. Training loss is flat. Gradient norm appears stable. Validation loss gives no signal until the transition begins.
Nanda et al.'s key contribution was to identify metrics that do give an early warning. Two stand out. First, the Fourier component norm: the L2 norm of the dominant Fourier frequencies in the embedding. This grows smoothly throughout training, long before validation accuracy improves. Second, the excluded loss: how well the Fourier circuit alone (with all other components zeroed out) can predict training outputs. Both metrics rise steadily during the long plateau — the grokking transition corresponds to the point where they approach saturation.
Chart 3 — Mechanistic progress measures as leading indicators of grokking
These progress measures tell us that the Fourier circuit forms gradually and continuously, even though the generalisation jump appears sudden. The network is quietly building its algorithm throughout the delay phase — the accuracy jump is merely the point at which the circuit becomes strong enough to dominate over the residual memorisation component.
eNTK dynamics at the grokking transition
The mechanistic picture — memorisation versus Fourier circuit, norm-driven competition, continuous progress measures — describes what happens inside the network. A separate question is what happens to the geometry of learning itself: how the model's local sensitivity to its inputs evolves through the transition. This is where the empirical Neural Tangent Kernel comes in.
My current research (Nicholson, 2026 — ICML workshop) studies the eNTK as a quantitative interface between kernel theory and mechanistic interpretability. The central claim is that grokking is not just a jump in accuracy — it is a geometric reorganisation of how the network processes inputs, and that reorganisation should be visible and measurable through the eNTK.
The empirical neural tangent kernel
Let denote the logits of a neural network with parameters and output classes. The empirical NTK at training time is
where is the Jacobian of the logits with respect to the parameters. For practical analysis we use the scalarized version
Under gradient flow and squared loss, model outputs evolve as
so the time dependence of directly controls learning dynamics. In the classical infinite-width NTK regime, the kernel stays fixed at initialisation: throughout training (Jacot et al., 2018). Grokking — where the model transitions from memorisation to a structured algorithm — is a paradigmatic non-lazy event. The eNTK should move.
To make kernel comparisons numerically stable across training, each kernel matrix is centered and trace-normalised:
where is the centering matrix and is a small numerical stabiliser.
The symmetry hypothesis
The deepest claim of my research is not merely that the eNTK moves, but that it moves in a structured, symmetry-respecting direction — one dictated by the algebra of the task.
For modular addition, the correct label depends only on the sum coordinate . This means the true solution is invariant under the family of sum-preserving transformations
A model that has genuinely learned modular addition should be indifferent to these shifts — swapping and components by equal amounts should not change its output. I formalise this as:
An addition-stationary kernel has a beautiful spectral structure. Its eigenfunctions on the full -point grid are the complex characters of the sum variable,
and its rank on the full grid is at most — because there are only distinct sum characters. In other words, the idealised kernel for modular addition is low-dimensional, Fourier-aligned, and sum-invariant. My hypothesis is that the eNTK should approach this structure as grokking proceeds.
This prediction connects directly to the mechanistic picture: the Fourier circuit discovered by Nanda et al. implements exactly the computation that would produce an addition-stationary kernel. If the network is building that circuit, its tangent geometry should be acquiring the same symmetry.
Spectral structure of an addition-stationary kernel
The key theoretical result underpinning the symmetry hypothesis is the following proposition, which makes the predicted spectral geometry precise.
(i) is invariant under simultaneous action of on both arguments:
(ii) The complex characters, , are eigenfunctions of .
(iii) The kernel matrix on the full -point grid has rank at most .
The proof of part (i) is immediate: since , the sum coordinate is unchanged by , so a kernel depending only on cannot detect the shift. For part (ii), writing out the action of on :
Grouping by the value , the sum over the elements of each level set contributes a factor of , leaving
This is the circular convolution of with a Fourier character, which equals for eigenvalue where is the discrete Fourier transform of . Part (iii) follows because the image of lies in the span of the distinct sum characters.
The practical consequence: if grokking corresponds to discovering the modular-addition algorithm, the eNTK should converge toward a rank-, Fourier-aligned, sum-invariant geometry. We can also directly quantify the distance from this ideal class via the addition-stationary approximation error:
where is the Frobenius-orthogonal projection onto the addition-stationary kernel class, obtained by averaging kernel entries over all pairs sharing the same value of . When the full -point grid is available, directly measures how far the eNTK lies from the idealised class in Proposition 1.
Kernel metrics
I operationalise the symmetry hypothesis through a suite of time-resolved metrics, computed from the centered kernel on a fixed probe set throughout training.
Kernel drift and velocity. How far has the tangent geometry moved from initialisation?
Peaks in provide a kernel-based change-point estimate . A core question is whether precedes, coincides with, or follows the grokking onset .
Spectral concentration. Does the kernel collapse onto a lower-dimensional algorithmic subspace? Let be the eigenvalues of , and define the normalised spectral weights . The entropy-based effective rank is
If grokking corresponds to discovering a low-dimensional algorithmic basis, should decrease near or before the transition. Whether it decreases or remains stable is an empirical question — if the transition requires a broader representation, might even rise. A less assumption-heavy complement is the top- spectral mass:
asks what fraction of total kernel variance is carried by the leading eigendirections. The hypothesis predicts that should approach 1 as the kernel converges toward its rank- addition-stationary limit.
Kernel-target alignment. Does the kernel geometry become more aligned with the task labels? Let be the centered one-hot label matrix on the probe set. Then
Sum-invariance residual. Does the kernel become invariant to sum-preserving transformations? For each , let be the permutation matrix induced by . Then
If the kernel depends primarily on the sum coordinate, should decrease sharply through the transition.
Fourier-subspace alignment. Let be the real span of the sum characters \chi^\cos_\omega(a,b) = \cos\!\left(\tfrac{2\pi\omega(a+b)}{p}\right) and \chi^\sin_\omega(a,b) = \sin\!\left(\tfrac{2\pi\omega(a+b)}{p}\right), with the orthogonal projector onto this space. Let span the top- eigenspace of . Then
This basis-invariant score asks whether the dominant kernel eigenspace becomes concentrated on the explicit algorithmic basis of the task — the same Fourier basis identified by Nanda et al. in the learned circuit.
Chart 4 — eNTK drift and instantaneous velocity through the grokking transition
Experimental setup
The experiments are conducted at two scales. The primary full-trace setting uses , where exact full-grid eNTK computation is feasible every epoch. The canonical setting uses , matching the common grokking literature, where the exact eNTK is computed on a fixed balanced probe set every epoch and on the full grid at selected checkpoints around the transition.
Two complementary architectures are trained. The primary mechanistic model is a one-layer transformer following the setup of Nanda et al. (2023): a short token sequence encoding , , and separator tokens feeds into a single attention layer with , , and an MLP multiplier of 4, followed by a-way classifier head. The control model is a compact MLP with learned embeddings for and (), concatenated and passed through two hidden layers () before the output head.
Both models are trained with AdamW (transformer , MLP ), weight decay , cross-entropy loss, and a training fraction . Each configuration is repeated over multiple random seeds. The eNTK is computed in PyTorch using forward-over-reverse automatic differentiation on the scalarized kernel of Eq. (1), with the probe set fixed at the start of training to ensure comparability across epochs.
Training phases and change points
Training is partitioned into three phases, defined by observable accuracy thresholds:
Memorisation phase: epochs , where is the first epoch at which training accuracy exceeds 99%. The model fits the training set but generalises no better than chance.
Delay phase: epochs , where train accuracy is high but test accuracy remains low. Both the Fourier circuit (Nanda et al.) and — per my hypothesis — the eNTK geometry are evolving during this silent period.
Generalisation phase: from , defined as the first epoch at which test accuracy exceeds 95% for 10 consecutive epochs.
Independently, the eNTK defines a kernel-based change point
or more robustly, the maximiser of a smoothed version of (moving-average window of 9 epochs). The central timing question is whether systematically precedes, coincides with, or follows . If it precedes, the eNTK provides a continuous early-warning signal — a kernel-theoretic analogue of the mechanistic progress measures.
Mechanistic correspondence analysis
To connect eNTK dynamics to mechanistic circuit formation, each kernel metric is compared against:
(1) the Fourier-restricted and Fourier-excluded losses from Nanda et al. (2023), obtained by ablating non-key and key Fourier components of the logits respectively; (2) the Fourier energy in the logits or internal activations; and (3) transition markers such as and .
Both contemporaneous correlations and lead-lag statistics are computed. The statistical protocol uses bootstrap confidence intervals (2 000 resamples) across seeds, Spearman rank correlations between eNTK metrics and mechanistic progress measures, and permutation tests for alignment of change points. Concretely, for metrics and the Spearman correlation is estimated over the full training trajectory:
If increases before test accuracy jumps, it constitutes a continuous kernel-based progress measure. Lead-lag analysis between , , and the mechanistic progress measures — averaged across seeds and reported with bootstrap confidence intervals — is the primary statistical target of the ongoing experiments.
What the kernel reveals
The metrics above are designed to test a layered hypothesis. Drift and velocity tell us that the eNTK is not static — the model is in a rich feature-learning regime, not the lazy NTK regime. Spectral concentration and effective rank tell us that the new geometry is lower-dimensional — the kernel is collapsing onto an algorithmic subspace. Label alignment tells us that this subspace is task-relevant. And Fourier-subspace alignment, combined with the sum-invariance residual, tells us that the subspace is specifically the one predicted by the mechanistic Fourier circuit hypothesis.
Chart 5 — Fourier-subspace alignment and sum-invariance residual
If the symmetry hypothesis is correct, then by the time the model groks, the top eigenvectors of should approximately coincide with the sum characters , and kernel entries between examples in the same orbit should be nearly equal. The eNTK would have acquired the symmetry of the task.
A further question is timing: do eNTK metrics provide an early warning of grokking, changing before the accuracy jump, or do they lag it? If rises before test accuracy, it constitutes a continuous kernel-based progress measure analogous to Nanda et al.'s Fourier component norm — but derived not from the circuit weights, but from the training dynamics geometry. Lead-lag analysis between , , and the mechanistic progress measures is a central target of the ongoing experiments.
The layerwise decomposition
allows us to localise where in the network the algorithmic structure first appears — whether it is front-loaded in the embedding layer, assembled in the attention heads, or concentrated in the output projection.
Broader lessons
Grokking is interesting partly because it is a toy phenomenon — it occurs reliably in small transformers on finite, noise-free tasks. Whether it occurs in large-scale training runs with noisy, high-dimensional data is less clear. But the conceptual lessons are worth taking seriously.
The most direct lesson is about early stopping. The standard practice of stopping when validation loss has plateaued may be too conservative. A model that has memorised its training data is not at a dead end — it may still be traversing a low-loss manifold in weight space toward a lower-norm, better-generalising solution.
A second lesson is about what it means to "leave the kernel regime." Rather than merely saying that grokking is associated with a transition from lazy to rich training, the eNTK perspective allows us to say how the kernel changes: it becomes lower-dimensional, more label-aligned, more invariant to the task symmetry, and more localised to interpretable layers. This is a richer characterisation than drift alone.
A third lesson concerns the relationship between kernel theory and mechanistic interpretability — two research programmes that are usually treated separately. If the eNTK eigenspaces converge to the Fourier characters exactly when the Fourier circuit forms, then the eNTK is literally reading out the same algorithmic structure as circuit analysis — from a completely different angle. That would make it a practical bridge between the two approaches.
Open questions
Does grokking occur in large-scale training? The clearest evidence comes from small models on finite tasks. Whether the same phase transitions appear in transformer pre-training or fine-tuning on large corpora is unclear. Some researchers have argued that the long tail of pre-training runs — where loss curves plateau but downstream performance continues to improve — may be a form of large-scale grokking. But the evidence is indirect.
Do eNTK metrics provide early warning? The mechanistic progress measures from Nanda et al. precede the accuracy jump. Whether kernel-based metrics like and do the same — and whether they correlate with the mechanistic measures — is the central open question of my ongoing research.
Is the symmetry hypothesis architecture-dependent? If the transformer and an MLP on the same task show different eNTK signatures, the relationship between algorithm formation and kernel dynamics may itself depend on architecture.
What is the relationship to double descent? Double descent describes a non-monotonicity in test error as model capacity grows. Grokking describes a similar non-monotonicity in time. Whether these phenomena share a common mechanism — and whether the eNTK provides a unified lens on both — remains to be seen.
Can we accelerate grokking using kernel diagnostics? If reliably precedes , one could imagine interventions that accelerate the kernel reorganisation directly — targeted regularisation, learning rate schedules, or data ordering — as a principled method for inducing grokking.
References
Nicholson, M. (2026). Empirical Neural Tangent Kernel Dynamics at the Grokking Transition: A Bridge Between Kernel Geometry and Mechanistic Interpretability. ICML Workshop Paper. University of Bath.
Power, A., Burda, Y., Edwards, H., Babuschkin, I., & Misra, V. (2022). Grokking: Generalisation beyond overfitting on small algorithmic datasets. ICLR 2022 Workshop. arXiv:2201.02177.
Nanda, N., Chan, L., Lieberum, T., Smith, J., & Steinhardt, J. (2023). Progress measures for grokking via mechanistic interpretability. ICLR 2023. arXiv:2301.05217.
Kumar, T., Bordelon, B., Gershman, S. J., & Pehlevan, C. (2024). Grokking as the transition from lazy to rich training dynamics. ICLR 2024. arXiv:2310.06110.
Mohamadi, M. A., Li, Z., Wu, L., & Sutherland, D. J. (2024). Why do you grok? A theoretical analysis of grokking modular addition. arXiv:2407.12332.
Lin, J. (2025). Feature identification via the empirical NTK. arXiv:2510.00468.
Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: Convergence and generalisation in neural networks. NeurIPS 2018. arXiv:1806.07572.
Gromov, A. (2023). Grokking modular arithmetic. arXiv:2301.02679.
Chizat, L., Oyallon, E., & Bach, F. (2019). On lazy training in differentiable programming. NeurIPS 2019. arXiv:1812.07956.