Sharpness Aware Surrogate Training for Spiking Neural Networks
Spiking neural networks are trained with smooth approximations but deployed with discontinuous spikes. SAST closes that gap by seeking flat loss minima, here is the full theory, the intuition behind it, and a concrete worked example.
Spiking neural networks
The LIF neuron
The workhorse model is the Leaky Integrate-and-Fire (LIF) neuron. At each discrete time step a neuron accumulates charge into its membrane potential , leaks a fraction of it away, and fires a binary spike whenever that potential crosses a threshold . After firing the potential resets.
Concretely, for a layer with weight matrix and input spikes from the layer below:
where is the membrane decay constant and is the Heaviside step function. The term implements the reset: once the neuron fires, the potential is zeroed before it integrates new input.
Surrogate gradients
Training requires backpropagating through , which has zero gradient almost everywhere and an undefined gradient at zero. The standard fix is to replace with a smooth surrogate during the backward pass while keeping the forward spike binary. A common choice is the arctangent surrogate:
SAST takes a different route: rather than keeping the forward pass hard and only smoothing the backward pass, it trains a surrogate-forward network where the spike function itself is replaced by everywhere. This means the gradient is exact for the model being trained — no straight-through approximation anywhere in the computation graph.
The transfer gap
After surrogate training the learned weights are evaluated under two regimes. In surrogate-forward mode the smooth is used at inference — this is what training optimised for. In hard-spike mode the smooth activations are replaced with — this is what a real neuromorphic deployment requires.
The transfer gap is simply:
For a standard baseline on N-MNIST, percentage points. SAST reduces this to pp. Understanding why requires thinking about the shape of the loss landscape.
Why flat minima help
A concrete example
Consider a single hidden neuron whose membrane potential at some time step is and whose threshold is .
Under surrogate training: . The neuron contributes a near-half activation — the network can learn to rely on this precisely calibrated fractional signal.
Under hard-spike evaluation: . The neuron is completely silent. If the downstream layers depended on that ~0.5 activation, their inputs change dramatically.
The problem scales: a network with many neurons hovering near threshold accumulates these mismatches across layers and time steps. A small shift in the input distribution can flip dozens of spike decisions simultaneously.
The solution is to make the network stop relying on neurons that sit on the fence. If every neuron whose surrogate output is near 0.5 were nudged away from threshold — so that — then replacing with would barely change anything, because both functions agree far from zero.
SAST achieves this indirectly, without explicitly penalising fence-sitter neurons. It does so by seeking flat regions of the parameter space.
The loss-landscape view
Imagine the surrogate training loss as a mountain range and the parameters as a location in that landscape. Standard gradient descent finds a valley — a local minimum — but says nothing about how steep the valley walls are. A narrow, sharp valley means that tiny perturbations to send the loss shooting upward. A wide, flat valley means the loss stays low across an entire neighbourhood.
Switching from surrogate to hard spikes is exactly such a perturbation. Every neuron near threshold gets its activation discretised — a jump that cannot be predicted from the gradient alone. If the network sits in a flat valley, this perturbation stays within the low-loss region. If it sits in a sharp valley, it likely escapes it.
Sharpness-aware minimization finds flat valleys by construction. That is the geometric reason SAST reduces the transfer gap.
The SAST method
SAM recap
Standard training minimises the empirical risk . Sharpness-Aware Minimization (Foret et al., 2021) instead minimises the worst-case loss in a ball around :
The inner maximisation is intractable in general but admits a first-order approximation. Expanding around :
The outer update then follows the gradient at the perturbed point , not at itself. This pushes toward regions where the loss is simultaneously low and locally flat.
The SAST update
SAST applies SAM to the surrogate-forward loss — the cross-entropy of the SNN when evaluated with everywhere. Each training step proceeds in two phases:
Phase 1 — compute the perturbation. Sample minibatch , compute the surrogate gradient, and form:
The small constant prevents division by zero.
Phase 2 — update at the perturbed point. Sample an independent minibatch and update:
The independence of and is not just an implementation detail — it is the condition required by the convergence theorem (Theorem 3 below). In code:
for each step:
g = grad surrogate_loss(w; B) # phase 1 batch
eps = rho * g / (norm(g) + delta)
w -= lr * grad surrogate_loss(w + eps; B') # phase 2 batchTheory
The theoretical results are scoped to the smooth auxiliary model — the surrogate-forward SNN that SAST actually optimises. They do not claim to govern the hard-spike network directly; that connection is empirical.
State stability
The first result establishes that the surrogate state sequence is well-behaved. Under the assumption that the spectral norm of each weight matrix is bounded:
the surrogate state map is a contraction in the norm across time steps. Concretely, for two input sequences and that differ only in one time step:
where . This fading-memory property means early perturbations wash out over time — a necessary precondition for the Lipschitz and smoothness results that follow.
Lipschitz bound and smoothness
Under the same spectral-norm bounds and a bound on the surrogate derivative , the network output is Lipschitz in its input with constant:
The empirical surrogate loss is then -smooth in the parameters — meaning its Hessian is bounded:
Smoothness is the key property SAM theory requires. Without it, the first-order perturbation would not reliably approximate the true worst-case direction. Because the surrogate-forward model is smooth by construction (every activation is differentiable), this holds cleanly. It would not hold for the hard-spike model.
Convergence
Given -smoothness and independent minibatches, SAST satisfies a standard nonconvex convergence bound. After steps with learning rate :
This guarantees convergence to a first-order stationary point of the surrogate loss at the standard rate for nonconvex stochastic optimisation. The independent-minibatch requirement is what keeps the gradient estimate unbiased at the perturbed point — if the same batch were used for both phases, the estimate would be biased and the bound would not hold.
The paper also includes a local mechanism proposition: under a bound on the per-sample Jacobian , the input gradient norm satisfies:
Smaller parameter gradients (as SAM encourages) imply smaller , which in turn implies smaller input sensitivity. This is the local theoretical connection between sharpness and the transfer gap — but it is presented as a proposition under local conditions, not a universal theorem.
Results
The empirical story is about the transfer gap, not raw surrogate accuracy. Surrogate accuracy changes modestly; hard-spike accuracy changes dramatically.
Transfer results summary
| Dataset | Baseline surrogate | Best surrogate | Baseline hard | Best hard | Transfer gap | Best |
|---|---|---|---|---|---|---|
| N-MNIST | 0.30 | 96.1% | 97.2% | 65.7% | 94.7% | 30.3 pp → 2.5 pp |
| DVS Gesture | 0.40 | 75.0% | 76.9% | 31.8% | 63.3% | 43.2 pp → 13.6 pp |
On N-MNIST, surrogate accuracy improves from 96.1% to 97.2% — a 1.1 pp gain. But hard-spike accuracy improves from 65.7% to 94.7% — a 29 pp gain. The transfer gap collapses from 30.3 pp to 2.5 pp. On DVS Gesture the pattern repeats: the surrogate gain is modest (1.9 pp) while the hard-spike gain is large (31.5 pp).
The overhead is real. Each SAST step requires two forward-backward passes — one to compute and one to compute the update gradient — giving a roughly 2× per-epoch cost. Whether this is acceptable depends on the deployment context.
Limitations
Compute-matched baselines. A 2× compute budget could alternatively be spent on longer standard training or larger batch sizes. Whether SAST still wins under a matched compute budget is an open question.
Post-hoc threshold calibration. Some of the hard-spike gap may be recoverable by simply re-tuning after surrogate training. This is a cheap alternative that has not been isolated as a control.
Scope of benchmarks. Results are reported on N-MNIST and DVS Gesture — both relatively small neuromorphic datasets. Whether the transfer-gap reduction generalises to larger, noisier, or temporally longer tasks is unknown.
Theory scope. The convergence and smoothness guarantees apply to the surrogate-forward auxiliary model. The connection between flat surrogate minima and hard-spike robustness is empirically demonstrated and locally motivated, but not proved in full generality.
References
Nicholson, M. (2026). Sharpness Aware Surrogate Training for Spiking Neural Networks. arXiv:2603.18039. arxiv.org/abs/2603.18039
Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2021). Sharpness-Aware Minimization for Efficiently Improving Generalization. ICLR 2021. arXiv:2010.01412.
Neftci, E. O., Mostafa, H., & Zenke, F. (2019). Surrogate Gradient Learning in Spiking Neural Networks. IEEE Signal Processing Magazine, 36(6), 51–63.
Mahowald, M., & Douglas, R. (1991). A silicon neuron. Nature, 354, 515–518.
Zenke, F., & Ganguli, S. (2018). SuperSpike: Supervised Learning in Multilayer Spiking Neural Networks. Neural Computation, 30(6), 1514–1541.