Algorithms
Scopes
- global: one threshold for the whole model.
- per_layer: threshold per module (group of parameters).
- per_param: threshold per parameter tensor.
All methods scale gradients by min(1, T / (||g|| + eps)), where T is the threshold.
AutoClip
- Default mode
mode="auto"(hyperparameter-free): - Streaming median via P² estimator and variance via Welford's algorithm.
- Threshold:
T = median + 3 * std. - No decay/percentile/window knobs; warmup/history gates still apply.
- Based on AutoClip by Seetharaman et al. (MLSP 2020) arXiv:2007.14469.
- Custom percentile mode
mode="percentile": - Choose history as either EMA quantile estimator (
history="ema") or rolling window (history="window"). - Parameters:
percentile(default 95.0),ema_decay(0.99),window_size(1024). - Warmup/history gates:
warmup_steps=100,min_history=50.
When to use / strengths:
- Auto mode: best default when you don't want to tune; robust across tasks and datasets, particularly when gradient norms are heavy‑tailed or non‑stationary. Conservative clipping reduces the chance of over‑clipping early in training.
- Percentile mode: choose an explicit tail risk (e.g., p95/p99). Prefer
history="ema"for low‑memory, fast adaptation on non‑stationary streams; preferhistory="window"for stronger outlier robustness on stationary regimes.
Examples:
# Hyperparameter-free auto mode (recommended default)
clipper = AutoClip()
# Custom percentile mode with EMA quantile
clipper = AutoClip(mode="percentile", percentile=95.0, history="ema", ema_decay=0.99)
# Custom percentile mode with rolling window
clipper = AutoClip(mode="percentile", percentile=95.0, history="window", window_size=1024)
Recommended default: AutoClip() (auto mode).
AGC (NFNets-style)
- Threshold depends on weight norm:
T = clipping * (||w|| + eps). - Scale:
min(1, T / (||g|| + eps))per group. - Parameters:
clipping=0.01(default),exclude_bias_bn=True, scopeper_layerby default. - Exclusions: skip parameters with dimensionality ≤ 1 when
exclude_bias_bnis enabled.
Recommended default: AGC(clipping=0.01, exclude_bias_bn=True, scope="per_layer").
When to use / strengths:
- Ideal when gradient magnitudes naturally scale with parameter magnitudes (e.g., NFNets and many CNNs). No history or warmup is required, so behavior is stable from step 1 and deterministic across processes.
- Per‑layer scope pairs well with the ratio‑based rule, offering scale‑invariant clipping that preserves signal for well‑scaled layers while taming outliers.
- Use
exclude_bias_bn=Trueto avoid over‑regularizing biases and affine scale parameters.
Z-Score clipping (EMA mean/std)
- Tracks EMA of mean and variance of norms per scope.
- Threshold:
T = m + zmax * std, clip when||g|| > T. - Parameters:
zmax=3.0,ema_decay=0.99.
Recommended default: ZScoreClip(zmax=3.0, ema_decay=0.99).
When to use / strengths:
- Good when gradient norms are roughly unimodal and you primarily want to suppress rare spikes;
zmaxgives an intuitive, unitless control on aggressiveness. - EMA adapts smoothly to drifting scales, making it suitable for long runs and curriculum schedules without manual retuning.
zmax: what it means and how to set it
zmaxis the tolerance in standard deviations above the EMA mean. The clip threshold ism + zmax * std, sozmax = 3.0means “allow up to ~3 standard deviations above the recent average before clipping.”- Higher
zmax→ fewer clips (more tolerant). Lowerzmax→ more clips (more aggressive). - Practical starting points:
- Start with
zmax=3.0(default) for most tasks. - If you see frequent large spikes or instability, try
zmax=2.0–2.5. - If training seems over‑clipped (signal too damped) or gradients are heavy‑tailed, try
zmax=3.5–4.0. - Simple tuning loop: if clipping triggers on a large fraction of steps for stable runs, increase
zmax; if spikes routinely exceed the threshold and hurt stability, decreasezmax. - Note on
ema_decay: larger values (e.g.,0.99) adapt more slowly but are smoother; if your gradient scale shifts quickly, consider a slightly smaller decay (e.g.,0.95–0.98) so the mean/std track faster.
Safety and stability
- Warmup/min-history gates prevent premature clipping.
- NaN/Inf guards drop non-finite observations when enabled (
guard_nans=True). - Thresholds are lower-bounded by
eps.