Skip to content

Algorithms

Scopes

  • global: one threshold for the whole model.
  • per_layer: threshold per module (group of parameters).
  • per_param: threshold per parameter tensor.

All methods scale gradients by min(1, T / (||g|| + eps)), where T is the threshold.

AutoClip

  • Default mode mode="auto" (hyperparameter-free):
  • Streaming median via P² estimator and variance via Welford's algorithm.
  • Threshold: T = median + 3 * std.
  • No decay/percentile/window knobs; warmup/history gates still apply.
  • Based on AutoClip by Seetharaman et al. (MLSP 2020) arXiv:2007.14469.
  • Custom percentile mode mode="percentile":
  • Choose history as either EMA quantile estimator (history="ema") or rolling window (history="window").
  • Parameters: percentile (default 95.0), ema_decay (0.99), window_size (1024).
  • Warmup/history gates: warmup_steps=100, min_history=50.

When to use / strengths:

  • Auto mode: best default when you don't want to tune; robust across tasks and datasets, particularly when gradient norms are heavy‑tailed or non‑stationary. Conservative clipping reduces the chance of over‑clipping early in training.
  • Percentile mode: choose an explicit tail risk (e.g., p95/p99). Prefer history="ema" for low‑memory, fast adaptation on non‑stationary streams; prefer history="window" for stronger outlier robustness on stationary regimes.

Examples:

# Hyperparameter-free auto mode (recommended default)
clipper = AutoClip()

# Custom percentile mode with EMA quantile
clipper = AutoClip(mode="percentile", percentile=95.0, history="ema", ema_decay=0.99)

# Custom percentile mode with rolling window
clipper = AutoClip(mode="percentile", percentile=95.0, history="window", window_size=1024)

Recommended default: AutoClip() (auto mode).

AGC (NFNets-style)

  • Threshold depends on weight norm: T = clipping * (||w|| + eps).
  • Scale: min(1, T / (||g|| + eps)) per group.
  • Parameters: clipping=0.01 (default), exclude_bias_bn=True, scope per_layer by default.
  • Exclusions: skip parameters with dimensionality ≤ 1 when exclude_bias_bn is enabled.

Recommended default: AGC(clipping=0.01, exclude_bias_bn=True, scope="per_layer").

When to use / strengths:

  • Ideal when gradient magnitudes naturally scale with parameter magnitudes (e.g., NFNets and many CNNs). No history or warmup is required, so behavior is stable from step 1 and deterministic across processes.
  • Per‑layer scope pairs well with the ratio‑based rule, offering scale‑invariant clipping that preserves signal for well‑scaled layers while taming outliers.
  • Use exclude_bias_bn=True to avoid over‑regularizing biases and affine scale parameters.

Z-Score clipping (EMA mean/std)

  • Tracks EMA of mean and variance of norms per scope.
  • Threshold: T = m + zmax * std, clip when ||g|| > T.
  • Parameters: zmax=3.0, ema_decay=0.99.

Recommended default: ZScoreClip(zmax=3.0, ema_decay=0.99).

When to use / strengths:

  • Good when gradient norms are roughly unimodal and you primarily want to suppress rare spikes; zmax gives an intuitive, unitless control on aggressiveness.
  • EMA adapts smoothly to drifting scales, making it suitable for long runs and curriculum schedules without manual retuning.

zmax: what it means and how to set it

  • zmax is the tolerance in standard deviations above the EMA mean. The clip threshold is m + zmax * std, so zmax = 3.0 means “allow up to ~3 standard deviations above the recent average before clipping.”
  • Higher zmax → fewer clips (more tolerant). Lower zmax → more clips (more aggressive).
  • Practical starting points:
  • Start with zmax=3.0 (default) for most tasks.
  • If you see frequent large spikes or instability, try zmax=2.0–2.5.
  • If training seems over‑clipped (signal too damped) or gradients are heavy‑tailed, try zmax=3.5–4.0.
  • Simple tuning loop: if clipping triggers on a large fraction of steps for stable runs, increase zmax; if spikes routinely exceed the threshold and hurt stability, decrease zmax.
  • Note on ema_decay: larger values (e.g., 0.99) adapt more slowly but are smoother; if your gradient scale shifts quickly, consider a slightly smaller decay (e.g., 0.95–0.98) so the mean/std track faster.

Safety and stability

  • Warmup/min-history gates prevent premature clipping.
  • NaN/Inf guards drop non-finite observations when enabled (guard_nans=True).
  • Thresholds are lower-bounded by eps.