API Reference
smartclip: Adaptive gradient clipping algorithms for deep learning frameworks.
This package provides a framework-agnostic core with optional thin integrations for PyTorch, TensorFlow/Keras, and JAX/Flax. Public APIs are typed and designed for fast import times and production use.
AutoClip
Bases: ClipperBase
Adaptive clipping of gradients.
Modes:
- "auto" (default): hyperparameter-free threshold using P² median (p=0.5)
and Welford variance: T = median + 3 * std.
- "percentile": target percentile of recent gradient norms using either
EMA quantile estimator (history="ema") or rolling window (history="window").
observe(value, key=None)
Observe a gradient norm for a grouping key.
Backends should call this once per measured norm (global/layer/param)
before applying clipping. Values that are non-finite are ignored when
guard_nans is True.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
float
|
Gradient norm (L2 norm of gradients for this group). |
required |
key
|
Optional[Key]
|
Grouping key tuple. Examples: - ("global",) for global scope - ("layer", "conv1") for per-layer scope - ("param", "0") for per-parameter scope Defaults to ("global",) if None. |
None
|
threshold(key=None)
Return current threshold for a key (default: global).
This does not enforce warmup/min-history gates. Callers should check
can_clip() to decide whether clipping should be applied.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
Optional[Key]
|
Grouping key tuple (e.g., ("global",), ("layer", "conv1")). Defaults to ("global",) if None. |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Current percentile threshold, lower-bounded by eps. |
threshold_any()
Convenience method for callers that do not track keys.
Returns the global threshold if available, otherwise the threshold for the single key if exactly one exists, otherwise eps.
AGC
Bases: ClipperBase
Adaptive Gradient Clipping (NFNets-style).
Scales gradients based on the ratio between gradient norm and parameter (weight)
norm per group. Given gradient norm g and weight norm w, the target
maximum gradient norm is T = clipping * (w + eps) and the applied scale is::
scale = min(1.0, T / (g + eps))
When exclude_bias_bn=True a simple framework-agnostic heuristic is used to
skip parameters with dimensionality <= 1 (bias vectors and affine scale
parameters such as BatchNorm/LN gammas).
observe(grad_norm, weight_norm, key=None)
Record one AGC observation for warmup/min-history gating.
Backends should call this once per group measurement prior to applying scaling.
Non-finite values are ignored when guard_nans is True.
scale(grad_norm, weight_norm)
Compute scale factor in [0, 1] for given gradient and weight norms.
scale = min(1, target_norm(weight_norm) / (grad_norm + eps))
Non-finite inputs return 1.0 (no scaling) when guard_nans is True.
should_exclude_param(param)
Return True if a parameter should be excluded from clipping.
Heuristic: when exclude_bias_bn is enabled, exclude parameters whose data
has dimensionality <= 1 (bias vectors and affine scales). If shape cannot
be determined, do not exclude.
target_norm(weight_norm)
Return the allowed gradient norm for a given weight norm.
Computes clipping * (weight_norm + eps) and lower-bounds the result by eps.
ZScoreClip
Bases: ClipperBase
Z-score based adaptive clipping using EMA mean/variance.
Tracks exponentially-weighted moving averages of the observed gradient norm
(m) and squared norm (m2) per grouping key. The standard deviation is
computed as sqrt(max(0, m2 - m^2)). For a new observation with norm g,
the Z-score is z = (g - m) / (std + eps) and clipping is recommended when
z > zmax. Backends typically implement clipping by scaling gradients by
min(1, T / (g + eps)) where the threshold T = m + zmax * std.
observe(value, key=None)
Observe a gradient norm for a grouping key.
Backends should call this once per measured norm (global/layer/param)
before applying clipping. Values that are non-finite are ignored when
guard_nans is True.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
float
|
Gradient norm (L2 norm of gradients for this group). |
required |
key
|
Optional[Key]
|
Grouping key tuple. Examples: - ("global",) for global scope - ("layer", "conv1") for per-layer scope - ("param", "0") for per-parameter scope Defaults to ("global",) if None. |
None
|
stats(key=None)
Return the current (mean, std) estimates for a key.
If the key has not been observed, or estimates are uninitialized,
returns (0.0, 0.0).
threshold(key=None)
Return current z-score threshold m + zmax * std for a key.
This does not enforce warmup/min-history gates. Callers should check
can_clip() to decide whether clipping should be applied.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
Optional[Key]
|
Grouping key tuple (e.g., ("global",), ("layer", "conv1")). Defaults to ("global",) if None. |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Current threshold, lower-bounded by eps. |
threshold_any()
Convenience method for callers that do not track keys.
Returns the global threshold if available, otherwise the threshold for the single key if exactly one exists, otherwise eps.
apply(model, clipper, on_metrics=None)
Apply adaptive clipping to model parameters.
Delegates to the active backend determined from the model instance.
step(model, optimizer, clipper, on_metrics=None)
Clip gradients on the model and then call optimizer.step().
clip_context(model, optimizer=None, clipper=None, on_metrics=None)
Context manager that clips before each optimizer step for the active backend.
Defaults to AutoClip() when clipper is None.
Core
AGC
Bases: ClipperBase
Adaptive Gradient Clipping (NFNets-style).
Scales gradients based on the ratio between gradient norm and parameter (weight)
norm per group. Given gradient norm g and weight norm w, the target
maximum gradient norm is T = clipping * (w + eps) and the applied scale is::
scale = min(1.0, T / (g + eps))
When exclude_bias_bn=True a simple framework-agnostic heuristic is used to
skip parameters with dimensionality <= 1 (bias vectors and affine scale
parameters such as BatchNorm/LN gammas).
observe(grad_norm, weight_norm, key=None)
Record one AGC observation for warmup/min-history gating.
Backends should call this once per group measurement prior to applying scaling.
Non-finite values are ignored when guard_nans is True.
scale(grad_norm, weight_norm)
Compute scale factor in [0, 1] for given gradient and weight norms.
scale = min(1, target_norm(weight_norm) / (grad_norm + eps))
Non-finite inputs return 1.0 (no scaling) when guard_nans is True.
should_exclude_param(param)
Return True if a parameter should be excluded from clipping.
Heuristic: when exclude_bias_bn is enabled, exclude parameters whose data
has dimensionality <= 1 (bias vectors and affine scales). If shape cannot
be determined, do not exclude.
target_norm(weight_norm)
Return the allowed gradient norm for a given weight norm.
Computes clipping * (weight_norm + eps) and lower-bounds the result by eps.
AutoClip
Bases: ClipperBase
Adaptive clipping of gradients.
Modes:
- "auto" (default): hyperparameter-free threshold using P² median (p=0.5)
and Welford variance: T = median + 3 * std.
- "percentile": target percentile of recent gradient norms using either
EMA quantile estimator (history="ema") or rolling window (history="window").
observe(value, key=None)
Observe a gradient norm for a grouping key.
Backends should call this once per measured norm (global/layer/param)
before applying clipping. Values that are non-finite are ignored when
guard_nans is True.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
float
|
Gradient norm (L2 norm of gradients for this group). |
required |
key
|
Optional[Key]
|
Grouping key tuple. Examples: - ("global",) for global scope - ("layer", "conv1") for per-layer scope - ("param", "0") for per-parameter scope Defaults to ("global",) if None. |
None
|
threshold(key=None)
Return current threshold for a key (default: global).
This does not enforce warmup/min-history gates. Callers should check
can_clip() to decide whether clipping should be applied.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
Optional[Key]
|
Grouping key tuple (e.g., ("global",), ("layer", "conv1")). Defaults to ("global",) if None. |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Current percentile threshold, lower-bounded by eps. |
threshold_any()
Convenience method for callers that do not track keys.
Returns the global threshold if available, otherwise the threshold for the single key if exactly one exists, otherwise eps.
ClipperBase
Base class for adaptive gradient clippers.
This class manages configuration, numeric stability constants, and minimal state serialization. Subclasses implement algorithm-specific logic.
ParamLike
Bases: Protocol
A minimal protocol representing a trainable parameter.
The parameter stores its data (tensor/array) and an optional gradient.
TensorLike
Bases: Protocol
A minimal protocol representing a tensor/array from any framework.
Intentionally small to avoid importing optional frameworks at type-check time.
ZScoreClip
Bases: ClipperBase
Z-score based adaptive clipping using EMA mean/variance.
Tracks exponentially-weighted moving averages of the observed gradient norm
(m) and squared norm (m2) per grouping key. The standard deviation is
computed as sqrt(max(0, m2 - m^2)). For a new observation with norm g,
the Z-score is z = (g - m) / (std + eps) and clipping is recommended when
z > zmax. Backends typically implement clipping by scaling gradients by
min(1, T / (g + eps)) where the threshold T = m + zmax * std.
observe(value, key=None)
Observe a gradient norm for a grouping key.
Backends should call this once per measured norm (global/layer/param)
before applying clipping. Values that are non-finite are ignored when
guard_nans is True.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
float
|
Gradient norm (L2 norm of gradients for this group). |
required |
key
|
Optional[Key]
|
Grouping key tuple. Examples: - ("global",) for global scope - ("layer", "conv1") for per-layer scope - ("param", "0") for per-parameter scope Defaults to ("global",) if None. |
None
|
stats(key=None)
Return the current (mean, std) estimates for a key.
If the key has not been observed, or estimates are uninitialized,
returns (0.0, 0.0).
threshold(key=None)
Return current z-score threshold m + zmax * std for a key.
This does not enforce warmup/min-history gates. Callers should check
can_clip() to decide whether clipping should be applied.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
Optional[Key]
|
Grouping key tuple (e.g., ("global",), ("layer", "conv1")). Defaults to ("global",) if None. |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Current threshold, lower-bounded by eps. |
threshold_any()
Convenience method for callers that do not track keys.
Returns the global threshold if available, otherwise the threshold for the single key if exactly one exists, otherwise eps.