Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 182 additions & 0 deletions .github/skills/debug-gradient-flow/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
---
name: debug-gradient-flow
description: Diagnose gradient flow issues in training, especially for compiled models (torch.compile/make_fx). Systematically isolates which loss components (energy, force, virial) contribute gradients to which parameters, and identifies where the gradient chain breaks.
license: LGPL-3.0-or-later
metadata:
author: deepmd-kit
version: '1.0'
---

# Debugging Gradient Flow in Training

Use this method when a loss component (force, virial, energy) does not decrease during training, or when compiled model training diverges from uncompiled training.

## When to use

- A loss term (e.g. `rmse_f`, `rmse_v`) stays flat or NaN during training
- Compiled training (`enable_compile=True`) behaves differently from uncompiled
- After adding a new loss component or model output
- After changes to `make_fx` tracing, `torch.compile`, or `autograd.grad` code paths

## Method: Per-component gradient isolation

The core technique: **zero out all loss terms except one**, run `loss.backward()`, and count which model parameters receive non-zero gradients. Compare across uncompiled and compiled paths to pinpoint where gradients are lost.

### Step 1: Write a gradient probe script

Create a script that constructs a trainer, injects labels if needed, and reports per-parameter gradient status:

```python
def check_grad(trainer, label_overrides=None):
trainer.wrapper.train()
trainer.optimizer.zero_grad(set_to_none=True)
inp, lab = trainer.get_data(is_train=True)
lr = trainer.scheduler.get_last_lr()[0]

# Override labels to isolate a single loss component
if label_overrides:
lab.update(label_overrides)

_, loss, more_loss = trainer.wrapper(**inp, cur_lr=lr, label=lab)
loss.backward()

status = {}
for name, p in trainer.wrapper.named_parameters():
if p.requires_grad:
has_grad = p.grad is not None and p.grad.abs().sum() > 0
status[name] = has_grad
return status
```

### Step 2: Run for each loss component in isolation

Test each loss component separately by zeroing out the others:

```python
scenarios = {
"energy only": {"find_force": 0.0, "find_virial": 0.0},
"force only": {"find_energy": 0.0, "find_virial": 0.0},
"virial only": {
"find_energy": 0.0,
"find_force": 0.0,
"virial": torch.randn(nframes, 9, ...), # inject if data lacks virial
"find_virial": 1.0,
},
"all losses": {
"virial": torch.randn(nframes, 9, ...),
"find_virial": 1.0,
},
}
```

If training data lacks virial labels, inject synthetic ones — the numerical values don't matter, only gradient flow matters.

### Step 3: Compare compiled vs uncompiled

Run each scenario for both compiled and uncompiled trainers. Present results as a table:

```
Uncompiled Compiled
energy only: 22/22 22/22
force only: 20/22 16/22 <-- problem
virial only: 20/22 16/22 <-- problem
all losses: 22/22 22/22 <-- OK in practice
```
Comment thread
iProzd marked this conversation as resolved.

Key interpretations:

- **Same count, both paths**: gradient flow is correct
- **Compiled < Uncompiled**: `make_fx` or `torch.compile` breaks some gradient paths
- **0 grads in compiled**: catastrophic failure (e.g. wrong `create_graph`, wrong backend)
- **"all losses" is OK but isolated isn't**: the missing grads are covered by other loss terms; may be acceptable

### Step 4: Identify affected parameters

When compiled has fewer grads, print the per-parameter diff:

```python
print(f"{'Parameter':<60} {'Uncompiled':>10} {'Compiled':>10}")
for name in sorted(status_uncompiled):
uc = "GRAD" if status_uncompiled[name] else "-"
cc = "GRAD" if status_compiled[name] else "-"
marker = " <-- DIFF" if uc != cc else ""
print(f"{name:<60} {uc:>10} {cc:>10}{marker}")
```

This tells you exactly which layers lose gradients and helps locate the broken link in the computation graph.

### Step 5: Bisect the cause

If compiled has fewer grads, test these layers in order:

| Layer | What to try | What it tests |
| ------------------------------------------------ | ------------------------------------------------------- | ------------------------------------------------------ |
| `make_fx` only (no `torch.compile`) | Replace `torch.compile(traced, ...)` with just `traced` | Is `make_fx` the problem or `torch.compile`? |
| Different `torch.compile` backends | Try `eager`, `aot_eager`, `inductor` | Which backend breaks gradients? |
| `model.train()` vs `model.eval()` during tracing | Toggle training mode before `make_fx` | Does `create_graph=self.training` get the wrong value? |
| `coord.requires_grad_(True)` placement | Check if coord has grad before entering compiled graph | Is the autograd entry point correct? |

```python
# Test make_fx only (no torch.compile)
traced = make_fx(fn)(ext_coord, ext_atype, nlist, mapping, fparam, aparam)
# Use traced directly instead of torch.compile(traced)

# Test different backends
for backend in ["eager", "aot_eager", "inductor"]:
compiled = torch.compile(traced, backend=backend, dynamic=False)
# ... run gradient check
```

## Common root causes

### 1. `create_graph=False` during tracing

**Symptom**: force/virial loss doesn't decrease; 0 params get grad from force/virial loss.

**Cause**: `model.eval()` before `make_fx` tracing makes `create_graph=self.training` evaluate to `False`. The `autograd.grad` that computes force is traced without graph creation, so the force tensor is detached from model parameters.

**Fix**: `model.train()` before `make_fx` tracing.

**Location**: `_trace_and_compile` in `deepmd/pt_expt/train/training.py`

### 2. `torch.compile` inductor backend kills second-order gradients

**Symptom**: force/virial loss doesn't decrease; 0 params get grad with inductor, but `eager`/`aot_eager` work fine.

**Cause**: The inductor backend's graph lowering doesn't support backward through `make_fx`-decomposed `autograd.grad` ops.

**Fix**: Default to `aot_eager` backend.

### 3. Ghost force contributions discarded

**Symptom**: force values differ between compiled and uncompiled models.

**Cause**: Using `extended_force[:, :nloc, :]` (slice) instead of scatter-summing ghost atom contributions back to local atoms via `mapping`.

**Fix**: `torch.zeros(...).scatter_add_(1, mapping_idx, extended_force[:, :actual_nall, :])`

### 4. Virial RMSE normalization mismatch

**Symptom**: `rmse_v` values differ between backends by a factor of `natoms`.

**Cause**: dpmodel `rmse_v = sqrt(l2_virial_loss)` missing `* atom_norm` normalization that other backends apply.

**Fix**: `rmse_v = sqrt(l2_virial_loss) * atom_norm`

## Verification

After fixing, always verify:

1. **Gradient count matches**: uncompiled and compiled should have the same number of params with grad for each isolated loss component
1. **Numerical consistency**: compiled model energy/force/virial should match uncompiled to float precision (`atol=1e-10, rtol=1e-10`)
1. **Loss decreases**: run a few training steps and verify `rmse_f` / `rmse_v` actually decrease
1. **Regression test**: add a test that catches the bug by reverting the fix and confirming the test fails

```bash
# Run compiled consistency test
python -m pytest source/tests/pt_expt/test_training.py::TestCompiledConsistency -v
Comment thread
iProzd marked this conversation as resolved.
# Run loss consistency test
python -m pytest source/tests/consistent/loss/test_ener.py -v
# Run full training smoke test
python -m pytest source/tests/pt_expt/test_training.py -v
```
Loading