Fix #380:issue 380 dlio mpi rank#396
Conversation
…nt mpirun launch test_dlio_mpi.py selected its endpoint via int(os.environ['OMPI_COMM_WORLD_RANK']), which raises KeyError whenever the script is not launched by OpenMPI mpirun (plain python, MPICH, srun). Switch to the portable comm.Get_rank() (matching test_mpi_basic.py) and read the OMPI var only for diagnostics. Update tests/README.md to launch with mpirun and document --oversubscribe for under-provisioned hosts. Add regression test.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
|
Please review this PR @FileSystemGuy @russfellows @dslik |
russfellows
left a comment
There was a problem hiding this comment.
PR #396 — Fix #380: portable MPI rank in test_dlio_mpi.py + correct DLIO launch commands
Author: idevasena | +340 / -15 | 3 files | Status: blocked (needs reviews)
Summary: Fixes an integration test that crashed with KeyError: 'OMPI_COMM_WORLD_RANK' when run outside OpenMPI, and corrects misleading docs and command examples.
Assessment: ✅ LGTM — straightforward and correct. - Root cause analysis is accurate: os.environ['OMPI_COMM_WORLD_RANK'] is OpenMPI-specific; comm.Get_rank() is the portable way. - The os.environ.get(..., 'not set') fallback for display is the right pattern. - The size-1 guard with helpful message is good UX. - New regression test test_issue_380_dlio_mpi_rank.py doesn't require a live MPI runtime — correct approach. - Docs fix in tests/README.md (add mpirun -np 8, add --oversubscribe note) is correct and needed. - One minor concern: test_dlio_mpi.py still has a hardcoded python path at line sys.path.insert(0, '/home/eval/Documents/Code/s3dlio/python'). That's a pre-existing issue, not introduced here, but worth noting.
Fix #380: portable MPI rank in test_dlio_mpi.py + correct DLIO launch commands
Fixes #380
Summary
The integration test
tests/integration/test_dlio_mpi.py, run as documented, crashed before doing anything useful, and the follow-ondlio_benchmarkcommand it printed was itself invalid. This PR fixes both the test and the documentedcommands, and adds a regression test.
Root causes
1. KeyError on non-OpenMPI launchers. The test selected its endpoint with
int(os.environ['OMPI_COMM_WORLD_RANK']). That env var is OpenMPI-specific and is unset under plainpython, MPICH/mpiexec, or Slurmsrun, so the test died with a bareKeyError: 'OMPI_COMM_WORLD_RANK'. The siblingtest_mpi_basic.pyalready does this correctly viacomm.Get_rank()and a safeos.environ.get(...).2. Docs launched an MPI program without MPI.
tests/README.mdtold users to runpython tests/integration/test_dlio_mpi.py, which yields a single rank (not a meaningful multi-endpoint test) and triggers the KeyError above. It also gave no guidance for the "not enough slots available" error that OpenMPI raises when-npexceeds the host core count.3. Invalid
dlio_benchmarkinvocation. Both the test's printed "Next steps" and the trailing comment inconfigs/dlio/workload/multi_endpoint_mpi.yamlinstructeddlio_benchmark --config multi_endpoint_mpi.yaml.dlio_benchmarkis a Hydra app and rejects--configas an ambiguous abbreviation of--config-path/--config-name/--config-dir. The correct form selects the workload via a Hydra override.Changes
tests/integration/test_dlio_mpi.py— select the endpoint fromrank = comm.Get_rank()instead of the OpenMPI env var (read now only for display, viaos.environ.get(..., 'not set')); add asize == 1guard that prints the correctmpiruninvocation; correct the printed "Next steps" command todlio_benchmark workload=multi_endpoint_mpi --config-dir=configs/dlio.tests/README.md— launch the test withmpirun -np 8 ...and document--oversubscribefor hosts with fewer cores than-np.configs/dlio/workload/multi_endpoint_mpi.yaml— fix the how-to-run comment to the valid Hydraworkload=... --config-dir=...form, with a note on why--config <file>fails.tests/integration/test_issue_380_dlio_mpi_rank.py(new) — regression tests that reproduce the originalKeyError, pin rank-based round-robin endpoint selection, and guard (via source checks) that neither the hard env-var subscriptnor the ambiguous
dlio_benchmark --configpattern can return. No live MPI runtime or mpi4py required.The corrected command form matches the repo's own harness in
mlpstorage_py/benchmarks/dlio.pyand the siblingconfigs/dlio/workload/llama3_8b_checkpoint.yaml.Validation
pytest tests/integration/test_issue_380_dlio_mpi_rank.py -v— 15 passed.python tests/integration/test_dlio_mpi.py(no launcher) — exits 0 with a launcher hint instead of crashing.mpirun -np 8 python tests/integration/test_dlio_mpi.py— ranks 0–7 map round-robin to endpoints 0–3; rank N selects endpoint[N % 4].Tests