-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Hi,
I'm trying to compute extra large systems (more than 5.120 residues) and as advised in #31 I tried to use unified memory.
I get this kind of error during the inference (details in out.txt) :
2025-05-27 11:26:29.427895: W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below -20.61GiB (-22125470549 bytes) by rematerialization; only reduced to 28.35GiB (30443270400 bytes), down from 28.35GiB (30443270400 bytes) originally
I have tried disabling rematerialization, and am waiting for the job to finish computing. This is the specifics of the node where this job is running, this is one of the usual nodes I use (sorry for bad formating):
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 PCIe On | 00000000:17:00.0 Off | 0 |
| N/A 68C P0 309W / 310W | 60332MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 533459 C /alphafold3_venv/bin/python 60320MiB |
+-----------------------------------------------------------------------------------------+
I am also running tests with A100 at the moment to see if it's a compatibility issue as mentioned in #341
my flags at the moment for my usual jobs are :
singularity exec \
--nv \
--bind $AF3_CODE_DIR:$AF3_CODE_DIR \
--env XLA_FLAGS="--xla_disable_hlo_passes=custom-kernel-fusion-rewriter" \
--env XLA_PYTHON_CLIENT_PREALLOCATE="false" \
--env TF_FORCE_UNIFIED_MEMORY="true" \
--env XLA_CLIENT_MEM_FRACTION="3.2" \
--env JAX_PLATFORMS="cuda" \
--bind $AF3_INPUT_DIR:/root/af_input \
--bind $AF3_OUTPUT_DIR:/root/af_output \
--bind $AF3_MODEL_PARAMETERS_DIR:/root/models \
--bind $AF3_DATABASES_DIR:/root/public_databases \
$AF3_IMAGE \
python $AF3_CODE_DIR/run_alphafold.py \
--json_path=/root/af_input/Nitrosopulimus_6mer.json \
--model_dir=/root/models \
--db_dir=/root/public_databases \
--output_dir=/root/af_output \
--flash_attention_implementation=xlaI am trying to debug as recommended in here: #341 (comment)
However any additional insight would be appreciated, as I am very much a beginner in comp chem and way out of my depth.
Thanks in advance!