Unified memory for extra large systems - rematerialization problems

Hi, 

I'm trying to compute extra large systems (more than 5.120 residues) and as advised in #31 I tried to use unified memory. 

I get this kind of error during the inference (details in [out.txt](https://github.com/user-attachments/files/20480821/out.txt)) :

```
2025-05-27 11:26:29.427895: W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below -20.61GiB (-22125470549 bytes) by rematerialization; only reduced to 28.35GiB (30443270400 bytes), down from 28.35GiB (30443270400 bytes) originally
```

I have tried disabling rematerialization, and am waiting for the job to finish computing. This is the specifics of the node where this job is running, this is one of the usual nodes I use (sorry for bad formating): 

```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               On  |   00000000:17:00.0 Off |                    0 |
| N/A   68C    P0            309W /  310W |   60332MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    533459      C   /alphafold3_venv/bin/python                 60320MiB |
+-----------------------------------------------------------------------------------------+
```

I am also running tests with A100 at the moment to see if it's a compatibility issue as mentioned in #341 

my flags at the moment for my usual jobs are : 

```sh
singularity exec \
     --nv \
     --bind $AF3_CODE_DIR:$AF3_CODE_DIR \
     --env XLA_FLAGS="--xla_disable_hlo_passes=custom-kernel-fusion-rewriter" \
     --env XLA_PYTHON_CLIENT_PREALLOCATE="false" \
     --env TF_FORCE_UNIFIED_MEMORY="true" \
     --env XLA_CLIENT_MEM_FRACTION="3.2" \
     --env JAX_PLATFORMS="cuda" \
     --bind $AF3_INPUT_DIR:/root/af_input \
     --bind $AF3_OUTPUT_DIR:/root/af_output \
     --bind $AF3_MODEL_PARAMETERS_DIR:/root/models \
     --bind $AF3_DATABASES_DIR:/root/public_databases \
     $AF3_IMAGE \
     python $AF3_CODE_DIR/run_alphafold.py \
     --json_path=/root/af_input/Nitrosopulimus_6mer.json \
     --model_dir=/root/models \
     --db_dir=/root/public_databases \
     --output_dir=/root/af_output \
     --flash_attention_implementation=xla
```

I am trying to debug as recommended in here: https://github.com/google-deepmind/alphafold3/issues/341#issuecomment-2747662482

However any additional insight would be appreciated, as I am very much a beginner in comp chem and way out of my depth. 

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified memory for extra large systems - rematerialization problems #432

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unified memory for extra large systems - rematerialization problems #432

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions