Skip to content

Unified memory for extra large systems - rematerialization problems #432

@GorgetteGorg

Description

@GorgetteGorg

Hi,

I'm trying to compute extra large systems (more than 5.120 residues) and as advised in #31 I tried to use unified memory.

I get this kind of error during the inference (details in out.txt) :

2025-05-27 11:26:29.427895: W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below -20.61GiB (-22125470549 bytes) by rematerialization; only reduced to 28.35GiB (30443270400 bytes), down from 28.35GiB (30443270400 bytes) originally

I have tried disabling rematerialization, and am waiting for the job to finish computing. This is the specifics of the node where this job is running, this is one of the usual nodes I use (sorry for bad formating):

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 PCIe               On  |   00000000:17:00.0 Off |                    0 |
| N/A   68C    P0            309W /  310W |   60332MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    533459      C   /alphafold3_venv/bin/python                 60320MiB |
+-----------------------------------------------------------------------------------------+

I am also running tests with A100 at the moment to see if it's a compatibility issue as mentioned in #341

my flags at the moment for my usual jobs are :

singularity exec \
     --nv \
     --bind $AF3_CODE_DIR:$AF3_CODE_DIR \
     --env XLA_FLAGS="--xla_disable_hlo_passes=custom-kernel-fusion-rewriter" \
     --env XLA_PYTHON_CLIENT_PREALLOCATE="false" \
     --env TF_FORCE_UNIFIED_MEMORY="true" \
     --env XLA_CLIENT_MEM_FRACTION="3.2" \
     --env JAX_PLATFORMS="cuda" \
     --bind $AF3_INPUT_DIR:/root/af_input \
     --bind $AF3_OUTPUT_DIR:/root/af_output \
     --bind $AF3_MODEL_PARAMETERS_DIR:/root/models \
     --bind $AF3_DATABASES_DIR:/root/public_databases \
     $AF3_IMAGE \
     python $AF3_CODE_DIR/run_alphafold.py \
     --json_path=/root/af_input/Nitrosopulimus_6mer.json \
     --model_dir=/root/models \
     --db_dir=/root/public_databases \
     --output_dir=/root/af_output \
     --flash_attention_implementation=xla

I am trying to debug as recommended in here: #341 (comment)

However any additional insight would be appreciated, as I am very much a beginner in comp chem and way out of my depth.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions