Skip to content

ControlNet resume from wrong step with gradient_accumulation_steps #3467

@exhyy

Description

@exhyy

Describe the bug

The initial_global_step is calculated by following code

# diffusers/examples/controlnet/train_controlnet.py line 980-982
global_step = int(path.split("-")[1])

initial_global_step = global_step * args.gradient_accumulation_steps

with args.gradient_accumulation_steps > 1, initial_global_step will be different from where the training was broken off.
For exmaple, if I resume from checkpoint-10 with gradient_accumulation_steps=4, the progress bar will start at step 40 instead of 10 while the total steps stay correct.

Reproduction

Run the follow script, which is modified from https://huggingface.co/docs/diffusers/training/controlnet, in examples/controlnet. Break the training after step 10 and run the script again to resume from checkpoint-10.

export MODEL_DIR="runwayml/stable-diffusion-v1-5"
export OUTPUT_DIR="output/sd15_control"

accelerate launch train_controlnet.py \
 --pretrained_model_name_or_path=$MODEL_DIR \
 --output_dir=$OUTPUT_DIR \
 --dataset_name=fusing/fill50k \
 --resolution=512 \
 --learning_rate=1e-5 \
 --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
 --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
 --train_batch_size=1 \
 --gradient_accumulation_steps=4 \
 --gradient_checkpointing \
 --use_8bit_adam \
 --enable_xformers_memory_efficient_attention \
 --set_grads_to_none \
 --checkpointing_steps 10 \
 --resume_from_checkpoint latest

Logs

the logs in resuming:

/root/miniconda3/envs/control/lib/python3.9/site-packages/accelerate/accelerator.py:258: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use `project_dir` instead.
  warnings.warn(
05/17/2023 12:58:05 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: no

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'clip_sample_range', 'prediction_type', 'dynamic_thresholding_ratio', 'thresholding', 'sample_max_value', 'variance_type'} was not found in config. Values will be initialized to default values.
{'scaling_factor'} was not found in config. Values will be initialized to default values.
{'projection_class_embeddings_input_dim', 'use_linear_projection', 'conv_out_kernel', 'time_embedding_act_fn', 'time_embedding_dim', 'num_class_embeds', 'mid_block_only_cross_attention', 'timestep_post_act', 'mid_block_type', 'cross_attention_norm', 'resnet_out_scale_factor', 'time_embedding_type', 'conv_in_kernel', 'dual_cross_attention', 'upcast_attention', 'resnet_skip_time_act', 'resnet_time_scale_shift', 'addition_embed_type_num_heads', 'encoder_hid_dim', 'addition_embed_type', 'time_cond_proj_dim', 'class_embeddings_concat', 'class_embed_type', 'only_cross_attention'} was not found in config. Values will be initialized to default values.
05/17/2023 12:58:09 - INFO - __main__ - Initializing controlnet weights from unet

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /root/miniconda3/envs/control/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
/root/miniconda3/envs/control/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/root/miniconda3/envs/control/lib/libcudart.so.11.0'), PosixPath('/root/miniconda3/envs/control/lib/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /root/miniconda3/envs/control/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /root/miniconda3/envs/control/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
05/17/2023 12:58:20 - WARNING - datasets.builder - No config specified, defaulting to: fill50k/default
05/17/2023 12:58:20 - WARNING - datasets.builder - Found cached dataset fill50k (/root/.cache/huggingface/datasets/fusing___fill50k/default/0.0.2/f23b778406682a796a540934e7163495e1b8a88fefc76ca08f7e5a79ddcd668b)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 179.53it/s]
05/17/2023 12:58:23 - INFO - __main__ - ***** Running training *****
05/17/2023 12:58:23 - INFO - __main__ -   Num examples = 50000
05/17/2023 12:58:23 - INFO - __main__ -   Num batches each epoch = 50000
05/17/2023 12:58:23 - INFO - __main__ -   Num Epochs = 1
05/17/2023 12:58:23 - INFO - __main__ -   Instantaneous batch size per device = 1
05/17/2023 12:58:23 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
05/17/2023 12:58:23 - INFO - __main__ -   Gradient Accumulation steps = 4
05/17/2023 12:58:23 - INFO - __main__ -   Total optimization steps = 12500
Resuming from checkpoint checkpoint-10
05/17/2023 12:58:23 - INFO - accelerate.accelerator - Loading states from output/sd15_control/checkpoint-10
05/17/2023 12:58:26 - INFO - accelerate.checkpointing - All model weights loaded successfully
05/17/2023 12:58:27 - INFO - accelerate.checkpointing - All optimizer states loaded successfully
05/17/2023 12:58:27 - INFO - accelerate.checkpointing - All scheduler states loaded successfully
05/17/2023 12:58:27 - INFO - accelerate.checkpointing - All random states loaded successfully
05/17/2023 12:58:27 - INFO - accelerate.accelerator - Loading in 0 custom states
Steps:   0%|▎                                                                                                  | 40/12500 [00:00<?, ?it/s]

System Info

  • diffusers version: 0.17.0.dev0
  • Platform: Linux-4.4.0-116-generic-x86_64-with-glibc2.31
  • Python version: 3.9.16
  • PyTorch version (GPU?): 1.12.1 (True)
  • Huggingface_hub version: 0.14.1
  • Transformers version: 4.29.1
  • Accelerate version: 0.19.0
  • xFormers version: 0.0.19
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions