Describe the bug
The initial_global_step is calculated by following code
# diffusers/examples/controlnet/train_controlnet.py line 980-982
global_step = int(path.split("-")[1])
initial_global_step = global_step * args.gradient_accumulation_steps
with args.gradient_accumulation_steps > 1, initial_global_step will be different from where the training was broken off.
For exmaple, if I resume from checkpoint-10 with gradient_accumulation_steps=4, the progress bar will start at step 40 instead of 10 while the total steps stay correct.
Reproduction
Run the follow script, which is modified from https://huggingface.co/docs/diffusers/training/controlnet, in examples/controlnet. Break the training after step 10 and run the script again to resume from checkpoint-10.
export MODEL_DIR="runwayml/stable-diffusion-v1-5"
export OUTPUT_DIR="output/sd15_control"
accelerate launch train_controlnet.py \
--pretrained_model_name_or_path=$MODEL_DIR \
--output_dir=$OUTPUT_DIR \
--dataset_name=fusing/fill50k \
--resolution=512 \
--learning_rate=1e-5 \
--validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
--validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--use_8bit_adam \
--enable_xformers_memory_efficient_attention \
--set_grads_to_none \
--checkpointing_steps 10 \
--resume_from_checkpoint latest
Logs
the logs in resuming:
/root/miniconda3/envs/control/lib/python3.9/site-packages/accelerate/accelerator.py:258: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use `project_dir` instead.
warnings.warn(
05/17/2023 12:58:05 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: no
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'clip_sample_range', 'prediction_type', 'dynamic_thresholding_ratio', 'thresholding', 'sample_max_value', 'variance_type'} was not found in config. Values will be initialized to default values.
{'scaling_factor'} was not found in config. Values will be initialized to default values.
{'projection_class_embeddings_input_dim', 'use_linear_projection', 'conv_out_kernel', 'time_embedding_act_fn', 'time_embedding_dim', 'num_class_embeds', 'mid_block_only_cross_attention', 'timestep_post_act', 'mid_block_type', 'cross_attention_norm', 'resnet_out_scale_factor', 'time_embedding_type', 'conv_in_kernel', 'dual_cross_attention', 'upcast_attention', 'resnet_skip_time_act', 'resnet_time_scale_shift', 'addition_embed_type_num_heads', 'encoder_hid_dim', 'addition_embed_type', 'time_cond_proj_dim', 'class_embeddings_concat', 'class_embed_type', 'only_cross_attention'} was not found in config. Values will be initialized to default values.
05/17/2023 12:58:09 - INFO - __main__ - Initializing controlnet weights from unet
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /root/miniconda3/envs/control/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
/root/miniconda3/envs/control/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/root/miniconda3/envs/control/lib/libcudart.so.11.0'), PosixPath('/root/miniconda3/envs/control/lib/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /root/miniconda3/envs/control/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /root/miniconda3/envs/control/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
05/17/2023 12:58:20 - WARNING - datasets.builder - No config specified, defaulting to: fill50k/default
05/17/2023 12:58:20 - WARNING - datasets.builder - Found cached dataset fill50k (/root/.cache/huggingface/datasets/fusing___fill50k/default/0.0.2/f23b778406682a796a540934e7163495e1b8a88fefc76ca08f7e5a79ddcd668b)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 179.53it/s]
05/17/2023 12:58:23 - INFO - __main__ - ***** Running training *****
05/17/2023 12:58:23 - INFO - __main__ - Num examples = 50000
05/17/2023 12:58:23 - INFO - __main__ - Num batches each epoch = 50000
05/17/2023 12:58:23 - INFO - __main__ - Num Epochs = 1
05/17/2023 12:58:23 - INFO - __main__ - Instantaneous batch size per device = 1
05/17/2023 12:58:23 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 4
05/17/2023 12:58:23 - INFO - __main__ - Gradient Accumulation steps = 4
05/17/2023 12:58:23 - INFO - __main__ - Total optimization steps = 12500
Resuming from checkpoint checkpoint-10
05/17/2023 12:58:23 - INFO - accelerate.accelerator - Loading states from output/sd15_control/checkpoint-10
05/17/2023 12:58:26 - INFO - accelerate.checkpointing - All model weights loaded successfully
05/17/2023 12:58:27 - INFO - accelerate.checkpointing - All optimizer states loaded successfully
05/17/2023 12:58:27 - INFO - accelerate.checkpointing - All scheduler states loaded successfully
05/17/2023 12:58:27 - INFO - accelerate.checkpointing - All random states loaded successfully
05/17/2023 12:58:27 - INFO - accelerate.accelerator - Loading in 0 custom states
Steps: 0%|▎ | 40/12500 [00:00<?, ?it/s]
System Info
diffusers version: 0.17.0.dev0
- Platform: Linux-4.4.0-116-generic-x86_64-with-glibc2.31
- Python version: 3.9.16
- PyTorch version (GPU?): 1.12.1 (True)
- Huggingface_hub version: 0.14.1
- Transformers version: 4.29.1
- Accelerate version: 0.19.0
- xFormers version: 0.0.19
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Describe the bug
The
initial_global_stepis calculated by following codewith
args.gradient_accumulation_steps > 1,initial_global_stepwill be different from where the training was broken off.For exmaple, if I resume from checkpoint-10 with
gradient_accumulation_steps=4, the progress bar will start at step 40 instead of 10 while the total steps stay correct.Reproduction
Run the follow script, which is modified from https://huggingface.co/docs/diffusers/training/controlnet, in
examples/controlnet. Break the training after step 10 and run the script again to resume from checkpoint-10.Logs
System Info
diffusersversion: 0.17.0.dev0