Skip to content

微调llama3时指定eval_dataset并开启predict_with_generate后验证报错 #5292

@Excuses123

Description

@Excuses123

Reminder

  • I have read the README and searched the existing issues.

System Info

####训练参数如下

model

model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct

method

stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

dataset

dataset: identity,alpaca_en_demo
eval_dataset: identity
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: saves/llama3-8b/full/sft
report_to: tensorboard
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 2
gradient_accumulation_steps: 2
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

eval

do_eval: true
predict_with_generate: true
#val_size: 0.1
per_device_eval_batch_size: 1
#eval_strategy: steps
#eval_steps: 500

Reproduction

报错信息如下

***** Running Evaluation *****
[INFO|trainer.py:3821] 2024-08-28 08:26:53,453 >> Num examples = 91
[INFO|trainer.py:3824] 2024-08-28 08:26:53,453 >> Batch size = 1
[rank2]: Traceback (most recent call last):
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/train.py", line 28, in
[rank2]: main()
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/train.py", line 19, in main
[rank2]: run_exp()
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank2]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 107, in run_sft
[rank2]: metrics = trainer.evaluate(metric_key_prefix="eval", **gen_kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank2]: return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer.py", line 3666, in evaluate
[rank2]: output = eval_loop(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer.py", line 3857, in evaluation_loop
[rank2]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank2]: File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 99, in prediction_step
[rank2]: loss, generated_tokens, _ = super().prediction_step( # ignore the returned labels (may be truncated)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 310, in prediction_step
[rank2]: generated_tokens = self.model.generate(**generation_inputs, **gen_kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]: return func(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 1989, in generate
[rank2]: result = self._sample(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2932, in _sample
[rank2]: outputs = self(**model_inputs, return_dict=True)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1141, in forward
[rank2]: outputs = self.model(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 944, in forward
[rank2]: layer_outputs = decoder_layer(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 677, in forward
[rank2]: hidden_states, self_attn_weights, present_key_value = self.self_attn(
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]: result = forward_call(*args, **kwargs)
[rank2]: File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 603, in forward
[rank2]: attn_output = torch.nn.functional.scaled_dot_product_attention(
[rank2]: RuntimeError: The expanded size of the tensor (32) must match the existing size (31) at non-singleton dimension 3. Target sizes: [1, 32, 1, 32]. Tensor sizes: [1, 1, 1, 31]

Expected behavior

运行命令:

CUDA_VISIBLE_DEVICES=0,1,2,3 FORCE_TORCHRUN=1 torchrun --nnodes 1 --node_rank 0 --nproc_per_node 4 src/train.py examples/demo/llama3_full_sft_ds3.yaml

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    solvedThis problem has been already solved

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions