微调llama3时指定eval_dataset并开启predict_with_generate后验证报错

### Reminder

- [X] I have read the README and searched the existing issues.

### System Info

####训练参数如下
### model
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: identity,alpaca_en_demo
eval_dataset: identity
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/llama3-8b/full/sft
report_to: tensorboard
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 2
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
do_eval: true
predict_with_generate: true
#val_size: 0.1
per_device_eval_batch_size: 1
#eval_strategy: steps
#eval_steps: 500


### Reproduction

# 报错信息如下

***** Running Evaluation *****
[INFO|trainer.py:3821] 2024-08-28 08:26:53,453 >>   Num examples = 91
[INFO|trainer.py:3824] 2024-08-28 08:26:53,453 >>   Batch size = 1
[rank2]: Traceback (most recent call last):
[rank2]:   File "/tf/notebooks/lujie/LLaMA-Factory/src/train.py", line 28, in <module>
[rank2]:     main()
[rank2]:   File "/tf/notebooks/lujie/LLaMA-Factory/src/train.py", line 19, in main
[rank2]:     run_exp()
[rank2]:   File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank2]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]:   File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 107, in run_sft
[rank2]:     metrics = trainer.evaluate(metric_key_prefix="eval", **gen_kwargs)
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank2]:     return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer.py", line 3666, in evaluate
[rank2]:     output = eval_loop(
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer.py", line 3857, in evaluation_loop
[rank2]:     losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank2]:   File "/tf/notebooks/lujie/LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 99, in prediction_step
[rank2]:     loss, generated_tokens, _ = super().prediction_step(  # ignore the returned labels (may be truncated)
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 310, in prediction_step
[rank2]:     generated_tokens = self.model.generate(**generation_inputs, **gen_kwargs)
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]:     return func(*args, **kwargs)
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 1989, in generate
[rank2]:     result = self._sample(
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2932, in _sample
[rank2]:     outputs = self(**model_inputs, return_dict=True)
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]:     result = forward_call(*args, **kwargs)
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1141, in forward
[rank2]:     outputs = self.model(
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]:     result = forward_call(*args, **kwargs)
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 944, in forward
[rank2]:     layer_outputs = decoder_layer(
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]:     result = forward_call(*args, **kwargs)
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 677, in forward
[rank2]:     hidden_states, self_attn_weights, present_key_value = self.self_attn(
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank2]:     return self._call_impl(*args, **kwargs)
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank2]:     result = forward_call(*args, **kwargs)
[rank2]:   File "/root/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 603, in forward
[rank2]:     attn_output = torch.nn.functional.scaled_dot_product_attention(
[rank2]: RuntimeError: The expanded size of the tensor (32) must match the existing size (31) at non-singleton dimension 3.  Target sizes: [1, 32, 1, 32].  Tensor sizes: [1, 1, 1, 31]


### Expected behavior

运行命令：

CUDA_VISIBLE_DEVICES=0,1,2,3 FORCE_TORCHRUN=1 torchrun --nnodes 1 --node_rank 0 --nproc_per_node 4 src/train.py examples/demo/llama3_full_sft_ds3.yaml

### Others

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

微调llama3时指定eval_dataset并开启predict_with_generate后验证报错 #5292

Reminder

System Info

model

method

dataset

output

train

eval

Reproduction

报错信息如下

Expected behavior

Others

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

微调llama3时指定eval_dataset并开启predict_with_generate后验证报错 #5292

Description

Reminder

System Info

model

method

dataset

output

train

eval

Reproduction

报错信息如下

Expected behavior

Others

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions