Skip to content

Conversation

@blacksheep-Aristotle
Copy link
Contributor

@blacksheep-Aristotle blacksheep-Aristotle commented Nov 12, 2024

PR types

New features

PR changes

Models

Description

添加llama ,qwen,gpt 单卡组网。
auto_trainer支持使用中层api & 运行脚本支持使用中层api

验证文档:https://ku.baidu-int.com/knowledge/HFVrC7hq1Q/pKzJfZczuc/ESWJRriQZ-/sfEV74J-hHGXIR

@paddle-bot
Copy link

paddle-bot bot commented Nov 12, 2024

Thanks for your contribution!

level = "os_g"
elif ShardingOption.FULL_SHARD in self.args.sharding:
level = "p_g_os"
model, self.optimizer = sharded_data_parallel(model, self.optimizer, level)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否构造dp_config传入parallelize,不需要再单独进行 sharded_data_parallel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done udpated

@codecov
Copy link

codecov bot commented Nov 15, 2024

Codecov Report

Attention: Patch coverage is 16.87500% with 1197 lines in your changes missing coverage. Please review.

Project coverage is 52.33%. Comparing base (3374e7f) to head (67bb667).
Report is 287 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/transformers/gpt/modeling_network.py 18.52% 387 Missing ⚠️
paddlenlp/transformers/llama/modeling_network.py 16.93% 368 Missing ⚠️
paddlenlp/transformers/qwen/modeling_network.py 17.50% 330 Missing ⚠️
paddlenlp/transformers/model_utils.py 3.03% 64 Missing ⚠️
paddlenlp/trainer/auto_trainer.py 0.00% 35 Missing ⚠️
paddlenlp/transformers/gpt/modeling_auto.py 15.38% 11 Missing ⚠️
paddlenlp/transformers/llama/modeling_auto.py 50.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9412      +/-   ##
===========================================
- Coverage    53.17%   52.33%   -0.85%     
===========================================
  Files          718      721       +3     
  Lines       114694   113772     -922     
===========================================
- Hits         60990    59540    -1450     
- Misses       53704    54232     +528     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

return logits


loss_cnt = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些都是干啥的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

遗留代码,已删除。


import numpy as np
import paddle
import paddle.distributed as dist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

未来的目标是啥,全面迁移auto_trainer?如果是的话,是不是直接所有trainer逻辑都统一重写,不要与旧的耦合

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按之前讨论,自动并行集成到auto_trainer,trainer主要是集成原来手动并行的逻辑,两者不耦合,只是共用基础设施

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

主要是公用基础设施,如果公用的api发生变化,可能会导致这里也挂了,从开发体验上来看,当前自动并行的测试监控需要能够及时发现定位这些问题

f"{prefix}lm_head.weight": ColWiseParallel(),
}
},
"pp_config": {"split_spec": f"{prefix}llama.layers"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于非layers层以及empty层切分方式麻烦提供下示例

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同时有个疑问,share weight的参数预计怎么如何标记了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

框架会自动识别整理出share weight的参数,做特殊处理,用户在使用时当作正常参数标记即可。

_keys_to_ignore_on_load_unexpected = [r"self_attn.rotary_emb.inv_freq"]

@classmethod
def _get_name_mappings(cls, config: LlamaConfig) -> list[StateDictNameMapping]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在 auto_dist_config 配置了 sp、tp的切分信息,建议去掉name_mapping的配置,在自动并行的中层API里面进行切分

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

level = 2
if ShardingOption.FULL_SHARD in sharding:
level = 3
final_config["dp_config"] = {"level": level}
Copy link
Contributor

@wawltor wawltor Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要考虑LoRA、DPO、KTO训练

  1. LoRA训练会自定义tensor parallel的LoRA层,这里如何配置了?
  2. KTO 和 DPO 有两个模型,一个更新参数,另外不更新参数,两个模型如何配置分布式策略了?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按之前讨论,预训练之后的流程会逐步验证支持

warnings.warn(
f"enable_parallel_cross_entropy, the vocab_size should be splited: {prediction_scores.shape[-1]}, {self.config.vocab_size}"
)
self.loss_func = paddle.nn.CrossEntropyLoss(reduction="none", ignore_index=self.ignore_index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在大模型训练中有一些自定义的PyLayer算子,非TP组网中能支持折中PyLayer算子吗?同时折中PyLayer算子和张量并行也是耦合在一起,如何开发了?

class FusedHeadAndCrossEntropy(PyLayer):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

自动并行支持PyLayer的工作在另外开展中,相关进展可以请 @From00 介绍下

level = 3
final_config["dp_config"] = {"level": level}

return final_config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

中层API对unified checkpoint的支持情况如何了?是否可以支持自适应的分布式策略的扩展

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前已支持checkpoint转换为单卡权重,单卡权重到unified checkpoint的转换在支持中

Copy link
Contributor

@DesmonDay DesmonDay Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Unified checkpoint现在可以支持基本任意的分布式策略切换;
  2. Unified checkpoint 保存出来的模型权重可以直接应用到推理框架、或者其他的训练流程中。

现在自动并行保存的格式可以同时支持这两点吗?
另外我看是需要把checkpoint转换为单卡权重,再转到unified checkpoint,是否太复杂了。能否直接支持 Unified checkpoint 的保存?

)


class LlamaPretrainingCriterion3DNet(paddle.nn.Layer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

中层API设计涉及criterion吗?比如ParallelCrossEntropy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

支持,在tensor_parallel_config中加上replace_with_parallel_cross_entropy配置即可



class LlamaPretrainingCriterion3DNet(paddle.nn.Layer):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

中层API的设计主要是涉及DP和TP,和PP的场景需要特殊的兼容吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f"{prefix}llama.layers.*.mlp.up_proj": ColWiseParallel(),
f"{prefix}llama.layers.*.mlp.gate_up_fused_proj": ColWiseParallel(),
f"{prefix}llama.layers.*.mlp.down_proj": RowWiseParallel(),
f"{prefix}lm_head.weight": ColWiseParallel(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColWiseParallel()后原先的linear会转化为ColumnParallelLinear或ColumnSequenceParallelLinear,还是另一个新的linear类型?

Copy link
Contributor Author

@blacksheep-Aristotle blacksheep-Aristotle Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColWiseParallel会对指定的linear的权重用基础api重写一遍,在后续运行的时候自动并行会自动推出来分布式状态同时插入通信算子。不会改变linear类型。

config.use_recompute = training_args.recompute
config.tensor_parallel_degree = training_args.tensor_parallel_degree
config.tensor_parallel_rank = training_args.tensor_parallel_rank
config.sharding_parallel_degree = training_args.sharding_parallel_degree
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看了一下代码,为啥在400多行创建Topology的位置,sharding_degree默认设置为1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto_parallel's sharding is not orthogonal with dp, mp and pp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dp_degree已经包含了sharding_degree,所以sharding_degree设为1即可。

@FeixLiu FeixLiu force-pushed the single_network branch 2 times, most recently from f1f4e46 to 612237d Compare November 27, 2024 05:04
@FeixLiu FeixLiu force-pushed the single_network branch 2 times, most recently from 5e24d14 to f35407b Compare December 9, 2024 05:58
@blacksheep-Aristotle blacksheep-Aristotle force-pushed the single_network branch 2 times, most recently from 1720943 to 2ebb3dc Compare December 19, 2024 05:34
normalized_shape=normalized_shape, epsilon=epsilon, weight_attr=weight_attr, bias_attr=bias_attr
)
self.config = config
self.ipp = ipp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ipp 是什么,为什现在需要额外传入 ipp ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

基础组网需要传入ipp,表示该layer在pipeline stage中的位置。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以避免这样传参数吗?每一个层都要接一个这样的参数,很麻烦。这个不能自动做吗?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

flash_attention = None

__all__ = [
"LlamaForCausalLM3DNet",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"LlamaForCausalLM3DNet",
"LlamaForCausalLM3DNet",

为什么要叫 3DNet,为什么要加特殊的 3D 前缀?

Copy link
Contributor Author

@blacksheep-Aristotle blacksheep-Aristotle Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为基础api组网中带了3D前缀,表示该网络支持pp,dp,tp 3d混合并行,故此保留

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单卡和3D的组网不一样吗?如果一样是不是可以直接去掉3D的前缀?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



class LlamaMLPNet(nn.Layer):
def __init__(self, config, ipp: Optional[int] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ipp 是什么意思,代表什么?必须传入吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

遗留代码,已删除,done

# output = (logits,) + outputs[1:]
# return (loss,) + output if loss is not None else output

# return CausalLMOutputWithCrossAttentions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些不支持吗? 可以支持 model.generate 生成吗?

Copy link
Contributor Author

@blacksheep-Aristotle blacksheep-Aristotle Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

动转静要求损失函数和模型分离,故此注释。

# )

def auto_dist_config(self, prefix=""):
if prefix != "":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些东西需要如何配置?有文档介绍吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)

def merge_auto_dist_configs(self, configs):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数必须放到 模型 基类里面吗?可以放到 autotrainer 吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为可能会有如下场景,模型A包含模型B和模型C。模型A,模型B和模型C都有各自的分布式配置,所以放在模型的基类中便于处理,能方便的找到各个模型自己的分布式配置,在auto trainer中merge最终得到模型A的分布式配置。


return final_config

def _generate_auto_dist_config(self, auto_dist_degree):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,看能不能放到 autotrainer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

attention_mask = paddle.where(attention_mask, zero, neg_inf)
attention_mask = dist.shard_tensor(attention_mask, get_mesh(), [dist.Replicate(), dist.Replicate()])
hidden_states = self.drop(hidden_states)
hidden_states = dist.reshard(hidden_states, get_mesh(), [dist.Shard(0), dist.Replicate()])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modeling_3D_auto.py
modeling_auto.py

要不统一一下,我看不同模型不同写法

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

normalized_shape=normalized_shape, epsilon=epsilon, weight_attr=weight_attr, bias_attr=bias_attr
)
self.config = config
self.ipp = ipp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以避免这样传参数吗?每一个层都要接一个这样的参数,很麻烦。这个不能自动做吗?

"pp_config": None,
}
for name, layer in self.named_sublayers(include_self=True):
if hasattr(layer, "auto_dist_config"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个属性每层layer 需要额外设置吗?

Copy link
Contributor Author

@blacksheep-Aristotle blacksheep-Aristotle Dec 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

问题一:基础组网暂时无法避免。因为每一层layer都需要知道自己在pipeline stage中的位置。
问题二:不需要。只是为了处理模型A包含模型B和模型C的场景。此时模型B和模型C属于模型A的sublayer且都有自己的auto_dist_config

Copy link
Contributor

@jeff41404 jeff41404 Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

补充下我的理解:
“可以避免这样传参数吗?每一个层都要接一个这样的参数,很麻烦。这个不能自动做吗?”
使用自动并行基础API的组网(modeling_auto.py)需要ipp参数,和之前一样,这种组网在自动并行中层API成熟之后会逐步退场。
而使用自动并行中层API的组网(modeling_network.py)即单卡组网就不需要ipp参数了,从代码中也可以看出来,是未来建议使用的方式

mem=-1
echo "result: loss=$loss ips=$ips mem=$mem loss_md5=$loss_md5"
loss_base=10.59486389 # output of dropout is different after supporting spmd
loss_base=10.55848312 # output of dropout is different after supporting spmd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些影响了啥呀,为啥改。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpt初始化权重改变了

flash_attention = None

__all__ = [
"LlamaForCausalLM3DNet",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单卡和3D的组网不一样吗?如果一样是不是可以直接去掉3D的前缀?

"""
Merged all auto dist configs into one config.
"""
assert isinstance(configs, (dict, list))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种assert,注明一下报错原因吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

final_config["sp_config"] = config["sp_config"]
else:
for k, v in config["sp_config"]["parallelize_plan"].items():
assert k not in final_config["sp_config"]["parallelize_plan"].keys()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"sp_config": None,
"pp_config": None,
}
for config in configs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

整体适当加一些注释,便于理解

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 82 to 83
assert model is not None
assert isinstance(model, PretrainedModel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert 添加一些后面error的原因

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +293 to +298
# # up
# a1 = self.w1(hidden_states)
# # gate
# a2 = self.w2(hidden_states)
# intermediate_parallel = a1 * F.silu(a2)
# down
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果是没用的一些代码可以删除

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# export PYTHONPATH=../../../:$PYTHONPATH

python -u -m paddle.distributed.launch \
--gpus "4,5,6,7" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成0,1,2,3是不是更合适点了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,113 @@
# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里依赖的paddle版本是否给出明确的版本信息,可以提现在readme文档上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

f"{prefix}lm_head.weight": dist.ColWiseParallel(),
}
},
"pp_config": {"split_spec": f"{prefix}llama.layers", "global_spec": "llama.global_layer"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在可以支持share weight的pipeline方式吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

支持

if prefix != "":
assert prefix.endswith(".")
config = {
"sp_config": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的配置有点疑问,sequence parallel虽然要依赖tensor parallel,但这里的配置和 tp config大部分是重复,是否可以减少点重复配置?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以在后续优化

@lugimzzz
Copy link
Contributor

lugimzzz commented Dec 27, 2024

  1. 对于DPO、KTO pipeline场景需要替换原有的criterion变成定制的DPOCriterion和KTOCriterion是否能支持
  2. DPOCriterion pipeline需要通过公共变量infohub传出logit 半自动并行有影响吗 https://github.com/lugimzzz/PaddleNLP/blob/develop/paddlenlp/trl/dpo_criterion.py#L299C1-L304C24 https://github.com/lugimzzz/PaddleNLP/blob/develop/paddlenlp/trl/dpo_trainer.py#L342C1-L349C31
  3. 那么原来是nn.Linear现在还是nn.Linear?image那么我如何判断这个linear是rowparallellinear还是 columnparallellinear
    在LoRA 我原本是把ColumnParallelLinear替换为 ColumnParallelLoRALinear,现在我要怎么操作https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/peft/lora/lora_model.py#L509C1-L527C14

AutoTokenizer,
CosineAnnealingWithWarmupDecay,
GPTConfig,
GPTForCausalLMAuto,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否可以考虑统一一个run_pretrain_auto.py,不然每个模型维护一个脚本,维护成本比较大。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

考率到修改启动脚本涉及到的ci/ce脚本较多。故计划此pr合入之后提交个pr统一修改

def _wrap_for_auto(self, model, train_dataloader):
logger.info("Wrapping model for auto paralle")
logger.info(f"Wrapping model for auto parallel using intermediate api {self.args.use_intermediate_api} ")
dist_loader = self._wrap_for_dist_loader(train_dataloader)
Copy link
Contributor

@DesmonDay DesmonDay Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块dist_loader和我们目前PaddleNLP的 distributed_dataloader,区别大吗?看起来功能差得有点多

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

区别不大,只是包了一层dataloader。对dataloader其余的功能支持还在陆续开发中。coming soon

@blacksheep-Aristotle
Copy link
Contributor Author

  1. 对于DPO、KTO pipeline场景需要替换原有的criterion变成定制的DPOCriterion和KTOCriterion是否能支持
  2. DPOCriterion pipeline需要通过公共变量infohub传出logit 半自动并行有影响吗 https://github.com/lugimzzz/PaddleNLP/blob/develop/paddlenlp/trl/dpo_criterion.py#L299C1-L304C24 https://github.com/lugimzzz/PaddleNLP/blob/develop/paddlenlp/trl/dpo_trainer.py#L342C1-L349C31
  3. 那么原来是nn.Linear现在还是nn.Linear?image那么我如何判断这个linear是rowparallellinear还是 columnparallellinear
    在LoRA 我原本是把ColumnParallelLinear替换为 ColumnParallelLoRALinear,现在我要怎么操作https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/peft/lora/lora_model.py#L509C1-L527C14

1.sft/dpo/ppo的支持还在开发验证当中。当前pr只考虑了预训练的场景。
2.同上
3.是的.对于基础api,可以通过判断权重的切分方式来对lora weight设置分布式状态。对于中层api,可以直接修改layer的auto_dist_config。

@blacksheep-Aristotle
Copy link
Contributor Author

  1. 对于DPO、KTO pipeline场景需要替换原有的criterion变成定制的DPOCriterion和KTOCriterion是否能支持
  2. DPOCriterion pipeline需要通过公共变量infohub传出logit 半自动并行有影响吗 https://github.com/lugimzzz/PaddleNLP/blob/develop/paddlenlp/trl/dpo_criterion.py#L299C1-L304C24 https://github.com/lugimzzz/PaddleNLP/blob/develop/paddlenlp/trl/dpo_trainer.py#L342C1-L349C31
  3. 那么原来是nn.Linear现在还是nn.Linear?image那么我如何判断这个linear是rowparallellinear还是 columnparallellinear
    在LoRA 我原本是把ColumnParallelLinear替换为 ColumnParallelLoRALinear,现在我要怎么操作https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/peft/lora/lora_model.py#L509C1-L527C14
    比如当前适配lora的做法:
    image

@DesmonDay
Copy link
Contributor

Unified checkpoint现在可以支持基本任意的分布式策略切换;
Unified checkpoint 保存出来的模型权重可以直接应用到推理框架、或者其他的训练流程中。

现在自动并行保存的格式可以同时支持这两点吗?
另外我看是需要把checkpoint转换为单卡权重,再转到unified checkpoint,是否太复杂了。能否直接支持 Unified checkpoint 的保存?

@blacksheep-Aristotle
Copy link
Contributor Author

Unified checkpoint现在可以支持基本任意的分布式策略切换; Unified checkpoint 保存出来的模型权重可以直接应用到推理框架、或者其他的训练流程中。

现在自动并行保存的格式可以同时支持这两点吗? 另外我看是需要把checkpoint转换为单卡权重,再转到unified checkpoint,是否太复杂了。能否直接支持 Unified checkpoint 的保存?

这个还在开发中呢。

@lugimzzz
Copy link
Contributor

lugimzzz commented Jan 2, 2025

  1. 对于DPO、KTO pipeline场景需要替换原有的criterion变成定制的DPOCriterion和KTOCriterion是否能支持
  2. DPOCriterion pipeline需要通过公共变量infohub传出logit 半自动并行有影响吗 https://github.com/lugimzzz/PaddleNLP/blob/develop/paddlenlp/trl/dpo_criterion.py#L299C1-L304C24 https://github.com/lugimzzz/PaddleNLP/blob/develop/paddlenlp/trl/dpo_trainer.py#L342C1-L349C31
  3. 那么原来是nn.Linear现在还是nn.Linear?image那么我如何判断这个linear是rowparallellinear还是 columnparallellinear
    在LoRA 我原本是把ColumnParallelLinear替换为 ColumnParallelLoRALinear,现在我要怎么操作https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/peft/lora/lora_model.py#L509C1-L527C14

1.sft/dpo/ppo的支持还在开发验证当中。当前pr只考虑了预训练的场景。 2.同上 3.是的.对于基础api,可以通过判断权重的切分方式来对lora weight设置分布式状态。对于中层api,可以直接修改layer的auto_dist_config。

DPO/KTO主要需要考虑的是需要支持Criterion这块是否可以更加灵活的支持,而不是限制需要写在组网里

Copy link
Contributor

@wawltor wawltor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wawltor wawltor merged commit 575896c into PaddlePaddle:develop Jan 3, 2025
9 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants