base_model [BlockSequenceConfig]:
type: lm (LM -> LMLike -> BlockSequenceConfig)
embeddings [EmbeddingsConfig]:
type: lm_embeddings (LmEmbeddingsConfig -> EmbeddingsConfig -> BlockSequenceConfig)
...
transformer [BlockSequenceConfig]:
type: repeat
...
head:
type: lm_head (LmHeadConfig -> HeadConfig -> BlockSequenceConfig)
🎯 Goal (What & Why)
We want to rework the configurations for models to help us going forward. Main goals are:
🚀 Execution Plan
Blocks:
Proposal:
BlockConfig, allowing easy swap between all sorts of layers.TransformerBlockwith typetransformer. For now, this will be the default and only type of block. A transformer block will be defined through three fully dynamic sub-layers: mixer, mlp and normalization.hidden_size,full_precision_residualand other variables that need to remain consistent between blocks will not be part of the block config, and will instead be configured in block sequences (see below) and be passed as arguments to block instantiation.Open questions:
TransformerBlock/transformerappropriate, or do we want a more inclusive name?*How should non-standard layers, ex. embeddings, head, relate to that interface?
Block sequences:
Proposal:
Sequentiallayers, so we will introduce matching dynamic constructBlockSequenceConfigon the config side.RepeatBlockConfigas an implementation of the block sequence interface, repeating a pattern of block fornum_layers, with constant parameters likehidden size.Open questions:
Linear:
Proposal:
weight_initialization,bias_initialization,bias,lr_scale,apply_peft. (see sections below for details)defaultnon-init field ofLinearConfigandLinearWeightConfig, in the parent config's_validate.Open questions:
add_linear_biasesorinit_method_stdto set all linear layers at once? This would make things less verbose but could be more difficult to understand, and could make things less self-contained (ex. linear configuration may appear to depend not only onLinearConfig, but also on arbitrary parameters in parent config.) Possible compromises:_from_dict, so that actual config end up with explicitLinearConfigfields, i.e. all traces of the shortcut are gone when printing or saving the config. (Ex.add_linear_biasesreplaced with explicitbias = Falsein allLinearConfigs.)lr_scale(MoE) andapply_peft(key and value) we can rely on, we can handle initialization by working with global tensors, and we shouldn't need non-constantbias.Other layers and weights
Proposal:
NormalizationConfig, with similar parameters toLinearConfigbut dynamic and without custom default (a fixed one works fine, and custom default doesn't work well with dynamic class).WeightConfig.Open questions:
WeightConfigthe only way to configure parameters, using composition for linear, normalization, etc. This would bring consistency at the cost of convenience (more verbose configs).Initialization
Proposal:
InitializationConfigallowing for any conceivable initialization.Open questions:
LR scales
Proposal:
Peft
Proposal:
Open questions:
LinearConfig.apply_peftis the only configuration parameter, which is enough for simple lora. Do we need more? Ex. other types of peft, application toLinearWeightConfigor other layers.BlockConfig)?