Proposal: new model configuration mechanism

# 🎯 **Goal (What & Why)**

We want to rework the configurations for models to help us going forward. Main goals are:
* Support block-modular models (#242)
* Replace and generalize repetitive, ad-hoc parameters such as lr scales and initialization parameters, with standardized and automated configuration parameters.
* Make more components dynamic to help experimenting with new model configurations, ex. mixers, mlps, or even entire blocks or sequence of blocks. 
* Help with the integration of more complex models, ex. multi-modal (vision) models
* Generalize peft/lora beyond transformers.

# 🚀 **Execution Plan**

#### Blocks:
Proposal:
* Standard blocks and their configs will be replaced by a block interface with a dynamic `BlockConfig`, allowing easy swap between all sorts of layers.
* Standard transformer and transformer-like blocks that differ only by their mixer, ex. Transformer and SSM blocks, will be unified into `TransformerBlock` with type `transformer`. For now, this will be the default and only type of block. A transformer block will be defined through three fully dynamic sub-layers: mixer, mlp and normalization.
* `hidden_size`, `full_precision_residual` and other variables that need to remain consistent between blocks will not be part of the block config, and will instead be configured in block sequences (see below) and be passed as arguments to block instantiation.

Open questions:
* Should we allow for a different config for the two normalization sub-layers?
* Is `TransformerBlock` /  `transformer` appropriate, or do we want a more inclusive name?
*How should non-standard layers, ex. embeddings, head, relate to that interface?

#### Block sequences:
Proposal: 
* We will introduce a concept of "block sequence", representing a logical sequence of blocks. We already have `Sequential` layers, so we will introduce matching dynamic construct `BlockSequenceConfig` on the config side. 
* We will introduce `RepeatBlockConfig` as an implementation of the block sequence interface, repeating a pattern of block for `num_layers`, with constant parameters like `hidden size`.

Open questions:
* Names are WIP.
* Allow for nested block sequences, ex. a block in a sequence that is itself a sequence? (Useful for lm and vision use cases below)
* Should the language model be itself a block sequence? Ex. we could have a fully generalizable structure (ex. vision of the form:
  ```
  base_model [BlockSequenceConfig]:
    type: lm (LM -> LMLike ->  BlockSequenceConfig)
    embeddings [EmbeddingsConfig]:
      type: lm_embeddings (LmEmbeddingsConfig -> EmbeddingsConfig -> BlockSequenceConfig)
      ...
    transformer [BlockSequenceConfig]:
      type: repeat
      ...
    head:
      type: lm_head (LmHeadConfig -> HeadConfig -> BlockSequenceConfig)
  ```
* How should be handle vision? Could be something like
```
base_model [BlockSequenceConfig]:
  type: multi_modal
  models:
    - vision:
        type: vision (Vision -> LMLike  ->  BlockSequenceConfig)
        ...
    - lm
        type: lm (LM -> LMLike ->  BlockSequenceConfig)
        ...
```

#### Linear:
Proposal:
* Linear layers will have their own config. Parameters so far are `weight_initialization`, `bias_initialization`, `bias`, `lr_scale`, `apply_peft`. (see sections below for details)
* Defaults will be customizable by the parent config, since having a fixed default doesn't make sense. Custom defaults will be set through the `default` non-init field of `LinearConfig` and `LinearWeightConfig`, in the parent config's `_validate` . 
* For layers that are the concatenation of logically distinct layers (ex. key_value, gate_and_up, moe mlp weights, ssm inner projection), there will be a separate configuration for each sub-layer.

Open questions:
* Should we make linear layers dynamic?
* Should we have a separate config class for layers that shouldn't have a bias option? (ex. MoE router)
* Is there a better way to achieve customizable defaults?
* Do we want to allow for convenience parameters to limit repetitions? Ex. keep `add_linear_biases` or `init_method_std` to set all linear layers at once? This would make things less verbose but could be more difficult to understand, and could make things less self-contained (ex. linear configuration may appear to depend not only on `LinearConfig`, but also on arbitrary parameters in parent config.) Possible compromises: 
  * have "init-only" shortcut fields that are converted explicitly in `_from_dict`, so that actual config end up with explicit `LinearConfig` fields, i.e. all traces of the shortcut are gone when printing or saving the config. (Ex. `add_linear_biases` replaced with explicit `bias = False` in all `LinearConfig`s.)
  * Rely on Hydra.
* Concatenation of logically distinct layers should be manageable, but some details are still TBD. We already have examples for `lr_scale` (MoE) and `apply_peft` (key and value) we can rely on, we can handle initialization by working with global tensors, and we shouldn't need non-constant `bias`.
* For MoE, configuring each expert separately could be tedious so maybe we want to make it optional.

##### Other layers and weights
Proposal:
* Normalization weights and biases will be managed through `NormalizationConfig`, with similar parameters to `LinearConfig` but dynamic and without custom default (a fixed one works fine, and custom default doesn't work well with dynamic class).
* Other parameters (embeddings, lm output, isolated parameters) will be configured in a generic `WeightConfig`.

Open questions:
* We could make `WeightConfig` the only way to configure parameters, using composition for linear, normalization, etc. This would bring consistency at the cost of convenience (more verbose configs).
* Should we make more custom layer configurations, ex. embeddings, lm output, conv1d, etc? Embedding and LM output are technically linear(-like) layers, so they could use a linear config instead (without bias). As for the layer itself, making a construct could be difficult for technical and legacy reasons (making a submodule change weight names so isn't backward compatible).

#### Initialization
Proposal:
* All parameters will be associated with a fully dynamic `InitializationConfig` allowing for any conceivable initialization. 
* For linear and custom parameters, the default will be set in the parent layer, while normalization will use fixed defaults (fill with ones/zeros)

Open questions:
* How do we generalize default initializations? (Ex. SSMs have many isolated weights not fitting in the above categories).

#### LR scales
Proposal:
* All parameters will be associated with their own customizable learning rate scale. They may be be shared in some cases (ex. linear/normalization weight and bias).
* Layers and blocks may also define customizable lr scales, ex. to allow freezing an entire block. When more than one lr scale applies to a given parameter, the effect is multiplicative.

#### Peft
Proposal:
* Instead of specializing the Peft config to each model , we will use one single model-agnostic Peft config, and configure Peft behavior in individual layers.

Open questions:
* So far `LinearConfig.apply_peft` is the only configuration parameter, which is enough for simple lora. Do we need more? Ex. other types of peft, application to `LinearWeightConfig` or other layers.
* Where shoud the peft config live? do we want it at the top-level of the model config, or deeper into the config (ex. in `BlockConfig`)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: new model configuration mechanism #356

🎯 Goal (What & Why)

🚀 Execution Plan

Blocks:

Block sequences:

Linear:

Other layers and weights

Initialization

LR scales

Peft

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Proposal: new model configuration mechanism #356

Description

🎯 Goal (What & Why)

🚀 Execution Plan

Blocks:

Block sequences:

Linear:

Other layers and weights

Initialization

LR scales

Peft

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions