Skip to content
Merged
Changes from 1 commit
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
d2fbcb2
docs: gguf spec first pass
philpax Jun 25, 2023
23eda2e
docs(gguf): update with review comments
philpax Jun 26, 2023
b303293
docs(gguf): update with review comments
philpax Jun 27, 2023
2bcd348
docs(gguf): quant version optional for unquant
philpax Jun 28, 2023
576e306
docs(gguf): normalize naming, add whisper
philpax Jul 9, 2023
44af6b8
Merge branch 'master' of https://github.com/ggerganov/ggml into gguf-…
philpax Jul 9, 2023
24260bf
docs(gguf): more review updates
philpax Jul 23, 2023
0133f2e
docs(gguf): add norm eps and added_tokens
philpax Jul 25, 2023
e9988f7
docs(gguf): move padding
philpax Jul 26, 2023
f4c4d6a
docs(gguf): remove migration tool
philpax Jul 27, 2023
39da254
docs(gguf): make offset base explicit
philpax Jul 27, 2023
a6d1cc1
docs(gguf): fix replace oops
philpax Jul 27, 2023
1d134ec
docs(gguf): alignment metadata+tensor name len max
philpax Aug 6, 2023
2a90bbf
docs(gguf): clarification, fixes, tensor names
philpax Aug 14, 2023
3d4507e
docs(gguf): clarify license
philpax Aug 15, 2023
39d6377
docs(gguf): minor tweaks
philpax Aug 15, 2023
e36b4ca
docs(gguf): data layout, GQA eq, no ft, LE GGUF
philpax Aug 20, 2023
d5cfb55
docs(gguf): fix magic order
philpax Aug 20, 2023
aa8d0ba
docs(gguf): match impl
philpax Aug 20, 2023
f3e7632
docs(gguf): specify fallback alignment
philpax Aug 20, 2023
2fe03e5
docs(gguf): remove TensorInfo::n_elements
philpax Aug 20, 2023
2b65fba
docs(gguf): filetype, rope base/linear scale
philpax Aug 24, 2023
b021b25
docs(gguf): v2 - uint64 all the things
philpax Aug 26, 2023
2da80c1
docs(gguf): tweak extensibility wording
philpax Aug 28, 2023
574b408
docs(gguf): fix spec discrepancies
philpax Sep 9, 2023
4ea9317
Merge branch 'master' into gguf-spec
philpax Oct 31, 2023
78faa7b
docs(gguf): v3 + other fixes
philpax Oct 31, 2023
0da010d
fix(editorconfig): use 2-space tabs for markdown
philpax Oct 31, 2023
ad95988
docs(gguf): clarify big-endian
philpax Oct 31, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
docs(gguf): add norm eps and added_tokens
  • Loading branch information
philpax committed Jul 25, 2023
commit 0133f2e5f908b7bfd2454ff3ba17dc00c9f0ffaf
63 changes: 38 additions & 25 deletions docs/gguf.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Fields, including arrays, are written sequentially without alignment unless othe

```c
enum ggml_type {
GGML_TYPE_F32 = 0,
GGML_TYPE_float32 = 0,
GGML_TYPE_F16 = 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about BF16 ?

Copy link
Contributor

@klosax klosax Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently if you want the highest quality you have to double the tensor sizes by using F32. My guess is that BF16 is not natively supported by many platform architectures yet. I would also like to see support for BF16 in ggml. I wonder if BF16 emulation really is slower than F32, since it is in fact a truncated version of F32. @ggerganov ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No plans for adding BF16 support - it would be too big change for what I think is too small benefit

GGML_TYPE_Q4_0 = 2,
GGML_TYPE_Q4_1 = 3,
Expand Down Expand Up @@ -208,7 +208,7 @@ If a particular community key is widely used, it may be promoted to a standardiz
- `bloom`
- `falcon`
- `rwkv`
- **`general.quantization_version: u32`**: version of quantization scheme. Not required if the model is not quantized (i.e. no tensors are quantized). If any tensors are quantized, this _must_ be present.
- **`general.quantization_version: uint32`**: version of quantization scheme. Not required if the model is not quantized (i.e. no tensors are quantized). If any tensors are quantized, this _must_ be present.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a quantization scheme and why is this needed in the model file?
The tensor quantization type is set by gguf_tensor_info_t { ggml_type type; }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was more informative. Look eg at the k-quants, they use a mix of different quantizations for the model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a kv for file type aka quantization description general.file_type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in ggml/examples to encode the quant version in the ftype param of the model file, but is never used for anything else than informative purposes in main.cpp .

I think this could be removed from the spec since that information should be in general.file_type like "mostly Q8_0" ?
Any executors could find out the full quantization scheme by looking at the tensor types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, breaking changes to existing ggml types. What I meant was that we dont need a quantization version if we mark the changed types as not supported and make new types instead, increasing in the ggml type enum.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GGML_QNT_VERSION number (i.e. general.quantization_version: uint32) is used to mark breaking changes in the quantization formats / algorithms. The user code can check it to determine if a format is supported or not.

increasing in the ggml type enum.

I prefer to reuse the existing enum values. We don't worry about backwards compatibility before ggml v1.0 is released.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem with reusing enum and the quantization format names is that most users distribute model files with names like llama-65b-q4_0.bin so users dont know before downloading and testing them if it works or not. This is confusing and frustrating to end-users. This is also discussed in PR ggml-org/llama.cpp#2434

Copy link
Contributor Author

@philpax philpax Aug 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included this as that's what the current version of GGML does, but (as mentioned in llama.cpp#2434) I would strongly prefer the breaking quantization changes to be treated as separate types, as klosax suggests. It's more descriptive, it means you only need one source of format version instead of two, and it's less error-prone.

If that's a possibility, I'd be a fan of removing the quantization_version value entirely.

I prefer to reuse the existing enum values. We don't worry about backwards compatibility before ggml v1.0 is released.

Dropping support for the older quantization formats is fine by me, but clearly disambiguating between older and newer formats makes things clearer for users and developers alike. (i.e. instead of having to specify which format and quantization version I'm using, I can refer to just one thing that describes them both and won't go out of sync)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintaining backwards compatibility in user code is currently a rather complex task. In kobold.cpp they have resorted to include several versions of ggml.c when trying to deal with the complexity.


#### General metadata

Expand All @@ -230,26 +230,28 @@ Information about where this model came from. This is useful for tracking the pr

In the following, `[llm]` is used to fill in for the name of a specific LLM architecture. They will be used in each architecture's section.

- `[llm].context_length: u32`: Also known as `n_ctx`. length of the context (in tokens) that the model was trained on. For most architectures, this is the hard limit on the length of the input. Architectures, like RWKV, that are not reliant on transformer-style attention may be able to handle larger inputs, but this is not guaranteed.
- `[llm].embedding_length: u32`: Also known as `n_embd`. Embedding layer size.
- `[llm].layer_count: u32`: Also known as `n_layers`. The number of attention+feedforward layers (i.e. the bulk of the LLM). Does not include the input or embedding layers.
- `[llm].feedforward_length: u32`: Also known as `n_ff`. The length of the feedforward layer.
- `[llm].context_length: uint32`: Also known as `n_ctx`. length of the context (in tokens) that the model was trained on. For most architectures, this is the hard limit on the length of the input. Architectures, like RWKV, that are not reliant on transformer-style attention may be able to handle larger inputs, but this is not guaranteed.
- `[llm].embedding_length: uint32`: Also known as `n_embd`. Embedding layer size.
- `[llm].layer_count: uint32`: Also known as `n_layers`. The number of attention+feedforward layers (i.e. the bulk of the LLM). Does not include the input or embedding layers.
- `[llm].feedforward_length: uint32`: Also known as `n_ff`. The length of the feedforward layer.
- `[llm].use_parallel_residual: bool`: Whether or not the parallel residual logic should be used.
- `[llm].tensor_data_layout: string`: When a model is converted to GGUF, tensors may be rearranged to improve performance. This key describes the layout of the tensor data. This is not required; if not present, it is assumed to be `reference`.
- `reference`: tensors are laid out in the same order as the original model
- further options can be found for each architecture in their respective sections

#### Attention

- `[llm].attention.head_count: u32`: Also known as `n_head`. Number of attention heads.
- `[llm].attention.head_count_kv: u32`: The number of heads per group used in Grouped-Query-Attention. If not present, the model does not use GQA.
- `[llm].attention.max_alibi_bias: f32`: The maximum bias to use for ALiBI.
- `[llm].attention.clamp_kqv: f32`: Value (`C`) to clamp the values of the `Q`, `K`, and `V` tensors between (`[-C, C]`).
- `[llm].attention.head_count: uint32`: Also known as `n_head`. Number of attention heads.
- `[llm].attention.head_count_kv: uint32`: The number of heads per group used in Grouped-Query-Attention. If not present, the model does not use GQA.
- `[llm].attention.max_alibi_bias: float32`: The maximum bias to use for ALiBI.
- `[llm].attention.clamp_kqv: float32`: Value (`C`) to clamp the values of the `Q`, `K`, and `V` tensors between (`[-C, C]`).
- `[llm].attention.layer_norm_epsilon: float32`: Layer normalization epsilon.
- `[llm].attention.layer_norm_rms_epsilon: float32`: Layer RMS normalization epsilon.

#### RoPE

- `[llm].rope.dimension_count: u32`: The number of rotary dimensions for RoPE.
- `[llm].rope.scale: f32`: A scale factor for RoPE to adjust the context length.
- `[llm].rope.dimension_count: uint32`: The number of rotary dimensions for RoPE.
- `[llm].rope.scale: float32`: A scale factor for RoPE to adjust the context length.

#### Models

Expand All @@ -263,6 +265,7 @@ The following sections describe the metadata for each model architecture. Each k
- `llama.feedforward_length`
- `llama.rope.dimension_count`
- `llama.attention.head_count`
- `llama.attention.layer_norm_rms_epsilon`

###### Optional

Expand All @@ -285,6 +288,7 @@ The following sections describe the metadata for each model architecture. Each k
- `mpt.attention.head_count`
- `mpt.attention.alibi_bias_max`
- `mpt.attention.clip_kqv`
- `mpt.attention.layer_norm_epsilon`

##### GPT-NeoX

Expand All @@ -294,6 +298,7 @@ The following sections describe the metadata for each model architecture. Each k
- `gptneox.use_parallel_residual`
- `gptneox.rope.dimension_count`
- `gptneox.attention.head_count`
- `gptneox.attention.layer_norm_epsilon`

###### Optional

Expand All @@ -306,6 +311,7 @@ The following sections describe the metadata for each model architecture. Each k
- `gptj.layer_count`
- `gptj.rope.dimension_count`
- `gptj.attention.head_count`
- `gptj.attention.layer_norm_epsilon`

###### Optional

Expand All @@ -317,6 +323,7 @@ The following sections describe the metadata for each model architecture. Each k
- `gpt2.embedding_length`
- `gpt2.layer_count`
- `gpt2.attention.head_count`
- `gpt2.attention.layer_norm_epsilon`

##### BLOOM

Expand All @@ -325,6 +332,7 @@ The following sections describe the metadata for each model architecture. Each k
- `bloom.layer_count`
- `bloom.feedforward_length`
- `bloom.attention.head_count`
- `bloom.attention.layer_norm_epsilon`

##### Falcon

Expand All @@ -334,6 +342,7 @@ The following sections describe the metadata for each model architecture. Each k
- `falcon.attention.head_count`
- `falcon.attention.head_count_kv`
- `falcon.attention.use_norm`
- `falcon.attention.layer_norm_epsilon`

###### Optional

Expand Down Expand Up @@ -365,11 +374,11 @@ The following sections describe the metadata for each model architecture. Each k

The vocabulary size is the same as the number of rows in the `head` matrix.

- `rwkv.architecture_version: u32`: The only allowed value currently is 4. Version 5 is expected to appear some time in the future.
- `rwkv.context_length: u32`: Length of the context used during training or fine-tuning. RWKV is able to handle larger context than this limit, but the output quality may suffer.
- `rwkv.layer_count: u32`
- `rwkv.embedding_length: u32`
- `rwkv.feedforward_length: u32`
- `rwkv.architecture_version: uint32`: The only allowed value currently is 4. Version 5 is expected to appear some time in the future.
- `rwkv.context_length: uint32`: Length of the context used during training or fine-tuning. RWKV is able to handle larger context than this limit, but the output quality may suffer.
- `rwkv.layer_count: uint32`
- `rwkv.embedding_length: uint32`
- `rwkv.feedforward_length: uint32`

##### Whisper

Expand All @@ -380,7 +389,7 @@ This is because they are both transformer models.
- `whisper.encoder.context_length`
- `whisper.encoder.embedding_length`
- `whisper.encoder.layer_count`
- `whisper.encoder.mels_count: u32`
- `whisper.encoder.mels_count: uint32`
- `whisper.encoder.attention.head_count`

- `whisper.decoder.context_length`
Expand Down Expand Up @@ -417,13 +426,17 @@ It is not guaranteed to be standardized across models, and may change in the fut
- `gpt2`: GPT-2 / GPT-NeoX style BPE (tokens extracted from HF `tokenizer.json`)
- `rwkv`: RWKV tokenizer
- `tokenizer.ggml.tokens: array[string]`: A list of tokens indexed by the token ID used by the model.
- `tokenizer.ggml.scores: array[f32]`: If present, the score/probability of each token. If not present, all tokens are assumed to have equal probability. Must be the same length as `tokens`.
- `tokenizer.ggml.scores: array[float32]`: If present, the score/probability of each token. If not present, all tokens are assumed to have equal probability. Must be the same length as `tokens`.
- `tokenizer.ggml.merges: array[string]`: If present, the merges of the tokenizer. If not present, the tokens are assumed to be atomic.
- `tokenizer.ggml.bos_token_id: u32`: Beginning of sequence marker
- `tokenizer.ggml.eos_token_id: u32`: End of sequence marker
- `tokenizer.ggml.unknown_token_id: u32`: Unknown token
- `tokenizer.ggml.separator_token_id: u32`: Separator token
- `tokenizer.ggml.padding_token_id: u32`: Padding token
- `tokenizer.ggml.added_tokens: array[string]`: If present, tokens that were added after training.

##### Special tokens

- `tokenizer.ggml.bos_token_id: uint32`: Beginning of sequence marker
- `tokenizer.ggml.eos_token_id: uint32`: End of sequence marker
- `tokenizer.ggml.unknown_token_id: uint32`: Unknown token
- `tokenizer.ggml.separator_token_id: uint32`: Separator token
- `tokenizer.ggml.padding_token_id: uint32`: Padding token

#### Hugging Face

Expand Down Expand Up @@ -472,7 +485,7 @@ These formats share the same fundamental structure:
- metadata about the model, such as the number of layers, the number of heads, etc.
- a `ftype` that describes the type of the majority of the tensors,
- for GGML files, the quantization version is encoded in the `ftype` divided by 1000
- an embedded vocabulary, which is a list of strings with length prepended. The GGMF/GGJT formats embed a f32 score next to the strings.
- an embedded vocabulary, which is a list of strings with length prepended. The GGMF/GGJT formats embed a float32 score next to the strings.
- finally, a list of tensors with their length-prepended name, type, and (aligned, in the case of GGJT) tensor data

Notably, this structure does not identify what model architecture the model belongs to, nor does it offer any flexibility for changing the structure of the hyperparameters. This means that the only way to add new hyperparameters is to add them to the end of the list, which is a breaking change for existing models.
Expand Down