Conversation
Co-authored-by: avirajBevli <bevliaviraj70@gmail.com>
📝 WalkthroughWalkthroughAdded comprehensive support for the Falcon-OCR model architecture across Python conversion, GGUF schema, C++ core inference, and vision projection components. Includes model loading, tensor handling, custom RoPE implementation, and multimodal vision integration. Changes
Sequence DiagramsequenceDiagram
participant User as User/Application
participant Loader as Model Loader
participant Arch as Architecture<br/>(llama-arch)
participant Inference as Inference Engine<br/>(falcon-ocr.cpp)
participant Vision as Vision Projector<br/>(CLIP)
participant GGML as GGML Graph
User->>Loader: Load GGUF file
Loader->>Arch: Check LLM_ARCH_FALCON_OCR
Loader->>Loader: Initialize parameters<br/>(RMS eps, layer count)
Loader->>Vision: Load vision config<br/>(image pixels, warmup)
Vision->>Vision: Load mmproj weights<br/>(img_projector, tok_embd)
Loader->>Inference: Create graph builder
User->>Vision: Process image
Vision->>GGML: Build conv2d graph<br/>(patch→embedding)
Vision->>Vision: Concatenate BOI tokens
GGML-->>Vision: Patch embeddings + BOI
User->>Inference: Encode prompt + image
Inference->>GGML: Embed tokens + image
Inference->>GGML: Iterate layers:<br/>RMSNorm → QKV proj →<br/>RoPE (2D hybrid) →<br/>Attention → FFN
Inference->>GGML: RoPE split: 1D normal<br/>+ 2D golden-axis
GGML->>GGML: Compute with<br/>FP32 FFN accum
Inference->>GGML: Output norm +<br/>logits projection
GGML-->>Inference: Inference result
Inference-->>User: Token logits
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tools/mtmd/mtmd.cpp (1)
1295-1303:⚠️ Potential issue | 🔴 CriticalMake the Falcon-OCR
n_boipath reachable.Because Falcon-OCR sets
use_mrope_pos = trueat Lines 785-789, this early return always wins and the newn_boibranch never executes. The chunk still reports the generic M-RoPE span instead of the BOI-aware one.🐛 Minimal fix
llama_pos mtmd_image_tokens_get_n_pos(const mtmd_image_tokens * image_tokens) { + if (image_tokens->n_boi > 0) { + // Falcon-OCR uses extra BOI temporal positions before the image grid. + return image_tokens->n_boi + 1; + } if (image_tokens->use_mrope_pos) { // for M-RoPE, temporal dimension = max(t,h,w) // t is omitted as we don't support video input return std::max(image_tokens->nx, image_tokens->ny); } - if (image_tokens->n_boi > 0) { - // for falcon-ocr, temporal dimension = n_boi + 1 (for all image tokens) - return image_tokens->n_boi + 1; - } return image_tokens->n_tokens(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/mtmd/mtmd.cpp` around lines 1295 - 1303, The current branch order causes image_tokens->use_mrope_pos to short-circuit and never reach the Falcon-OCR BOI path; change the logic in the temporal-dimension calculation so that image_tokens->n_boi > 0 is checked before use_mrope_pos (or add a combined condition) so the BOI-aware return (return image_tokens->n_boi + 1) executes for Falcon-OCR; update the block containing image_tokens->use_mrope_pos and image_tokens->n_boi so BOI takes precedence (or only use M-RoPE when n_boi == 0).
🧹 Nitpick comments (3)
src/llama-graph.cpp (1)
1204-1207: Consider centralizing architecture-specific precision policy.At Line 1204, the inline architecture allowlist keeps growing. Similar precision-gating checks exist elsewhere in this file; extracting a shared helper would reduce drift risk when adding future architectures.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/llama-graph.cpp` around lines 1204 - 1207, Extract the repeated architecture-specific precision gating into a single helper (e.g., bool arch_requires_f32_prec(LLM_ARCH arch) or Precision get_precision_for_arch(LLM_ARCH arch)) and use it wherever the file currently inlines checks like the if (arch == LLM_ARCH_GLM4 || ...) before calling ggml_mul_mat_set_prec(cur, GGML_PREC_F32);; replace the ad-hoc allowlist with calls to this helper in functions that set accumulator precision (references: ggml_mul_mat_set_prec usage and the surrounding architecture check), update all similar checks in the file to call the helper, and keep the helper as the single place to add future architectures.tools/mtmd/clip.cpp (2)
1443-1449: Consider settingimage_resize_algoif Falcon-OCR requires a specific resize method.Similar OCR projectors like
PADDLEOCRexplicitly sethparams.image_resize_algo. If the default resize algorithm is intentional for Falcon-OCR, this is fine. Otherwise, you may want to specify it here for consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/mtmd/clip.cpp` around lines 1443 - 1449, The PROJECTOR_TYPE_FALCON_OCR branch sets n_merge, image_min_pixels, image_max_pixels, and warmup tokens but doesn't set hparams.image_resize_algo; update the Falcon-OCR case (the switch branch for PROJECTOR_TYPE_FALCON_OCR) to explicitly assign hparams.image_resize_algo to the desired resize algorithm (matching what PADDLEOCR uses if appropriate) so the projector uses a known resizing method rather than the implicit default.
2354-2358: Indentation is inconsistent with other cases in this switch statement.The
case PROJECTOR_TYPE_FALCON_OCR:block has extra indentation compared to adjacent cases likePROJECTOR_TYPE_LFM2Aat Line 2308.🔧 Suggested fix for indentation
- case PROJECTOR_TYPE_FALCON_OCR: - { - model.mm_0_w = get_tensor(string_format(TN_LLAVA_PROJ, 0, "weight")); - model.mm_img_begin = get_tensor(TN_TOK_IMG_BEGIN); - } break; + case PROJECTOR_TYPE_FALCON_OCR: + { + model.mm_0_w = get_tensor(string_format(TN_LLAVA_PROJ, 0, "weight")); + model.mm_img_begin = get_tensor(TN_TOK_IMG_BEGIN); + } break;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/mtmd/clip.cpp` around lines 2354 - 2358, The indentation of the switch case for PROJECTOR_TYPE_FALCON_OCR is inconsistent with adjacent cases; align the case label and its brace/contents to match the other cases (e.g., PROJECTOR_TYPE_LFM2A) by removing the extra leading spaces so the "case PROJECTOR_TYPE_FALCON_OCR:" line and the block that sets model.mm_0_w = get_tensor(string_format(TN_LLAVA_PROJ, 0, \"weight\")) and model.mm_img_begin = get_tensor(TN_TOK_IMG_BEGIN) use the same indentation level and brace placement as the other case blocks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@convert_hf_to_gguf.py`:
- Around line 2649-2655: The rope_dimension_count is currently halved causing a
mismatch with the loader; update the call to
gguf_writer.add_rope_dimension_count to use the full rotary width (use
hparams["head_dim"] instead of hparams["head_dim"] // 2) so the exported
dimension matches how freqs_cis_golden/rope_freqs are emitted and consumed
(check symbols add_rope_dimension_count, add_rope_freq_base, head_dim,
freqs_cis_golden, rope_freqs) and ensure any related emission logic that
computes columns uses n_heads * (head_dim/2) consistently.
- Around line 2762-2768: The code builds prefix token IDs by tokenizing a
concatenated string (prefix_str) which is fragile; replace this by calling
tokenizer.convert_tokens_to_ids on the individual special-token strings (e.g.
["<|image_cls|>", "<|image_reg_1|>", ...]) and validate each returned id is not
None or the tokenizer's unknown id before indexing data_torch; update the
variable ids and keep the rest (prefix_embd = data_torch[ids].contiguous(),
logging, and yield with
self.format_tensor_name(gguf.MODEL_TENSOR.V_TOK_IMG_BEGIN, suffix="")) intact so
the embedding extraction uses explicit token-to-id mapping rather than
tokenizer.encode on a concatenated string.
In `@src/llama-model.cpp`:
- Around line 7993-8005: The Falcon-OCR build creates layer.wqkv with a
nonstandard width computed as n_embd_qkv (see symbol n_embd_qkv used when
creating LLM_TENSOR_ATTN_QKV), but llama_meta_device_get_split_state() still
assumes pattern_qkv_weight rows equal n_embd + 2 * n_embd_gqa; add a
LLM_ARCH_FALCON_OCR branch inside llama_meta_device_get_split_state() that
recognizes pattern_qkv_weight for Falcon-OCR and computes the expected width
from the same formula (use the n_embd_qkv/related fields: n_embd_head_k, n_head,
n_embd_k_gqa, n_embd_v_gqa and n_head_kv_ratio) so the assertion and
row-split/meta-load offsets use n_embd_qkv instead of the generic n_embd +
2*n_embd_gqa.
- Line 8007: The attn_sinks tensor is being created without the "weight" suffix,
so Falcon-OCR's sink loader will miss it; update the creation in the layer
initialization (the call to create_tensor for layer.attn_sinks that uses
tn(LLM_TENSOR_ATTN_SINKS, i)) to use the "weight" suffixed name (i.e. the same
naming pattern used elsewhere such as the sink-loading path at the earlier sink
load), so that layer.attn_sinks maps to the tensor named "attn_sinks.weight" and
will be found by the existing sink-loading logic.
In `@src/models/falcon-ocr.cpp`:
- Around line 139-147: The packed Q/K/V views (Qcur, Kcur, Vcur using
ggml_view_3d) are using per-section sizes instead of cumulative byte offsets so
Kcur and Vcur point to the wrong memory; fix by passing cumulative row-byte
offsets to ggml_view_3d: set Qcur offset to 0, Kcur offset to
ggml_row_size(cur->type, n_embd_q) (i.e. byte size after Q), and Vcur offset to
ggml_row_size(cur->type, n_embd_q + n_embd_k) (byte size after Q+K). Update the
calls that construct Qcur, Kcur, Vcur (and references to cur, ggml_view_3d,
ggml_row_size, n_embd_q, n_embd_k, n_embd_v) accordingly.
In `@tools/mtmd/models/falcon-ocr.cpp`:
- Around line 15-18: The conv result stored in cur uses model.mm_0_w but never
applies model.mm_0_b, so add the projector bias after ggml_conv_2d: create a
ggml_tensor for the bias (reference model.mm_0_b), reshape/broadcast it to match
cur's spatial/channel shape and then ggml_add(cur, bias_tensor) (or equivalent)
so that the output of ggml_conv_2d (cur) is summed with the properly shaped
mm_0_b; locate proj_w, ggml_conv_2d, model.mm_0_w, model.mm_0_b, inp_raw and cur
to implement this bias-add immediately after the conv.
---
Outside diff comments:
In `@tools/mtmd/mtmd.cpp`:
- Around line 1295-1303: The current branch order causes
image_tokens->use_mrope_pos to short-circuit and never reach the Falcon-OCR BOI
path; change the logic in the temporal-dimension calculation so that
image_tokens->n_boi > 0 is checked before use_mrope_pos (or add a combined
condition) so the BOI-aware return (return image_tokens->n_boi + 1) executes for
Falcon-OCR; update the block containing image_tokens->use_mrope_pos and
image_tokens->n_boi so BOI takes precedence (or only use M-RoPE when n_boi ==
0).
---
Nitpick comments:
In `@src/llama-graph.cpp`:
- Around line 1204-1207: Extract the repeated architecture-specific precision
gating into a single helper (e.g., bool arch_requires_f32_prec(LLM_ARCH arch) or
Precision get_precision_for_arch(LLM_ARCH arch)) and use it wherever the file
currently inlines checks like the if (arch == LLM_ARCH_GLM4 || ...) before
calling ggml_mul_mat_set_prec(cur, GGML_PREC_F32);; replace the ad-hoc allowlist
with calls to this helper in functions that set accumulator precision
(references: ggml_mul_mat_set_prec usage and the surrounding architecture
check), update all similar checks in the file to call the helper, and keep the
helper as the single place to add future architectures.
In `@tools/mtmd/clip.cpp`:
- Around line 1443-1449: The PROJECTOR_TYPE_FALCON_OCR branch sets n_merge,
image_min_pixels, image_max_pixels, and warmup tokens but doesn't set
hparams.image_resize_algo; update the Falcon-OCR case (the switch branch for
PROJECTOR_TYPE_FALCON_OCR) to explicitly assign hparams.image_resize_algo to the
desired resize algorithm (matching what PADDLEOCR uses if appropriate) so the
projector uses a known resizing method rather than the implicit default.
- Around line 2354-2358: The indentation of the switch case for
PROJECTOR_TYPE_FALCON_OCR is inconsistent with adjacent cases; align the case
label and its brace/contents to match the other cases (e.g.,
PROJECTOR_TYPE_LFM2A) by removing the extra leading spaces so the "case
PROJECTOR_TYPE_FALCON_OCR:" line and the block that sets model.mm_0_w =
get_tensor(string_format(TN_LLAVA_PROJ, 0, \"weight\")) and model.mm_img_begin =
get_tensor(TN_TOK_IMG_BEGIN) use the same indentation level and brace placement
as the other case blocks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2fbfa96f-f0c7-4c1c-b37b-b7ceed3f5a71
📒 Files selected for processing (19)
convert_hf_to_gguf.pygguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pysrc/CMakeLists.txtsrc/llama-arch.cppsrc/llama-arch.hsrc/llama-context.cppsrc/llama-graph.cppsrc/llama-kv-cache.cppsrc/llama-model.cppsrc/models/falcon-ocr.cppsrc/models/models.htools/mtmd/CMakeLists.txttools/mtmd/clip-impl.htools/mtmd/clip.cpptools/mtmd/clip.htools/mtmd/models/falcon-ocr.cpptools/mtmd/models/models.htools/mtmd/mtmd.cpp
| cb(Kcur, "Kcur_normed", il); | ||
|
|
||
| // repeat K and V to match shape of Q (required by rope_falcon) | ||
| Kcur = ggml_repeat(ctx0, Kcur, Qcur); |
There was a problem hiding this comment.
Please note that this is not the format expected during GQA KV head expansion. This is doing repeat in a block repeat fashion. We want to repeat heads in an interleaved fashion as such:
// GQA expansion: interleave K/V heads to match Q head count.
// Project from 3D -> 4D to enable head repetition within group
// (e.g. for n_rep=2 => [h0,h0,h1,h1,...] not [h0,h1,..,h0,h1,...]).
{
Kcur = ggml_cont(ctx0, Kcur);
Vcur = ggml_cont(ctx0, Vcur);
Kcur = ggml_reshape_4d(ctx0, Kcur, n_embd_head, 1, n_head_kv, n_tokens);
Vcur = ggml_reshape_4d(ctx0, Vcur, n_embd_head, 1, n_head_kv, n_tokens);
ggml_tensor * K_tgt = ggml_new_tensor_4d(ctx0, Kcur->type,
n_embd_head, n_head_kv_ratio, n_head_kv, n_tokens);
ggml_tensor * V_tgt = ggml_new_tensor_4d(ctx0, Vcur->type,
n_embd_head, n_head_kv_ratio, n_head_kv, n_tokens);
Kcur = ggml_repeat(ctx0, Kcur, K_tgt);
Vcur = ggml_repeat(ctx0, Vcur, V_tgt);
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head, n_tokens);
Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head, n_tokens);
}
Please refer to https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/model.py#L73C1-L83C6
| } break; | ||
| case PROJECTOR_TYPE_FALCON_OCR: | ||
| { | ||
| image_preproc = std::make_unique<mtmd_image_preprocessor_dyn_size>(ctx_v); |
There was a problem hiding this comment.
Please note that this preprocessing function is not quite correct. Please refer to ggml-org#21045 (comment)
We need to use a custom preprocess function :
image_preproc = std::make_unique<mtmd_image_preprocessor_falcon_ocr>(ctx_v);
And in mtmd-image.cpp, we can use the custom preprocess function:
//
// mtmd_image_preprocessor_falcon_ocr
//
bool mtmd_image_preprocessor_falcon_ocr::preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) {
const int max_dim = hparams.image_size;
const int min_dim = (int)std::sqrt((float)hparams.image_min_pixels);
const int ps = hparams.patch_size;
const float aspect = (float)img.nx / (float)img.ny;
int new_w = img.nx;
int new_h = img.ny;
if (min_dim <= new_w && new_w <= max_dim &&
min_dim <= new_h && new_h <= max_dim) {
// fits, no resize needed
} else {
const bool is_vertical = new_w < new_h;
if (new_w < min_dim || new_h < min_dim) {
if (is_vertical) {
new_w = min_dim;
new_h = (int)(new_w / aspect);
} else {
new_h = min_dim;
new_w = (int)(new_h * aspect);
}
} else {
if (is_vertical) {
new_w = max_dim;
new_h = (int)(new_w / aspect);
} else {
new_h = max_dim;
new_w = (int)(new_h * aspect);
}
}
if (new_w > max_dim) {
new_w = max_dim;
new_h = (int)(new_w / aspect);
}
if (new_h > max_dim) {
new_h = max_dim;
new_w = (int)(new_h * aspect);
}
}
clip_image_u8 intermediate;
const clip_image_size intermediate_size = {new_w, new_h};
img_tool::resize(img, intermediate, intermediate_size, RESIZE_ALGO_BICUBIC, false);
GGML_ASSERT(hparams.image_min_pixels > 0 && hparams.image_max_pixels > 0);
const clip_image_size final_size = img_tool::calc_size_preserved_ratio(
intermediate_size, ps,
hparams.image_min_pixels, hparams.image_max_pixels);
clip_image_u8 resized;
img_tool::resize(intermediate, resized, final_size, RESIZE_ALGO_BICUBIC, false);
clip_image_f32_ptr img_f32(clip_image_f32_init());
img_u8_to_f32(resized, *img_f32, hparams.image_mean, hparams.image_std);
output.entries.push_back(std::move(img_f32));
return true;
}
| if (n_boi > 0) { | ||
| // falcon-ocr style | ||
| if (i < n_boi) { | ||
| pos.t = i; |
There was a problem hiding this comment.
Please note that this should be pos.t = 0
(related to ggml-org#21045 (comment))
| pos.y = 0; | ||
| } else { | ||
| size_t idx = i - n_boi; | ||
| pos.t = n_boi; // all image tokens share the same temporal pos |
There was a problem hiding this comment.
Please note that this should be pos.t = 0
(related to ggml-org#21045 (comment))
| if (image_tokens->n_boi > 0) { | ||
| // for falcon-ocr, temporal dimension = n_boi + 1 (for all image tokens) | ||
| return image_tokens->n_boi + 1; | ||
| } |
There was a problem hiding this comment.
Please note this seems incorrect
Please refer to https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/data.py#L330-L357
| } else { | ||
| size_t idx = i - n_boi; | ||
| pos.t = n_boi; // all image tokens share the same temporal pos | ||
| pos.x = idx % image_tokens->nx; |
There was a problem hiding this comment.
Please note that for falcon ocr model, this is not the correct calculation for pos_h, pos_w. It uses normalized coordinates.
Please refer to https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/data.py#L330-L357
This is one of the reasons why it is hard to use the existing mrope infra for falcon ocr.
The other list of reasons are:
-
In
mtmd_helper.cpp,set_position_mrope_2d()adds offset of pos_0 to pos_h, pos_w for image tokens.
Which is not the case for falcon ocr. -
mtmd_decoder_poshas integer fields. But falcon ocr expects normalized values of pos_h, pos_w. So it is difficult to use mtmd_decoder_pos to represent falcon_ocr positions. Please refer to https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/data.py#L330-L357 -
In llama-graph.cpp,
llm_graph_input_pos::set_inputsets pos_h = pos_w = pos_t for text tokens. But for falcon ocr, pos_h = pos_w = 0 for text tokens
Please note that I verified correctness of this branch by incorporating the comments that I mentioned in this PR; i.e. I can run correct inference
But since I am not able to use the existing mrope infra for falcon ocr without some modification to handle the issues mentioned above, I did not raise a PR with my changes (including custom rope code) because that might not be a clean implementation.
Please let me know if you want me to raise a PR anyways for you to review. Or if you have any thoughts or suggestions
Looking forward to your inputs @ngxson !
There was a problem hiding this comment.
Thanks for having a look. To answer to your comments:
In mtmd_helper.cpp, set_position_mrope_2d() adds offset of pos_0 to pos_h, pos_w for image tokens.
Because the media chunk is the only thing being processed, pos_0 will always be 0, making it aligned with the pytorch version. (Note: I now acknowledge that pos_0 must stay as-is, I will update that)
mtmd_decoder_pos has integer fields. But falcon ocr expects normalized values of pos_h, pos_w. So it is difficult to use mtmd_decoder_pos to represent falcon_ocr positions. Please refer to https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/data.py#L330-L357
Hmm ok so please confirm if I understand correctly here: Because the positional indexes is float instead of int (confirmed by the pytorch code that you mentioned), you have to use the SPATIAL_3D_POS_SCALE, then cancel it out inside the text model's RoPE, right?
In such case, we can simply copy the code that I quoted in this comment: ggml-org#21045 (comment) and bring it into get_decoder_pos. Does it sound good to you?
In llama-graph.cpp, llm_graph_input_pos::set_input sets pos_h = pos_w = pos_t for text tokens. But for falcon ocr, pos_h = pos_w = 0 for text tokens
Yes but get_decoder_pos will simply return 0 for all text tokens, I will update that part.
There was a problem hiding this comment.
One more question, following up your comment ggml-org#21045 (comment)
What is the spatial position of the first generated token? I.e. the first token right AFTER the image, that will be generated by the LLM
Summary by CodeRabbit
Release Notes
New Features
Improvements