Skip to content

(WIP) falcon-ocr, for discussion#100

Draft
ngxson wants to merge 5 commits intomasterfrom
xsn/falcon-ocr
Draft

(WIP) falcon-ocr, for discussion#100
ngxson wants to merge 5 commits intomasterfrom
xsn/falcon-ocr

Conversation

@ngxson
Copy link
Copy Markdown
Owner

@ngxson ngxson commented Apr 14, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for Falcon-OCR model architecture with full inference capability.
    • Enabled Falcon-OCR model conversion from Hugging Face format to GGUF.
    • Added multimodal vision projector support for Falcon-OCR.
  • Improvements

    • Optimized matrix-multiplication precision and KV-cache handling for Falcon-OCR.
    • Enhanced positional embedding calculations for multimodal tasks.

@ngxson ngxson marked this pull request as draft April 14, 2026 23:11
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

Added comprehensive support for the Falcon-OCR model architecture across Python conversion, GGUF schema, C++ core inference, and vision projection components. Includes model loading, tensor handling, custom RoPE implementation, and multimodal vision integration.

Changes

Cohort / File(s) Summary
Python HuggingFace Conversion
convert_hf_to_gguf.py
Added FalconOCRModel and FalconOCRMmprojModel classes to handle conversion of Falcon-OCR text and vision-projection models from HF format, with custom tensor transformations (RoPE frequency scaling, feed-forward deinterleaving, weight scaling, image projector conversion).
GGUF Schema & Constants
gguf-py/gguf/constants.py, gguf-py/gguf/tensor_mapping.py
Extended MODEL_ARCH enum and MODEL_ARCH_NAMES mapping with FALCON_OCR; registered tensor schema (TOKEN_EMBD, ATTN_QKV, ATTN_SINKS, FFN layers, etc.); added VisionProjectorType.FALCON_OCR constant and Falcon-OCR-specific tensor name mappings (attention wqkv, sinks).
Architecture Enumerations
src/llama-arch.h, src/llama-arch.cpp
Added LLM_ARCH_FALCON_OCR enum entry and registered string name mapping "falcon-ocr" for architecture identification.
Core Model Integration
src/llama-model.cpp, src/llama-context.cpp, src/llama-graph.cpp, src/llama-kv-cache.cpp
Added Falcon-OCR handling: parameter loading (RMS epsilon, layer count detection), tensor setup, graph-building dispatch, special-case graph node sizing, FP32 FFN accumulation, and disabled K-shift support.
Falcon-OCR Inference Engine
src/models/falcon-ocr.cpp, src/models/models.h
Implemented llm_build_falcon_ocr graph builder with custom RoPE helper functions (rope_golden_axis, rope_falcon) that apply split-head 2D+1D hybrid rotations; handles token embedding, RMSNorm, LoRA-augmented attention, gated FFN, and output projection.
Vision Projection Pipeline
tools/mtmd/clip.h, tools/mtmd/clip.cpp, tools/mtmd/clip-impl.h, tools/mtmd/models/models.h, tools/mtmd/models/falcon-ocr.cpp
Added PROJECTOR_TYPE_FALCON_OCR enum and name mapping; implemented vision loader (image token range, warmup tokens, mmproj weight loading); added clip_get_n_boi() helper; implemented clip_graph_falcon_ocr to perform 2D convolution projection of image patches with begin-of-image token concatenation.
Multimodal Integration
tools/mtmd/mtmd.cpp
Extended mtmd_image_tokens with n_boi field for begin-of-image token count; updated token counting, positional encoding (temporal/spatial for Falcon-OCR scheme), and decoder feature gating (non-causal and M-RoPE modes for Falcon-OCR).
Build Configuration
src/CMakeLists.txt, tools/mtmd/CMakeLists.txt
Added models/falcon-ocr.cpp to both llama and mtmd library source lists.

Sequence Diagram

sequenceDiagram
    participant User as User/Application
    participant Loader as Model Loader
    participant Arch as Architecture<br/>(llama-arch)
    participant Inference as Inference Engine<br/>(falcon-ocr.cpp)
    participant Vision as Vision Projector<br/>(CLIP)
    participant GGML as GGML Graph
    
    User->>Loader: Load GGUF file
    Loader->>Arch: Check LLM_ARCH_FALCON_OCR
    Loader->>Loader: Initialize parameters<br/>(RMS eps, layer count)
    Loader->>Vision: Load vision config<br/>(image pixels, warmup)
    Vision->>Vision: Load mmproj weights<br/>(img_projector, tok_embd)
    Loader->>Inference: Create graph builder
    
    User->>Vision: Process image
    Vision->>GGML: Build conv2d graph<br/>(patch→embedding)
    Vision->>Vision: Concatenate BOI tokens
    GGML-->>Vision: Patch embeddings + BOI
    
    User->>Inference: Encode prompt + image
    Inference->>GGML: Embed tokens + image
    Inference->>GGML: Iterate layers:<br/>RMSNorm → QKV proj →<br/>RoPE (2D hybrid) →<br/>Attention → FFN
    Inference->>GGML: RoPE split: 1D normal<br/>+ 2D golden-axis
    GGML->>GGML: Compute with<br/>FP32 FFN accum
    Inference->>GGML: Output norm +<br/>logits projection
    GGML-->>Inference: Inference result
    Inference-->>User: Token logits
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A falcon soars with ocr sight,
New RoPE winds twist left and right,
Two-D golden paths converge with care,
Vision projectors float on air,
Image tokens dance and multiply—
Multimodal magic reaches sky!

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is entirely missing; no content was provided in the description field, and required sections from the template are absent. Add a complete PR description following the template: provide an Overview section explaining what this PR does and why, include any additional information/related discussions, and complete the Requirements checklist.
Docstring Coverage ⚠️ Warning Docstring coverage is 8.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and generic, using '(WIP)' and 'for discussion' without clearly describing the actual changes being made to the codebase. Replace with a specific, descriptive title that clearly indicates the main change (e.g., 'Add Falcon-OCR model support' or 'Implement Falcon-OCR architecture and conversion').

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch xsn/falcon-ocr

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tools/mtmd/mtmd.cpp (1)

1295-1303: ⚠️ Potential issue | 🔴 Critical

Make the Falcon-OCR n_boi path reachable.

Because Falcon-OCR sets use_mrope_pos = true at Lines 785-789, this early return always wins and the new n_boi branch never executes. The chunk still reports the generic M-RoPE span instead of the BOI-aware one.

🐛 Minimal fix
 llama_pos mtmd_image_tokens_get_n_pos(const mtmd_image_tokens * image_tokens) {
+    if (image_tokens->n_boi > 0) {
+        // Falcon-OCR uses extra BOI temporal positions before the image grid.
+        return image_tokens->n_boi + 1;
+    }
     if (image_tokens->use_mrope_pos) {
         // for M-RoPE, temporal dimension = max(t,h,w)
         // t is omitted as we don't support video input
         return std::max(image_tokens->nx, image_tokens->ny);
     }
-    if (image_tokens->n_boi > 0) {
-        // for falcon-ocr, temporal dimension = n_boi + 1 (for all image tokens)
-        return image_tokens->n_boi + 1;
-    }
     return image_tokens->n_tokens();
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/mtmd/mtmd.cpp` around lines 1295 - 1303, The current branch order
causes image_tokens->use_mrope_pos to short-circuit and never reach the
Falcon-OCR BOI path; change the logic in the temporal-dimension calculation so
that image_tokens->n_boi > 0 is checked before use_mrope_pos (or add a combined
condition) so the BOI-aware return (return image_tokens->n_boi + 1) executes for
Falcon-OCR; update the block containing image_tokens->use_mrope_pos and
image_tokens->n_boi so BOI takes precedence (or only use M-RoPE when n_boi ==
0).
🧹 Nitpick comments (3)
src/llama-graph.cpp (1)

1204-1207: Consider centralizing architecture-specific precision policy.

At Line 1204, the inline architecture allowlist keeps growing. Similar precision-gating checks exist elsewhere in this file; extracting a shared helper would reduce drift risk when adding future architectures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llama-graph.cpp` around lines 1204 - 1207, Extract the repeated
architecture-specific precision gating into a single helper (e.g., bool
arch_requires_f32_prec(LLM_ARCH arch) or Precision
get_precision_for_arch(LLM_ARCH arch)) and use it wherever the file currently
inlines checks like the if (arch == LLM_ARCH_GLM4 || ...) before calling
ggml_mul_mat_set_prec(cur, GGML_PREC_F32);; replace the ad-hoc allowlist with
calls to this helper in functions that set accumulator precision (references:
ggml_mul_mat_set_prec usage and the surrounding architecture check), update all
similar checks in the file to call the helper, and keep the helper as the single
place to add future architectures.
tools/mtmd/clip.cpp (2)

1443-1449: Consider setting image_resize_algo if Falcon-OCR requires a specific resize method.

Similar OCR projectors like PADDLEOCR explicitly set hparams.image_resize_algo. If the default resize algorithm is intentional for Falcon-OCR, this is fine. Otherwise, you may want to specify it here for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/mtmd/clip.cpp` around lines 1443 - 1449, The PROJECTOR_TYPE_FALCON_OCR
branch sets n_merge, image_min_pixels, image_max_pixels, and warmup tokens but
doesn't set hparams.image_resize_algo; update the Falcon-OCR case (the switch
branch for PROJECTOR_TYPE_FALCON_OCR) to explicitly assign
hparams.image_resize_algo to the desired resize algorithm (matching what
PADDLEOCR uses if appropriate) so the projector uses a known resizing method
rather than the implicit default.

2354-2358: Indentation is inconsistent with other cases in this switch statement.

The case PROJECTOR_TYPE_FALCON_OCR: block has extra indentation compared to adjacent cases like PROJECTOR_TYPE_LFM2A at Line 2308.

🔧 Suggested fix for indentation
-                case PROJECTOR_TYPE_FALCON_OCR:
-                    {
-                        model.mm_0_w       = get_tensor(string_format(TN_LLAVA_PROJ, 0, "weight"));
-                        model.mm_img_begin = get_tensor(TN_TOK_IMG_BEGIN);
-                    } break;
+            case PROJECTOR_TYPE_FALCON_OCR:
+                {
+                    model.mm_0_w       = get_tensor(string_format(TN_LLAVA_PROJ, 0, "weight"));
+                    model.mm_img_begin = get_tensor(TN_TOK_IMG_BEGIN);
+                } break;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/mtmd/clip.cpp` around lines 2354 - 2358, The indentation of the switch
case for PROJECTOR_TYPE_FALCON_OCR is inconsistent with adjacent cases; align
the case label and its brace/contents to match the other cases (e.g.,
PROJECTOR_TYPE_LFM2A) by removing the extra leading spaces so the "case
PROJECTOR_TYPE_FALCON_OCR:" line and the block that sets model.mm_0_w =
get_tensor(string_format(TN_LLAVA_PROJ, 0, \"weight\")) and model.mm_img_begin =
get_tensor(TN_TOK_IMG_BEGIN) use the same indentation level and brace placement
as the other case blocks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@convert_hf_to_gguf.py`:
- Around line 2649-2655: The rope_dimension_count is currently halved causing a
mismatch with the loader; update the call to
gguf_writer.add_rope_dimension_count to use the full rotary width (use
hparams["head_dim"] instead of hparams["head_dim"] // 2) so the exported
dimension matches how freqs_cis_golden/rope_freqs are emitted and consumed
(check symbols add_rope_dimension_count, add_rope_freq_base, head_dim,
freqs_cis_golden, rope_freqs) and ensure any related emission logic that
computes columns uses n_heads * (head_dim/2) consistently.
- Around line 2762-2768: The code builds prefix token IDs by tokenizing a
concatenated string (prefix_str) which is fragile; replace this by calling
tokenizer.convert_tokens_to_ids on the individual special-token strings (e.g.
["<|image_cls|>", "<|image_reg_1|>", ...]) and validate each returned id is not
None or the tokenizer's unknown id before indexing data_torch; update the
variable ids and keep the rest (prefix_embd = data_torch[ids].contiguous(),
logging, and yield with
self.format_tensor_name(gguf.MODEL_TENSOR.V_TOK_IMG_BEGIN, suffix="")) intact so
the embedding extraction uses explicit token-to-id mapping rather than
tokenizer.encode on a concatenated string.

In `@src/llama-model.cpp`:
- Around line 7993-8005: The Falcon-OCR build creates layer.wqkv with a
nonstandard width computed as n_embd_qkv (see symbol n_embd_qkv used when
creating LLM_TENSOR_ATTN_QKV), but llama_meta_device_get_split_state() still
assumes pattern_qkv_weight rows equal n_embd + 2 * n_embd_gqa; add a
LLM_ARCH_FALCON_OCR branch inside llama_meta_device_get_split_state() that
recognizes pattern_qkv_weight for Falcon-OCR and computes the expected width
from the same formula (use the n_embd_qkv/related fields: n_embd_head_k, n_head,
n_embd_k_gqa, n_embd_v_gqa and n_head_kv_ratio) so the assertion and
row-split/meta-load offsets use n_embd_qkv instead of the generic n_embd +
2*n_embd_gqa.
- Line 8007: The attn_sinks tensor is being created without the "weight" suffix,
so Falcon-OCR's sink loader will miss it; update the creation in the layer
initialization (the call to create_tensor for layer.attn_sinks that uses
tn(LLM_TENSOR_ATTN_SINKS, i)) to use the "weight" suffixed name (i.e. the same
naming pattern used elsewhere such as the sink-loading path at the earlier sink
load), so that layer.attn_sinks maps to the tensor named "attn_sinks.weight" and
will be found by the existing sink-loading logic.

In `@src/models/falcon-ocr.cpp`:
- Around line 139-147: The packed Q/K/V views (Qcur, Kcur, Vcur using
ggml_view_3d) are using per-section sizes instead of cumulative byte offsets so
Kcur and Vcur point to the wrong memory; fix by passing cumulative row-byte
offsets to ggml_view_3d: set Qcur offset to 0, Kcur offset to
ggml_row_size(cur->type, n_embd_q) (i.e. byte size after Q), and Vcur offset to
ggml_row_size(cur->type, n_embd_q + n_embd_k) (byte size after Q+K). Update the
calls that construct Qcur, Kcur, Vcur (and references to cur, ggml_view_3d,
ggml_row_size, n_embd_q, n_embd_k, n_embd_v) accordingly.

In `@tools/mtmd/models/falcon-ocr.cpp`:
- Around line 15-18: The conv result stored in cur uses model.mm_0_w but never
applies model.mm_0_b, so add the projector bias after ggml_conv_2d: create a
ggml_tensor for the bias (reference model.mm_0_b), reshape/broadcast it to match
cur's spatial/channel shape and then ggml_add(cur, bias_tensor) (or equivalent)
so that the output of ggml_conv_2d (cur) is summed with the properly shaped
mm_0_b; locate proj_w, ggml_conv_2d, model.mm_0_w, model.mm_0_b, inp_raw and cur
to implement this bias-add immediately after the conv.

---

Outside diff comments:
In `@tools/mtmd/mtmd.cpp`:
- Around line 1295-1303: The current branch order causes
image_tokens->use_mrope_pos to short-circuit and never reach the Falcon-OCR BOI
path; change the logic in the temporal-dimension calculation so that
image_tokens->n_boi > 0 is checked before use_mrope_pos (or add a combined
condition) so the BOI-aware return (return image_tokens->n_boi + 1) executes for
Falcon-OCR; update the block containing image_tokens->use_mrope_pos and
image_tokens->n_boi so BOI takes precedence (or only use M-RoPE when n_boi ==
0).

---

Nitpick comments:
In `@src/llama-graph.cpp`:
- Around line 1204-1207: Extract the repeated architecture-specific precision
gating into a single helper (e.g., bool arch_requires_f32_prec(LLM_ARCH arch) or
Precision get_precision_for_arch(LLM_ARCH arch)) and use it wherever the file
currently inlines checks like the if (arch == LLM_ARCH_GLM4 || ...) before
calling ggml_mul_mat_set_prec(cur, GGML_PREC_F32);; replace the ad-hoc allowlist
with calls to this helper in functions that set accumulator precision
(references: ggml_mul_mat_set_prec usage and the surrounding architecture
check), update all similar checks in the file to call the helper, and keep the
helper as the single place to add future architectures.

In `@tools/mtmd/clip.cpp`:
- Around line 1443-1449: The PROJECTOR_TYPE_FALCON_OCR branch sets n_merge,
image_min_pixels, image_max_pixels, and warmup tokens but doesn't set
hparams.image_resize_algo; update the Falcon-OCR case (the switch branch for
PROJECTOR_TYPE_FALCON_OCR) to explicitly assign hparams.image_resize_algo to the
desired resize algorithm (matching what PADDLEOCR uses if appropriate) so the
projector uses a known resizing method rather than the implicit default.
- Around line 2354-2358: The indentation of the switch case for
PROJECTOR_TYPE_FALCON_OCR is inconsistent with adjacent cases; align the case
label and its brace/contents to match the other cases (e.g.,
PROJECTOR_TYPE_LFM2A) by removing the extra leading spaces so the "case
PROJECTOR_TYPE_FALCON_OCR:" line and the block that sets model.mm_0_w =
get_tensor(string_format(TN_LLAVA_PROJ, 0, \"weight\")) and model.mm_img_begin =
get_tensor(TN_TOK_IMG_BEGIN) use the same indentation level and brace placement
as the other case blocks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2fbfa96f-f0c7-4c1c-b37b-b7ceed3f5a71

📥 Commits

Reviewing files that changed from the base of the PR and between 707c0b7 and e85f231.

📒 Files selected for processing (19)
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
  • gguf-py/gguf/tensor_mapping.py
  • src/CMakeLists.txt
  • src/llama-arch.cpp
  • src/llama-arch.h
  • src/llama-context.cpp
  • src/llama-graph.cpp
  • src/llama-kv-cache.cpp
  • src/llama-model.cpp
  • src/models/falcon-ocr.cpp
  • src/models/models.h
  • tools/mtmd/CMakeLists.txt
  • tools/mtmd/clip-impl.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/clip.h
  • tools/mtmd/models/falcon-ocr.cpp
  • tools/mtmd/models/models.h
  • tools/mtmd/mtmd.cpp

Comment thread convert_hf_to_gguf.py
Comment thread convert_hf_to_gguf.py
Comment thread src/llama-model.cpp
Comment thread src/llama-model.cpp
Comment thread src/models/falcon-ocr.cpp
Comment thread tools/mtmd/models/falcon-ocr.cpp
Comment thread src/models/falcon-ocr.cpp
cb(Kcur, "Kcur_normed", il);

// repeat K and V to match shape of Q (required by rope_falcon)
Kcur = ggml_repeat(ctx0, Kcur, Qcur);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this is not the format expected during GQA KV head expansion. This is doing repeat in a block repeat fashion. We want to repeat heads in an interleaved fashion as such:

// GQA expansion: interleave K/V heads to match Q head count.
            // Project from 3D -> 4D to enable head repetition within group
            // (e.g. for n_rep=2 => [h0,h0,h1,h1,...] not [h0,h1,..,h0,h1,...]).
            {
                Kcur = ggml_cont(ctx0, Kcur);
                Vcur = ggml_cont(ctx0, Vcur);
                Kcur = ggml_reshape_4d(ctx0, Kcur, n_embd_head, 1, n_head_kv, n_tokens);
                Vcur = ggml_reshape_4d(ctx0, Vcur, n_embd_head, 1, n_head_kv, n_tokens);

                ggml_tensor * K_tgt = ggml_new_tensor_4d(ctx0, Kcur->type,
                        n_embd_head, n_head_kv_ratio, n_head_kv, n_tokens);
                ggml_tensor * V_tgt = ggml_new_tensor_4d(ctx0, Vcur->type,
                        n_embd_head, n_head_kv_ratio, n_head_kv, n_tokens);

                Kcur = ggml_repeat(ctx0, Kcur, K_tgt);
                Vcur = ggml_repeat(ctx0, Vcur, V_tgt);

                Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head, n_tokens);
                Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head, n_tokens);
            }

Please refer to https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/model.py#L73C1-L83C6

Comment thread tools/mtmd/mtmd.cpp
} break;
case PROJECTOR_TYPE_FALCON_OCR:
{
image_preproc = std::make_unique<mtmd_image_preprocessor_dyn_size>(ctx_v);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this preprocessing function is not quite correct. Please refer to ggml-org#21045 (comment)

We need to use a custom preprocess function :

image_preproc = std::make_unique<mtmd_image_preprocessor_falcon_ocr>(ctx_v);

And in mtmd-image.cpp, we can use the custom preprocess function:

//
// mtmd_image_preprocessor_falcon_ocr
//

bool mtmd_image_preprocessor_falcon_ocr::preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) {
    const int max_dim = hparams.image_size;
    const int min_dim = (int)std::sqrt((float)hparams.image_min_pixels);
    const int ps      = hparams.patch_size;

    const float aspect = (float)img.nx / (float)img.ny;
    int new_w = img.nx;
    int new_h = img.ny;

    if (min_dim <= new_w && new_w <= max_dim &&
        min_dim <= new_h && new_h <= max_dim) {
        // fits, no resize needed
    } else {
        const bool is_vertical = new_w < new_h;
        if (new_w < min_dim || new_h < min_dim) {
            if (is_vertical) {
                new_w = min_dim;
                new_h = (int)(new_w / aspect);
            } else {
                new_h = min_dim;
                new_w = (int)(new_h * aspect);
            }
        } else {
            if (is_vertical) {
                new_w = max_dim;
                new_h = (int)(new_w / aspect);
            } else {
                new_h = max_dim;
                new_w = (int)(new_h * aspect);
            }
        }
        if (new_w > max_dim) {
            new_w = max_dim;
            new_h = (int)(new_w / aspect);
        }
        if (new_h > max_dim) {
            new_h = max_dim;
            new_w = (int)(new_h * aspect);
        }
    }
    clip_image_u8 intermediate;
    const clip_image_size intermediate_size = {new_w, new_h};
    img_tool::resize(img, intermediate, intermediate_size, RESIZE_ALGO_BICUBIC, false);

    GGML_ASSERT(hparams.image_min_pixels > 0 && hparams.image_max_pixels > 0);
    const clip_image_size final_size = img_tool::calc_size_preserved_ratio(
        intermediate_size, ps,
        hparams.image_min_pixels, hparams.image_max_pixels);

    clip_image_u8 resized;
    img_tool::resize(intermediate, resized, final_size, RESIZE_ALGO_BICUBIC, false);

    clip_image_f32_ptr img_f32(clip_image_f32_init());
    img_u8_to_f32(resized, *img_f32, hparams.image_mean, hparams.image_std);
    output.entries.push_back(std::move(img_f32));
    return true;
}

Comment thread tools/mtmd/mtmd.cpp
if (n_boi > 0) {
// falcon-ocr style
if (i < n_boi) {
pos.t = i;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this should be pos.t = 0

(related to ggml-org#21045 (comment))

Comment thread tools/mtmd/mtmd.cpp
pos.y = 0;
} else {
size_t idx = i - n_boi;
pos.t = n_boi; // all image tokens share the same temporal pos
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this should be pos.t = 0

(related to ggml-org#21045 (comment))

Comment thread tools/mtmd/mtmd.cpp
if (image_tokens->n_boi > 0) {
// for falcon-ocr, temporal dimension = n_boi + 1 (for all image tokens)
return image_tokens->n_boi + 1;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread tools/mtmd/mtmd.cpp
} else {
size_t idx = i - n_boi;
pos.t = n_boi; // all image tokens share the same temporal pos
pos.x = idx % image_tokens->nx;
Copy link
Copy Markdown

@avirajBevli avirajBevli Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that for falcon ocr model, this is not the correct calculation for pos_h, pos_w. It uses normalized coordinates.

Please refer to https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/data.py#L330-L357

This is one of the reasons why it is hard to use the existing mrope infra for falcon ocr.
The other list of reasons are:

  1. In mtmd_helper.cpp, set_position_mrope_2d() adds offset of pos_0 to pos_h, pos_w for image tokens.
    Which is not the case for falcon ocr.

  2. mtmd_decoder_pos has integer fields. But falcon ocr expects normalized values of pos_h, pos_w. So it is difficult to use mtmd_decoder_pos to represent falcon_ocr positions. Please refer to https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/data.py#L330-L357

  3. In llama-graph.cpp, llm_graph_input_pos::set_input sets pos_h = pos_w = pos_t for text tokens. But for falcon ocr, pos_h = pos_w = 0 for text tokens

Please note that I verified correctness of this branch by incorporating the comments that I mentioned in this PR; i.e. I can run correct inference

But since I am not able to use the existing mrope infra for falcon ocr without some modification to handle the issues mentioned above, I did not raise a PR with my changes (including custom rope code) because that might not be a clean implementation.

Please let me know if you want me to raise a PR anyways for you to review. Or if you have any thoughts or suggestions

Looking forward to your inputs @ngxson !

Copy link
Copy Markdown
Owner Author

@ngxson ngxson Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for having a look. To answer to your comments:

In mtmd_helper.cpp, set_position_mrope_2d() adds offset of pos_0 to pos_h, pos_w for image tokens.

Because the media chunk is the only thing being processed, pos_0 will always be 0, making it aligned with the pytorch version. (Note: I now acknowledge that pos_0 must stay as-is, I will update that)

mtmd_decoder_pos has integer fields. But falcon ocr expects normalized values of pos_h, pos_w. So it is difficult to use mtmd_decoder_pos to represent falcon_ocr positions. Please refer to https://github.com/tiiuae/Falcon-Perception/blob/main/falcon_perception/data.py#L330-L357

Hmm ok so please confirm if I understand correctly here: Because the positional indexes is float instead of int (confirmed by the pytorch code that you mentioned), you have to use the SPATIAL_3D_POS_SCALE, then cancel it out inside the text model's RoPE, right?

In such case, we can simply copy the code that I quoted in this comment: ggml-org#21045 (comment) and bring it into get_decoder_pos. Does it sound good to you?

In llama-graph.cpp, llm_graph_input_pos::set_input sets pos_h = pos_w = pos_t for text tokens. But for falcon ocr, pos_h = pos_w = 0 for text tokens

Yes but get_decoder_pos will simply return 0 for all text tokens, I will update that part.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question, following up your comment ggml-org#21045 (comment)

What is the spatial position of the first generated token? I.e. the first token right AFTER the image, that will be generated by the LLM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants