Skip to content

fix: remap predicate field index to projected read schema in InternalReadContext#338

Draft
duanyyyyyyy wants to merge 1 commit into
alibaba:mainfrom
duanyyyyyyy:duanyan/remap_predicate_field_index
Draft

fix: remap predicate field index to projected read schema in InternalReadContext#338
duanyyyyyyy wants to merge 1 commit into
alibaba:mainfrom
duanyyyyyyy:duanyan/remap_predicate_field_index

Conversation

@duanyyyyyyy
Copy link
Copy Markdown
Contributor

Purpose

InternalReadContext::Create performs a strict field-id check that compares
the predicate's FieldIndex against the field's position in read_schema.
Predicates are constructed upstream (Java SDK / FE) against the latest table
schema
, so when the query projects a subset of columns the leaf's
FieldIndex no longer matches the column's position in the projected
read_schema and TableRead::Create fails:

  Paimon TableRead::Create error: Invalid: field obs_index has field
  idx 0 in input schema, mismatch field idx 1 in predicate

The same mismatch is also unsafe downstream: LeafPredicateImpl::Test uses
field_index_ directly to index into the arrow array / InternalRow built
from read_schema, so even with the validation relaxed the predicate would
silently read the wrong column.

This is the C++ analogue of paimon Java's AbstractDataTableRead.executeFilter
calling PredicateProjectionConverter.fromProjection before applying
row-level filters; we do it once at context construction so every downstream
reader sees a predicate already aligned with read_schema.

Walks the predicate tree once at InternalReadContext::Create and rebuilds
each LeafPredicate with its field_index resolved from read_schema by
field name. CompoundPredicate is reconstructed only when any descendant
actually changed, otherwise the original shared_ptr is reused. The
remapped predicate is stored on InternalReadContext; GetPredicate()
returns it (the original FE predicate stays accessible via the wrapped
ReadContext if ever needed). CreateWithSchema remaps against its
new_read_schema for the same reason — the previously-remapped predicate
is aligned with the original read schema, not the new one.

Reproduces on a DE table with btree global index, query pattern:

  SELECT clip_id, obs_index, ...
  FROM t
  WHERE collected_date = ... AND clip_id = ... AND obs_index BETWEEN 0 AND 10;                                                                                                            

Tests

InternalReadContext gains four cases (in internal_read_context_test.cpp):

  • TestPredicateFieldIdxRemappedWhenProjected — projected read_schema
    has the predicate's field at a different position; leaf's FieldIndex
    is rewritten to the projected position.
  • TestPredicateUnchangedWhenAligned — when already aligned, the
    returned shared_ptr is the input pointer (zero-copy reuse).
  • TestCompoundPredicateRemap — every leaf inside a CompoundPredicate
    is recursively rewritten.
  • TestPredicateOnFieldMissingFromReadSchema — a leaf naming a field
    not in read_schema surfaces as Status::Invalid.

API and Format

No public-include / storage-format / protocol change. The private
constructor of InternalReadContext gains a remapped_predicate
parameter; GetPredicate()'s return value now reflects the
read-schema-aligned indices rather than forwarding to the wrapped
ReadContext. Behavior change is correctness-only (previously failing
paths now succeed; previously succeeding paths return identical results
since the remap is a no-op when already aligned).

Documentation

No new user-facing feature; no documentation changes.

Generative AI tooling

Generated-by: Claude Code (claude-opus-4-7)

@duanyyyyyyy duanyyyyyyy force-pushed the duanyan/remap_predicate_field_index branch 2 times, most recently from e5c7e75 to a6e29b7 Compare June 4, 2026 06:04
@zjw1111 zjw1111 requested a review from Copilot June 4, 2026 07:31
return read_context_->GetPredicate();
return remapped_predicate_;
}
bool EnablePredicateFilter() const {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class comment in predicate_builder.h clearly states that field_index is “the index of the field in the read schema (0-based)”. If we need to change its behavior, could you please also review the comments carefully and update them all accordingly?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the current change based on the assumption that the field_idx provided in the user predicate is not meaningful, since it will be rebuilt in InternalReadContext anyway?

bool any_changed = false;
for (const auto& child : compound->Children()) {
PAIMON_ASSIGN_OR_RAISE(auto remapped, RemapPredicateFieldIndex(read_schema, child));
if (remapped != child) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In production code, could we please make the type explicit in PAIMON_ASSIGN_OR_RAISE to improve readability?

EXPECT_EQ(leaf->FieldName(), "f3");
// f3 is the first column in the projected read schema.
EXPECT_EQ(leaf->FieldIndex(), 0);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please prefer ASSERT_* over EXPECT_* here.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

TableRead::Create(std::move(read_context)),
"field f3 has field idx 0 in input schema, mismatch field idx 2 in predicate");
}
{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add an integration test, or alternatively modify field_idx so it no longer matches the read schema, to verify end-to-end correctness? Thanks!

@lxy-9602 lxy-9602 changed the title Remap predicate field index to projected read schema in Intern… fix: remap predicate field index to projected read schema in InternalReadContext Jun 4, 2026
@duanyyyyyyy duanyyyyyyy force-pushed the duanyan/remap_predicate_field_index branch from a6e29b7 to a849a59 Compare June 4, 2026 08:06
if (new_index == leaf->FieldIndex()) {
return predicate;
}
return std::static_pointer_cast<Predicate>(leaf->NewLeafPredicate(new_index));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why std::static_pointer_cast here

return std::static_pointer_cast<Predicate>(
compound->NewCompoundPredicate(remapped_children));
}
return predicate;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if neither LeafPredicateImpl or CompoundPredicateImpl, should return Invalid state

@duanyyyyyyy duanyyyyyyy force-pushed the duanyan/remap_predicate_field_index branch from 21edd0f to 2e2b1b3 Compare June 4, 2026 12:31
…alReadContext

Predicates are constructed against the latest table schema, so each
LeafPredicate carries a field index that points to the column's position
in the full table schema. When the query projects a subset of columns
the read_schema built inside InternalReadContext::Create lays those
columns out at different positions; the leaf's field index no longer
matches the column it names.

The strict field-id validation that runs immediately afterwards then
fails:

    Paimon TableRead::Create error: Invalid: field obs_index has field
    idx 0 in input schema, mismatch field idx 1 in predicate

The downstream LeafPredicateImpl::Test paths use field_index_ directly
to index into arrow arrays / internal rows built from read_schema, so
even if validation were relaxed, leaving the mismatch in place would
silently read the wrong column.

Walk the predicate tree once at context construction and rebuild each
LeafPredicate with the field index resolved from the read_schema by
field name. CompoundPredicate is reconstructed only when any descendant
actually changed, otherwise the original shared_ptr is reused.
GetPredicate() now returns the remapped predicate; the original (with
latest-schema indices) is still available via the wrapped ReadContext if
ever needed.

CreateWithSchema lays the context over a different read_schema (e.g.
the minimal column set for COUNT(*)), so it also remaps from the
original FE predicate against the new read_schema rather than copying
the already-remapped one whose indices are aligned with the original
read_schema.

Repro:

    -- DE table with btree global index on obs_index
    CREATE TABLE t (
      clip_id STRING, obs_index INT, time_offset_ms BIGINT, collected_date DATE
    ) PARTITIONED BY (collected_date, clip_id)
    TBLPROPERTIES (
      'data-evolution.enabled' = 'true',
      'global-index.btree.index-column' = 'obs_index',
      'bucket' = '-1'
    );
    -- after INSERT + CALL paimon.sys.create_global_index(...)
    SELECT clip_id, obs_index, time_offset_ms FROM t
    WHERE collected_date = '2026-05-26' AND clip_id = 'clip_a'
      AND obs_index BETWEEN 0 AND 10;
    -- before: TableRead::Create error (field idx mismatch)
    -- after:  returns matching rows

Coverage: InternalReadContext gains four cases —
TestPredicateFieldIdxRemappedWhenProjected,
TestPredicateUnchangedWhenAligned,
TestCompoundPredicateRemap,
TestPredicateOnFieldMissingFromReadSchema —
covering the projected/aligned/compound/missing-field paths through
the remap.

Signed-off-by: duanyyyyyyy <yan.duan9759@gmail.com>
@duanyyyyyyy duanyyyyyyy force-pushed the duanyan/remap_predicate_field_index branch from 2e2b1b3 to cf44831 Compare June 4, 2026 12:55
@duanyyyyyyy duanyyyyyyy marked this pull request as draft June 5, 2026 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants