fix: remap predicate field index to projected read schema in InternalReadContext#338
fix: remap predicate field index to projected read schema in InternalReadContext#338duanyyyyyyy wants to merge 1 commit into
Conversation
e5c7e75 to
a6e29b7
Compare
| return read_context_->GetPredicate(); | ||
| return remapped_predicate_; | ||
| } | ||
| bool EnablePredicateFilter() const { |
There was a problem hiding this comment.
The class comment in predicate_builder.h clearly states that field_index is “the index of the field in the read schema (0-based)”. If we need to change its behavior, could you please also review the comments carefully and update them all accordingly?
There was a problem hiding this comment.
Is the current change based on the assumption that the field_idx provided in the user predicate is not meaningful, since it will be rebuilt in InternalReadContext anyway?
| bool any_changed = false; | ||
| for (const auto& child : compound->Children()) { | ||
| PAIMON_ASSIGN_OR_RAISE(auto remapped, RemapPredicateFieldIndex(read_schema, child)); | ||
| if (remapped != child) { |
There was a problem hiding this comment.
In production code, could we please make the type explicit in PAIMON_ASSIGN_OR_RAISE to improve readability?
| EXPECT_EQ(leaf->FieldName(), "f3"); | ||
| // f3 is the first column in the projected read schema. | ||
| EXPECT_EQ(leaf->FieldIndex(), 0); | ||
| } |
There was a problem hiding this comment.
Please prefer ASSERT_* over EXPECT_* here.
| TableRead::Create(std::move(read_context)), | ||
| "field f3 has field idx 0 in input schema, mismatch field idx 2 in predicate"); | ||
| } | ||
| { |
There was a problem hiding this comment.
Could you please add an integration test, or alternatively modify field_idx so it no longer matches the read schema, to verify end-to-end correctness? Thanks!
a6e29b7 to
a849a59
Compare
| if (new_index == leaf->FieldIndex()) { | ||
| return predicate; | ||
| } | ||
| return std::static_pointer_cast<Predicate>(leaf->NewLeafPredicate(new_index)); |
There was a problem hiding this comment.
why std::static_pointer_cast here
| return std::static_pointer_cast<Predicate>( | ||
| compound->NewCompoundPredicate(remapped_children)); | ||
| } | ||
| return predicate; |
There was a problem hiding this comment.
if neither LeafPredicateImpl or CompoundPredicateImpl, should return Invalid state
21edd0f to
2e2b1b3
Compare
…alReadContext
Predicates are constructed against the latest table schema, so each
LeafPredicate carries a field index that points to the column's position
in the full table schema. When the query projects a subset of columns
the read_schema built inside InternalReadContext::Create lays those
columns out at different positions; the leaf's field index no longer
matches the column it names.
The strict field-id validation that runs immediately afterwards then
fails:
Paimon TableRead::Create error: Invalid: field obs_index has field
idx 0 in input schema, mismatch field idx 1 in predicate
The downstream LeafPredicateImpl::Test paths use field_index_ directly
to index into arrow arrays / internal rows built from read_schema, so
even if validation were relaxed, leaving the mismatch in place would
silently read the wrong column.
Walk the predicate tree once at context construction and rebuild each
LeafPredicate with the field index resolved from the read_schema by
field name. CompoundPredicate is reconstructed only when any descendant
actually changed, otherwise the original shared_ptr is reused.
GetPredicate() now returns the remapped predicate; the original (with
latest-schema indices) is still available via the wrapped ReadContext if
ever needed.
CreateWithSchema lays the context over a different read_schema (e.g.
the minimal column set for COUNT(*)), so it also remaps from the
original FE predicate against the new read_schema rather than copying
the already-remapped one whose indices are aligned with the original
read_schema.
Repro:
-- DE table with btree global index on obs_index
CREATE TABLE t (
clip_id STRING, obs_index INT, time_offset_ms BIGINT, collected_date DATE
) PARTITIONED BY (collected_date, clip_id)
TBLPROPERTIES (
'data-evolution.enabled' = 'true',
'global-index.btree.index-column' = 'obs_index',
'bucket' = '-1'
);
-- after INSERT + CALL paimon.sys.create_global_index(...)
SELECT clip_id, obs_index, time_offset_ms FROM t
WHERE collected_date = '2026-05-26' AND clip_id = 'clip_a'
AND obs_index BETWEEN 0 AND 10;
-- before: TableRead::Create error (field idx mismatch)
-- after: returns matching rows
Coverage: InternalReadContext gains four cases —
TestPredicateFieldIdxRemappedWhenProjected,
TestPredicateUnchangedWhenAligned,
TestCompoundPredicateRemap,
TestPredicateOnFieldMissingFromReadSchema —
covering the projected/aligned/compound/missing-field paths through
the remap.
Signed-off-by: duanyyyyyyy <yan.duan9759@gmail.com>
2e2b1b3 to
cf44831
Compare
Purpose
InternalReadContext::Createperforms a strict field-id check that comparesthe predicate's
FieldIndexagainst the field's position inread_schema.Predicates are constructed upstream (Java SDK / FE) against the latest table
schema, so when the query projects a subset of columns the leaf's
FieldIndexno longer matches the column's position in the projectedread_schemaandTableRead::Createfails:The same mismatch is also unsafe downstream:
LeafPredicateImpl::Testusesfield_index_directly to index into the arrow array /InternalRowbuiltfrom
read_schema, so even with the validation relaxed the predicate wouldsilently read the wrong column.
This is the C++ analogue of paimon Java's
AbstractDataTableRead.executeFiltercalling
PredicateProjectionConverter.fromProjectionbefore applyingrow-level filters; we do it once at context construction so every downstream
reader sees a predicate already aligned with
read_schema.Walks the predicate tree once at
InternalReadContext::Createand rebuildseach
LeafPredicatewith itsfield_indexresolved fromread_schemabyfield name.
CompoundPredicateis reconstructed only when any descendantactually changed, otherwise the original
shared_ptris reused. Theremapped predicate is stored on
InternalReadContext;GetPredicate()returns it (the original FE predicate stays accessible via the wrapped
ReadContextif ever needed).CreateWithSchemaremaps against itsnew_read_schemafor the same reason — the previously-remapped predicateis aligned with the original read schema, not the new one.
Reproduces on a DE table with btree global index, query pattern:
Tests
InternalReadContextgains four cases (ininternal_read_context_test.cpp):TestPredicateFieldIdxRemappedWhenProjected— projectedread_schemahas the predicate's field at a different position; leaf's
FieldIndexis rewritten to the projected position.
TestPredicateUnchangedWhenAligned— when already aligned, thereturned
shared_ptris the input pointer (zero-copy reuse).TestCompoundPredicateRemap— every leaf inside aCompoundPredicateis recursively rewritten.
TestPredicateOnFieldMissingFromReadSchema— a leaf naming a fieldnot in
read_schemasurfaces asStatus::Invalid.API and Format
No public-include / storage-format / protocol change. The private
constructor of
InternalReadContextgains aremapped_predicateparameter;
GetPredicate()'s return value now reflects theread-schema-aligned indices rather than forwarding to the wrapped
ReadContext. Behavior change is correctness-only (previously failingpaths now succeed; previously succeeding paths return identical results
since the remap is a no-op when already aligned).
Documentation
No new user-facing feature; no documentation changes.
Generative AI tooling
Generated-by: Claude Code (claude-opus-4-7)