[fix](load) only load source scanners update load counters#63781
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR fixes load-quality counter accounting in BE scanners by ensuring only load scanners update RuntimeState’s load filtered/unselected row counters, preventing query-side predicate filtering from polluting INSERT/DELETE load statistics (especially when enable_profile is enabled).
Changes:
- Gate
RuntimeState::update_num_rows_load_filtered/unselected()updates behind_is_loadinScanner::_collect_profile_before_close(). - Clarify via comments that query predicate filtering must not contribute to load-quality counters.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6a52d13 to
3b09c76
Compare
|
/review |
There was a problem hiding this comment.
Review summary: no blocking issues found.
Critical checkpoint conclusions:
- Goal/test: The PR restricts load quality counter updates to scanners that are actually part of load processing. This matches the described DELETE/subquery failure mode, and the added regression enables profile and exercises DELETE with a TVF subquery predicate.
- Scope/focus: The production change is small and focused; it preserves scan profile counter updates while excluding only load-quality counters for non-load scanners.
- Concurrency/lifecycle: No new shared mutable state, threads, locks, lifecycle ownership, or static initialization concerns were introduced. The existing atomic RuntimeState counters remain used only for statistics/accounting.
- Compatibility/configuration: No protocol, storage format, persisted metadata, or configuration changes are introduced.
- Parallel paths: The central Scanner::_collect_profile_before_close path covers OlapScanner/FileScanner subclasses that delegate to it; load scanners still update load filtered/unselected counters through the existing _is_load classification.
- Error handling/data correctness: No ignored Status or visibility-version/delete-bitmap changes. The change prevents query-side predicate filtering from corrupting load success/filtered row accounting.
- Performance/observability: The added branch is trivial and not on a hot per-row path; existing scan profile counters remain collected for observability.
- Test coverage: Regression coverage was added for the profile-enabled DELETE + TVF subquery case. I did not run the regression suite in this runner.
User focus: No additional user-provided review focus was specified.
|
run buildall |
TPC-H: Total hot run time: 31896 ms |
TPC-DS: Total hot run time: 173476 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
1568a28 to
989d36e
Compare
989d36e to
4e2557f
Compare
For DELETE/UPDATE/INSERT INTO ... SELECT executed through the insert
path, rows filtered by query scan predicates (including runtime
filters) were added to RuntimeState load counters. When all scanned
rows are filtered, num_rows_load_success() (total - filtered -
unselected) goes negative and FE fails the insert_max_filter_ratio
check with errors like:
Insert has too many filtered data 0/-2 insert_max_filter_ratio is 1.000000
Fix by gating the counter update in Scanner::_collect_profile_before_close()
on a new virtual _should_update_load_counters():
- Base Scanner only reports when _is_load (classic stream/broker/routine
load scanners with src tuple desc).
- FileScanner additionally reports for FILE_STREAM scans, because TVF
based loads (http_stream, group commit) plan the load source as a tvf
query scan without src tuple desc, but their WHERE clause filtered
rows must still be reported as NumberUnselectedRows. FILE_STREAM is
only reachable from such load entries, never from normal queries.
4e2557f to
b4f208d
Compare
|
/review |
|
run buildall |
There was a problem hiding this comment.
I found one regression-test convention issue that should be fixed before merging.
Critical checkpoint conclusions:
- Goal/test proof: The scanner counter change addresses the negative load-success accounting for INSERT/DELETE/UPDATE scan predicates, and the new regression covers the profile-enabled INSERT plus no-op DELETE/UPDATE path. Existing http_stream/group_commit tests cover the FILE_STREAM unselected-row branch.
- Scope: The BE change is small and focused around close-time scanner counter accounting.
- Concurrency/lifecycle: Scanner close can run from scanner worker threads, but the changed RuntimeState counter updates are atomic and the new predicate reads scanner-local scan metadata; I did not find a new locking or lifecycle issue.
- Compatibility/config/storage/transactions: No new config, storage format, FE-BE protocol field, or persistence path is introduced.
- Parallel paths: FileScanner keeps the FILE_STREAM load-source exception while the base Scanner blocks ordinary query scanners from polluting load counters.
- Test coverage/results: Functional coverage is present, but the new regression drops tables at the end, which violates the Doris regression-test standard to preserve state for debugging.
- Observability/performance: No additional observability is needed for this counter gate, and the added check is constant-time at scanner close.
User focus: no additional user-provided review focus was supplied.
|
run buildall |
TPC-H: Total hot run time: 29794 ms |
TPC-DS: Total hot run time: 171572 ms |
TPC-H: Total hot run time: 29485 ms |
TPC-DS: Total hot run time: 168341 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
I found two regression-artifact issues that should be fixed before merging.
Checkpoint conclusions:
- Goal/test: the C++ change addresses the intended load-counter pollution path, and the new suite covers INSERT SELECT, profile-enabled INSERT, DELETE, and UPDATE no-op cases. The test artifacts need cleanup to satisfy repository standards.
- Scope/focus: the load-counter gate itself is small and focused. No additional user focus points were provided.
- Concurrency/lifecycle: no new concurrency path or lifecycle ownership issue is introduced by this PR's reviewed changes; the existing RuntimeState counters remain atomic and are still updated during scanner close.
- Config/compatibility/persistence: no new config, thrift field, storage format, or edit-log compatibility concern found.
- Parallel paths: base Scanner now suppresses non-load scanners, while FileScanner preserves FILE_STREAM TVF load reporting; I did not find another scanner path that requires the same special case.
- Test standards/results: the new regression suite should hardcode simple table names instead of using variables, and the generated .out currently fails git diff --check due to a trailing blank line.
- Observability/performance: no additional observability or performance issue found for this narrow counter-gating change.
Review only; I did not run the regression suite in this runner.
|
PR approved by at least one committer and no changes requested. |
### What problem does this PR solve? Issue Number: None Related PR: apache#63781, apache#64671 Problem Summary: File scanner v2 did not carry the same fixes as the existing file scanner path. Predicate rows filtered inside v2 file readers were still reported through scanner load counters unless the scanner was a real load source, and Hive TEXTFILE empty physical lines were still skipped unless read_csv_empty_line_as_null was enabled. This change gates v2 load counter reporting with the same FILE_STREAM exception used by FileScanner and adds a delimited text hook so Hive Text v2 treats empty physical lines as records while CSV keeps the old default behavior. ### Release note Fix file scanner v2 load counter reporting and Hive TEXTFILE empty-line handling. ### Check List (For Author) - Test: Unit Test / Manual test - Added TextV2ReaderTest coverage for Hive TEXTFILE empty line records, single-column empty string fields, and COUNT pushdown. - Ran git diff --check. - Ran clang-format v16 through build-support/run_clang_format.py for changed files. - Attempted ./run-be-ut.sh --run --filter='TextV2ReaderTest.*:FileScannerV2Test.*', but the local run was blocked because the script needed to update/download datasketches-cpp and network access was unavailable; no BE UT binary was already built. - Attempted clang-tidy with the available compile_commands.json, but it pointed at a stale /mnt/disk3/gabriel path; the project clang-tidy wrapper also requires bash 4+ while only system bash is available. - Behavior changed: Yes. File scanner v2 now matches v1 load counter gating and Hive TEXTFILE empty-line semantics. - Does this need documentation: No
### What problem does this PR solve? Issue Number: None Related PR: #63781, #64671 Problem Summary: File scanner v2 did not carry the same fixes as the existing file scanner path. Predicate rows filtered inside v2 file readers were still reported through scanner load counters unless the scanner was a real load source, and Hive TEXTFILE empty physical lines were still skipped unless read_csv_empty_line_as_null was enabled. This change gates v2 load counter reporting with the same FILE_STREAM exception used by FileScanner and adds a delimited text hook so Hive Text v2 treats empty physical lines as records while CSV keeps the old default behavior. ### Release note Fix file scanner v2 load counter reporting and Hive TEXTFILE empty-line handling. ### Check List (For Author) - Test: Unit Test / Manual test - Added TextV2ReaderTest coverage for Hive TEXTFILE empty line records, single-column empty string fields, and COUNT pushdown. - Ran git diff --check. - Ran clang-format v16 through build-support/run_clang_format.py for changed files. - Attempted ./run-be-ut.sh --run --filter='TextV2ReaderTest.*:FileScannerV2Test.*', but the local run was blocked because the script needed to update/download datasketches-cpp and network access was unavailable; no BE UT binary was already built. - Attempted clang-tidy with the available compile_commands.json, but it pointed at a stale /mnt/disk3/gabriel path; the project clang-tidy wrapper also requires bash 4+ while only system bash is available. - Behavior changed: Yes. File scanner v2 now matches v1 load counter gating and Hive TEXTFILE empty-line semantics. - Does this need documentation: No ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
What
Scanner::_collect_profile_before_close()on a new virtual_should_update_load_counters():Scannerreports only when_is_load(classic stream/broker/routine load scanners with src tuple desc).FileScanneradditionally reports forFILE_STREAMscans: TVF based loads (http_stream, group commit) plan the load source as a tvf query scan without src tuple desc (_is_loadis false), but their WHERE clause filtered rows must still be reported asNumberUnselectedRows/ counted intoNumberTotalRows.Why
For DELETE/UPDATE/INSERT INTO ... SELECT executed through the insert path, rows filtered by query scan predicates (including runtime filters) were added to the RuntimeState load counters. When all scanned rows are filtered,
num_rows_load_success()(total - filtered - unselected) goes negative, BE reports a negativedpp.norm.ALL, and FE fails theinsert_max_filter_ratiocheck with errors like:This only triggers with
enable_insert_strict=falseandinsert_max_filter_ratio > 0(the strict branch only checksfilteredRows > 0). The intermittency in the field comes from runtime filter arrival timing: rows are only counted when the RF arrives in time to filter inside the scanner.This was historically gated by
if (!enable_profile && !_is_load) return;(already buggy withenable_profile=true), and became unconditional after #57314 removed the early return.Compared to the previous revision of this PR (thrift
skip_query_scan_load_countersoption set by FE for DELETE/UPDATE):AbstractInsertExecutorsetsquery_type=LOADfor all insert-path commands, so the query-type based gate still let OlapScanner predicate-filtered rows pollute the counters there.FILE_STREAMis a precise discriminator: only thehttp_stream/group_commitTVFs use it, and they require a backend id on the ConnectContext (only present for stream-load style HTTP requests), so they can never appear in a normal query/DELETE/UPDATE.FileScanner::_counter.num_rows_filteredis only accumulated in_convert_to_output_block(load-only path), so forhttp_streamtheNumberFilteredRowsreported to clients comes from sink validation and is unaffected by this gate.Test
regression-test/suites/load_p0/insert/test_scan_filtered_rows_not_pollute_load_counter.groovy: uses value-column predicates on an AGGREGATE KEY table (cannot be pushed down to storage, evaluated by scanner conjuncts) to deterministically reproduce; before this fix the INSERT-SELECT/DELETE/UPDATE statements fail with0/-10.test_group_commit_http_streamsemantics preserved:insert into ... select from http_stream(...) where ...still reportsNumberUnselectedRows/NumberTotalRows(the previous_is_load-only attempt broke this withexpected: <6> but was: <5>).Issue: CIR-20393