### What
- Gate the load counter update in
`Scanner::_collect_profile_before_close()` on a new virtual
`_should_update_load_counters()`:
- Base `Scanner` reports only when `_is_load` (classic
stream/broker/routine load scanners with src tuple desc).
- `FileScanner` additionally reports for `FILE_STREAM` scans: TVF based
loads (`http_stream`, group commit) plan the load source as a tvf query
scan without src tuple desc (`_is_load` is false), but their WHERE
clause filtered rows must still be reported as `NumberUnselectedRows` /
counted into `NumberTotalRows`.
- Add a deterministic regression case covering INSERT-SELECT / DELETE /
UPDATE whose scans filter out all rows.
### Why
For DELETE/UPDATE/INSERT INTO ... SELECT executed through the insert
path, rows filtered by query scan predicates (including runtime filters)
were added to the RuntimeState load counters. When all scanned rows are
filtered, `num_rows_load_success()` (total - filtered - unselected) goes
negative, BE reports a negative `dpp.norm.ALL`, and FE fails the
`insert_max_filter_ratio` check with errors like:
```
Insert has too many filtered data 0/-2 insert_max_filter_ratio is 1.000000
```
This only triggers with `enable_insert_strict=false` and
`insert_max_filter_ratio > 0` (the strict branch only checks
`filteredRows > 0`). The intermittency in the field comes from runtime
filter arrival timing: rows are only counted when the RF arrives in time
to filter inside the scanner.
This was historically gated by `if (!enable_profile && !_is_load)
return;` (already buggy with `enable_profile=true`), and became
unconditional after #57314 removed the early return.
Compared to the previous revision of this PR (thrift
`skip_query_scan_load_counters` option set by FE for DELETE/UPDATE):
- No thrift / FE changes needed.
- Also fixes plain INSERT INTO ... SELECT: `AbstractInsertExecutor` sets
`query_type=LOAD` for all insert-path commands, so the query-type based
gate still let OlapScanner predicate-filtered rows pollute the counters
there.
- `FILE_STREAM` is a precise discriminator: only the `http_stream` /
`group_commit` TVFs use it, and they require a backend id on the
ConnectContext (only present for stream-load style HTTP requests), so
they can never appear in a normal query/DELETE/UPDATE.
`FileScanner::_counter.num_rows_filtered` is only accumulated in
`_convert_to_output_block` (load-only path), so for `http_stream` the
`NumberFilteredRows` reported to clients comes from sink validation and
is unaffected by this gate.
### Test
- New regression case
`regression-test/suites/load_p0/insert/test_scan_filtered_rows_not_pollute_load_counter.groovy`:
uses value-column predicates on an AGGREGATE KEY table (cannot be pushed
down to storage, evaluated by scanner conjuncts) to deterministically
reproduce; before this fix the INSERT-SELECT/DELETE/UPDATE statements
fail with `0/-10`.
- `test_group_commit_http_stream` semantics preserved: `insert into ...
select from http_stream(...) where ...` still reports
`NumberUnselectedRows`/`NumberTotalRows` (the previous `_is_load`-only
attempt broke this with `expected: <6> but was: <5>`).
What problem does this PR solve?
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)