[WIP] tests: Columnar CI pipeline#10868
Conversation
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
|
@JaySon-Huang I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces Docker-based testing infrastructure for next-gen columnar tests and improves test execution reliability. It adds shared utilities for managing test components, new configuration files, Compose definitions for full service orchestration, environment variable management, and test failure tracking with timeout support. ChangesDocker Testing Infrastructure for Next-Gen Columnar
Test Execution and Reliability Improvements
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/retest |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/docker/next-gen-columnar-config/pd.toml`:
- Around line 15-20: Move the config key max-replicas out of the [dashboard]
table and place it under a [replication] table so PD's parser recognizes it;
specifically remove max-replicas = 1 from the [dashboard] block and add a
[replication] block containing max-replicas = 1 (use the same key name
"max-replicas" and create/locate the [replication] section in the same TOML
file).
In `@tests/docker/next-gen-columnar-yaml/cluster.yaml`:
- Around line 32-33: The docker-compose volume mounts still point at
"./next-gen-config/..." instead of the new "./next-gen-columnar-config/..."
directory, so replace each occurrence of the source paths like
"./next-gen-config/pd.toml" (and any other "./next-gen-config/..." mounts at the
noted spots) with "./next-gen-columnar-config/..." so the services use the new
columnar configs; ensure you update all instances mentioned (the mounts at the
shown diff and the other occurrences referenced) so pd.toml and related config
files are mounted from next-gen-columnar-config rather than the old directory.
In `@tests/docker/next-gen-columnar-yaml/disagg_tiflash.yaml`:
- Around line 24-25: The TiFlash CN service is still mounting legacy config
files "./next-gen-config/tiflash_cn.toml" and
"./next-gen-config/tiflash_cn_proxy.toml"; update those mount entries to
reference the new directory "./next-gen-columnar-config/tiflash_cn.toml" and
"./next-gen-columnar-config/tiflash_cn_proxy.toml" so TiFlash CN uses the
intended next-gen columnar configs (modify the two mount strings in
disagg_tiflash.yaml).
In `@tests/docker/util.sh`:
- Around line 175-176: The override_dir default points to the wrong location;
update the default value of the local variable override_dir (and the other
occurrences around the same block) from "../docker/next-gen-yaml/override" to
the correct "tests/docker/override-yaml" so compose will reference the PR's
override files; also verify any code paths that build compose file references
using LOCAL_*_BIN_DIR or EXPOSE_TIDB_PORT still concatenate against override_dir
and adjust those occurrences (around the 220-223 group) to use the new path
variable.
In `@tests/fullstack-test-next-gen/run.sh`:
- Around line 36-65: Define an on-exit cleanup function and register it with
trap so the compose teardown always runs even if a command fails: create an
on_exit() that does set +e; ${COMPOSE} "${COMPOSE_FILES[@]}" down;
clean_data_log and then trap 'on_exit' EXIT near the top of the script (before
calling ${COMPOSE} ... up and before wait_next_gen_env / invoking tests).
Reference symbols: on_exit, trap, ${COMPOSE}, "${COMPOSE_FILES[@]}",
wait_next_gen_env, clean_data_log, and ENV_ARGS to ensure teardown executes for
all failure paths.
In `@tests/run-test.py`:
- Around line 71-84: run_with_timeout can leave os.popen subprocesses running
when _timeout_handler triggers because exec_func may be interrupted during
p.readlines() before p.close(); fix by ensuring subprocess cleanup on timeout:
modify run_with_timeout to catch the timeout exception raised by
_timeout_handler (or detect the alarm in the except/finally) and invoke a
cleanup step that closes any open popen handles (e.g., accept an optional
on_timeout_cleanup callable in kwargs and call it, or document/implement that
exec_func returns or exposes the popen as p so run_with_timeout can call
p.close()/p.kill()); update references to run_with_timeout, _timeout_handler,
os.popen, exec_func, and p.close() accordingly so the spawned shell is always
closed when the alarm fires.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 53c2c064-7700-4cf6-a13f-745001a8dea9
📒 Files selected for processing (25)
tests/.gitignoretests/docker/_env.shtests/docker/compose.shtests/docker/next-gen-columnar-config/pd.tomltests/docker/next-gen-columnar-config/tidb.tomltests/docker/next-gen-columnar-config/tiflash_cn.tomltests/docker/next-gen-columnar-config/tiflash_cn_proxy.tomltests/docker/next-gen-columnar-config/tikv-worker.tomltests/docker/next-gen-columnar-config/tikv.tomltests/docker/next-gen-columnar-yaml/cluster.yamltests/docker/next-gen-columnar-yaml/disagg_tiflash.yamltests/docker/next-gen-config/tidb.tomltests/docker/next-gen-yaml/cluster.yamltests/docker/next-gen-yaml/disagg_tiflash.yamltests/docker/override-yaml/expose_tidb.yamltests/docker/override-yaml/local_pd.yamltests/docker/override-yaml/local_tidb.yamltests/docker/override-yaml/local_tikv.yamltests/docker/util.shtests/fullstack-test-next-gen-columnar/_env.shtests/fullstack-test-next-gen-columnar/compose.shtests/fullstack-test-next-gen/compose.shtests/fullstack-test-next-gen/run.shtests/run-test.pytests/run-test.sh
💤 Files with no reviewable changes (1)
- tests/docker/next-gen-yaml/disagg_tiflash.yaml
| local override_dir="${2:-../docker/next-gen-yaml/override}" | ||
| local validate_binaries="${3:-true}" |
There was a problem hiding this comment.
Fix override compose path mismatch.
The override directory/path points to ../docker/next-gen-yaml/override, but this PR’s override files are under tests/docker/override-yaml. With LOCAL_*_BIN_DIR or EXPOSE_TIDB_PORT, compose will reference non-existent files.
💡 Proposed fix
function append_local_binary_overrides() {
local -n compose_files=$1
- local override_dir="${2:-../docker/next-gen-yaml/override}"
+ local override_dir="${2:-../docker/override-yaml}"
local validate_binaries="${3:-true}"
@@
function setup_next_gen_compose_files() {
@@
- append_local_binary_overrides "${compose_files_name}" "../docker/next-gen-yaml/override" "${validate_binaries}"
+ append_local_binary_overrides "${compose_files_name}" "../docker/override-yaml" "${validate_binaries}"
if [[ -n "${EXPOSE_TIDB_PORT:-}" ]]; then
- compose_files+=(-f "../docker/next-gen-yaml/override/expose_tidb.yaml")
+ compose_files+=(-f "../docker/override-yaml/expose_tidb.yaml")
echo "Exposing tidb0 on host port ${EXPOSE_TIDB_PORT}"
fi
}Also applies to: 220-223
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/docker/util.sh` around lines 175 - 176, The override_dir default points
to the wrong location; update the default value of the local variable
override_dir (and the other occurrences around the same block) from
"../docker/next-gen-yaml/override" to the correct "tests/docker/override-yaml"
so compose will reference the PR's override files; also verify any code paths
that build compose file references using LOCAL_*_BIN_DIR or EXPOSE_TIDB_PORT
still concatenate against override_dir and adjust those occurrences (around the
220-223 group) to use the new path variable.
| def run_with_timeout(seconds, func, *args, **kwargs): | ||
| if seconds <= 0: | ||
| return func(*args, **kwargs) | ||
| if not hasattr(signal, 'SIGALRM'): | ||
| return func(*args, **kwargs) | ||
|
|
||
| old_handler = signal.signal(signal.SIGALRM, _timeout_handler) | ||
| signal.alarm(seconds) | ||
| try: | ||
| return func(*args, **kwargs) | ||
| finally: | ||
| signal.alarm(0) | ||
| signal.signal(signal.SIGALRM, old_handler) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that timeout is SIGALRM-based and command execution is via os.popen.
# Expected: signal.alarm usage in run_with_timeout and os.popen usage in exec_func.
rg -n -C3 'def run_with_timeout|signal\.alarm|def exec_func|os\.popen|p\.close\(' tests/run-test.pyRepository: pingcap/tiflash
Length of output: 779
Ensure os.popen subprocesses are cleaned up on SIGALRM timeouts
run_with_timeout() uses signal.alarm() (SIGALRM) with a handler that raises, which can interrupt exec_func() while it’s blocked in p.readlines(). Since exec_func() only reaches p.close() after readlines(), the timeout path can skip p.close() and leave the spawned shell command running/resource-leaking after the test aborts.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/run-test.py` around lines 71 - 84, run_with_timeout can leave os.popen
subprocesses running when _timeout_handler triggers because exec_func may be
interrupted during p.readlines() before p.close(); fix by ensuring subprocess
cleanup on timeout: modify run_with_timeout to catch the timeout exception
raised by _timeout_handler (or detect the alarm in the except/finally) and
invoke a cleanup step that closes any open popen handles (e.g., accept an
optional on_timeout_cleanup callable in kwargs and call it, or
document/implement that exec_func returns or exposes the popen as p so
run_with_timeout can call p.close()/p.kill()); update references to
run_with_timeout, _timeout_handler, os.popen, exec_func, and p.close()
accordingly so the spawned shell is always closed when the alarm fires.
|
/test ? |
|
@JaySon-Huang: The following commands are available to trigger required jobs: The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test all |
|
/test ? |
|
@JaySon-Huang: The following commands are available to trigger required jobs: The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@JaySon-Huang: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/test ? |
|
@JaySon-Huang: The following commands are available to trigger required jobs: The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
tests/docker/util.sh (1)
220-223:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix remaining override path mismatch in compose wiring.
setup_next_gen_compose_filesstill uses../docker/next-gen-yaml/override, while overrides are under../docker/override-yaml. This breaks local-binary andEXPOSE_TIDB_PORToverride resolution.Proposed fix
- append_local_binary_overrides "${compose_files_name}" "../docker/next-gen-yaml/override" "${validate_binaries}" + append_local_binary_overrides "${compose_files_name}" "../docker/override-yaml" "${validate_binaries}" if [[ -n "${EXPOSE_TIDB_PORT:-}" ]]; then - compose_files+=(-f "../docker/next-gen-yaml/override/expose_tidb.yaml") + compose_files+=(-f "../docker/override-yaml/expose_tidb.yaml") echo "Exposing tidb0 on host port ${EXPOSE_TIDB_PORT}" fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/docker/util.sh` around lines 220 - 223, In setup_next_gen_compose_files, the override path is incorrect: update calls that use "../docker/next-gen-yaml/override" (including the append_local_binary_overrides invocation and the compose_files+=(-f "../docker/next-gen-yaml/override/expose_tidb.yaml") line) to use the correct "../docker/override-yaml" directory so local-binary overrides and EXPOSE_TIDB_PORT resolution load the correct override files (ensure both the append_local_binary_overrides argument and the expose_tidb.yaml compose_files entry are changed).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@tests/docker/util.sh`:
- Around line 220-223: In setup_next_gen_compose_files, the override path is
incorrect: update calls that use "../docker/next-gen-yaml/override" (including
the append_local_binary_overrides invocation and the compose_files+=(-f
"../docker/next-gen-yaml/override/expose_tidb.yaml") line) to use the correct
"../docker/override-yaml" directory so local-binary overrides and
EXPOSE_TIDB_PORT resolution load the correct override files (ensure both the
append_local_binary_overrides argument and the expose_tidb.yaml compose_files
entry are changed).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 8186eda4-8b5a-4b3c-8770-6cee2bdde2cf
📒 Files selected for processing (11)
tests/docker/compose.shtests/docker/next-gen-columnar-config/pd.tomltests/docker/next-gen-columnar-config/tidb.tomltests/docker/next-gen-columnar-config/tiflash_cn.tomltests/docker/next-gen-columnar-config/tiflash_cn_proxy.tomltests/docker/next-gen-columnar-config/tikv-worker.tomltests/docker/next-gen-columnar-config/tikv.tomltests/docker/next-gen-columnar-yaml/cluster.yamltests/docker/next-gen-columnar-yaml/disagg_tiflash.yamltests/docker/next-gen-config/pd.tomltests/docker/util.sh
✅ Files skipped from review due to trivial changes (5)
- tests/docker/next-gen-config/pd.toml
- tests/docker/next-gen-columnar-yaml/disagg_tiflash.yaml
- tests/docker/next-gen-columnar-config/tikv-worker.toml
- tests/docker/next-gen-columnar-config/tiflash_cn.toml
- tests/docker/next-gen-columnar-config/pd.toml
Prepare for next-gen-columnar testing env setup Add wait/prepare helpers that skip tiflash-wn0, wire them into the columnar test directory, and symlink next-gen-columnar-config. Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
Pin kvengine and related crates to the i64::MAX columnar read fix, and document how to maintain the dependency.
Embed kvengine git rev from Cargo.lock in version output and log it after logger init.
What problem does this PR solve?
Issue Number: ref #10844
Problem Summary:
What is changed and how it works?
Local verification
Check List
Tests
See the "Local verification" above
Side effects
Documentation
Release note
Summary by CodeRabbit
New Features
Improvements