-
Notifications
You must be signed in to change notification settings - Fork 7
test: add brainlayer run_tests.sh orchestrator #256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,72 +1,197 @@ | ||
| # Bugbot Review: PR feat/p2b-fixture-corpus | ||
| # Bugbot Review: PR feat/p4b-run-tests-orchestrator | ||
|
|
||
| **Date**: 2026-04-27 | ||
| **Reviewer**: @bugbot | ||
| **Status**: ⚠️ ISSUES FOUND | ||
|
|
||
| --- | ||
|
|
||
| ## Summary | ||
| This PR adds a comprehensive test fixture for FTS5 ranking regression detection. The hermetic test fix addresses a **critical network dependency issue** found by @codex. | ||
|
|
||
| This PR adds `scripts/run_tests.sh` as a cross-language test orchestrator. The implementation is mostly solid, but I've identified **3 bugs** and **2 potential issues** that should be addressed. | ||
|
|
||
| --- | ||
|
|
||
| ## 🐛 Critical Bugs Fixed | ||
| ## 🐛 Critical Issues | ||
|
|
||
| ### 1. **pytest `-x` flag conflicts with "never short-circuit" design goal** | ||
|
|
||
| **Location**: `scripts/run_tests.sh:46` | ||
|
|
||
| ```46:46:scripts/run_tests.sh | ||
| run_step "pytest unit suite" run_pytest "$TEST_ROOT/" -v --tb=short -m "not integration" -x | ||
| ``` | ||
|
|
||
| **Problem**: The `-x` flag makes pytest exit on first failure, contradicting the PR description's promise to "never short-circuit on the first failing command." | ||
|
|
||
| ### 1. **Network Dependency in FTS Test (CRITICAL)** | ||
| **Severity: HIGH** ⚠️ **Found by @codex** | ||
| **Location**: `tests/stale_index_query.test.ts` | ||
| **Impact**: If the first test fails, pytest will stop immediately, and the remaining test phases (MCP tool registration, bun tests) will never run. The exit code aggregation works correctly, but you're not running all tests. | ||
|
|
||
| **Issue**: The test originally used `uvx --from sqlite-utils` which requires network access to PyPI on every run, breaking in offline/CI environments. | ||
| **Evidence**: The PR description states: | ||
| > aggregate failures with bitwise OR and never short-circuit on the first failing command | ||
|
|
||
| **Fix**: ✅ **FIXED** - Refactored to use native Bun SQLite (`db.query()`) for FTS assertions. The FTS ranking test is now fully hermetic. The embedding drift test still legitimately requires `uv run python3` for the embedding model. | ||
| But pytest with `-x` means "stop on first failure." | ||
|
|
||
| **Fix**: Remove the `-x` flag from line 46. | ||
|
|
||
| --- | ||
|
|
||
| ### 2. **Missing Dependency Check** | ||
| **Severity: MEDIUM** | ||
| **Location**: `scripts/generate-fixtures.sh` | ||
| ### 2. **Missing file reference in line 49** | ||
|
|
||
| **Location**: `scripts/run_tests.sh:49` | ||
|
|
||
| ```47:49:scripts/run_tests.sh | ||
| run_step \ | ||
| "pytest MCP tool registration" \ | ||
| run_pytest "$TEST_ROOT/test_think_recall_integration.py::TestMCPToolCount" -v --tb=short | ||
| ``` | ||
|
|
||
| **Problem**: The script hardcodes a reference to `test_think_recall_integration.py::TestMCPToolCount`, but this file exists and the test class exists. However, there's no validation that this file/test actually exists before attempting to run it. | ||
|
|
||
| **Issue**: Script assumes `uvx` is available without checking. | ||
| **Impact**: If someone renames or removes this test, the script will fail with a confusing error (pytest collection error) rather than a clear message. | ||
|
|
||
| **Risk Level**: Medium - This is a fragile dependency. The file exists now, but the script should be more defensive. | ||
|
|
||
| **Recommendation**: Either: | ||
| - Add a comment explaining why this specific test is important, OR | ||
| - Add a check to see if the file exists before running it, OR | ||
| - Make this configurable via an environment variable | ||
|
|
||
| --- | ||
|
|
||
| **Fix**: ✅ **FIXED** - Added preflight check with helpful error message. | ||
| ### 3. **TypeScript test depends on `uv` and `uvx` which may not be installed** | ||
|
|
||
| **Location**: `tests/stale_index_query.test.ts:100, 115` | ||
|
|
||
| ```100:109:tests/stale_index_query.test.ts | ||
| "uvx", | ||
| "--from", | ||
| "sqlite-utils", | ||
| "sqlite-utils", | ||
| "query", | ||
| sqlitePath, | ||
| `SELECT chunk_id FROM chunks_fts WHERE chunks_fts MATCH '${fixture.query.match}' ORDER BY bm25(chunks_fts), chunk_id`, | ||
| ], | ||
| repoRoot, | ||
| ); | ||
| ``` | ||
|
|
||
| ```113:126:tests/stale_index_query.test.ts | ||
| const liveEmbeddingJson = runCommand( | ||
| [ | ||
| "uv", | ||
| "run", | ||
| "python3", | ||
| "-c", | ||
| [ | ||
| "import json", | ||
| "from brainlayer.embeddings import get_embedding_model", | ||
| `print(json.dumps(get_embedding_model().embed_query(${JSON.stringify(fixture.sample_text.text)})))`, | ||
| ].join("; "), | ||
| ], | ||
| repoRoot, | ||
| ); | ||
| ``` | ||
|
|
||
| **Problem**: The TypeScript test file calls `uvx` and `uv run python3` directly, but: | ||
| 1. The orchestrator script respects `BRAINLAYER_USE_UV` env var (can be set to 0) | ||
| 2. `uv` may not be installed in the environment | ||
| 3. The test will fail with a confusing error if `uv` is missing | ||
|
|
||
| **Impact**: The bun test suite will fail in environments without `uv`, even though the orchestrator script has a fallback for pytest. | ||
|
|
||
| **Fix**: The TypeScript test should either: | ||
| - Check for `uv` availability and skip if missing | ||
| - Use the same fallback logic as the bash script | ||
| - Document this requirement clearly | ||
|
|
||
| --- | ||
|
|
||
| ### 3. **Missing `.gitattributes`** | ||
| **Severity: LOW** | ||
| ## ⚠️ Potential Issues | ||
|
|
||
| **Fix**: ✅ **FIXED** - Added entry to mark fixtures as linguist-generated for better GitHub stats. | ||
| ### 4. **Race condition potential with process substitution** | ||
|
|
||
| **Location**: `scripts/run_tests.sh:52-54` | ||
|
|
||
| ```52:54:scripts/run_tests.sh | ||
| while IFS= read -r test_file; do | ||
| bun_tests+=("$test_file") | ||
| done < <(collect_bun_tests) | ||
| ``` | ||
|
|
||
| **Analysis**: Process substitution with `< <()` is generally safe in bash, but can be subtle in some edge cases. This code is correct, but consider using a simpler pattern: | ||
|
|
||
| ```bash | ||
| bun_tests=() | ||
| if [ ! -d "$TEST_ROOT" ]; then | ||
| : # no tests | ||
| else | ||
| mapfile -t bun_tests < <(find "$TEST_ROOT" -type f -name "*.test.ts" | sort) | ||
| fi | ||
| ``` | ||
|
|
||
| **Risk Level**: Low - Current code works, but the suggested refactor is cleaner. | ||
|
|
||
| --- | ||
|
|
||
| ## ⚠️ Design Notes | ||
| ### 5. **No validation of `$ROOT_DIR` resolution** | ||
|
|
||
| **Location**: `scripts/run_tests.sh:5` | ||
|
|
||
| ```5:5:scripts/run_tests.sh | ||
| ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" | ||
| ``` | ||
|
|
||
| **Analysis**: If `cd` fails (e.g., permissions issue), `pwd` will output the current directory instead of failing. This could lead to tests running in the wrong location. | ||
|
|
||
| ### Cosine Similarity Threshold | ||
| The fixture uses 0.999 threshold. Consider relaxing to 0.995 if you see false positives across different CPU architectures (x86 vs ARM) or BLAS implementations. | ||
| **Risk Level**: Very Low - This is an edge case and the script would likely fail fast anyway. | ||
|
|
||
| ### Test Timeout | ||
| Current 120s timeout may be tight on slow CI. Consider making it configurable via env var if needed. | ||
| **Recommendation**: Add `set -e` at the top or check the result: | ||
| ```bash | ||
| ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" || exit 1 | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## ✅ What's Excellent | ||
| ## ✅ What Works Well | ||
|
|
||
| 1. **Hermetic FTS Test**: Core ranking assertions now require zero network access | ||
| 2. **Fixture Provenance**: README and embedded metadata make regeneration transparent | ||
| 3. **Dual Language Coverage**: Bun + pytest ensure the fixture works across ecosystems | ||
| 4. **Control Document**: `orchard-ml-004` is a proper negative control | ||
| 1. **Exit code aggregation**: The bitwise OR logic is correct and well-tested | ||
| 2. **Test coverage**: The contract tests properly validate the core behavior | ||
| 3. **Conditional bun execution**: Properly skips bun tests when none exist | ||
| 4. **Environment variable support**: Good use of `BRAINLAYER_TEST_ROOT` and `BRAINLAYER_USE_UV` | ||
| 5. **Clear output**: The step labels and PASS/FAIL messages are helpful | ||
|
|
||
| --- | ||
|
|
||
| ## Test Results | ||
| ## 📋 Test Results | ||
|
|
||
| - ✅ **pytest**: `tests/test_stale_index_fixture.py` passes | ||
| - ✅ **Bun FTS test**: Now hermetic (no network required) | ||
| - ⚠️ **Bun embedding test**: Requires `uv` (acceptable - needs embedding model) | ||
| I ran the following validations: | ||
|
|
||
| - ✅ Bash syntax check: `bash -n scripts/run_tests.sh` → PASS | ||
| - ✅ Python linting: `ruff check tests/test_run_tests_script.py` → PASS | ||
| - ✅ Contract tests: `pytest tests/test_run_tests_script.py -v` → 2 PASSED | ||
|
|
||
| --- | ||
|
|
||
| ## 🔧 Recommended Fixes | ||
|
|
||
| ### High Priority | ||
| 1. **Remove the `-x` flag** from line 46 to match the PR's design goal | ||
| 2. **Add documentation** or validation for the hardcoded `test_think_recall_integration.py` reference | ||
|
|
||
| ### Medium Priority | ||
| 3. **Add `uv` dependency checking** to `stale_index_query.test.ts` or document the requirement | ||
|
|
||
| ### Low Priority | ||
| 4. Consider adding `set -e` or explicit error handling for `ROOT_DIR` resolution | ||
|
|
||
| --- | ||
|
|
||
| ## Verdict | ||
| ## 🎯 Verdict | ||
|
|
||
| The orchestrator script is well-designed and the contract tests demonstrate good engineering practices. However, the **`-x` flag is a clear bug** that contradicts the stated goal of running all test phases regardless of failures. | ||
|
|
||
| **✅ READY TO MERGE** - Critical network dependency fixed, test infrastructure is solid. | ||
| The PR should not be merged until issue #1 is fixed. | ||
|
|
||
| --- | ||
|
|
||
| **Reviewed by**: Bugbot (Claude Agent) + @codex | ||
| **Date**: 2026-04-27 | ||
| **Commits**: ea23dcf (+ bugbot fixes) | ||
| **Signed**: Bugbot 🤖 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,76 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| set -u -o pipefail | ||
|
|
||
| ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" | ||
| TEST_ROOT="${BRAINLAYER_TEST_ROOT:-$ROOT_DIR/tests}" | ||
| BRAINLAYER_USE_UV="${BRAINLAYER_USE_UV:-1}" | ||
| exit_status=0 | ||
|
|
||
| run_step() { | ||
| local label="$1" | ||
| shift | ||
|
|
||
| echo "==> $label" | ||
| "$@" | ||
| local rc=$? | ||
| exit_status=$(( exit_status | rc )) | ||
|
|
||
| if [ "$rc" -eq 0 ]; then | ||
| echo "PASS: $label" | ||
| else | ||
| echo "FAIL ($rc): $label" | ||
| fi | ||
|
|
||
| echo | ||
| } | ||
|
|
||
| collect_bun_tests() { | ||
| if [ ! -d "$TEST_ROOT" ]; then | ||
| return 0 | ||
| fi | ||
|
|
||
| find "$TEST_ROOT" -type f -name "*.test.ts" | sort | ||
| } | ||
|
|
||
| run_pytest() { | ||
| if [ "$BRAINLAYER_USE_UV" = "1" ] && command -v uv >/dev/null 2>&1; then | ||
| uv run pytest "$@" | ||
| else | ||
| pytest "$@" | ||
| fi | ||
| } | ||
|
|
||
| cd "$ROOT_DIR" | ||
|
|
||
| run_step "pytest unit suite" run_pytest "$TEST_ROOT/" -v --tb=short -m "not integration" | ||
| run_step \ | ||
| "pytest MCP tool registration" \ | ||
| run_pytest "$TEST_ROOT/test_think_recall_integration.py::TestMCPToolCount" -v --tb=short | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The MCP leg is built from Useful? React with 👍 / 👎. |
||
|
|
||
| bun_tests=() | ||
| while IFS= read -r test_file; do | ||
| bun_tests+=("$test_file") | ||
| done < <(collect_bun_tests) | ||
|
|
||
| if [ "${#bun_tests[@]}" -gt 0 ]; then | ||
| if command -v bun >/dev/null 2>&1; then | ||
| run_step "bun test suite" bun test "${bun_tests[@]}" | ||
| else | ||
| echo "FAIL (1): bun not found but TypeScript tests exist under $TEST_ROOT" | ||
| echo | ||
| exit_status=$(( exit_status | 1 )) | ||
| fi | ||
| else | ||
| echo "==> bun test suite" | ||
| echo "SKIP: no .test.ts files found under $TEST_ROOT" | ||
| echo | ||
| fi | ||
|
|
||
| if [ "$exit_status" -ne 0 ]; then | ||
| echo "BrainLayer test gate failed." | ||
| else | ||
| echo "BrainLayer test gate passed." | ||
| fi | ||
|
|
||
| exit "$exit_status" | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| """Contract tests for scripts/run_tests.sh.""" | ||
|
|
||
| import os | ||
| import stat | ||
| import subprocess | ||
| from pathlib import Path | ||
|
|
||
| SCRIPT_PATH = Path(__file__).resolve().parent.parent / "scripts" / "run_tests.sh" | ||
|
|
||
|
|
||
| def _write_executable(path: Path, contents: str) -> None: | ||
| path.write_text(contents) | ||
| path.chmod(path.stat().st_mode | stat.S_IEXEC) | ||
|
|
||
|
|
||
| def _make_stub_bin(tmp_path: Path, *, pytest_exit: int, bun_exit: int | None) -> tuple[Path, Path]: | ||
| bin_dir = tmp_path / "bin" | ||
| bin_dir.mkdir() | ||
|
|
||
| pytest_log = tmp_path / "pytest.log" | ||
| bun_log = tmp_path / "bun.log" | ||
|
|
||
| _write_executable( | ||
| bin_dir / "pytest", | ||
| "\n".join( | ||
| [ | ||
| "#!/usr/bin/env bash", | ||
| 'echo "$*" >> "$PYTEST_LOG"', | ||
| f"exit {pytest_exit}", | ||
| "", | ||
| ] | ||
| ), | ||
| ) | ||
|
|
||
| if bun_exit is not None: | ||
| _write_executable( | ||
| bin_dir / "bun", | ||
| "\n".join( | ||
| [ | ||
| "#!/usr/bin/env bash", | ||
| 'echo "$*" >> "$BUN_LOG"', | ||
| f"exit {bun_exit}", | ||
| "", | ||
| ] | ||
| ), | ||
| ) | ||
|
|
||
| return pytest_log, bun_log | ||
|
|
||
|
|
||
| def test_run_tests_aggregates_exit_codes_and_keeps_running(tmp_path: Path) -> None: | ||
| test_root = tmp_path / "tests" | ||
| test_root.mkdir() | ||
| (test_root / "fixture.test.ts").write_text("test placeholder\n") | ||
|
|
||
| pytest_log, bun_log = _make_stub_bin(tmp_path, pytest_exit=2, bun_exit=4) | ||
|
|
||
| env = os.environ.copy() | ||
| env["PATH"] = f"{tmp_path / 'bin'}:{env['PATH']}" | ||
| env["BRAINLAYER_TEST_ROOT"] = str(test_root) | ||
| env["BRAINLAYER_USE_UV"] = "0" | ||
| env["PYTEST_LOG"] = str(pytest_log) | ||
| env["BUN_LOG"] = str(bun_log) | ||
|
|
||
| result = subprocess.run(["bash", str(SCRIPT_PATH)], capture_output=True, text=True, env=env) | ||
|
|
||
| assert result.returncode == 6 | ||
| assert pytest_log.read_text().strip() | ||
| assert bun_log.read_text().strip() | ||
|
|
||
|
|
||
| def test_run_tests_skips_bun_when_no_typescript_tests_exist(tmp_path: Path) -> None: | ||
| test_root = tmp_path / "tests" | ||
| test_root.mkdir() | ||
|
|
||
| pytest_log, bun_log = _make_stub_bin(tmp_path, pytest_exit=0, bun_exit=0) | ||
|
|
||
| env = os.environ.copy() | ||
| env["PATH"] = f"{tmp_path / 'bin'}:{env['PATH']}" | ||
| env["BRAINLAYER_TEST_ROOT"] = str(test_root) | ||
| env["BRAINLAYER_USE_UV"] = "0" | ||
| env["PYTEST_LOG"] = str(pytest_log) | ||
| env["BUN_LOG"] = str(bun_log) | ||
|
|
||
| result = subprocess.run(["bash", str(SCRIPT_PATH)], capture_output=True, text=True, env=env) | ||
|
|
||
| assert result.returncode == 0 | ||
| assert pytest_log.read_text().strip() | ||
| assert not bun_log.exists() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When
uvis installed, this branch always executesuv run pytest, which switches test execution to uv’s project environment and dependency sync path instead of the caller’s active Python environment.uv run --helpdocuments a default PyPI index and that--activeis needed to prefer the active venv, so in offline/pre-provisioned setups this can fail before running any tests (I reproduced this behavior: uv created.venvthen failed fetching dependencies). BecauseBRAINLAYER_USE_UVdefaults to1, the script is brittle by default on machines that happen to have uv installed.Useful? React with 👍 / 👎.