feat(detection): replace NL-literal anchors with protocol tokens per #106#109
Conversation
…106 Phase 9 r2 consensus META_JUDGE_DONE:consensus:structural:no DetectionInvariantV1; use stable protocol tokens, explicit anchors, and source-regression guard against localized prose detectors Changes: - 48 NL-literal assertions replaced across 4 files - Detection anchors now use: marker tokens (SOLVER_DONE etc), sentinel ⟦AI:AUTO-LOOP⟧, emoji prefixes, heading-id matching - No DetectionInvariantV1 registry introduced (rejected per consensus) - Source-regression guard added: forbidden NL patterns in newly added test files Verification: 349/349 tests pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🤖 quality review: self-doc block missingTL;DR
详细说明
建议在模块常量附近补一段清晰说明,例如 Old 是 source-regression tests 用 localized prose literals 当 detection anchor,New 是 detection should key on protocol tokens / sentinel / heading ids, and this test prevents new localized prose detectors。补上后我这边的 quality verdict 可以翻到 approve。 📎 完整 codex 原始输出(存档备查)---
pr: 109
role: quality
verdict: reject
---
## Verdict
Reject: the code is small and focused, but the new source-regression guard is missing the mandated Old/New refactor self-doc for this detection refactor.
## Evidence
- skills/codex-refactor-loop/scripts/test_no_new_prose_detection.py:1: the new file starts with only a module docstring and then the token list; there is no `Refactor (iterN/cluster-XXX):` Old/New block explaining the NL-literal-anchor to protocol-token/sentinel/heading-id change. That makes the new detector list opaque to a non-audit reader and violates the cluster self-documentation rule.
## What would change your verdict (only if comment or reject)
Add a clear refactor self-doc block near the module docstring or `FORBIDDEN_NEW_TEST_TOKENS`, with Old explaining the localized prose literal detector pattern and New explaining the protocol-token/sentinel/heading-id contract this guard enforces.⟦AI:AUTO-LOOP⟧ |
🤖 tests reviewer: 拒绝,测试断言被弱化TL;DR
详细说明这次变更目标是把自然语言检测锚点迁到 protocol token / sentinel / heading-id,这个方向本身不在 tests review 里否定。但当前 diff 里有几处测试把业务语义测丢了:例如旧测试要求 triage prompt 明确“不写 这属于断言弱化。source-regression test 的价值是防止控制器边界退化,不是统计词出现次数。修复方式不需要退回中文锚点,可以改成 protocol-token + denial semantics: forbidden action 只能出现在 deny / no-alias / controller-owner 语义下,而不是任意 section placement。 📎 完整 codex 原始输出(存档备查)pr: 109
|
🤖 架构复审通过: NL 检测已收敛到协议 tokenTL;DR
详细说明我按三点核对:实际三点 diff、issue106 r2 consensus/implement 授权、CLAUDE.md/AGENTS.md 约束。PR 只改 关键证据是新的 📎 完整 codex 原始输出(存档备查)---
pr: 109
role: architect
verdict: approve
---
## Verdict
Approve: no architecture compliance blocker found; the PR keeps NL-prose detection out of machine checks while preserving host-neutral policy and existing controller boundaries.
## Evidence
- `skills/codex-refactor-loop/scripts/test_no_new_prose_detection.py:15` adds a source-regression guard against reintroducing localized prose detector tokens in changed `test_*.py` files. This supports CLAUDE.md: `**变更必须可验证**:行为约束必须落到机械验证手段(behavior test / source-regression test / 段落 lint);仅靠"agent 应该记得"承载的约束视为未落地。`
- `skills/codex-refactor-loop/scripts/test_skill_entrypoint_contract.py:70` through `:85` swaps human prose literals for sentinel and `REFERENCE.md#...` anchors; `:134` through `:151` keeps heavy templates out of `SKILL.md` by asserting reference anchors plus emoji heading placement. This matches CLAUDE.md: `**内核最小化**:SKILL.md 只承载稳定不变量与触发条件;可机械化、可参考化的部分下沉到 `scripts/` / `prompts/` / `REFERENCE.md`。`
- `skills/codex-refactor-loop/scripts/test_ensure_project_rules_fixed_points.py:2883` through `:2920` keeps the exact six `HOST_*` host-language policy variables and rejects aliases; `:2922` through `:2957` continues guarding against old C#/.NET/protobuf prompt defaults outside the refactor comment. This preserves CLAUDE.md: `**事实源唯一**:同一约束禁止在多处平行声明。版本号 → `.version-bump.json`;host 运行时事实 → `host.env`;skill 行为 → 该 skill 的 SKILL.md 与 `scripts/test_*.py`。`
- `skills/codex-refactor-loop/scripts/test_ensure_project_rules_fixed_points.py:2845` through `:2865` still constrains non-controller prompt mentions of `git commit`, `git push`, `git checkout`, PR creation/merge, and issue close to explicit ban/post/marker sections. This preserves CLAUDE.md: `**强类型边界,窄扩展点**:任何 controller-runtime 例外必须 narrow allowlist + no lifecycle authority by default;授权来源必须 durable artifact + 仓库级文档双重锚定。`
- `skills/codex-refactor-loop/scripts/test_marker_only_prompts_gh_ban.py:22` through `:38` still requires marker-only prompts to contain lifecycle-ban tokens, and `:79` through `:94` verifies lifecycle snippets appear only inside the ban/redline/marker-allowlist areas. No new lifecycle actor, split actor, store abstraction, external repo reference, schema/protocol default, compat shim, or parallel pathway appears in the PR diff.
- Scope is honest against the issue-106 implement authorization: the diff is limited to `skills/codex-refactor-loop/scripts/test_ensure_project_rules_fixed_points.py`, `test_marker_only_prompts_gh_ban.py`, new `test_no_new_prose_detection.py`, and `test_skill_entrypoint_contract.py`, all under the declared `skills/codex-refactor-loop/scripts/*.py` scan/refactor scope.
REVIEW_DONE:109:architect:approve
⟦AI:AUTO-LOOP⟧⟦AI:AUTO-LOOP⟧ |
…ject Tests reviewer r1 reject: source-regression tests 从"禁止某行为"被弱化为"出现某 token 即可"。 Changes: - Restore behavior-level assertions (token must be in correct structural context, not just present) - Add FORBIDDEN_NEW_TEST_TOKENS guard: scan new test files for re-introduced NL detection tokens - 3 files edited, 349 tests pass tests reject comment: gh pr view 109 --comments | latest tests review Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🤖 Quality review: approveTL;DR
详细说明本轮只看 readability / simplicity / naming / dead code / scope。PR 把脆弱的自然语言字面断言改成更窄的 protocol token、section scope、heading/link anchor 检查,整体符合 cluster 意图。 我也打开了 4 个 touched files 的完整内容。新增 helper 的职责命名清楚, 本地验证说明:我没有把当前 worktree 切到 PR 分支。尝试跑测试时命中了当前 📎 完整 codex 原始输出(存档备查)---
pr: 109
role: quality
verdict: approve
---
## Verdict
Approve: the refactor is focused, readable, and replaces brittle NL-literal checks with clearer protocol-token/section-scoped assertions without adding avoidable abstraction.
## Evidence
- `skills/codex-refactor-loop/scripts/test_no_new_prose_detection.py:15`: self-doc explains the old brittle prose-detector pattern and the new protocol-token principle clearly enough for a non-audit reader.
- `skills/codex-refactor-loop/scripts/test_no_new_prose_detection.py:36`: `changed_test_files()` is small and directly named for its purpose; no dead helper surface or generic Manager/Handler naming.
- `skills/codex-refactor-loop/scripts/test_marker_only_prompts_gh_ban.py:14`: `DENIAL_OR_CONTROLLER_OWNER_RE` is a domain-specific name for the reusable context check, not a vague helper abstraction.
- `skills/codex-refactor-loop/scripts/test_marker_only_prompts_gh_ban.py:84`: lifecycle-token validation stays in one test method with straightforward loops and no excessive branching.
- `skills/codex-refactor-loop/scripts/test_skill_entrypoint_contract.py:21`: the markdown heading and section helpers reduce repeated parsing without introducing a standalone abstraction layer.
- `skills/codex-refactor-loop/scripts/test_ensure_project_rules_fixed_points.py:61`: shared `assert_denial_or_controller_owner_context()` removes repeated regex assertion boilerplate and keeps failure messages specific to the token and line.
- `skills/codex-refactor-loop/scripts/test_ensure_project_rules_fixed_points.py:1493`: changed prompt-render checks now assert token presence plus redline context instead of pinning one localized sentence, which keeps the test intent readable.
- `skills/codex-refactor-loop/scripts/test_ensure_project_rules_fixed_points.py:2690`: daemon startup checks are scoped to the relevant section and unsafe examples, avoiding unrelated prose pinning.
## What would change your verdict (only if comment or reject)
N/A⟦AI:AUTO-LOOP⟧ |
🤖 Architect review: approveTL;DR
详细说明我按 关键点: 📎 完整 codex 原始输出(存档备查)pr: 109
|
🤖 tests review: reject, source-regression guard can no-opTL;DR
详细说明主要问题在 这会让 #106 的 source-regression guard 变成可选项。当前 📎 完整 codex 原始输出(存档备查)---
pr: 109
role: tests
verdict: reject
---
## Verdict
Reject: the new no-prose-detection regression test can silently no-op when its git baseline lookup fails, so the added rule is not reliably enforced in CI.
## Evidence
- `skills/codex-refactor-loop/scripts/test_no_new_prose_detection.py:36`-`52`: `changed_test_files()` shells out to `git diff --name-only origin/auto-refact-dev ...`, but if that command fails it returns `[]`; `test_new_or_modified_tests_do_not_add_prose_detection_tokens` then passes with `offenders == []`. This is not a business assertion, it converts missing baseline evidence into success.
- `.github/workflows/consensus-rnd-ci.yml:27` runs `actions/checkout@v4` with default depth/ref behavior and no `fetch-depth: 0` or explicit fetch of `origin/auto-refact-dev`; the new test therefore depends on a remote-tracking ref that is not established by the test job contract. The guard added for issue #106 can pass without scanning any modified test files.
- I did not find added sleeps/delays, `[Skip]`, manual traits, or mock-only pseudo-coverage in the current diff. The other modified tests generally replace localized prose literals with protocol/section/denial-context assertions and still check concrete source-regression outcomes.
## What would change your verdict (only if comment or reject)
Make `test_no_new_prose_detection.py` fail closed when the baseline diff cannot be computed, or make CI fetch the required base ref before running the suite. The test should also assert that at least one changed test file was scanned for this PR shape, so a broken git query cannot become a green no-op.
REVIEW_DONE:109:tests:reject
⟦AI:AUTO-LOOP⟧⟦AI:AUTO-LOOP⟧ |
…tests r2 reject Tests r2 reject: test_no_new_prose_detection.py changed_test_files() returned [] when git diff failed → offenders == [] always passed → guard was effectively noop. Changes: - changed_test_files() now fail-closed: raises if git diff errors or returns empty list (under expected non-empty PR) - Added assertion: PR must scan ≥1 test file (otherwise guard is buggy) - CI workflow: fetch origin auto-refact-dev base ref before running tests so git diff has a valid base Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🤖 tests reviewer — approveTL;DR
详细说明我按三点看了测试质量:有没有真实业务/契约结果断言,有没有靠 关键点:新的 📎 完整 codex 原始输出(存档备查)---
pr: 109
role: tests
verdict: approve
---
## Verdict
Test coverage and quality are adequate for this diff.
## Evidence
- Reviewed diff `origin/auto-refact-dev...origin/refactor/issue106-detection-no-nl-literals`; touched files are CI wiring plus tests only, with no production code requiring separate behavior coverage.
- `.github/workflows/consensus-rnd-ci.yml:28` sets `fetch-depth: 0` and line 32 fetches `origin auto-refact-dev`, which gives the new source-regression test its required `origin/auto-refact-dev` comparison ref in CI.
- `skills/codex-refactor-loop/scripts/test_no_new_prose_detection.py:36` computes changed `test_*.py` files, lines 51-56 fail closed on git command errors, lines 65-70 prevent the guard-only change from scanning nothing, and lines 71-82 assert the concrete no-regression outcome: changed tests must not contain the listed localized prose detector tokens.
- `skills/codex-refactor-loop/scripts/test_marker_only_prompts_gh_ban.py:84` strengthens lifecycle-token coverage by checking all forbidden token mentions across the prompt body, not just presence in the ban section, and rejects affirmative lifecycle mentions outside denial/controller-owner context.
- `skills/codex-refactor-loop/scripts/test_skill_entrypoint_contract.py:91` still checks ordered Phase 0 obligations and now also asserts the five concrete restart-helper-managed daemon names at lines 117-125, so replacing the old prose anchor did not drop daemon coverage.
- No new sleep/delay test pacing, `[Skip]`/manual category escape, weakened assertion pattern, or mock-only pseudo-coverage was introduced in the PR diff.
- Verification in isolated worktree `.worktrees/pr109-tests`: focused tests passed, 12 tests in 0.058s; full `python3 -m unittest discover -s skills/codex-refactor-loop/scripts -p 'test_*.py'` passed, 349 tests in 41.773s; `git diff --check origin/auto-refact-dev...HEAD` passed.
REVIEW_DONE:109:tests:approve
⟦AI:AUTO-LOOP⟧⟦AI:AUTO-LOOP⟧ |
TL;DR
Phase 9 #106 r2 consensus → 48 NL-literal detection anchors → protocol tokens / sentinel / emoji / heading-id。No registry introduced。349/349 tests pass。4 files,新增
test_no_new_prose_detection.pysource-regression guard。Closes #106
授权:
.refactor-loop/logs/meta-judge-issue106-r2.log→META_JUDGE_DONE:consensus:structural:no DetectionInvariantV1⟦AI:AUTO-LOOP⟧