feat(scripts): semantic config-drift detector + weekly CI#1
Conversation
Adds scripts/drift_check.py which audits a watchlist of shared config files across all framework repos and reports semantic drift (JSON key-order, YAML flow style, .gitignore line order, XML attribute order all collapse to one equivalence class). Watchlist + repo list live in scripts/drift-policy.yaml (16 repos, 33 paths today; extend as repos and files are added). .github/workflows/drift-check.yml runs the detector every Monday 06:00 UTC and opens (or updates) a config-drift labelled issue with the report when drift is found. Verified locally: 8 of 33 paths fully clean. The 21 drifted paths include legitimate per-repo variation (anti-self-bump renovate.json rules, .coderabbit.yaml policies) which still surface for human review. Notable: renovate.json had 16 byte-unique versions but only 5 semantic clusters — 12 of 16 collapse to identical content after JSON normalisation.
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a semantic configuration drift detector CLI with per-file-type normalizers and clustering, a policy YAML listing repos/paths to watch, README documentation, and a GitHub Actions workflow that runs the detector, uploads artifacts, and opens/updates a ChangesConfiguration Drift Detection System
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 88 |
| Duplication | 0 |
AI Reviewer: first review requested successfully. AI can make mistakes. Always validate suggestions.
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Pull Request Overview
While this PR introduces a valuable 'semantic' drift detector, there are several critical issues that should prevent merging in its current state. Most notably, the XML normalization logic is vulnerable to XML External Entity (XXE) attacks and incorrectly handles character encoding by manually decoding bytes before parsing.
Additionally, the CI workflow incorrectly flags any script failure (such as authentication or syntax errors) as a 'drift detected' event, which will lead to false positives. The core logic also fails to account for 'partial presence' as a form of drift, potentially missing cases where configuration files are deleted from a subset of repositories. Finally, the script introduces high logic complexity without any unit tests for the various file normalizers, posing a significant maintenance risk.
About this PR
- The script's heavy reliance on the 'gh' CLI and authenticated environments makes local development and debugging difficult. Furthermore, the error handling in 'gh_api' depends on string-matching '404' within stderr, which is fragile and may break across different environments or CLI versions. Consider using a more robust method for status code verification.
Test suggestions
- Missing: JSON normalizer ignores key ordering and whitespace differences
- Missing: YAML normalizer collapses flow style into block style and sorts keys
- Missing: XML normalizer sorts attributes and ignores insignificant whitespace
- Missing: INI normalizer handles section/key sorting and comment stripping
- Missing: Line-based normalizer (.gitignore) dedupes and sorts entries
- Missing: Clustering logic correctly identifies the majority (canonical) set vs drift variants
- Missing: CI workflow correctly identifies and updates existing drift issues instead of creating duplicates
- Missing: Verify XML parser is protected against XXE (XML External Entity) attacks
Prompt proposal for missing tests
Consider implementing these tests if applicable:
1. Missing: JSON normalizer ignores key ordering and whitespace differences
2. Missing: YAML normalizer collapses flow style into block style and sorts keys
3. Missing: XML normalizer sorts attributes and ignores insignificant whitespace
4. Missing: INI normalizer handles section/key sorting and comment stripping
5. Missing: Line-based normalizer (.gitignore) dedupes and sorts entries
6. Missing: Clustering logic correctly identifies the majority (canonical) set vs drift variants
7. Missing: CI workflow correctly identifies and updates existing drift issues instead of creating duplicates
8. Missing: Verify XML parser is protected against XXE (XML External Entity) attacks
TIP Improve review quality by adding custom instructions
TIP How was this review? Give us feedback
| python scripts/drift_check.py \ | ||
| --policy "${{ inputs.policy || 'scripts/drift-policy.yaml' }}" \ | ||
| --output drift-report.md \ | ||
| --manifest drift-manifest.json || echo "DRIFT_DETECTED=1" >> $GITHUB_ENV |
There was a problem hiding this comment.
🔴 HIGH RISK
The workflow incorrectly sets DRIFT_DETECTED=1 for any non-zero exit code. Since the script uses exit code 1 for drift and 2+ for system/config errors, this conflates actual drift with infrastructure failures (e.g., GITHUB_TOKEN expiration). Update the 'Run drift detector' step to only set the flag when the exit code is exactly 1.
|
|
||
| def norm_xml(data: bytes) -> str: | ||
| """Parse XML, normalise whitespace and attribute order, dump.""" | ||
| root = ET.fromstring(data.decode("utf-8")) |
There was a problem hiding this comment.
🔴 HIGH RISK
The XML normalization logic has several critical issues:
- Security: Using
xml.etree.ElementTreemakes the script vulnerable to XML External Entity (XXE) attacks. Usedefusedxmlto safely parse untrusted configuration files. - Encoding: Do not manually decode
datato UTF-8. Passing raw bytes to the parser allows it to correctly detect and handle encoding declarations (e.g., UTF-16) specified in the XML prolog.
Recommended fix: Use defusedxml.ElementTree.fromstring(data).
| "absent_in": [r for r in repos if r not in repo_versions and r not in repo_errors], | ||
| "clusters": clusters, | ||
| "errors": repo_errors, | ||
| "drift": len(clusters) > 1, |
There was a problem hiding this comment.
🟡 MEDIUM RISK
A file is currently considered 'Clean' if all repositories that contain it share the same content. However, if a file exists in only a subset of repositories (e.g., 1 out of 16), this 'partial presence' should also be flagged as drift in a consistency tool. Consider updating the logic: "drift": len(clusters) > 1 or (0 < len(repo_versions) < len(repos)).
| @@ -0,0 +1,386 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
🟡 MEDIUM RISK
Suggestion: This script introduces high logic complexity (83) for semantic normalization across JSON, YAML, XML, and INI, yet has 0% test coverage. Given the risk of regressions in drift detection logic, comprehensive unit tests for the NORMALISERS are required.
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/drift-check.yml:
- Around line 31-42: Replace the blanket "|| echo DRIFT_DETECTED=1" with
explicit exit-code handling: run python scripts/drift_check.py (the detector),
capture its exit code, and if the code is exactly 1 write "DRIFT_DETECTED=1" to
$GITHUB_ENV; if the code is 2 re-exit with code 2 so the job fails (propagate
failures), otherwise exit/continue normally for code 0. Target the Run drift
detector step and the invocation of drift_check.py, and ensure you reference
DRIFT_DETECTED and $GITHUB_ENV when setting the environment variable.
In `@README.md`:
- Line 171: Update the README entry for the `raw` normalizer to accurately
reflect `norm_raw`'s behavior: state that `norm_raw` uses Python's `.strip()` to
remove both leading and trailing whitespace (not just trailing), and adjust the
table cell text from "Trailing whitespace" to something like "Removes leading
and trailing whitespace" so docs match the actual `norm_raw` implementation.
In `@scripts/drift_check.py`:
- Around line 244-249: In check_path(), instead of returning an error dict when
norm_func is falsy (unknown normaliser_name), raise a specific exception (e.g.,
ValueError or custom ConfigError) that includes the path and normaliser_name;
then update main() to catch that exception and call sys.exit(2) for
configuration/authentication errors, ensuring other drift logic and exit codes
remain unchanged and references to check_path(), main(), norm_func, and
normaliser_name are used to locate the changes.
- Around line 262-268: The early return when repo_versions is empty should
include the same keys as the normal return: add "errors": repo_errors (or [] if
None) and "drift": [] alongside "path", "normalizer"/normaliser_name,
"present_in": 0, and "clusters" so parse-only failures aren't dropped; update
the block that builds the result for the repo in the function handling
repo_versions to return these fields (reference repo_versions, repo_errors,
normaliser_name, path). Also change the downstream reporting filter that builds
missing files (the list comprehension using findings and f.get("present_in", 0)
== 0) to distinguish truly missing files from parse-failed files by checking the
"errors" field (e.g., treat a file as missing only if present_in == 0 and not
f.get("errors")), so errored-but-present paths are reported separately.
- Around line 51-54: The subprocess.run call that invokes "gh api" (the res =
subprocess.run([...]) call in scripts/drift_check.py) needs a timeout to avoid
indefinite hangs; add a timeout argument (e.g., timeout=30 or a defined
GH_API_TIMEOUT constant) to the subprocess.run invocation and wrap the call in a
try/except catching subprocess.TimeoutExpired, and on timeout exit the process
with exit code 2 (configuration/tool error), optionally logging the timeout and
command context before exiting.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: a4b500e5-f009-46bd-a251-fcde26d14ad5
📒 Files selected for processing (4)
.github/workflows/drift-check.ymlREADME.mdscripts/drift-policy.yamlscripts/drift_check.py
📜 Review details
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Codacy Static Code Analysis
| - name: Run drift detector | ||
| id: drift | ||
| env: | ||
| GH_TOKEN: ${{ secrets.DRIFT_CHECK_TOKEN || secrets.GITHUB_TOKEN }} | ||
| run: | | ||
| # The script reads gh CLI which is preinstalled on ubuntu-latest. | ||
| # GH_TOKEN above gives it auth. | ||
| python scripts/drift_check.py \ | ||
| --policy "${{ inputs.policy || 'scripts/drift-policy.yaml' }}" \ | ||
| --output drift-report.md \ | ||
| --manifest drift-manifest.json || echo "DRIFT_DETECTED=1" >> $GITHUB_ENV | ||
| echo "----- report start -----" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Locate drift_check.py
find . -name "drift_check.py" -o -name "*drift*" -type f | head -20Repository: ANcpLua/github-settings-automation
Length of output: 167
🏁 Script executed:
# Look for the script
git ls-files | grep -i driftRepository: ANcpLua/github-settings-automation
Length of output: 161
🏁 Script executed:
cat -n scripts/drift_check.pyRepository: ANcpLua/github-settings-automation
Length of output: 15777
Differentiate detector exit code 1 (drift) from 2 (failure).
The || echo "DRIFT_DETECTED=1" handler maps every non-zero exit to drift, including auth/config/script failures. The detector explicitly returns exit code 2 for configuration or authentication errors (documented in the script's docstring), but the workflow conflates this with drift detection. This causes real breakage to be silently misclassified as drift and allows the workflow to continue instead of failing.
Capture the exit code and only set DRIFT_DETECTED=1 when the exit code is exactly 1; propagate exit code 2 to fail the workflow.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/drift-check.yml around lines 31 - 42, Replace the blanket
"|| echo DRIFT_DETECTED=1" with explicit exit-code handling: run python
scripts/drift_check.py (the detector), capture its exit code, and if the code is
exactly 1 write "DRIFT_DETECTED=1" to $GITHUB_ENV; if the code is 2 re-exit with
code 2 so the job fails (propagate failures), otherwise exit/continue normally
for code 0. Target the Run drift detector step and the invocation of
drift_check.py, and ensure you reference DRIFT_DETECTED and $GITHUB_ENV when
setting the environment variable.
| | `*.xml`, `*.props`, `*.targets`, `nuget.config`, `Directory.Build.props` | `xml` | Insignificant whitespace, attribute order | | ||
| | `.editorconfig`, `.gitmodules`, `.npmrc`, `.globalconfig` | `ini` | Comments, section ordering, key ordering | | ||
| | `.gitignore`, `.dockerignore`, `.gitattributes`, `.markdownlintignore` | `lines` | Comments, blank lines, duplicate lines, ordering | | ||
| | `LICENSE`, `build.sh`, `build.cmd` | `raw` | Trailing whitespace | |
There was a problem hiding this comment.
Align raw normalizer docs with actual behavior.
norm_raw uses .strip(), so it drops leading and trailing whitespace, not just trailing whitespace. Document this precisely to avoid report misinterpretation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` at line 171, Update the README entry for the `raw` normalizer to
accurately reflect `norm_raw`'s behavior: state that `norm_raw` uses Python's
`.strip()` to remove both leading and trailing whitespace (not just trailing),
and adjust the table cell text from "Trailing whitespace" to something like
"Removes leading and trailing whitespace" so docs match the actual `norm_raw`
implementation.
| res = subprocess.run( | ||
| ["gh", "api", endpoint], | ||
| capture_output=True, text=True | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd -t f "drift_check.py" | head -5Repository: ANcpLua/github-settings-automation
Length of output: 101
🏁 Script executed:
wc -l scripts/drift_check.pyRepository: ANcpLua/github-settings-automation
Length of output: 105
🏁 Script executed:
sed -n '45,65p' scripts/drift_check.pyRepository: ANcpLua/github-settings-automation
Length of output: 883
🏁 Script executed:
# Check the entire gh_api function and exit code patterns
sed -n '40,100p' scripts/drift_check.pyRepository: ANcpLua/github-settings-automation
Length of output: 2241
🏁 Script executed:
# Check exit code usage in main function
rg "sys.exit\(" scripts/drift_check.py -B 2 -A 1Repository: ANcpLua/github-settings-automation
Length of output: 896
🏁 Script executed:
# Check for other subprocess.run calls in the file
rg "subprocess.run" scripts/drift_check.py -B 2 -A 2Repository: ANcpLua/github-settings-automation
Length of output: 299
🏁 Script executed:
# Double-check for all subprocess usage
rg "subprocess\." scripts/drift_check.pyRepository: ANcpLua/github-settings-automation
Length of output: 104
🏁 Script executed:
# Verify the exit code pattern is consistent
grep -n "sys.exit" scripts/drift_check.pyRepository: ANcpLua/github-settings-automation
Length of output: 237
🏁 Script executed:
# Check imports to confirm subprocess is imported
head -20 scripts/drift_check.pyRepository: ANcpLua/github-settings-automation
Length of output: 774
Add timeout to gh api calls to prevent workflow stalls.
The subprocess.run() call lacks a timeout, allowing indefinite hangs that stall scheduled CI. Timeout should exit with code 2 (configuration/tool error).
Proposed fix
def gh_api(endpoint: str) -> dict | None:
"""Call `gh api`, return parsed JSON. None on 404. Exits on other errors."""
- res = subprocess.run(
- ["gh", "api", endpoint],
- capture_output=True, text=True
- )
+ try:
+ res = subprocess.run(
+ ["gh", "api", endpoint],
+ capture_output=True,
+ text=True,
+ timeout=30,
+ )
+ except subprocess.TimeoutExpired:
+ print(f"gh api {endpoint} timed out", file=sys.stderr)
+ sys.exit(2)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/drift_check.py` around lines 51 - 54, The subprocess.run call that
invokes "gh api" (the res = subprocess.run([...]) call in
scripts/drift_check.py) needs a timeout to avoid indefinite hangs; add a timeout
argument (e.g., timeout=30 or a defined GH_API_TIMEOUT constant) to the
subprocess.run invocation and wrap the call in a try/except catching
subprocess.TimeoutExpired, and on timeout exit the process with exit code 2
(configuration/tool error), optionally logging the timeout and command context
before exiting.
| if not norm_func: | ||
| return { | ||
| "path": path, | ||
| "normalizer": normaliser_name, | ||
| "error": f"unknown normaliser: {normaliser_name}", | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd scripts && wc -l drift_check.py && head -50 drift_check.pyRepository: ANcpLua/github-settings-automation
Length of output: 1583
🏁 Script executed:
cd scripts && sed -n '240,255p' drift_check.py && echo "---" && sed -n '370,385p' drift_check.pyRepository: ANcpLua/github-settings-automation
Length of output: 1240
🏁 Script executed:
cd scripts && grep -n "def main" drift_check.py && grep -n "sys.exit" drift_check.pyRepository: ANcpLua/github-settings-automation
Length of output: 261
🏁 Script executed:
cd scripts && cat -n drift_check.py | grep -A 30 "def main"Repository: ANcpLua/github-settings-automation
Length of output: 1540
🏁 Script executed:
cd scripts && sed -n '360,390p' drift_check.pyRepository: ANcpLua/github-settings-automation
Length of output: 1039
🏁 Script executed:
cd scripts && grep -B 5 -A 5 "unknown normaliser" drift_check.pyRepository: ANcpLua/github-settings-automation
Length of output: 408
🏁 Script executed:
cd scripts && sed -n '1,50p' drift_check.py | grep -A 10 "Exit codes"Repository: ANcpLua/github-settings-automation
Length of output: 351
Unknown normalizer must exit with code 2 for configuration error, not code 0/1 based on drift.
The documented exit-code contract specifies code 2 for "configuration or authentication error". Returning an error dict allows the program to exit based solely on drift_count, ignoring configuration failures. Raise the exception in check_path() and catch it in main() with sys.exit(2).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/drift_check.py` around lines 244 - 249, In check_path(), instead of
returning an error dict when norm_func is falsy (unknown normaliser_name), raise
a specific exception (e.g., ValueError or custom ConfigError) that includes the
path and normaliser_name; then update main() to catch that exception and call
sys.exit(2) for configuration/authentication errors, ensuring other drift logic
and exit codes remain unchanged and references to check_path(), main(),
norm_func, and normaliser_name are used to locate the changes.
| if not repo_versions: | ||
| return { | ||
| "path": path, | ||
| "normalizer": normaliser_name, | ||
| "present_in": 0, | ||
| "clusters": [], | ||
| } |
There was a problem hiding this comment.
❓ Verification inconclusive
Script executed:
wc -l scripts/drift_check.pyRepository: ANcpLua/github-settings-automation
Repository: ANcpLua/github-settings-automation
Exit code: 0
stdout:
386 scripts/drift_check.py
Script executed:
head -n 330 scripts/drift_check.py | tail -n 80Repository: ANcpLua/github-settings-automation
Repository: ANcpLua/github-settings-automation
Exit code: 0
stdout:
repo_versions: dict[str, str] = {}
repo_errors: dict[str, str] = {}
for repo in repos:
data = fetch_file(repo, path)
if data is None:
continue
try:
repo_versions[repo] = norm_func(data)
except Exception as e:
repo_errors[repo] = f"{type(e).__name__}: {e}"
if not repo_versions:
return {
"path": path,
"normalizer": normaliser_name,
"present_in": 0,
"clusters": [],
}
clusters = cluster_versions(repo_versions)
return {
"path": path,
"normalizer": normaliser_name,
"present_in": len(repo_versions),
"absent_in": [r for r in repos if r not in repo_versions and r not in repo_errors],
"clusters": clusters,
"errors": repo_errors,
"drift": len(clusters) > 1,
}
# ---------------------------------------------------------------------------
# Output renderers
# ---------------------------------------------------------------------------
def write_markdown(findings: list[dict], path: str) -> int:
"""Write markdown report. Return drift count."""
lines: list[str] = ["# Config Drift Report\n"]
total = len(findings)
drift = [f for f in findings if f.get("drift")]
clean = [f for f in findings if not f.get("drift") and f.get("present_in", 0) > 0]
missing = [f for f in findings if f.get("present_in", 0) == 0]
lines.append(f"- Paths checked: **{total}**")
lines.append(f"- Drifted: **{len(drift)}**")
lines.append(f"- Clean (semantic 1-cluster): **{len(clean)}**")
lines.append(f"- Not present in any repo: **{len(missing)}**\n")
if drift:
lines.append("## Drifted paths\n")
for f in drift:
lines.append(f"### `{f['path']}` (normaliser: `{f['normalizer']}`)")
lines.append(f"_{f['present_in']} repos, {len(f['clusters'])} semantic clusters_\n")
for i, c in enumerate(f["clusters"]):
marker = "**canonical** (majority)" if i == 0 else f"drift #{i}"
short_repos = ", ".join(r.split("/")[-1] for r in c["repos"])
lines.append(f"- {marker} `{c['hash']}` ({c['size']}× ): {short_repos}")
if f.get("errors"):
lines.append("\n_Normalisation errors:_")
for repo, err in f["errors"].items():
lines.append(f" - `{repo.split('/')[-1]}`: {err}")
lines.append("")
if clean:
lines.append("## Clean paths\n")
lines.append("| Path | Repos | Hash |")
lines.append("|---|---|---|")
for f in clean:
h = f["clusters"][0]["hash"] if f["clusters"] else "-"
lines.append(f"| `{f['path']}` | {f['present_in']} | `{h}` |")
lines.append("")
if missing:
lines.append("## Watched but absent everywhere\n")
for f in missing:
lines.append(f"- `{f['path']}`")
lines.append("")
Path(path).write_text("\n".join(lines) + "\n")
return len(drift)
Script executed:
cat -n scripts/drift_check.py | sed -n '250,330p'Repository: ANcpLua/github-settings-automation
Repository: ANcpLua/github-settings-automation
Exit code: 0
stdout:
250
251 repo_versions: dict[str, str] = {}
252 repo_errors: dict[str, str] = {}
253 for repo in repos:
254 data = fetch_file(repo, path)
255 if data is None:
256 continue
257 try:
258 repo_versions[repo] = norm_func(data)
259 except Exception as e:
260 repo_errors[repo] = f"{type(e).__name__}: {e}"
261
262 if not repo_versions:
263 return {
264 "path": path,
265 "normalizer": normaliser_name,
266 "present_in": 0,
267 "clusters": [],
268 }
269
270 clusters = cluster_versions(repo_versions)
271 return {
272 "path": path,
273 "normalizer": normaliser_name,
274 "present_in": len(repo_versions),
275 "absent_in": [r for r in repos if r not in repo_versions and r not in repo_errors],
276 "clusters": clusters,
277 "errors": repo_errors,
278 "drift": len(clusters) > 1,
279 }
280
281
282 # ---------------------------------------------------------------------------
283 # Output renderers
284 # ---------------------------------------------------------------------------
285
286 def write_markdown(findings: list[dict], path: str) -> int:
287 """Write markdown report. Return drift count."""
288 lines: list[str] = ["# Config Drift Report\n"]
289 total = len(findings)
290 drift = [f for f in findings if f.get("drift")]
291 clean = [f for f in findings if not f.get("drift") and f.get("present_in", 0) > 0]
292 missing = [f for f in findings if f.get("present_in", 0) == 0]
293
294 lines.append(f"- Paths checked: **{total}**")
295 lines.append(f"- Drifted: **{len(drift)}**")
296 lines.append(f"- Clean (semantic 1-cluster): **{len(clean)}**")
297 lines.append(f"- Not present in any repo: **{len(missing)}**\n")
298
299 if drift:
300 lines.append("## Drifted paths\n")
301 for f in drift:
302 lines.append(f"### `{f['path']}` (normaliser: `{f['normalizer']}`)")
303 lines.append(f"_{f['present_in']} repos, {len(f['clusters'])} semantic clusters_\n")
304 for i, c in enumerate(f["clusters"]):
305 marker = "**canonical** (majority)" if i == 0 else f"drift #{i}"
306 short_repos = ", ".join(r.split("/")[-1] for r in c["repos"])
307 lines.append(f"- {marker} `{c['hash']}` ({c['size']}× ): {short_repos}")
308 if f.get("errors"):
309 lines.append("\n_Normalisation errors:_")
310 for repo, err in f["errors"].items():
311 lines.append(f" - `{repo.split('/')[-1]}`: {err}")
312 lines.append("")
313
314 if clean:
315 lines.append("## Clean paths\n")
316 lines.append("| Path | Repos | Hash |")
317 lines.append("|---|---|---|")
318 for f in clean:
319 h = f["clusters"][0]["hash"] if f["clusters"] else "-"
320 lines.append(f"| `{f['path']}` | {f['present_in']} | `{h}` |")
321 lines.append("")
322
323 if missing:
324 lines.append("## Watched but absent everywhere\n")
325 for f in missing:
326 lines.append(f"- `{f['path']}`")
327 lines.append("")
328
329 Path(path).write_text("\n".join(lines) + "\n")
330 return len(drift)
Include errors and drift in early return; separately report paths with parse-only failures.
When files exist but all normalizations fail, repo_errors is discarded at the early return (lines 262–268) and the path gets reported as "absent everywhere" in the markdown. This masks parse problems that drift detection should surface.
The normal return path (lines 271–279) correctly includes both fields; the early return must match. Additionally, the reporting logic (line 292) conflates truly missing paths with errored ones—missing = [f for f in findings if f.get("present_in", 0) == 0] catches both, so parse-only failures disappear from view.
Proposed fix
if not repo_versions:
return {
"path": path,
"normalizer": normaliser_name,
"present_in": 0,
"clusters": [],
+ "errors": repo_errors,
+ "drift": False,
}- missing = [f for f in findings if f.get("present_in", 0) == 0]
+ missing = [f for f in findings if f.get("present_in", 0) == 0 and not f.get("errors")]
+ errored = [f for f in findings if f.get("errors")]
@@
+ if errored:
+ lines.append("## Paths with normalization errors\n")
+ for f in errored:
+ lines.append(f"### `{f['path']}` (normaliser: `{f['normalizer']}`)")
+ for repo, err in f["errors"].items():
+ lines.append(f"- `{repo.split('/')[-1]}`: {err}")
+ lines.append("")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not repo_versions: | |
| return { | |
| "path": path, | |
| "normalizer": normaliser_name, | |
| "present_in": 0, | |
| "clusters": [], | |
| } | |
| if not repo_versions: | |
| return { | |
| "path": path, | |
| "normalizer": normaliser_name, | |
| "present_in": 0, | |
| "clusters": [], | |
| "errors": repo_errors, | |
| "drift": False, | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/drift_check.py` around lines 262 - 268, The early return when
repo_versions is empty should include the same keys as the normal return: add
"errors": repo_errors (or [] if None) and "drift": [] alongside "path",
"normalizer"/normaliser_name, "present_in": 0, and "clusters" so parse-only
failures aren't dropped; update the block that builds the result for the repo in
the function handling repo_versions to return these fields (reference
repo_versions, repo_errors, normaliser_name, path). Also change the downstream
reporting filter that builds missing files (the list comprehension using
findings and f.get("present_in", 0) == 0) to distinguish truly missing files
from parse-failed files by checking the "errors" field (e.g., treat a file as
missing only if present_in == 0 and not f.get("errors")), so errored-but-present
paths are reported separately.
Git's gitignore and gitattributes semantics give the LAST matching pattern precedence within one level. Sorting + set-dedupe (the previous "lines" normaliser) destroyed that order, so two semantically different files could hash to the same cluster and silently pass. This commit: - Adds norm_lines_set (was norm_lines_sorted) — sort + dedupe, valid only when the consumer treats the file as an unordered set. - Adds norm_lines_ordered — preserves order, collapses only adjacent duplicates. Required for .gitignore and .gitattributes. - Keeps "lines" as an alias that resolves to ordered (fail-safe default for any future entry). - Switches .gitignore, .gitattributes, .dockerignore, .markdownlintignore to lines_ordered in DEFAULT_NORMALISERS. Verified locally: .gitattributes now collapses to 1 cluster across all 5 repos that have it (semantically identical with comment / whitespace noise removed); .gitignore stays at 16/16 (genuine content drift); .dockerignore stays at 2/2 (genuine content drift). Credit: caught by parallel agent reviewing PR #1.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/drift_check.py`:
- Around line 297-300: When norm_func(data) raises, record the error as now but
also treat it as a fatal detection failure: in the except block for the
repo_versions assignment add a failure marker (e.g., set
repo_versions[repo]=None and increment drift_count or set a has_error flag) and
ensure the final exit path checks repo_errors/has_error and calls sys.exit(1) so
normalization failures produce a non-zero exit; apply the same change to the
analogous try/except at the block around lines 413-422 (same handling for
norm_func exceptions).
- Around line 74-77: The current check treats empty-file content ("") as missing
because it uses "if not content:"; change the logic to only treat missing
content when result.get("content") is None. In the block using
result.get("content") (the local variable content) and returning
base64.b64decode(content), replace the falsy check with an explicit None check
so that an empty string decodes to b"" instead of returning None.
- Around line 397-411: Validate that the loaded policy is a mapping and contains
non-empty "repos" and "watch" lists before dereferencing: after policy =
yaml.safe_load(...), check isinstance(policy, dict) and set repos =
policy.get("repos") or []; watch = policy.get("watch") or []; if not
isinstance(policy, dict) or not repos or not watch then print a clear error to
stderr and sys.exit(2). Also, when iterating watch before calling
check_path(repos, entry), ensure each entry is either a string or a mapping with
a "path" key (reject/massage other types with a helpful error and exit 2); keep
the existing args.quiet logging behavior unchanged.
- Around line 172-188: The current ini normalizer function norm_ini in
scripts/drift_check.py flattens sections and sorts keys, which breaks
.editorconfig's order-dependent override semantics; update the normalization
mapping so files named ".editorconfig" use the raw normalizer (preserving
original line order) instead of norm_ini, leaving norm_ini unchanged for true
INI-like files (e.g., .gitmodules, .npmrc); locate the mapping that assigns
normalizers to file patterns and replace the association for ".editorconfig" to
use the raw/plain normalizer so later-section overrides remain intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 52da6590-0196-43fc-b4f9-9eb7edd20554
📒 Files selected for processing (1)
scripts/drift_check.py
📜 Review details
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Codacy Static Code Analysis
| content = result.get("content") | ||
| if not content: | ||
| return None | ||
| return base64.b64decode(content) |
There was a problem hiding this comment.
Don't classify zero-byte files as missing.
An existing empty file comes back with content == "", so if not content: drops it as absent. That will hide drift for legitimately empty watched files.
Proposed fix
content = result.get("content")
- if not content:
+ if content is None:
return None
return base64.b64decode(content)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/drift_check.py` around lines 74 - 77, The current check treats
empty-file content ("") as missing because it uses "if not content:"; change the
logic to only treat missing content when result.get("content") is None. In the
block using result.get("content") (the local variable content) and returning
base64.b64decode(content), replace the falsy check with an explicit None check
so that an empty string decodes to b"" instead of returning None.
| def norm_ini(data: bytes) -> str: | ||
| """INI-like files (.editorconfig, .gitmodules, .npmrc, .globalconfig): | ||
| parse sections + key=value pairs, sort, dump.""" | ||
| text = data.decode("utf-8", errors="replace") | ||
| entries: dict[str, str] = {} | ||
| section = "" | ||
| for line in text.splitlines(): | ||
| s = line.strip() | ||
| if not s or s.startswith("#") or s.startswith(";"): | ||
| continue | ||
| if s.startswith("[") and s.endswith("]"): | ||
| section = s[1:-1].strip() | ||
| continue | ||
| if "=" in s: | ||
| k, v = s.split("=", 1) | ||
| entries[f"{section}/{k.strip()}"] = v.strip() | ||
| return "\n".join(f"{k}={v}" for k, v in sorted(entries.items())) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In the EditorConfig specification, when multiple sections in one .editorconfig file match the same file, do properties from later sections override earlier sections?
💡 Result:
Yes, in the EditorConfig specification, properties from later sections override properties from earlier sections when multiple sections within the same.editorconfig file match the same file [1][2]. According to the official EditorConfig specification, files are read from top to bottom, and the most recent pairs found take precedence [1][3]. Consequently, if a file matches multiple sections within a single.editorconfig file, the properties defined in the section that appears later in the file will override those defined in earlier sections [1][2]. When multiple.editorconfig files are involved, the search process reads files from the directory of the target file up to the root directory [1][4]. Because the rules from closer EditorConfig files are read last, properties in closer files take precedence over those in files located further up the directory tree [1][5].
Citations:
- 1: https://spec.editorconfig.org/
- 2: https://spec.editorconfig.org/index.html
- 3: https://gsp.com/cgi-bin/man.cgi?section=5&topic=editorconfig-format
- 4: http://docs.editorconfig.org/en/master/editorconfig-format.html
- 5: https://manpages.debian.org/testing/editorconfig/editorconfig-format.5.en.html
🏁 Script executed:
# Find the DEFAULT_NORMALISERS mapping and verify the .editorconfig line
rg "DEFAULT_NORMALISERS|\.editorconfig.*ini" scripts/drift_check.py -A 5 -B 5Repository: ANcpLua/github-settings-automation
Length of output: 948
.editorconfig mapping to ini normalizer creates false semantic equivalences.
.editorconfig files are order-dependent per the official spec: later matching sections override earlier ones. norm_ini() parses sections and keys into a sorted dict, discarding the original order. Two files with different override precedence will produce identical normalized output, creating false positives in drift detection.
Example: [*]→indent=space followed by [*.py]→indent=tab produces the same normalized form as the reversed order, despite opposite final behavior for Python files.
Switch .editorconfig to raw normalizer to preserve order-dependent semantics.
Proposed fix
- ".editorconfig": "ini",
+ ".editorconfig": "raw",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/drift_check.py` around lines 172 - 188, The current ini normalizer
function norm_ini in scripts/drift_check.py flattens sections and sorts keys,
which breaks .editorconfig's order-dependent override semantics; update the
normalization mapping so files named ".editorconfig" use the raw normalizer
(preserving original line order) instead of norm_ini, leaving norm_ini unchanged
for true INI-like files (e.g., .gitmodules, .npmrc); locate the mapping that
assigns normalizers to file patterns and replace the association for
".editorconfig" to use the raw/plain normalizer so later-section overrides
remain intact.
| try: | ||
| repo_versions[repo] = norm_func(data) | ||
| except Exception as e: | ||
| repo_errors[repo] = f"{type(e).__name__}: {e}" |
There was a problem hiding this comment.
Normalization failures never affect the final exit code.
repo_errors is collected, but the process still exits solely from drift_count. A malformed watched file should fail the detector, not quietly become a 0/1 result with incomplete clustering.
Proposed fix
drift_count = write_markdown(findings, args.output)
write_manifest(findings, args.manifest)
+ had_errors = any(f.get("error") or f.get("errors") for f in findings)
@@
- sys.exit(1 if drift_count > 0 else 0)
+ if had_errors:
+ sys.exit(2)
+ sys.exit(1 if drift_count > 0 else 0)Also applies to: 413-422
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/drift_check.py` around lines 297 - 300, When norm_func(data) raises,
record the error as now but also treat it as a fatal detection failure: in the
except block for the repo_versions assignment add a failure marker (e.g., set
repo_versions[repo]=None and increment drift_count or set a has_error flag) and
ensure the final exit path checks repo_errors/has_error and calls sys.exit(1) so
normalization failures produce a non-zero exit; apply the same change to the
analogous try/except at the block around lines 413-422 (same handling for
norm_func exceptions).
| policy = yaml.safe_load(policy_path.read_text()) | ||
|
|
||
| repos = policy.get("repos") or [] | ||
| watch = policy.get("watch") or [] | ||
| if not repos or not watch: | ||
| print("policy must define non-empty 'repos' and 'watch' lists", file=sys.stderr) | ||
| sys.exit(2) | ||
|
|
||
| findings: list[dict] = [] | ||
| for entry in watch: | ||
| if isinstance(entry, str): | ||
| entry = {"path": entry} | ||
| if not args.quiet: | ||
| print(f"checking {entry['path']} across {len(repos)} repos...", file=sys.stderr) | ||
| findings.append(check_path(repos, entry)) |
There was a problem hiding this comment.
Validate the policy structure before dereferencing it.
A malformed policy, a non-mapping top level, or a watch entry without path will traceback here and exit 1. This is config input; fail fast with a clear message and exit 2 instead of indexing straight into it.
Proposed fix
- policy = yaml.safe_load(policy_path.read_text())
+ try:
+ policy = yaml.safe_load(policy_path.read_text())
+ except Exception as e:
+ print(f"could not parse policy file {args.policy}: {e}", file=sys.stderr)
+ sys.exit(2)
+ if not isinstance(policy, dict):
+ print("policy must be a mapping with 'repos' and 'watch' keys", file=sys.stderr)
+ sys.exit(2)
@@
for entry in watch:
if isinstance(entry, str):
entry = {"path": entry}
+ if not isinstance(entry, dict) or not entry.get("path"):
+ print("each watch entry must be a string or mapping with a non-empty 'path'", file=sys.stderr)
+ sys.exit(2)
if not args.quiet:
print(f"checking {entry['path']} across {len(repos)} repos...", file=sys.stderr)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/drift_check.py` around lines 397 - 411, Validate that the loaded
policy is a mapping and contains non-empty "repos" and "watch" lists before
dereferencing: after policy = yaml.safe_load(...), check isinstance(policy,
dict) and set repos = policy.get("repos") or []; watch = policy.get("watch") or
[]; if not isinstance(policy, dict) or not repos or not watch then print a clear
error to stderr and sys.exit(2). Also, when iterating watch before calling
check_path(repos, entry), ensure each entry is either a string or a mapping with
a "path" key (reject/massage other types with a helpful error and exit 2); keep
the existing args.quiet logging behavior unchanged.
|
@claude[agent] Verify each finding against current code. Fix only still-valid issues, skip the In |
|
@codex[agent] Verify each finding against current code. Fix only still-valid issues, skip the In |
Three small upgrades from review feedback on PR #1: 1. Added branch_protection() + branch_rulesets() + is_default_branch_writable() helpers. Lets a future --remediate path detect 'classic protection OR active ruleset on default branch' BEFORE attempting writes, so we choose PUT-direct vs branch-then-PR upfront. Read-only audit path is unchanged. 2. Documented why fetch_file uses Contents API + base64-decode (not raw.githubusercontent.com): CDN cache lags merges ~5min and lies about post-merge state. Already implemented; now the docstring says why. 3. Hardened policy YAML loading per Codex CR review: - Reject non-mapping top-level YAML (was implicit AttributeError later). - Reject 'repos' or 'watch' not being lists. - Normalise watch entries upfront — accept str or {path: str, ...}, reject anything else with a clear pointer to the offending index. No behavioural change for well-formed policy files; bad ones now fail fast with a useful message instead of a downstream traceback.
Adds templates/triage-bot.yml — exact copy of the validated workflow from ANcpLua/triage-bot-playground PR #1 + #2 (2026-05-18) with a documentation header explaining decision rules, prerequisites, and the gap it closes. Decision rules (v1, deterministic): body contains [triage:auto-resolve] → dismiss body matches /^(nit:|nitpick:|praise:)/i → dismiss author is Copilot, body < 200 chars → dismiss anything else → needs-human (default) The gap it closes: GitHub native auto-merge fires the moment CI goes green, even if Copilot / CodeRabbit later post review comments. With branch protection or a ruleset enforcing required_conversation_resolution=true, any unresolved thread keeps the PR mergeStateStatus=BLOCKED until explicitly resolved. The bot auto-resolves the non-blocking threads (so they don't accumulate as noise) while leaving real concerns to block the merge until a human acts. This is the structural fix that makes the pr-heal.yml cooldown (commit 22d0b28) belt-and-suspenders rather than the primary line of defence — the cooldown buys time for reviewers, the native gate deterministically waits for explicit resolution. enforce-repo-settings.yml extended (both enforce-user and enforce-org jobs): seed-if-missing pattern. Skip-if-exists because adopting the workflow requires a per-repo TRIAGE_PAT secret and the required_conversation_resolution gate on the default branch; this sweep can put the file in place but can't safely flip either. Gotcha (in the workflow's docstring): GITHUB_TOKEN cannot resolveReviewThread on threads it didn't author — needs a classic PAT or GitHub App installation token with pull_requests: write.
Semantic drift detection for shared configs across the framework repos. Parses each file per type (JSON/YAML/XML/INI/line-based) and clusters by equivalence — whitespace/key-order/flow-style noise collapses, only real content drift surfaces.
Watchlist defaults to 16 repos × 33 paths. Weekly cron Monday 06:00 UTC; opens/updates a
config-driftissue on drift.Local test result: 8 paths fully clean, 21 drifted — including legitimate variation (
renovate.jsonanti-self-bump per repo,.coderabbit.yamlper-repo policies).renovate.jsonwas 16 byte-unique → only 5 semantic clusters (12 of 16 identical after normalisation). Exactly the noise reduction we wanted.