ci: add doc-validation hooks (markdownlint, typos, placeholder linter, lychee) by andreahlert · Pull Request #18 · apache/magpie

andreahlert · 2026-04-30T14:24:44Z

Adds doc validation hooks to pre-commit and a lychee link-check workflow. Catches the kind of bugs review currently has to find by eye.

The hooks:

markdownlint-cli2 with a tight config (MD051 for broken anchors, MD053 for dangling link refs). Style rules off so the diff stays small and existing prose isn't churned.
typos with project terms allowlisted in .typos.toml (CNA, Vulnogram, ponymail, mis-, Nd, pre-empted).
tools/dev/check-placeholders.sh, a small bash linter that refuses hardcoded apache/airflow / Apache Airflow inside .claude/skills/ and tools/*.md. PR docs: tighten Airflow references to placeholders across framework files #1 already had to scrub these once. Cheaper to prevent the regression than redo the cleanup.

Lychee runs in a separate workflow on PRs and on a daily cron. Marked continue-on-error: true for now because the tree has 24 pre-existing broken refs to files that haven't landed yet (config/, projects/airflow/, the issue-template YAML). When the baseline hits zero we flip it to a hard gate.

Wiring this up surfaced five real broken anchors I fixed along the way:

AGENTS.md link to #point-reporters-to-the-security-model-dont-re-explain-it was missing the "project's" prefix the actual heading uses.
Two anchors in tools/ponymail/operations.md (#get-email, #get-thread) pointed at headings that are actually ## Get an email and ## Get a thread.
Headings with → in projects/_template/scope-labels.md and tools/github/issue-template.md produce URL-encoded slugs that GitHub doesn't resolve. Renamed to to and re-ran doctoc.

Files

.markdownlint.json                 (new)
.lychee.toml                       (new)
.typos.toml                        (new)
tools/dev/check-placeholders.sh    (new, executable)
.github/workflows/link-check.yml   (new, informational)
.pre-commit-config.yaml            (+3 hooks)
CONTRIBUTING.md                    (documents the hooks)
AGENTS.md, projects/_template/scope-labels.md,
tools/github/issue-template.md, tools/ponymail/operations.md
                                   (anchor fixes)

Local run

prek run --all-files

Or one at a time:

npx markdownlint-cli2 "**/*.md"
typos
tools/dev/check-placeholders.sh
lychee --config .lychee.toml .

Test plan

npx markdownlint-cli2 "**/*.md" clean.
typos clean.
tools/dev/check-placeholders.sh clean.
lychee --offline . reports the 24 pre-existing breakages, no new ones from this PR.
prek run --all-files green.

Out of scope

@potiuk, thinking about picking up a couple more fronts after this one if it lands well. Nothing huge, just smoothing edges:

MD040 (fenced code language tags). 62 untagged fences in the tree, mechanical fix but it would balloon the diff, better as its own PR.
MD038 (no-space-in-code). Most hits are intentional literal markdown samples like ` # `, ` ### `, ` - `. Re-enabling means escaping those consistently, again its own PR.
Skill frontmatter schema validator. Useful eventually, but only 8 skills today and the shape rarely changes.
Lychee as a blocking gate. Needs the 24 pre-existing breakages cleaned first.

Happy to do any or none of those, whatever fits the direction you have in mind.

Was generative AI tooling used to co-author this PR?

Yes

…, lychee) Adds a minimal doc-validation layer to pre-commit + a lychee link-check workflow. Catches the bug classes review currently has to find by eye: - markdownlint-cli2 with a tight config (MD051 broken anchors, MD053 dangling link refs); style rules off so the diff stays small - typos with project-term allowlist in .typos.toml (CNA, Vulnogram, ponymail, mis-, Nd, pre-empted) - tools/dev/check-placeholders.sh refuses hardcoded apache/airflow / Apache Airflow inside .claude/skills/ and tools/*.md (PR #1 already had to scrub these once) - lychee runs in a separate workflow on PR + daily cron; informational only today (continue-on-error: true) because the existing tree has 24 pre-existing broken refs to files that have not landed yet (config/, projects/airflow/, the issue-template YAML); flips to a hard gate once the baseline reaches zero Wiring this up surfaced five real broken anchors I fixed along the way: - AGENTS.md missing "the project's" prefix in #point-reporters-to-the-security-model-dont-re-explain-it - tools/ponymail/operations.md anchors #get-email and #get-thread pointed at headings that are actually "Get an email" / "Get a thread" - projects/_template/scope-labels.md and tools/github/issue-template.md carried headings with a literal "→" that GitHub URL-encodes into unresolvable slugs; renamed to "to" and re-ran doctoc Signed-off-by: André Ahlert <andre@aex.partners>

…on SHA PR apache#18's first CI run failed three checks; this fixup commit addresses all three: - prek / markdownlint — markdownlint-cli2 v0.22.1 requires Node ≥ 20 (its string-width dep uses the regex `/v` flag), but `default_language_version.node` was pinned to 18.6.0. Bumped to 22.11.0 (current active LTS). - asf-allowlist-check — `lycheeverse/lychee-action@82202e5e…` (v2.6.1) is not on the ASF infrastructure-actions allowlist. Re-pinned to the allowlisted v2.8.0 SHA `8646ba30535128ac92d33dfc9133794bfdd9b411`. Comment in the workflow now explains the allowlist requirement so future bumps go through the same check. - zizmor (ref-version-mismatch) — the v2.6.1-comment + actual-SHA combination from the original pin was flagged because the SHA pointed to v2.4.1, not v2.6.1. The new v2.8.0 SHA correctly maps to its tag, so the warning disappears with the same change. prek + zizmor verified clean locally. Generated-by: Claude Code (Opus 4.7)

potiuk · 2026-05-01T13:42:58Z

Fixed all failures. Merging. Yep. Improvements welcome @andreahlert :)

…sedes #30) (#44) * docs: enable markdownlint MD040 and tag all fenced code blocks Follow-up to #18. Flips MD040 from `false` to `true` and tags the 64 previously untagged fences across the tree. Most fences ended up `text` (MCP call sketches, URL examples, dir trees, plain output, commit trailers). 3 got `html` for HTML-comment idempotency markers and the <details> envelope, 3 `markdown` for the AI-disclosure block and rollup body samples, 1 `yaml` for the subagent return block in sync-security-issue. One nested-fence case in allocate-cve/SKILL.md needed the outer fence promoted from 3 to 4 backticks so the inner 3-backtick block renders as an actual nested code block instead of breaking the outer one. Signed-off-by: André Ahlert <andre@aex.partners> * docs: extend MD040 tagging to skills added since #30 #30 was opened against an earlier tree state; the pr-management skill family (lifted in #33, renamed to type-what-action in #35) added 9 new skill supporting files with 20 untagged fences that fail markdownlint MD040 once it's enabled. This commit applies the same tagging convention #30 established for the security family + tools to the new pr-management files: pr-management-triage/fetch-and-batch.md 1 fence pr-management-triage/comment-templates.md 2 fences → markdown pr-management-triage/interaction-loop.md 4 fences → text (UI mockups) pr-management-triage/workflow-approval.md 2 fences → text (UI mockups) pr-management-stats/fetch.md 3 fences → text (search queries) pr-management-stats/render.md 3 fences → text (output samples) pr-management-stats/classify.md 3 fences → text (pseudocode) pr-management-code-review/review-flow.md 1 fence → text (CLI mockup) pr-management-code-review/prerequisites.md 1 fence → text (HTTP error) Most fences ended up `text` (the same catch-all #30's commit message used for "MCP call sketches, URL examples, dir trees, plain output"). Two `markdown` fences in `pr-management-triage/comment-templates.md` because the content is a markdown link / list item example that GitHub should render as markdown. prek run --all-files clean. MD040 reports zero violations across the tree. Generated-by: Claude Code (Claude Opus 4.7) --------- Signed-off-by: André Ahlert <andre@aex.partners> Co-authored-by: André Ahlert <andre@aex.partners>

* feat(validator): add branch-name confidentiality check (#18, SOFT advisory) Adds check #17 to skill-and-tool-validator: scans git checkout -b and git switch -c examples inside fenced code blocks (across skills/ and docs/) and flags any concrete branch name that contains an embargo-breaking term — CVE IDs (CVE-YYYY-NNNNN), security, vulnerability/vuln, or advisory. Pre-disclosure public branch names must not reveal embargo context; neutral descriptive slugs are the safe alternative. Lines explicitly marked as bad examples (**bad**, bad:) are exempt, and placeholder branch names (<fix-slug>, $VAR) are silently skipped. The check is SOFT-advisory only (never blocks the run). 14 unit tests cover CVE IDs, security framing, vuln/advisory terms, placeholder exemptions, neutral names, and bad-example exemptions. The full codebase currently produces zero new violations. Generated-by: Claude (Opus 4.7) * fix for tool directories * change regular expression

…OFT advisory) Reads the Axis 1 (skill) and Axis 2 (tool) capability vocabulary tables from docs/labels-and-capabilities.md and verifies every taxonomy entry appears in at least one mapping-table row; entries marked *(reserved)* or *(future)* are exempt. Cross-checks the SKILL_CAPABILITIES / TOOL_CAPABILITIES code constants against the parsed vocabulary. The reserved/future marker accepts an elaborated parenthetical (e.g. *(future work)*, *(reserved for #999)*), not only the exact forms. Co-authored-by: Justin McLean <justin@classsoftware.com>

…SOFT advisory) contract:mail-source and contract:mail-archive adapter READMEs must declare that fetched mail content is external data (not instructions) and mention the prompt-injection risk in embedded mail content. Both are SOFT advisories. Co-authored-by: Justin McLean <justin@classsoftware.com>

…OFT advisory) Reads the Axis 1 (skill) and Axis 2 (tool) capability vocabulary tables from docs/labels-and-capabilities.md and verifies every taxonomy entry appears in at least one mapping-table row; entries marked *(reserved)* or *(future)* are exempt. Cross-checks the SKILL_CAPABILITIES / TOOL_CAPABILITIES code constants against the parsed vocabulary. The reserved/future marker accepts an elaborated parenthetical (e.g. *(future work)*, *(reserved for #999)*), not only the exact forms. Co-authored-by: Justin McLean <justin@classsoftware.com>

potiuk force-pushed the feat/doc-validation-ci branch from d8540ff to 53fa19e Compare May 1, 2026 13:12

potiuk approved these changes May 1, 2026

View reviewed changes

potiuk merged commit 603258b into apache:main May 1, 2026
8 checks passed

andreahlert deleted the feat/doc-validation-ci branch May 1, 2026 13:47

andreahlert mentioned this pull request May 2, 2026

docs: enable markdownlint MD040 and tag all fenced code blocks #30

Closed

4 tasks

andreahlert added the mode:platform Substrate / infra — not a mode (sandbox, CI, validators) label May 7, 2026

andreahlert mentioned this pull request May 15, 2026

docs(principles): add operational principles document #147

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: add doc-validation hooks (markdownlint, typos, placeholder linter, lychee)#18

ci: add doc-validation hooks (markdownlint, typos, placeholder linter, lychee)#18
potiuk merged 2 commits into
apache:mainfrom
andreahlert:feat/doc-validation-ci

andreahlert commented Apr 30, 2026 •

edited

Loading

Uh oh!

potiuk commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

andreahlert commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Files

Local run

Test plan

Out of scope

Was generative AI tooling used to co-author this PR?

Uh oh!

potiuk commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andreahlert commented Apr 30, 2026 •

edited

Loading