Skip to content

ci: add doc-validation hooks (markdownlint, typos, placeholder linter, lychee)#18

Merged
potiuk merged 2 commits into
apache:mainfrom
andreahlert:feat/doc-validation-ci
May 1, 2026
Merged

ci: add doc-validation hooks (markdownlint, typos, placeholder linter, lychee)#18
potiuk merged 2 commits into
apache:mainfrom
andreahlert:feat/doc-validation-ci

Conversation

@andreahlert

@andreahlert andreahlert commented Apr 30, 2026

Copy link
Copy Markdown
Collaborator

Adds doc validation hooks to pre-commit and a lychee link-check workflow. Catches the kind of bugs review currently has to find by eye.

The hooks:

  • markdownlint-cli2 with a tight config (MD051 for broken anchors, MD053 for dangling link refs). Style rules off so the diff stays small and existing prose isn't churned.
  • typos with project terms allowlisted in .typos.toml (CNA, Vulnogram, ponymail, mis-, Nd, pre-empted).
  • tools/dev/check-placeholders.sh, a small bash linter that refuses hardcoded apache/airflow / Apache Airflow inside .claude/skills/ and tools/*.md. PR docs: tighten Airflow references to placeholders across framework files #1 already had to scrub these once. Cheaper to prevent the regression than redo the cleanup.

Lychee runs in a separate workflow on PRs and on a daily cron. Marked continue-on-error: true for now because the tree has 24 pre-existing broken refs to files that haven't landed yet (config/, projects/airflow/, the issue-template YAML). When the baseline hits zero we flip it to a hard gate.

Wiring this up surfaced five real broken anchors I fixed along the way:

  • AGENTS.md link to #point-reporters-to-the-security-model-dont-re-explain-it was missing the "project's" prefix the actual heading uses.
  • Two anchors in tools/ponymail/operations.md (#get-email, #get-thread) pointed at headings that are actually ## Get an email and ## Get a thread.
  • Headings with in projects/_template/scope-labels.md and tools/github/issue-template.md produce URL-encoded slugs that GitHub doesn't resolve. Renamed to to and re-ran doctoc.

Files

.markdownlint.json                 (new)
.lychee.toml                       (new)
.typos.toml                        (new)
tools/dev/check-placeholders.sh    (new, executable)
.github/workflows/link-check.yml   (new, informational)
.pre-commit-config.yaml            (+3 hooks)
CONTRIBUTING.md                    (documents the hooks)
AGENTS.md, projects/_template/scope-labels.md,
tools/github/issue-template.md, tools/ponymail/operations.md
                                   (anchor fixes)

Local run

prek run --all-files

Or one at a time:

npx markdownlint-cli2 "**/*.md"
typos
tools/dev/check-placeholders.sh
lychee --config .lychee.toml .

Test plan

  • npx markdownlint-cli2 "**/*.md" clean.
  • typos clean.
  • tools/dev/check-placeholders.sh clean.
  • lychee --offline . reports the 24 pre-existing breakages, no new ones from this PR.
  • prek run --all-files green.

Out of scope

@potiuk, thinking about picking up a couple more fronts after this one if it lands well. Nothing huge, just smoothing edges:

  • MD040 (fenced code language tags). 62 untagged fences in the tree, mechanical fix but it would balloon the diff, better as its own PR.
  • MD038 (no-space-in-code). Most hits are intentional literal markdown samples like ` # `, ` ### `, ` - `. Re-enabling means escaping those consistently, again its own PR.
  • Skill frontmatter schema validator. Useful eventually, but only 8 skills today and the shape rarely changes.
  • Lychee as a blocking gate. Needs the 24 pre-existing breakages cleaned first.

Happy to do any or none of those, whatever fits the direction you have in mind.


Was generative AI tooling used to co-author this PR?
  • Yes

…, lychee)

Adds a minimal doc-validation layer to pre-commit + a lychee link-check
workflow. Catches the bug classes review currently has to find by eye:

- markdownlint-cli2 with a tight config (MD051 broken anchors, MD053
  dangling link refs); style rules off so the diff stays small
- typos with project-term allowlist in .typos.toml (CNA, Vulnogram,
  ponymail, mis-, Nd, pre-empted)
- tools/dev/check-placeholders.sh refuses hardcoded apache/airflow /
  Apache Airflow inside .claude/skills/ and tools/*.md (PR #1 already
  had to scrub these once)
- lychee runs in a separate workflow on PR + daily cron; informational
  only today (continue-on-error: true) because the existing tree has
  24 pre-existing broken refs to files that have not landed yet
  (config/, projects/airflow/, the issue-template YAML); flips to a
  hard gate once the baseline reaches zero

Wiring this up surfaced five real broken anchors I fixed along the
way:

- AGENTS.md missing "the project's" prefix in
  #point-reporters-to-the-security-model-dont-re-explain-it
- tools/ponymail/operations.md anchors #get-email and #get-thread
  pointed at headings that are actually "Get an email" / "Get a thread"
- projects/_template/scope-labels.md and tools/github/issue-template.md
  carried headings with a literal "→" that GitHub URL-encodes into
  unresolvable slugs; renamed to "to" and re-ran doctoc

Signed-off-by: André Ahlert <andre@aex.partners>
@potiuk potiuk force-pushed the feat/doc-validation-ci branch from d8540ff to 53fa19e Compare May 1, 2026 13:12
…on SHA

PR apache#18's first CI run failed three checks; this fixup commit addresses
all three:

- prek / markdownlint — markdownlint-cli2 v0.22.1 requires Node ≥ 20
  (its string-width dep uses the regex `/v` flag), but
  `default_language_version.node` was pinned to 18.6.0. Bumped to
  22.11.0 (current active LTS).

- asf-allowlist-check — `lycheeverse/lychee-action@82202e5e…`
  (v2.6.1) is not on the ASF infrastructure-actions allowlist.
  Re-pinned to the allowlisted v2.8.0 SHA
  `8646ba30535128ac92d33dfc9133794bfdd9b411`. Comment in the workflow
  now explains the allowlist requirement so future bumps go through
  the same check.

- zizmor (ref-version-mismatch) — the v2.6.1-comment + actual-SHA
  combination from the original pin was flagged because the SHA
  pointed to v2.4.1, not v2.6.1. The new v2.8.0 SHA correctly maps
  to its tag, so the warning disappears with the same change.

prek + zizmor verified clean locally.

Generated-by: Claude Code (Opus 4.7)
@potiuk

potiuk commented May 1, 2026

Copy link
Copy Markdown
Member

Fixed all failures. Merging. Yep. Improvements welcome @andreahlert :)

@potiuk potiuk merged commit 603258b into apache:main May 1, 2026
8 checks passed
@andreahlert andreahlert deleted the feat/doc-validation-ci branch May 1, 2026 13:47
potiuk added a commit that referenced this pull request May 4, 2026
…sedes #30) (#44)

* docs: enable markdownlint MD040 and tag all fenced code blocks

Follow-up to #18. Flips MD040 from `false` to `true` and tags the 64
previously untagged fences across the tree.

Most fences ended up `text` (MCP call sketches, URL examples, dir
trees, plain output, commit trailers). 3 got `html` for HTML-comment
idempotency markers and the <details> envelope, 3 `markdown` for the
AI-disclosure block and rollup body samples, 1 `yaml` for the
subagent return block in sync-security-issue.

One nested-fence case in allocate-cve/SKILL.md needed the outer
fence promoted from 3 to 4 backticks so the inner 3-backtick block
renders as an actual nested code block instead of breaking the outer
one.

Signed-off-by: André Ahlert <andre@aex.partners>

* docs: extend MD040 tagging to skills added since #30

#30 was opened against an earlier tree state; the pr-management
skill family (lifted in #33, renamed to type-what-action in
#35) added 9 new skill supporting files with 20 untagged
fences that fail markdownlint MD040 once it's enabled.

This commit applies the same tagging convention #30 established
for the security family + tools to the new pr-management files:

  pr-management-triage/fetch-and-batch.md     1 fence
  pr-management-triage/comment-templates.md   2 fences  → markdown
  pr-management-triage/interaction-loop.md    4 fences  → text (UI mockups)
  pr-management-triage/workflow-approval.md   2 fences  → text (UI mockups)
  pr-management-stats/fetch.md                3 fences  → text (search queries)
  pr-management-stats/render.md               3 fences  → text (output samples)
  pr-management-stats/classify.md             3 fences  → text (pseudocode)
  pr-management-code-review/review-flow.md    1 fence   → text (CLI mockup)
  pr-management-code-review/prerequisites.md  1 fence   → text (HTTP error)

Most fences ended up `text` (the same catch-all #30's commit
message used for "MCP call sketches, URL examples, dir trees,
plain output"). Two `markdown` fences in
`pr-management-triage/comment-templates.md` because the
content is a markdown link / list item example that GitHub
should render as markdown.

prek run --all-files clean. MD040 reports zero violations
across the tree.

Generated-by: Claude Code (Claude Opus 4.7)

---------

Signed-off-by: André Ahlert <andre@aex.partners>
Co-authored-by: André Ahlert <andre@aex.partners>
@andreahlert andreahlert added the mode:platform Substrate / infra — not a mode (sandbox, CI, validators) label May 7, 2026
potiuk pushed a commit that referenced this pull request Jul 3, 2026
* feat(validator): add branch-name confidentiality check (#18, SOFT advisory)

Adds check #17 to skill-and-tool-validator: scans git checkout -b and
git switch -c examples inside fenced code blocks (across skills/ and docs/)
and flags any concrete branch name that contains an embargo-breaking term
— CVE IDs (CVE-YYYY-NNNNN), security, vulnerability/vuln, or advisory.

Pre-disclosure public branch names must not reveal embargo context;
neutral descriptive slugs are the safe alternative.  Lines explicitly
marked as bad examples (**bad**, bad:) are exempt, and placeholder
branch names (<fix-slug>, $VAR) are silently skipped.

The check is SOFT-advisory only (never blocks the run).  14 unit tests
cover CVE IDs, security framing, vuln/advisory terms, placeholder
exemptions, neutral names, and bad-example exemptions.  The full
codebase currently produces zero new violations.

Generated-by: Claude (Opus 4.7)

* fix for tool directories

* change regular expression
potiuk pushed a commit to justinmclean/airflow-steward that referenced this pull request Jul 3, 2026
…OFT advisory)

Reads the Axis 1 (skill) and Axis 2 (tool) capability vocabulary tables
from docs/labels-and-capabilities.md and verifies every taxonomy entry
appears in at least one mapping-table row; entries marked *(reserved)*
or *(future)* are exempt. Cross-checks the SKILL_CAPABILITIES /
TOOL_CAPABILITIES code constants against the parsed vocabulary.

The reserved/future marker accepts an elaborated parenthetical (e.g.
*(future work)*, *(reserved for #999)*), not only the exact forms.

Co-authored-by: Justin McLean <justin@classsoftware.com>
potiuk pushed a commit to justinmclean/airflow-steward that referenced this pull request Jul 3, 2026
…SOFT advisory)

contract:mail-source and contract:mail-archive adapter READMEs must
declare that fetched mail content is external data (not instructions)
and mention the prompt-injection risk in embedded mail content. Both
are SOFT advisories.

Co-authored-by: Justin McLean <justin@classsoftware.com>
potiuk pushed a commit to justinmclean/airflow-steward that referenced this pull request Jul 3, 2026
…OFT advisory)

Reads the Axis 1 (skill) and Axis 2 (tool) capability vocabulary tables
from docs/labels-and-capabilities.md and verifies every taxonomy entry
appears in at least one mapping-table row; entries marked *(reserved)*
or *(future)* are exempt. Cross-checks the SKILL_CAPABILITIES /
TOOL_CAPABILITIES code constants against the parsed vocabulary.

The reserved/future marker accepts an elaborated parenthetical (e.g.
*(future work)*, *(reserved for #999)*), not only the exact forms.

Co-authored-by: Justin McLean <justin@classsoftware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mode:platform Substrate / infra — not a mode (sandbox, CI, validators)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants