From 146f6e1d77f4ae7450101d1a16ad6e24ba03fd55 Mon Sep 17 00:00:00 2001 From: kjgbot Date: Sun, 10 May 2026 10:08:08 +0200 Subject: [PATCH] docs(agents): add source-text-analysis rule (use parsers, not regex) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two recent fixes converged on the same architectural answer for text inspection on Ricky source-code hot paths: - #86 (`auto-fix-loop.ts:hasRickyWorkflowAliasImport`) — substring match for `from 'node:fs'` was fooled by the literal text inside a HEREDOC embedded in a `.step({ command: ... })` body. Fixed by walking `ts.createSourceFile` ImportDeclaration nodes. - #88 (`spec-intake/markdown-target-files.ts:extractTargetFilesFromMarkdown`) — regex `PATH_PATTERN` matched paths inside fenced code blocks and prose noise. Fixed by walking `mdast-util-from-markdown` inlineCode nodes. Both fixes also independently rejected LLM-based detection for the same reasons: non-determinism breaks the offline eval suite, latency multiplies on every retry, and model output is a prompt-injection surface for paths that drive shell commands. Capture this as a durable rule in AGENTS.md so the next person tempted to grep-and-substitute on TS source has the prior art. Adds a new top-level "Ricky Source Code Conventions" section above the existing workflow-authoring conventions, since it applies to Ricky's own source rather than to the workflows Ricky generates. CLAUDE.md auto-follows via the existing symlink — no separate Claude rule needed (consistent with AGENTS.md:206). Co-Authored-By: Claude Opus 4.7 (1M context) --- AGENTS.md | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/AGENTS.md b/AGENTS.md index c7ae292d..5487ccc6 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -170,6 +170,42 @@ Your trajectory helps others understand: Future agents can query past trajectories to learn from your decisions. +# Ricky Source Code Conventions + +Rules that apply to Ricky's own source under `src/`, distinct from the workflow-authoring rules below. + +## Source-Text Analysis: Use Grammar-Aware Parsers, Not Regex + +When Ricky source code needs to inspect text whose grammar Ricky already understands — TypeScript or JavaScript artifacts, Markdown specs, JSON manifests, shell command bodies — use a parser for that grammar. Do not use substring matches, line-anchored regex, or comment-stripping heuristics. Do not call an LLM. + +**Why parsers, not regex.** The bugs we keep shipping in this area share one shape: a substring or regex that looked right against the file's "normal" code matches the same characters inside a string literal, a comment, a fenced code block, a HEREDOC, or a multi-line declaration. The injected fix then either no-ops (because we falsely detected the pattern as already present) or duplicates a declaration (because we falsely detected it as missing). Two recent examples in the same week converged on this conclusion: + +- `src/local/auto-fix-loop.ts:hasRickyWorkflowAliasImport` — `injectWorkflowEnvLoader`'s substring check for `from 'node:fs'` matched the literal text inside a `node --input-type=module` HEREDOC embedded in a `.step({ command: ... })` body, then skipped adding the real `import * as rickyWorkflowFs` alias. The fix walks `ts.createSourceFile` `ImportDeclaration` nodes; string-literal contents are structurally inert. +- `src/product/spec-intake/markdown-target-files.ts:extractTargetFilesFromMarkdown` — the previous `PATH_PATTERN` regex over markdown spec text matched paths inside fenced code blocks and prose noise like `base/head`. The fix walks `mdast-util-from-markdown` `inlineCode` nodes; fenced blocks become `code` nodes and are excluded by construction. + +**Why not an LLM.** For hot-path pure functions: + +- Non-determinism breaks the offline eval suite and makes regressions hard to bisect. +- Latency multiplies on every retry and every `--no-run` / `--run` invocation. +- Spawning a model adds an external-dependency surface to a deterministic transformation. +- For paths that later drive shell commands, model output is a prompt-injection surface. + +LLMs are the right tool for judgment ("is this code refactor good?"). They are the wrong tool for "does this file have an import statement matching this pattern." + +**Concrete tools available.** Both are already in `dependencies`: + +- TypeScript / JavaScript: `import ts from 'typescript'`; `ts.createSourceFile(name, content, ts.ScriptTarget.Latest, false, ts.ScriptKind.TS)` then walk `sourceFile.statements`. +- Markdown: `import { fromMarkdown } from 'mdast-util-from-markdown'`; walk the resulting mdast tree (`inlineCode`, `code`, `heading`, `list`, etc.). + +**When you must touch source text, prefer in this order:** + +1. AST walk for the file's grammar (TS, mdast, JSON.parse, etc.). +2. A small purpose-built tokenizer that strips inert regions (string literals, comments, code fences) before checking the cleaned content. Use this only when the parser dependency is genuinely too heavy for the call site. +3. Regex on a region you have already proven inert (e.g., a single import-statement node's `getText()` output). +4. Substring matching on the raw file. **Default to no.** If you reach for this, the bug class above is already loaded and aimed at you. + +The detection rule is the strict one. Insertion code that anchors against a known marker (e.g., "append after the `import { workflow }` line") may stay regex-based when the marker itself is unambiguous, but if you find yourself stripping comments or guessing string-literal boundaries, switch to AST. + # Ricky Workflow Conventions Every agent working in this repo must follow these rules when authoring, reviewing, or modifying Ricky workflows.