github · pelikhan · Jun 21, 2026 · Jun 21, 2026 · Jun 21, 2026 · Jun 21, 2026
diff --git a/docs/adr/40662-targeted-text-scans-before-yaml-unmarshal.md b/docs/adr/40662-targeted-text-scans-before-yaml-unmarshal.md
@@ -0,0 +1,35 @@
+# ADR-40662: Use Targeted Text Scans to Gate yaml.Unmarshal in Template-Injection Validation
+
+**Date**: 2026-06-21
+**Status**: Accepted
+
+## Context
+
+`BenchmarkCompileComplexWorkflow` regressed ~320% (from ~3ms/op to ~12.7ms/op). CPU profiling traced the cost to `validateTemplateInjection` Path B — the code path taken whenever `skipValidation=true`, which is the default in `NewCompiler()`. That path used `hasAnyExpressionInRunContent`, a pre-check that matched **any** `${{ }}` expression in a `run:` block, including compiler-owned references such as `${{ runner.temp }}`. Because compiled workflows almost always contain such safe expressions, the pre-check effectively always returned true and unconditionally triggered a full `yaml.Unmarshal` of the ~1200-line compiled YAML on every compilation, even when no injection or regression violation was possible.
+
+## Decision
+
+We will replace the broad any-expression pre-check in `validateTemplateInjection` Path B with two targeted text scans that map one-to-one onto the two downstream sub-validators, and only call `yaml.Unmarshal` when at least one scan signals a potential issue. `needsUnsafeCheck` uses `hasExpressionInRunContent(UnsafeContextPattern)` to detect user-controlled contexts (template-injection risk), and `needsDisallowedCheck` uses the new `hasNonAllowedExpressionInRunContent` to detect expressions outside the compiler-owned allow-list (`allowedRunScriptExpressionRegex`). Both scans share the same raw-YAML `run:` walker, including flow-style and quoted-key support, so they stay aligned with each other and with the parsed-YAML validators. For well-formed workflows both scans return false and the YAML re-parse is skipped entirely.
+
+## Alternatives Considered
+
+### Alternative 1: Reuse the schema-validation parsed tree in Path B
+We could enable schema validation (or share its parsed `map[string]any`) so Path B never re-parses. This was rejected because `skipValidation=true` is the intentional default fast path; forcing a parse there would reintroduce the very `yaml.Unmarshal` cost this change eliminates and change compiler semantics for callers that deliberately skip schema checks.
+
+### Alternative 2: Narrow `hasAnyExpressionInRunContent` to ignore allowed expressions only
+We could keep a single combined pre-check that skips allow-listed expressions but still gate both validators together. This was rejected because the two validators have genuinely different trigger conditions (unsafe-context vs. non-allowed-expression); collapsing them would run one validator unnecessarily whenever the other's condition is met, and splitting the scans keeps each sub-validator's cost proportional to its own risk signal.
+
+## Consequences
+
+### Positive
+- `BenchmarkCompileComplexWorkflow` drops from ~6.8ms/op to ~1.64ms/op and allocations from 98,282 to 10,894 per op.
+- `yaml.Unmarshal` is skipped entirely in the common case (well-formed compiled workflows with only compiler-owned expressions).
+- The two scans now correspond directly to the two sub-validators, making the trigger logic easier to reason about.
+
+### Negative
+- The scan logic is more complex: the shared raw-YAML `run:` walker must stay consistent with the YAML parser's view of the same structure, including block scalars, flow-style mappings, and quoted keys.
+- Two regexes (`UnsafeContextPattern`, `allowedRunScriptExpressionRegex`) plus block detection now encode security-relevant gating; a bug that wrongly returns false would silently skip a security validator.
+
+### Neutral
+- Behavior is unchanged when a violation is actually present — the YAML is still parsed and the same validators run.
+- `allowedRunScriptExpressionRegex` becomes part of the performance-critical fast path; extending the compiler's allow-list now also affects which expressions can skip the parse.
diff --git a/pkg/workflow/compiler.go b/pkg/workflow/compiler.go
@@ -145,11 +145,9 @@ func (c *Compiler) generateAndValidateYAML(workflowData *WorkflowData, markdownP
 	// result between the two validators to avoid redundant yaml.Unmarshal calls.
 	//
 	// Performance note: when schema validation is enabled (needsSchemaCheck=true) the
-	// YAML is parsed regardless.  hasUnsafeExpressionInRunContent performs an expensive
-	// text scan (regex + strings.Split + full line walk) that would be redundant in that
-	// path; we skip it and reuse the pre-parsed result for template injection instead.
-	// The text scan is only used when schema validation is disabled (skipValidation=true),
-	// where it avoids an otherwise unnecessary yaml.Unmarshal call.
+	// YAML is parsed regardless.  The text scan in validateTemplateInjection is only
+	// used when schema validation is disabled (skipValidation=true), where targeted
+	// fast-path checks avoid an unnecessary yaml.Unmarshal.
 	needsSchemaCheck := !c.skipValidation
 
 	var parsedWorkflow map[string]any
@@ -169,8 +167,8 @@ func (c *Compiler) generateAndValidateYAML(workflowData *WorkflowData, markdownP
 	// parsedWorkflow != nil means the YAML was already parsed for schema validation;
 	// validateTemplateInjection reuses the pre-parsed tree (inspects only run: block values)
 	// rather than re-scanning the full YAML string.  When parsedWorkflow is nil (schema
-	// validation disabled), the lightweight hasUnsafeExpressionInRunContent text scan is
-	// used first to avoid an unnecessary yaml.Unmarshal.
+	// validation disabled), the validator uses targeted text scans to avoid an unnecessary
+	// yaml.Unmarshal when all run-block expressions are in the compiler-owned allowed list.
 	if err := c.validateTemplateInjection(yamlContent, lockFile, markdownPath, parsedWorkflow); err != nil {
 		return "", nil, nil, err
 	}
@@ -304,11 +302,20 @@ func (c *Compiler) writeWorkflowOutput(lockFile, yamlContent string, markdownPat
 // this function reuses it directly by walking the run: block values in the pre-parsed
 // tree, which is faster than re-scanning the full YAML string with a regex.
 //
-// When parsedWorkflow is nil (schema validation disabled via skipValidation), the
-// function first uses the lightweight hasAnyExpressionInRunContent text scan
-// to avoid an unnecessary yaml.Unmarshal call. When the scan detects expressions
-// in run blocks, the YAML is parsed with github.com/goccy/go-yaml for consistency
-// with the parsed-workflow validators.
+// When parsedWorkflow is nil (schema validation disabled via skipValidation), this
+// function uses two targeted text scans instead of a broad any-expression check:
+//
+//  1. hasExpressionInRunContent(UnsafeContextPattern) — detects user-controlled
+//     contexts (github.event.*, steps.*.outputs.*, inputs.*) that represent genuine
+//     template-injection risks.
+//
+//  2. hasNonAllowedExpressionInRunContent — detects compiler-regression cases where
+//     a ${{ ... }} expression that is NOT in the compiler-owned allow-list slipped
+//     into a run: block.
+//
+// For well-formed compiled workflows (the common case) both scans return false, so
+// no yaml.Unmarshal is needed at all.  A yaml.Unmarshal is only performed when at
+// least one scan detects a potential issue.
 func (c *Compiler) validateTemplateInjection(yamlContent, lockFile, markdownPath string, parsedWorkflow map[string]any) error {
 	var templateErr error
 
@@ -323,10 +330,20 @@ func (c *Compiler) validateTemplateInjection(yamlContent, lockFile, markdownPath
 		}
 	} else {
 		// Path B: schema validation is disabled (parsedWorkflow is nil).
-		// Use the text scan to cheaply determine whether any expressions (safe or
-		// unsafe) appear inside a run: block before paying the cost of a full
-		// yaml.Unmarshal for layered security + regression validation.
-		if hasAnyExpressionInRunContent(yamlContent) {
+		//
+		// Use two targeted text scans with a shared run-block walker so we can skip the
+		// expensive yaml.Unmarshal when all run-block expressions are compiler-owned safe
+		// references (e.g. ${{ runner.temp }}, ${{ env.FOO }}).
+		//
+		// needsUnsafeCheck:    true when user-controlled contexts appear in run: blocks
+		//                      → triggers validateNoTemplateInjectionFromParsed
+		// needsDisallowedCheck: true when any expression outside the compiler allow-list
+		//                      appears in run: blocks
+		//                      → triggers validateNoGitHubExpressionsInRunScriptsFromParsed
+		needsUnsafeCheck := hasExpressionInRunContent(yamlContent, UnsafeContextPattern)
+		needsDisallowedCheck := hasNonAllowedExpressionInRunContent(yamlContent)
+
+		if needsUnsafeCheck || needsDisallowedCheck {
 			workflowLog.Print("Validating for template injection vulnerabilities")
 			var reparsed map[string]any
 			if err := yaml.Unmarshal([]byte(yamlContent), &reparsed); err != nil {
@@ -335,8 +352,12 @@ func (c *Compiler) validateTemplateInjection(yamlContent, lockFile, markdownPath
 				reparsed = nil
 			}
 			if reparsed != nil {
-				templateErr = validateNoTemplateInjectionFromParsed(reparsed)
-				if templateErr == nil {
+				// validateNoTemplateInjectionFromParsed only catches user-controlled contexts,
+				// so it is intentionally skipped when the UnsafeContextPattern scan found none.
+				if needsUnsafeCheck {
+					templateErr = validateNoTemplateInjectionFromParsed(reparsed)
+				}
+				if templateErr == nil && needsDisallowedCheck {
 					templateErr = validateNoGitHubExpressionsInRunScriptsFromParsed(reparsed)
 				}
 			}

diff --git a/pkg/workflow/compiler_template_injection_both_paths_test.go b/pkg/workflow/compiler_template_injection_both_paths_test.go
@@ -70,6 +70,17 @@ jobs:
         run: echo "${{ job.services['redis'].ports['6379'] }}"
 `
 
+	flowStyleRegressionYAML := `
+name: flow-style-regression
+on: workflow_dispatch
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - { run: "echo ${{ github.actor }}" }
+      - run: node ${{ runner.temp }}/actions/foo.cjs
+`
+
 	heredocExpressionYAML := `
 name: heredoc-expression
 on: workflow_dispatch
@@ -159,6 +170,13 @@ jobs:
 		assert.NoError(t, err, "compiler-owned job.services expression should be allowed")
 	})
 
+	t.Run("Path B - schema disabled - flow-style regression detected alongside allowed expression", func(t *testing.T) {
+		err := compiler.validateTemplateInjection(flowStyleRegressionYAML, lockFile, markdownPath, nil)
+		require.Error(t, err, "flow-style run expressions should still be detected in the fallback path")
+		assert.Contains(t, err.Error(), "compiler regression detected")
+		assert.Contains(t, err.Error(), "github.actor")
+	})
+
 	t.Run("Path A - schema enabled - heredoc expression passes", func(t *testing.T) {
 		err := compiler.validateTemplateInjection(heredocExpressionYAML, lockFile, markdownPath, parseYAML(t, heredocExpressionYAML))
 		assert.NoError(t, err, "expressions inside heredoc content should not be flagged")

diff --git a/pkg/workflow/template_injection_validation.go b/pkg/workflow/template_injection_validation.go
@@ -59,80 +59,103 @@ var (
 	// allowedRunScriptExpressionRegex matches trusted compiler-owned expressions that are
 	// intentionally rendered in generated run scripts and are not user-controlled.
 	allowedRunScriptExpressionRegex = regexp.MustCompile(`^\$\{\{\s*(env\.[^}]+|vars\.[^}]+|runner\.[^}]+|github\.(repository|run_id|workspace)|steps\.parse-guard-vars\.outputs\.(approval_labels|blocked_users|trusted_users)|job\.services\[[^]]+\]\.ports\[[^]]+\])\s*\}\}$`)
+	runKeyPattern                   = regexp.MustCompile(`(?:^|[\s{,])(?:run|["']run["']):`)
 )
 
-// hasAnyExpressionInRunContent performs a fast line-by-line text scan to determine
-// whether any GitHub Actions expression (${{ ... }}) appears inside a YAML run: block.
-// Used by the compiler regression guardrail to detect expressions that should have
-// been rewritten to env variables.
-func hasAnyExpressionInRunContent(yamlContent string) bool {
-	return hasExpressionInRunContent(yamlContent, InlineExpressionPattern)
-}
-
-func hasExpressionInRunContent(yamlContent string, expressionRegex *regexp.Regexp) bool {
-	// Fast-path: no matching expressions anywhere → definitely no violation.
-	if !expressionRegex.MatchString(yamlContent) {
-		return false
+func findRunValue(keyPart string) (string, bool) {
+	loc := runKeyPattern.FindStringIndex(keyPart)
+	if loc == nil {
+		return "", false
 	}
+	return strings.TrimSpace(keyPart[loc[1]:]), true
+}
 
-	// Matching expressions exist somewhere; scan for any that appear inside a run: block
-	// without doing a full YAML parse.
-	// Use SplitSeq to iterate over lines lazily, avoiding the up-front allocation of the
-	// full []string slice that strings.Split would create for large YAML content.
+// walkRunBlockLines scans raw YAML text and visits each inline run value or line inside a
+// multiline run block. It recognizes plain run: keys as well as quoted and flow-style forms
+// so Path B stays aligned with the parsed-YAML validators.
+func walkRunBlockLines(yamlContent string, visit func(line string) bool) bool {
 	inRunBlock := false
 	runBlockIndent := 0
 
 	for line := range strings.SplitSeq(yamlContent, "\n") {
-		// Compute indentation first; skip blank and all-whitespace lines in one step.
 		trimmed := strings.TrimLeft(line, " \t")
 		if trimmed == "" {
-			// Blank / all-whitespace lines are allowed inside block scalars.
 			continue
 		}
 		indent := len(line) - len(trimmed)
 
 		if inRunBlock {
-			// A non-blank line at the same or lesser indentation ends the block.
 			if indent <= runBlockIndent {
 				inRunBlock = false
 				// Fall through: check whether this line starts a new run: block.
 			} else {
-				// Inside run block content — check for matching expressions.
-				if expressionRegex.MatchString(line) {
+				if visit(line) {
 					return true
 				}
 				continue
 			}
 		}
 
-		// Outside a run block: look for a run: key.
-		// Handle both "run: ..." (map key) and "- run: ..." (inline sequence item).
 		keyPart := trimmed
 		if strings.HasPrefix(keyPart, "-") {
 			keyPart = strings.TrimSpace(keyPart[1:])
 		}
-		if !strings.HasPrefix(keyPart, "run:") {
+
+		rest, ok := findRunValue(keyPart)
+		if !ok {
 			continue
 		}
-		rest := strings.TrimSpace(keyPart[4:]) // text after "run:"
 
-		if rest == "" {
-			// Empty run: value is unusual; treat conservatively as if block content follows.
-			inRunBlock = true
-			runBlockIndent = indent
-		} else if rest[0] == '|' || rest[0] == '>' {
-			// Literal or folded block scalar — content is on subsequent lines.
+		if rest == "" || rest[0] == '|' || rest[0] == '>' {
 			inRunBlock = true
 			runBlockIndent = indent
-		} else {
-			// Inline run value, e.g. run: echo "hello ${{ github.event.foo }}".
-			if expressionRegex.MatchString(rest) {
+			continue
+		}
+
+		if visit(rest) {
+			return true
+		}
+	}
+
+	return false
+}
+
+// hasNonAllowedExpressionInRunContent performs a fast line-by-line text scan to
+// determine whether any GitHub Actions expression (${{ ... }}) that is NOT in the
+// compiler-owned allow-list appears inside a YAML run: block.
+//
+// Unlike hasAnyExpressionInRunContent, this function skips expressions that match
+// allowedRunScriptExpressionRegex (e.g. ${{ runner.temp }}, ${{ env.FOO }}). It is
+// used as a fast pre-check for the regression guardrail inside validateTemplateInjection
+// (Path B) to avoid a yaml.Unmarshal when every run-block expression is compiler-owned.
+func hasNonAllowedExpressionInRunContent(yamlContent string) bool {
+	// Fast-path: no inline expressions anywhere → nothing to check.
+	if !InlineExpressionPattern.MatchString(yamlContent) {
+		return false
+	}
+
+	return walkRunBlockLines(yamlContent, func(line string) bool {
+		if !InlineExpressionPattern.MatchString(line) {
+			return false
+		}
+		for _, expr := range InlineExpressionPattern.FindAllString(line, -1) {
+			if !allowedRunScriptExpressionRegex.MatchString(expr) {
 				return true
 			}
 		}
+		return false
+	})
+}
+
+func hasExpressionInRunContent(yamlContent string, expressionRegex *regexp.Regexp) bool {
+	// Fast-path: no matching expressions anywhere → definitely no violation.
+	if !expressionRegex.MatchString(yamlContent) {
+		return false
 	}
 
-	return false
+	// Matching expressions exist somewhere; scan for any that appear inside a run: block
+	// without doing a full YAML parse.
+	return walkRunBlockLines(yamlContent, expressionRegex.MatchString)
 }
 
 // validateNoTemplateInjectionFromParsed checks a pre-parsed workflow map for template