Skip to content

fix(producer): force single worker for html-in-canvas compositions#1052

Merged
miguel-heygen merged 2 commits into
mainfrom
fix/htmlincanvas-single-worker
May 23, 2026
Merged

fix(producer): force single worker for html-in-canvas compositions#1052
miguel-heygen merged 2 commits into
mainfrom
fix/htmlincanvas-single-worker

Conversation

@miguel-heygen
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen commented May 23, 2026

Summary

  • Detects html-in-canvas API usage (<canvas layoutsubtree>) during composition compilation and adds it as a new RenderModeHintCode
  • Forces worker count to 1 when detected — Chrome's drawElementImage does not support concurrent usage across multiple browser instances and produces flickering artifacts
  • Overrides both auto-sizing and explicit --workers flags (with a warning log when the user explicitly requested >1)

Test plan

  • New unit test: detectRenderModeHints detects layoutsubtree canvas attribute
  • New unit test: detectRenderModeHints ignores plain canvas without layoutsubtree
  • New unit test: resolveRenderWorkerCount forces 1 worker on htmlInCanvas hint
  • New unit test: resolveRenderWorkerCount overrides explicit --workers 8 with warning
  • All 99 existing tests pass
  • Manual: render a liquid-glass composition and verify no flickering

Chrome's drawElementImage API does not support concurrent usage across
multiple browser instances — running >1 capture worker causes flickering
artifacts. Detect the layoutsubtree canvas attribute during compilation
and unconditionally pin workers to 1, overriding both auto-sizing and
explicit --workers flags.
@miguel-heygen miguel-heygen force-pushed the fix/htmlincanvas-single-worker branch from 4fd03e0 to 497f220 Compare May 23, 2026 19:48
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: COMMENT (would-be APPROVE)

Workaround is well-shaped: surgical detection, tight test coverage, override path keeps the user informed. Two non-blocking observations on workaround-quality vs root-cause framing. Holding for James's stamp greenlight + Magi's deeper root-cause investigation (which will land separately per the delegation).

Surgical scope is right ✓

PR title reads "force single worker for html-in-canvas comps" which could be interpreted as "force-single across the board." The actual implementation is more targeted:

  • detectRenderModeHints greps the compiled HTML for canvas[layoutsubtree] — the marker for the Chrome experimental html-in-canvas API.
  • Only when that selector matches does resolveRenderWorkerCount clamp to 1.
  • Plain canvas comps (three.js, audio waveform, ProgressIndicator, etc.) keep their multi-worker parallelism.

Composition-level surgical, not global. ✓

Sub-composition scope confirmed ✓

I was concerned the per-HTML-string detection might miss canvas[layoutsubtree] usage in sub-compositions — but verified at htmlCompiler.ts:934-944:

const inlinedHtml = inlineSubCompositions(fullHtml, subCompositions, projectDir);
// ...
const sanitizedHtml = inlinedHtml.replace(...);
const renderModeHints = detectRenderModeHints(sanitizedHtml);

detectRenderModeHints runs on the flattened HTML (with all sub-comps inlined), so html-in-canvas usage anywhere in the composition tree triggers the worker clamp. No sub-comp scope gap. ✓

Test coverage solid ✓

Four new tests pin both halves of the behavior:

  • Detection: canvas[layoutsubtree] triggers / plain canvas doesn't
  • Worker clamp: auto-sized → 1 / explicit --workers 8 → 1 with warning

The "with warning" assertion (expect(log.warn).toHaveBeenCalledOnce()) specifically pins the user-facing log line, so a future "let's parallelize again" refactor would fail loudly. Good shape — matches the spirit of the briefing's "test that pins the prior behavior so future drift fails loudly."

Observation 1 (worth addressing) — no TODO / tracking-issue link

The two user-facing messages explain what is happening:

"Detected html-in-canvas API (layoutsubtree canvas). Chrome does not support concurrent drawElementImage across multiple workers; render is pinned to a single worker."

But neither references:

  • A Chromium bug ID (Crbug? Issue tracker?)
  • An internal tracking issue for Magi's deeper investigation
  • A // TODO(@miguel-heygen): comment marking this as a workaround to remove

The risk: when the root cause is eventually fixed upstream (or our deeper investigation finds the missing-determinism angle Miguel called out for Magi), nobody knows to remove this clamp. It becomes permanent cruft, capping html-in-canvas renders at 1 worker forever.

Suggested shape (one-line addition to the warning + a code comment):

// TODO(htmlInCanvas): workaround — Chrome's experimental drawElementImage
// API is non-deterministic across concurrent browser instances.
// Remove this clamp once root cause is identified.
// Tracking: <link to Magi's investigation findings or chromium bug>
if (reasonCodes.has("htmlInCanvas")) {
  // ...
}

Cheap insurance against the workaround outliving its cause.

Observation 2 (optional) — no escape hatch for power users

resolveRenderWorkerCount clamps to 1 unconditionally when htmlInCanvas is detected, even when the user passes --workers 8 explicitly. The warning message tells the user what happened, but there's no way to override.

For the common case this is correct — preventing users from foot-gunning with a flickering output is good UX. But:

  • A developer testing whether the bug is fixed in a newer Chrome can't compare multi-worker output without a code change.
  • A user who's confident their composition is safe (maybe they're only animating, not capturing) has no opt-out.

An optional --force-workers N flag (distinct from --workers N) would be the clean shape — explicit acknowledgment that the user is overriding the safety check. Or --workers 8 --html-in-canvas-unsafe. Not blocking; the current shape is defensible.

Sanity checks (no concerns)

  • Performance cost on the clamped subset: workers go from auto-sized (8 typically?) to 1 → ~8x slowdown for html-in-canvas renders. Trade-off is right given the alternative is unusable flickering output. Worth noting in the PR body that the cost is bounded to the html-in-canvas subset of compositions, not all renders.
  • canvas[layoutsubtree] selector matches any value of the attribute — correct, since the spec uses it as a boolean attribute (presence = on).
  • drawElementImage JS API without layoutsubtree: per the experimental spec, drawElementImage requires the canvas to have layoutsubtree enabled — so detection on the attribute alone covers the JS-call surface too. ✓
  • CI: 20 success / 10 in_progress / 0 failures.

Pre-merge ask (non-blocking)

  • Add a TODO + tracking-issue reference so this doesn't become permanent. The reference can point to Magi's investigation findings once those land, or a Chromium issue if we file one. The shape matters more than the exact link target.

Workaround quality is otherwise solid. Deferring to Magi on the root-cause question (Chrome experimental feature vs missing-determinism in our side).

— Rames Jusso

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detection is clean and narrow — canvas[layoutsubtree] selector in detectRenderModeHints, propagated through the RenderModeHintCode union to an early-exit in resolveRenderWorkerCount. Tests cover the positive case, the negative case (plain canvas), the auto-sizing override, and the explicit --workers override with the warn log. The shape is consistent with how other render-mode hints (iframe, raf) are wired.

Strengths:

  • captureCost.ts:116–126: early-exit before the cost estimate path is clean and correct. Single-worker enforcement is total — no edge where the estimate path can slide under it.
  • htmlCompiler.test.ts:416–441: positive/negative detection tests are well-formed with realistic HTML fixtures.
  • Warn log on the --workers > 1 override is the right UX signal.

important — silent override on the auto-sized path

captureCost.ts:118: the warn fires only when requestedWorkers !== undefined && requestedWorkers > 1. An auto-sized render (the majority of production jobs, where requestedWorkers is undefined) is silently pinned to 1 worker with no log at all. Operators have no signal that single-worker mode was engaged unless they happen to check workerCount downstream. The warn should also fire (or a separate log.info) on the auto path so the behavior is observable. Suggested shape:

log.warn(
  "[Render] html-in-canvas (drawElementImage) detected — pinning to 1 worker (Chrome concurrency limitation).",
  { requestedWorkers },
);
return 1;

Single branch, no condition needed.


important — htmlInCanvas implicitly forces screenshot mode; test doesn't pin this

htmlCompiler.ts:130: detectRenderModeHints returns recommendScreenshot: reasons.length > 0. Adding htmlInCanvas to reasons means every html-in-canvas composition is routed to screenshot capture mode via applyRenderModeHints (shared.ts:212), in addition to being pinned to 1 worker. The PR body doesn't mention this side effect, and the new test at htmlCompiler.test.ts:416 only asserts reasons — it does not assert recommendScreenshot. If this is intentional (screenshot + single-worker both apply), the test should pin it. If it's unintentional, htmlInCanvas needs a carve-out in detectRenderModeHints analogous to how the code already avoids double-applying hints in other places.

Manual testing passed, so at minimum screenshot mode is compatible with drawElementImage — but the coupling should be explicit in the test.


important — no escape hatch; no follow-up issue

Single-worker enforcement cannot be bypassed. If the workaround misfires on a composition that doesn't actually flicker, or if the root-cause fix lands separately and the detection is over-broad, there is no --force-workers override or env var that unblocks the operator. Consider a HYPERFRAMES_OVERRIDE_WORKERS env or a cfg.disableHtmlInCanvasWorkerPin flag that lets the team A/B without a code change.

Also: the PR body describes this as a fix for non-determinism but doesn't reference a follow-up issue to investigate the actual root cause in Chrome's drawElementImage implementation. A linked issue would close the loop on the workaround framing.


nit — wasted calibration on auto-sized html-in-canvas renders

renderOrchestrator.ts:1702: the calibration run is gated on job.config.workers === undefined && totalFrames >= 60. For auto-sized renders of html-in-canvas compositions, calibration will still execute (potentially 600ms–3s), then resolveRenderWorkerCount throws the estimate away and returns 1. The early-exit in resolveRenderWorkerCount doesn't propagate up to skip the calibration. Not a correctness issue, but a latency cost on every affected render. A pre-check before the calibration block using the hint codes would skip it:

if (job.config.workers === undefined && totalFrames >= 60 && !compiled.renderModeHints.reasons.some(r => r.code === "htmlInCanvas")) {
  // run calibration
}

CI note: all required checks were pending at review time; no failures.

Verdict: APPROVE
Reasoning: The core mechanism is correct — single-worker enforcement is total, detection is precise, and manual testing on the affected composition type passed. The important findings are observability and documentation gaps, not correctness breaks. Ship with follow-up for the escape hatch and the recommendScreenshot coupling test.

— Vai

Copy link
Copy Markdown
Collaborator Author

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Magi — root-cause investigation + code review

Root-cause investigation (Miguel's question)

Verdict: Chrome's drawElementImage IS the root cause. Not a HyperFrames bug.

drawElementImage() is an experimental API behind --enable-features=CanvasDrawElement that rasterizes live DOM subtrees into a <canvas>. The non-determinism in parallel rendering comes from two Chrome-internal mechanisms:

  1. Paint cache racedrawElementImage reads from the compositor's internal paint records. When multiple Chrome instances run on the same machine, CPU/GPU contention causes the compositor to serve stale or incomplete paint records to the canvas rasterizer. This produces flickering artifacts (frames where the canvas content is partially rendered or shows the previous frame's state).

  2. SwiftShader contention — Both distributed and in-process renders run browserGpuMode: "software" (SwiftShader). Multiple SwiftShader instances contend on shared OpenGL state, and drawElementImage's rasterizer shares that pipeline. A single Chrome instance avoids this contention entirely.

The API is behind an experimental flag for a reason — it isn't stable for concurrent use. There's no workaround on our side short of waiting for Chromium to stabilize it.

Distributed rendering is unaffectedrenderChunk.ts already hardcodes workerCount: 1 (line 544), so each Temporal activity runs a single Chrome instance per chunk on separate processes/machines, avoiding the shared-state problem.

Code review

Detection is correct: canvas[layoutsubtree] attribute check runs on the flattened HTML (after inlineSubCompositions), so sub-composition usage is caught. The resolveRenderWorkerCount short-circuit is clean.

Non-blocking (1): recommendScreenshot side effect — the htmlInCanvas reason being pushed into reasons also forces screenshot capture mode (recommendScreenshot: reasons.length > 0). This is actually correct and necessary: the micro-screenshot paint-force protocol in frameCapture.ts:740 only runs when captureMode !== "beginframe", and drawElementImage depends on that forced paint. But the test doesn't assert recommendScreenshot is true, so this critical coupling is invisible. Suggest adding expect(result.recommendScreenshot).toBe(true) to pin this.

Non-blocking (2): Agree with Rames on the tracking TODO. A TODO(htmlInCanvas): remove when Chromium stabilizes CanvasDrawElement for concurrent use would prevent this from becoming permanent cruft.

LGTM ✓

Magi

Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following Magi's root-cause finding + acknowledging three Vai catches I missed

Magi's root-cause conclusion changes my framing

My prior review framed this as a "workaround that might become permanent cruft." Magi's investigation shifts that:

"drawElementImage is an experimental API behind --enable-features=CanvasDrawElement ... two Chrome-internal mechanisms ... There's no workaround on our side short of waiting for Chromium to stabilize it."

So the clamp IS the correct fix given the upstream API's design, not a temporary patch over our own bug. My "TODO + tracking issue" ask still applies, but the framing should shift from "remove when fixed" to "document the two mechanisms (paint cache race + SwiftShader contention) so a future maintainer understands why this clamp is permanent until Chromium stabilizes CanvasDrawElement for concurrent use."

Three observations I missed, verified independently at the code

  1. Silent override on auto-sized renders (Vai #1). Verified at captureCost.ts:118-126: the warn fires only when requestedWorkers !== undefined && requestedWorkers > 1. Auto-sized jobs (where requestedWorkers === undefined — the majority of prod renders) take the silent return-1 path. Operators have no observability that single-worker mode engaged. Real gap; Vai's suggested fix (move warn out of the explicit-workers guard, fire on every htmlInCanvas detection) is the right shape.

  2. htmlInCanvas implicitly forces screenshot mode, untested (Vai #2 / Magi non-blocking #1). Verified at htmlCompiler.ts: recommendScreenshot: reasons.length > 0. Adding htmlInCanvas to reasons flips recommendScreenshot to true. Per Magi's deeper note, this is actually correct and necessary — the micro-screenshot paint-force protocol in frameCapture.ts:740 only runs when captureMode !== "beginframe", and drawElementImage depends on that forced paint. So the coupling is load-bearing, not incidental. The test should pin it with expect(result.recommendScreenshot).toBe(true).

  3. Calibration waste on auto-sized htmlInCanvas renders (Vai nit). Verified at renderOrchestrator.ts: if (job.config.workers === undefined && totalFrames >= 60) runs runCaptureCalibration without checking the hint codes. For html-in-canvas auto-sized renders, calibration runs (600ms-3s) and then resolveRenderWorkerCount immediately throws the estimate away by early-exiting to 1. The hint-code check Vai suggests would skip the calibration upfront.

Why I missed these

I focused review effort on "is the surgical scope right" + "is the workaround quality good" per the briefing, but didn't audit the observability surface (auto-vs-explicit log-fire branches) or trace the side effects of adding to reasons (the recommendScreenshot coupling). Both are exactly the kind of "check the second-order effects of the data structure change" lens I should apply on PRs that thread state through shared types like RenderModeHints.

Verdict: APPROVE (consensus with Vai + Magi)

Magi confirmed the clamp is the correct fix; Vai approved with observability + test-pinning asks; my review aligned on surgical scope + missed the observability gaps. All three reviews converge on APPROVE-on-merit with the same set of non-blocking follow-ups (auto-path log, test pinning recommendScreenshot, escape hatch, tracking comment with Magi's two mechanisms documented).

Holding for James's stamp greenlight.

— Rames Jusso

- Always log when html-in-canvas pins to 1 worker, not just on
  explicit --workers override (Vance, James)
- Assert recommendScreenshot is true for htmlInCanvas detection —
  pins the load-bearing coupling with the screenshot paint-force
  protocol (Vance, Magi)
- Skip capture calibration for htmlInCanvas auto-sized renders to
  avoid wasting 600ms–3s on an estimate that gets thrown away (Vance)
- Add TODO documenting Chrome's two root-cause mechanisms: paint
  cache race and SwiftShader contention (James, Magi)
@miguel-heygen miguel-heygen merged commit 0845dc9 into main May 23, 2026
41 checks passed
@miguel-heygen miguel-heygen deleted the fix/htmlincanvas-single-worker branch May 23, 2026 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants