fix(producer): force single worker for html-in-canvas compositions#1052
Conversation
Chrome's drawElementImage API does not support concurrent usage across multiple browser instances — running >1 capture worker causes flickering artifacts. Detect the layoutsubtree canvas attribute during compilation and unconditionally pin workers to 1, overriding both auto-sizing and explicit --workers flags.
4fd03e0 to
497f220
Compare
jrusso1020
left a comment
There was a problem hiding this comment.
Verdict: COMMENT (would-be APPROVE)
Workaround is well-shaped: surgical detection, tight test coverage, override path keeps the user informed. Two non-blocking observations on workaround-quality vs root-cause framing. Holding for James's stamp greenlight + Magi's deeper root-cause investigation (which will land separately per the delegation).
Surgical scope is right ✓
PR title reads "force single worker for html-in-canvas comps" which could be interpreted as "force-single across the board." The actual implementation is more targeted:
detectRenderModeHintsgreps the compiled HTML forcanvas[layoutsubtree]— the marker for the Chrome experimental html-in-canvas API.- Only when that selector matches does
resolveRenderWorkerCountclamp to 1. - Plain canvas comps (three.js, audio waveform, ProgressIndicator, etc.) keep their multi-worker parallelism.
Composition-level surgical, not global. ✓
Sub-composition scope confirmed ✓
I was concerned the per-HTML-string detection might miss canvas[layoutsubtree] usage in sub-compositions — but verified at htmlCompiler.ts:934-944:
const inlinedHtml = inlineSubCompositions(fullHtml, subCompositions, projectDir);
// ...
const sanitizedHtml = inlinedHtml.replace(...);
const renderModeHints = detectRenderModeHints(sanitizedHtml);detectRenderModeHints runs on the flattened HTML (with all sub-comps inlined), so html-in-canvas usage anywhere in the composition tree triggers the worker clamp. No sub-comp scope gap. ✓
Test coverage solid ✓
Four new tests pin both halves of the behavior:
- Detection:
canvas[layoutsubtree]triggers / plain canvas doesn't - Worker clamp: auto-sized → 1 / explicit
--workers 8→ 1 with warning
The "with warning" assertion (expect(log.warn).toHaveBeenCalledOnce()) specifically pins the user-facing log line, so a future "let's parallelize again" refactor would fail loudly. Good shape — matches the spirit of the briefing's "test that pins the prior behavior so future drift fails loudly."
Observation 1 (worth addressing) — no TODO / tracking-issue link
The two user-facing messages explain what is happening:
"Detected html-in-canvas API (layoutsubtree canvas). Chrome does not support concurrent drawElementImage across multiple workers; render is pinned to a single worker."
But neither references:
- A Chromium bug ID (Crbug? Issue tracker?)
- An internal tracking issue for Magi's deeper investigation
- A
// TODO(@miguel-heygen):comment marking this as a workaround to remove
The risk: when the root cause is eventually fixed upstream (or our deeper investigation finds the missing-determinism angle Miguel called out for Magi), nobody knows to remove this clamp. It becomes permanent cruft, capping html-in-canvas renders at 1 worker forever.
Suggested shape (one-line addition to the warning + a code comment):
// TODO(htmlInCanvas): workaround — Chrome's experimental drawElementImage
// API is non-deterministic across concurrent browser instances.
// Remove this clamp once root cause is identified.
// Tracking: <link to Magi's investigation findings or chromium bug>
if (reasonCodes.has("htmlInCanvas")) {
// ...
}Cheap insurance against the workaround outliving its cause.
Observation 2 (optional) — no escape hatch for power users
resolveRenderWorkerCount clamps to 1 unconditionally when htmlInCanvas is detected, even when the user passes --workers 8 explicitly. The warning message tells the user what happened, but there's no way to override.
For the common case this is correct — preventing users from foot-gunning with a flickering output is good UX. But:
- A developer testing whether the bug is fixed in a newer Chrome can't compare multi-worker output without a code change.
- A user who's confident their composition is safe (maybe they're only animating, not capturing) has no opt-out.
An optional --force-workers N flag (distinct from --workers N) would be the clean shape — explicit acknowledgment that the user is overriding the safety check. Or --workers 8 --html-in-canvas-unsafe. Not blocking; the current shape is defensible.
Sanity checks (no concerns)
- Performance cost on the clamped subset: workers go from auto-sized (8 typically?) to 1 → ~8x slowdown for html-in-canvas renders. Trade-off is right given the alternative is unusable flickering output. Worth noting in the PR body that the cost is bounded to the html-in-canvas subset of compositions, not all renders.
canvas[layoutsubtree]selector matches any value of the attribute — correct, since the spec uses it as a boolean attribute (presence = on).drawElementImageJS API withoutlayoutsubtree: per the experimental spec,drawElementImagerequires the canvas to havelayoutsubtreeenabled — so detection on the attribute alone covers the JS-call surface too. ✓- CI: 20 success / 10 in_progress / 0 failures.
Pre-merge ask (non-blocking)
- Add a TODO + tracking-issue reference so this doesn't become permanent. The reference can point to Magi's investigation findings once those land, or a Chromium issue if we file one. The shape matters more than the exact link target.
Workaround quality is otherwise solid. Deferring to Magi on the root-cause question (Chrome experimental feature vs missing-determinism in our side).
— Rames Jusso
vanceingalls
left a comment
There was a problem hiding this comment.
Detection is clean and narrow — canvas[layoutsubtree] selector in detectRenderModeHints, propagated through the RenderModeHintCode union to an early-exit in resolveRenderWorkerCount. Tests cover the positive case, the negative case (plain canvas), the auto-sizing override, and the explicit --workers override with the warn log. The shape is consistent with how other render-mode hints (iframe, raf) are wired.
Strengths:
captureCost.ts:116–126: early-exit before the cost estimate path is clean and correct. Single-worker enforcement is total — no edge where the estimate path can slide under it.htmlCompiler.test.ts:416–441: positive/negative detection tests are well-formed with realistic HTML fixtures.- Warn log on the
--workers > 1override is the right UX signal.
important — silent override on the auto-sized path
captureCost.ts:118: the warn fires only when requestedWorkers !== undefined && requestedWorkers > 1. An auto-sized render (the majority of production jobs, where requestedWorkers is undefined) is silently pinned to 1 worker with no log at all. Operators have no signal that single-worker mode was engaged unless they happen to check workerCount downstream. The warn should also fire (or a separate log.info) on the auto path so the behavior is observable. Suggested shape:
log.warn(
"[Render] html-in-canvas (drawElementImage) detected — pinning to 1 worker (Chrome concurrency limitation).",
{ requestedWorkers },
);
return 1;Single branch, no condition needed.
important — htmlInCanvas implicitly forces screenshot mode; test doesn't pin this
htmlCompiler.ts:130: detectRenderModeHints returns recommendScreenshot: reasons.length > 0. Adding htmlInCanvas to reasons means every html-in-canvas composition is routed to screenshot capture mode via applyRenderModeHints (shared.ts:212), in addition to being pinned to 1 worker. The PR body doesn't mention this side effect, and the new test at htmlCompiler.test.ts:416 only asserts reasons — it does not assert recommendScreenshot. If this is intentional (screenshot + single-worker both apply), the test should pin it. If it's unintentional, htmlInCanvas needs a carve-out in detectRenderModeHints analogous to how the code already avoids double-applying hints in other places.
Manual testing passed, so at minimum screenshot mode is compatible with drawElementImage — but the coupling should be explicit in the test.
important — no escape hatch; no follow-up issue
Single-worker enforcement cannot be bypassed. If the workaround misfires on a composition that doesn't actually flicker, or if the root-cause fix lands separately and the detection is over-broad, there is no --force-workers override or env var that unblocks the operator. Consider a HYPERFRAMES_OVERRIDE_WORKERS env or a cfg.disableHtmlInCanvasWorkerPin flag that lets the team A/B without a code change.
Also: the PR body describes this as a fix for non-determinism but doesn't reference a follow-up issue to investigate the actual root cause in Chrome's drawElementImage implementation. A linked issue would close the loop on the workaround framing.
nit — wasted calibration on auto-sized html-in-canvas renders
renderOrchestrator.ts:1702: the calibration run is gated on job.config.workers === undefined && totalFrames >= 60. For auto-sized renders of html-in-canvas compositions, calibration will still execute (potentially 600ms–3s), then resolveRenderWorkerCount throws the estimate away and returns 1. The early-exit in resolveRenderWorkerCount doesn't propagate up to skip the calibration. Not a correctness issue, but a latency cost on every affected render. A pre-check before the calibration block using the hint codes would skip it:
if (job.config.workers === undefined && totalFrames >= 60 && !compiled.renderModeHints.reasons.some(r => r.code === "htmlInCanvas")) {
// run calibration
}CI note: all required checks were pending at review time; no failures.
Verdict: APPROVE
Reasoning: The core mechanism is correct — single-worker enforcement is total, detection is precise, and manual testing on the affected composition type passed. The important findings are observability and documentation gaps, not correctness breaks. Ship with follow-up for the escape hatch and the recommendScreenshot coupling test.
— Vai
miguel-heygen
left a comment
There was a problem hiding this comment.
Review: Magi — root-cause investigation + code review
Root-cause investigation (Miguel's question)
Verdict: Chrome's drawElementImage IS the root cause. Not a HyperFrames bug.
drawElementImage() is an experimental API behind --enable-features=CanvasDrawElement that rasterizes live DOM subtrees into a <canvas>. The non-determinism in parallel rendering comes from two Chrome-internal mechanisms:
-
Paint cache race —
drawElementImagereads from the compositor's internal paint records. When multiple Chrome instances run on the same machine, CPU/GPU contention causes the compositor to serve stale or incomplete paint records to the canvas rasterizer. This produces flickering artifacts (frames where the canvas content is partially rendered or shows the previous frame's state). -
SwiftShader contention — Both distributed and in-process renders run
browserGpuMode: "software"(SwiftShader). Multiple SwiftShader instances contend on shared OpenGL state, anddrawElementImage's rasterizer shares that pipeline. A single Chrome instance avoids this contention entirely.
The API is behind an experimental flag for a reason — it isn't stable for concurrent use. There's no workaround on our side short of waiting for Chromium to stabilize it.
Distributed rendering is unaffected — renderChunk.ts already hardcodes workerCount: 1 (line 544), so each Temporal activity runs a single Chrome instance per chunk on separate processes/machines, avoiding the shared-state problem.
Code review
Detection is correct: canvas[layoutsubtree] attribute check runs on the flattened HTML (after inlineSubCompositions), so sub-composition usage is caught. The resolveRenderWorkerCount short-circuit is clean.
Non-blocking (1): recommendScreenshot side effect — the htmlInCanvas reason being pushed into reasons also forces screenshot capture mode (recommendScreenshot: reasons.length > 0). This is actually correct and necessary: the micro-screenshot paint-force protocol in frameCapture.ts:740 only runs when captureMode !== "beginframe", and drawElementImage depends on that forced paint. But the test doesn't assert recommendScreenshot is true, so this critical coupling is invisible. Suggest adding expect(result.recommendScreenshot).toBe(true) to pin this.
Non-blocking (2): Agree with Rames on the tracking TODO. A TODO(htmlInCanvas): remove when Chromium stabilizes CanvasDrawElement for concurrent use would prevent this from becoming permanent cruft.
LGTM ✓
— Magi
jrusso1020
left a comment
There was a problem hiding this comment.
Following Magi's root-cause finding + acknowledging three Vai catches I missed
Magi's root-cause conclusion changes my framing
My prior review framed this as a "workaround that might become permanent cruft." Magi's investigation shifts that:
"
drawElementImageis an experimental API behind--enable-features=CanvasDrawElement... two Chrome-internal mechanisms ... There's no workaround on our side short of waiting for Chromium to stabilize it."
So the clamp IS the correct fix given the upstream API's design, not a temporary patch over our own bug. My "TODO + tracking issue" ask still applies, but the framing should shift from "remove when fixed" to "document the two mechanisms (paint cache race + SwiftShader contention) so a future maintainer understands why this clamp is permanent until Chromium stabilizes CanvasDrawElement for concurrent use."
Three observations I missed, verified independently at the code
-
Silent override on auto-sized renders (Vai #1). Verified at
captureCost.ts:118-126: the warn fires only whenrequestedWorkers !== undefined && requestedWorkers > 1. Auto-sized jobs (whererequestedWorkers === undefined— the majority of prod renders) take the silent return-1 path. Operators have no observability that single-worker mode engaged. Real gap; Vai's suggested fix (move warn out of the explicit-workers guard, fire on every htmlInCanvas detection) is the right shape. -
htmlInCanvasimplicitly forces screenshot mode, untested (Vai #2 / Magi non-blocking #1). Verified athtmlCompiler.ts:recommendScreenshot: reasons.length > 0. AddinghtmlInCanvasto reasons flipsrecommendScreenshotto true. Per Magi's deeper note, this is actually correct and necessary — the micro-screenshot paint-force protocol inframeCapture.ts:740only runs whencaptureMode !== "beginframe", anddrawElementImagedepends on that forced paint. So the coupling is load-bearing, not incidental. The test should pin it withexpect(result.recommendScreenshot).toBe(true). -
Calibration waste on auto-sized htmlInCanvas renders (Vai nit). Verified at
renderOrchestrator.ts:if (job.config.workers === undefined && totalFrames >= 60)runsrunCaptureCalibrationwithout checking the hint codes. For html-in-canvas auto-sized renders, calibration runs (600ms-3s) and thenresolveRenderWorkerCountimmediately throws the estimate away by early-exiting to 1. The hint-code check Vai suggests would skip the calibration upfront.
Why I missed these
I focused review effort on "is the surgical scope right" + "is the workaround quality good" per the briefing, but didn't audit the observability surface (auto-vs-explicit log-fire branches) or trace the side effects of adding to reasons (the recommendScreenshot coupling). Both are exactly the kind of "check the second-order effects of the data structure change" lens I should apply on PRs that thread state through shared types like RenderModeHints.
Verdict: APPROVE (consensus with Vai + Magi)
Magi confirmed the clamp is the correct fix; Vai approved with observability + test-pinning asks; my review aligned on surgical scope + missed the observability gaps. All three reviews converge on APPROVE-on-merit with the same set of non-blocking follow-ups (auto-path log, test pinning recommendScreenshot, escape hatch, tracking comment with Magi's two mechanisms documented).
Holding for James's stamp greenlight.
— Rames Jusso
- Always log when html-in-canvas pins to 1 worker, not just on explicit --workers override (Vance, James) - Assert recommendScreenshot is true for htmlInCanvas detection — pins the load-bearing coupling with the screenshot paint-force protocol (Vance, Magi) - Skip capture calibration for htmlInCanvas auto-sized renders to avoid wasting 600ms–3s on an estimate that gets thrown away (Vance) - Add TODO documenting Chrome's two root-cause mechanisms: paint cache race and SwiftShader contention (James, Magi)
Summary
<canvas layoutsubtree>) during composition compilation and adds it as a newRenderModeHintCodedrawElementImagedoes not support concurrent usage across multiple browser instances and produces flickering artifacts--workersflags (with a warning log when the user explicitly requested >1)Test plan
detectRenderModeHintsdetectslayoutsubtreecanvas attributedetectRenderModeHintsignores plain canvas withoutlayoutsubtreeresolveRenderWorkerCountforces 1 worker on htmlInCanvas hintresolveRenderWorkerCountoverrides explicit--workers 8with warning