feat: add Maestro replay compatibility#581
Conversation
|
6794225 to
baaee00
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6794225256
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
baaee00 to
1ea51f9
Compare
Code reviewOverviewThis reworks the Maestro replay compatibility layer: it removes the previously-mapped device/utility commands ( FindingsSecurity / decisions to make consciouslyS1 —
S2 — iOS non-hittable selector tap: bounds check is looser than its comment. ( Issues to verifyI1 — Android I2 — I3 — I4 — I5 — Nits
Test coverage & CICoverage of the new mapping/runtime surface is good: Gaps worth filling:
CI: not failing. All completed checks pass; the SummarySolid, well-factored compatibility work with strong runtime tests. The item warranting an explicit maintainer decision is Reviewed with Claude Code. Generated by Claude Code |
thymikee
left a comment
There was a problem hiding this comment.
Took a closer look at the Maestro replay compatibility surface. Overall the structure looks reasonable — the parse-time → runtime-command split for tapOn/scrollUntilVisible/runFlow.when/pressEnter is the right factoring, and the test coverage on replay-flow.test.ts and session-replay-vars.test.ts is substantial.
Spot-checked the two Codex P1s from earlier:
- scrollUntilVisible direction inversion: fixed by introducing the separate
readScrollUntilVisibleDirection(left a note suggesting a docblock so the asymmetry withreadScrollPositionalsFromDirectionSwipeis obvious to future readers). - runFlow.when swallowing errors: narrowed, but the heuristic in
isMaestroWhenConditionMissstill classifies mostCOMMAND_FAILEDoutcomes fromisas "condition missed" → silently skip subflow. Worth a small refactor on theishandler side to make pass/fail/error distinct.
Other observations I'd like to see addressed before merge:
vm.runInNewContextis not a security boundary and itstimeoutdoesn't bound async work. Both should be documented in code +replay-e2e.mdso the trust model is explicit (this is internal-flows-only, right?).pressEnterfetch-failed recovery matches by substring and treats a successful follow-up snapshot as "Enter worked". Worth tightening both the match condition and the response shape so upstreamispredicates can't read the recovered state as success.MAX_REPEAT_EXPANSIONS = 100is fine to ship but parse-time expansion will be painful to back out of if real flows want larger repeats. A runtime-command form (likerunFlowWhen) would scale better.http.postspawns a Node child per call — fine for one-off auth setup, will be slow if anyone actually loops on it.
Nothing in here looks like a hard block — most of the concerns are around tightening trust boundaries and getting the is-handler error taxonomy right so runFlow.when doesn't drift back toward "errors look like misses".
Generated by Claude Code
| timeout: RUN_SCRIPT_TIMEOUT_MS, | ||
| }); | ||
| } catch (error) { | ||
| throw new AppError( |
There was a problem hiding this comment.
A few notes on vm.runInNewContext here that I think deserve being called out in code and/or in replay-e2e.md:
-
vmis not a sandbox. The Node docs are explicit: "The node:vm module is not a security mechanism. Do not use it to run untrusted code." Script code can reach back into the host process via prototype chains, getter side effects, etc. Given that Maestro replay flows usually come from files in the repo, this is probably an acceptable trust model — but it should be documented, because someone might reasonably assume "we sandbox runScript". A one-line comment near the call site and a sentence inreplay-e2e.mdwould prevent future surprise. -
The
timeoutoption only bounds synchronous execution. As soon as the script does anything async (Promise,http.postreturns, callback timers, etc.), the 30 s timeout doesn't help. In practice this means a long-running fetch in user code won't be aborted by thevm.runInNewContexttimeout — only by the child-process timeout insiderunHttpRequestSync. Worth noting in a comment so readers don't assume "everything is bounded by 30 s". -
Output keys with dots end up as env keys with dots (
context.env[\output.${key}`] = ...). If a script doesoutput["bar.baz"] = 1, the resulting keyoutput.bar.bazis ambiguous w.r.t. Maestro's nested variable resolution if it ever lands. Probably fine for now, but consider rejecting.` in output keys to keep the namespace clean.
Generated by Claude Code
| ok: true, | ||
| data: { skipped: true, condition: condition.mode, selector: condition.selector }, | ||
| }; | ||
| } |
There was a problem hiding this comment.
This heuristic is the narrowed version of the Codex bot's earlier P1 (runFlow.when swallowed runtime errors as condition misses). It's better than the original — the android_foreground_surface carve-out is good — but the underlying conflation is still there: any COMMAND_FAILED from the is handler (e.g. snapshot transport failure, runner disconnect, selector parse error, timeout) is interpreted as "the condition did not match" and the subflow is silently skipped.
The cleanest fix would be for the is handler to use distinct error codes (or include details.predicate = 'pass' | 'fail' | 'error') so this check can pivot on code === 'PREDICATE_FAILED' (or details.predicate === 'fail') instead of code === 'COMMAND_FAILED'. That makes the behavior intentional and exhaustive instead of "everything looks like a miss unless we know otherwise".
Short of that refactor, I'd at least make the carve-out an allowlist of "things that mean the condition was evaluated and was false" rather than a denylist of "things that don't count as a miss" — denylists drift over time as new failure modes get added to the is path.
Generated by Claude Code
| predicate: mode === 'visible' ? 'visible' : 'hidden', | ||
| selector, | ||
| }; | ||
| } |
There was a problem hiding this comment.
Two concerns with the "Enter caused transport reset, snapshot succeeded → treat as recovered" path:
-
message.includes('fetch failed')matches anywhere in the string, including substrings of unrelated errors that happen to mention fetch failure. If you also start surfacing other error messages that contain "fetch failed" (e.g. an HTTP plugin or a future replay backend), this will accidentally extend the recovery path to them. Matching onerror.codeplus a specific transport reason string would be more durable. -
A successful follow-up snapshot doesn't prove that the original
type "\n"actually submitted the form — only that the runner is alive again. The warning in the response acknowledges this, but the caller sees{ ok: true, recovered: true, ... }and will probably proceed as if Enter worked. For replay specifically that's often fine (the next step will fail loudly if it didn't), but forrunFlow.whenevaluation upstream, anok: truehere is indistinguishable from a real success. Consider whether the recovered case should be propagated as a distinct outcome (e.g.{ ok: true, data: { recovered: true } }should never feed back intoispredicates as "pass").
Generated by Claude Code
| @@ -136,10 +181,39 @@ export function convertSwipe(value: unknown): SessionAction { | |||
| ); | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
Note for future readers: readScrollUntilVisibleDirection passes the direction through unchanged (up → up), while readScrollPositionalsFromDirectionSwipe above inverts it (up → down). Both behaviors are correct given Maestro semantics — swipe.direction describes finger motion, scrollUntilVisible.direction describes scroll/content motion — but the asymmetry inside one file is genuinely confusing.
A short docblock on each function ("Maestro swipe direction describes finger motion, so we invert to get scroll direction" / "Maestro scrollUntilVisible.direction already describes scroll direction") would save the next reader a few minutes of "wait, why are these different?".
This also looks like the fix for the Codex bot's resolved P1 about scroll direction inversion in scrollUntilVisible — the previous code reused readScrollPositionalsFromDirectionSwipe here, which inverted incorrectly. The new function is the right shape.
Generated by Claude Code
| const MAX_REPEAT_EXPANSIONS = 100; | ||
|
|
||
| type ConvertCommandList = ( | ||
| commands: MaestroCommand[], |
There was a problem hiding this comment.
MAX_REPEAT_EXPANSIONS = 100 is a fairly aggressive cap given Maestro's typical use of repeat (e.g. scrolling N times to find an item, or polling). 100 is enough for most cases but I wouldn't be surprised to see real-world flows hit this — particularly repeat: { times: ${ITEM_COUNT}, commands: [tap, swipe up] } patterns for long lists.
A couple of options:
- Bump to 1000 (still bounds the eager expansion to a sensible memory ceiling).
- Or, instead of eagerly expanding
timescopies of the action list at parse time, emit a single runtime command (runtime: repeat) and have the runtime loop, the same wayrunFlow.whenalready does. That removes the cap entirely.
Not blocking, but worth a thought before this lands — bumping the cap later is fine, but the parse-time-expansion pattern can be painful to back out of once flows depend on it.
Generated by Claude Code
| }), | ||
| timeoutMs: RUN_SCRIPT_TIMEOUT_MS, | ||
| allowFailure: true, | ||
| }); |
There was a problem hiding this comment.
Spawning a fresh Node child process for every http.post is heavy — startup cost is ~50–150 ms per call even on a fast machine, before the request is even sent. For replays that do auth/setup HTTP calls in a script, this is OK; for anything that polls or batches it'll dominate runtime.
Also: this passes the URL and headers/body to a child via JSON stdin. That's fine, but the HTTP_REQUEST_SCRIPT text is hardcoded and uses global fetch, so it implicitly requires Node ≥ 18. A defensive engines constraint or explicit error if typeof fetch === 'undefined' would help users on older Node versions.
Lower priority — fine for v1 of the Maestro replay shim, but if http.post ends up used heavily, an in-process synchronous wrapper around fetch (e.g. via Atomics.wait on a worker, or child_process.execFileSync with the script written to disk once and reused) would pay back quickly.
Generated by Claude Code
b4f7170 to
ed3e970
Compare
Summary
Adds a focused Maestro replay compatibility path for existing YAML flows, including command mapping, variable/runScript handling, flow-control shims, and replay-only runtime helpers for scroll/tap behavior. Also adds the iOS non-hittable selector tap backdoor needed for hidden RN E2E controls and documents the supported replay subset.
Updates #558.
Touched 34 files. Scope is replay compatibility plus the narrow iOS runner tap fallback required by that workflow. Device utility commands such as permissions, mock location, airplane mode, orientation, and recording are intentionally left unsupported for now.
Validation
Verified with
pnpm format, focused replay/help tests,pnpm build, andgit diff --check.pnpm check:unitoutside the sandbox passed all but one unrelated Android .aab install test that timed out in the full run; rerunning that exact test passed. Earlier validation on this branch also passedpnpm build:xcuitestfor iOS and macOS. Manual Bluesky replay comparison produced Agent Device and Maestro recordings during validation; Maestro still fails the selected Bluesky flow earlier than Agent Device because its precondition does not reach the target screen.