Skip to content

Explore TCP-like command lifecycle protocol for iOS runner #656

@thymikee

Description

@thymikee

What to explore

Investigate a more TCP-like command protocol between the daemon and the iOS XCTest runner so mutating command failures can be classified more precisely than today's preflight/send split.

Today the daemon uses an uptime preflight before most mutating commands to distinguish "runner was dead before command send" from "transport failed after command send". That is reliability-first, but it adds a round trip and still leaves uncertainty when the actual command send fails: the runner may not have received the command, may have partially executed it, may have completed it but lost the response, or may have crashed mid-command.

Explore whether command ids and runner-side command lifecycle acknowledgements can improve this without weakening the no-replay guarantee for mutating commands.

Acceptance criteria

  • Document the current failure modes around mutating iOS runner commands, including why post-send failures are not retried today.
  • Prototype or design a command lifecycle model with at least accepted, started, completed, and failed/unknown states.
  • Decide whether lifecycle state should be exposed through uptime, a dedicated status(commandId) command, or another runner endpoint.
  • Define daemon retry behavior for not accepted, accepted, started, completed, and unknown/crashed states.
  • Evaluate whether the model can safely remove or reduce eager uptime preflights for non-tap mutating commands.
  • Include perf expectations: likely saved round trips vs XCTest-dominated command costs.
  • Include reliability risks and limits, especially runner crashes after UI mutation but before status is recorded.

Notes

The likely direction is:

  • daemon assigns a stable commandId to each runner command
  • runner keeps a small in-memory journal of the current and recent command ids
  • daemon can reconnect and query command status before deciding whether a retry is safe
  • retry is allowed only when the runner can prove the command was not accepted
  • started/completed/unknown mutating commands remain non-retryable by default

This should be treated as protocol exploration, not a quick perf optimization. Reliability should continue to trump latency.

Blocked by

None - can start immediately.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions