Skip to content

Cbones#26

Open
amniskin wants to merge 64 commits into
refactorfrom
cbones
Open

Cbones#26
amniskin wants to merge 64 commits into
refactorfrom
cbones

Conversation

@amniskin
Copy link
Copy Markdown
Contributor

@amniskin amniskin commented May 9, 2026

No description provided.

amniskin and others added 30 commits March 21, 2026 09:18
…sion

This is a significant internal rework laying foundation for future public APIs:

**Native DB Layer**
- Add C extension (_db) using LMDB for storage and msgpack for serialization
- Replace Python-based ops with optimized C implementations

**Codec System**
- Introduce codec registry with priority ordering for IndexOps literals
- Register built-in NodeCodec in API bootstrap; normalize Node->Ref through put_literal
- Simplify Dag.put to literal insertion only; route imports through Dag.load/put_import

**Ops Layer Rewrite**
- Implement core ops in C: base_ops, cache, commit, dag, gc, head, index, node, remote
- Python ops layer now delegates to C extension for performance-critical paths

**Execution Contracts**
- Finalize adapter/execution/error model with proper type hierarchy
- Add contrib runtime APIs for docker and script executors
- Isolate funk test execution in temporary working directories

**Documentation**
- Reorganize docs into PRD/concept/architecture structure with authority sections
- Add docs mapping guidance for agents
- Document scalar URI node wrapping and S3 tar exclusion/symlink rules

**Type Safety**
- Add typing stub for native Cython _db extension
- Align stubs/annotations with runtime behavior; simplify entry-point loading
- Update CI workflow to use 'uv' for faster dependency management and running tests/lints.
- Fix Dockerfile in examples/dkr-ctx to use python 3.13 and remove invalid extras.
- Fix Dockerfile in dml-util/tests/assets/dkr-context to correctly copy parent directory for local daggerml installation.
* Initial plan

* Add cross-platform CI and C sanitizer (ASAN+UBSAN) checks

Co-authored-by: amniskin <10365753+amniskin@users.noreply.github.com>
Agent-Logs-Url: https://github.com/daggerml/python-lib/sessions/34e1e5f0-fd70-4089-8c24-504b9ace60a4

* Fix macOS CI failures: PYTHONMALLOC for ASAN and docker test isolation

Co-authored-by: amniskin <10365753+amniskin@users.noreply.github.com>
Agent-Logs-Url: https://github.com/daggerml/python-lib/sessions/88826cf4-6130-4529-8545-92944bb0e129

* Fix CI: remove macOS sanitize (unreliable), restrict macOS tests to Python 3.13 only

Co-authored-by: amniskin <10365753+amniskin@users.noreply.github.com>
Agent-Logs-Url: https://github.com/daggerml/python-lib/sessions/68c40a26-42ac-4b83-9dc6-0de508888929

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: amniskin <10365753+amniskin@users.noreply.github.com>
docs: Make contrib/testing.md conform to spec-schema.md

docs: Make specs conform to spec-schema.md

wip

add stateless ssh contrib executor

contrib: simplify executor lifecycle and state record

Remove kill method from all executor classes; gc now handles cancellation
and cleanup, deleting the state record.

Remove owner_executor, owner_instance, lease_expires_ts, updated_ts from
StateRecord. Heartbeat staleness via is_stale() is the single source of
truth. heartbeat_ts is updated on every state mutation.

Simplify DockerExecutor poll/gc: poll reads nested state directly,
removed container-inspection dead code (_finish, _fail, _record_result,
_cleanup, _container_status, _container_exit_code, _container_logs).

Simplify BatchExecutor poll: store batch job state in executor metadata
(batch_status) instead of the outer record status. The sub-adapter
manages the outer record status. Gracefully handle AWS errors.

Remove LEASE_SECONDS from script/supervisor.
Executor interface no longer requires resolve_runnable or poll.
Only start and gc are mandatory for back-end executors.

Introduces CfnExecutor for CloudFormation stack lifecycle (create,
update, poll terminal statuses, surface failure reasons) and a
convenience cfn() function. ExecutorBase is now a minimal base
class with no default method stubs.
Replace the legacy inline ref-manifest flow with per-dag manifest publication and
lookup so remote sync reads, publishes, and garbage-collects refs through direct
dag-targeted manifests.

- add per-dag manifest helper primitives and reader resolution paths in remote ops
- switch publication to write refs with per-dag targets and validate direct dag references only
- update GC marking to follow dag refs through the new manifest structure
- remove the old inline manifest fallback and align cache/index behavior with the new flow
- expand remote ops coverage and adjust executor/runtime tests for the updated contract
- fold the design and rollout details into the remote docs and task specs
Keep the CLI suite focused on parser and dispatch behavior while relying on ops tests for command semantics. This cuts duplicate end-to-end coverage and speeds up the test matrix without reducing ownership checks at the CLI boundary.
Move runnable DAG cache publication out of the open index transaction and wait for S3 object visibility when writing remote descriptors, CAS objects, and refs. This avoids success paths racing remote cache lookup before newly published objects are observable.
Unify remote materialization around manifest-pointer loads and raw cache keys so execution, cache publication, and pull all follow one contract. Remove legacy dump and commit-dump paths now that adapters publish argv and results through the remote manifest protocol.
Remove all old symbols (StateBase, LocalState, DynamoState, StateRecord,
state_from_comms, lock_from_comms, is_stale). Replace with ExecutionState
class implementing advisory locking and state machine transitions
(pending -> running -> succeeded/failed). 26 moto-backed tests.
…factor executor infrastructure

Redesign the execution state system: replace LocalState/DynamoState/StateBase
with a single DynamoDB-backed ExecutionState implementing a proper state machine
with advisory locking and atomic claim_running() for duplicate-launch prevention.

- ExecutionState: pending->running->succeeded/failed->done state machine
- ExecutorBase.handle(): orchestrates start/poll/cleanup dispatch
- start_fn: rewritten as orchestrator (cache check, upsert, dispatch, publish, mark_done)
- All executors (script, docker, cfn, ssh, batch, lambda) rewritten for new model
- Wrapper executors (ssh, batch, docker) use child execution identities
- Supervisor: durable terminal writes, version-2-only payloads
- Named caches (remote_cache/cache_name/DML_CACHE_KEY) removed entirely
- Batch executor: propagate DML_DYNAMODB_TABLE into container env
- Executor registry: validate lifecycle callables even when resolve_runnable present
- Remove dead _proc_exists from script executor
- All docs updated for new DynamoDB execution model
- 646 tests passing
…nterface, add is_node_like

The contrib executor infrastructure had two problems:

- ExecutorState was a standalone DynamoDB-backed class in contrib that mixed
  execution lifecycle concerns with state storage, making executors hard to
  reason about and test. Replace it with a new internal ExecutionState
  (exec_state.py) backed purely by DynamoDB, with a cleaner start/poll/cleanup
  split across all executors.

- Executor kwargs validation used isinstance checks against DelayedActionCodec
  (an internal codec wrapper) instead of the actual user-facing Delayed* types.
  Add is_node_like(x) to contrib/api.py as a shared predicate for
  Node | DelayedRef | DelayedLoad | DelayedRunnable, and update
  SshExecutor._validate_kw to use it.

Executors (ssh, docker, batch, cfn, lambda, script) are all updated to the new
interface. Examples and docs updated to match.
Fire-and-monitor executors (docker, batch) previously passed sub-adapter
payloads via local tmpdir mounts or ad-hoc S3Store paths, storing the
resulting URIs in executor state so poll() could locate them later.  This
was fragile: tmpdir paths tied docker to the host filesystem, the batch
executor stored input_uri/output_uri in job state that were never needed
again, and neither pattern had a stable, discoverable S3 namespace.

Introduce AdapterIO — a lightweight surrogate for stdin/stdout when direct
piping is not possible.  Paths are derived deterministically from
(cache_key, exec_id, name) under the existing fn-exec S3 namespace:

  {fn-exec-prefix}/io/{cache_key}/{exec_id}/{name}/input.json
  {fn-exec-prefix}/io/{cache_key}/{exec_id}/{name}/output.json

Because paths are derived, poll() can reconstruct AdapterIO from the same
three values without reading them from state.  This eliminates workdir,
output_path, input_uri, and output_uri from executor state entirely.

Changes:
- exec_state.py: add AdapterIO class and ExecutionState.adapter_io() factory
- adapters.py: add S3 write branch to AdapterBase._write_output() so the
  sub-adapter can write its result directly to an S3 URI (parallel to the
  existing S3 read branch in _read_input)
- docker.py: start() writes payload via AdapterIO.write_input() and passes
  S3 URIs to the container; _prepare_image tmpdir is now ephemeral (created
  and removed within start()); poll() calls io.read_output() instead of
  reading a local file; state shape reduced to {container_id, cleanup_image}
- batch.py: start() writes payload via AdapterIO.write_input(); removes
  S3Store.cd("jobs") and the stored input_uri/output_uri; poll() reconstructs
  AdapterIO("lambda:batch") and calls io.read_output(); state shape reduced
  to {job_id, job_definition}
- tests updated throughout; batch tests no longer need _FakeStore
- docs updated: executor-state.md documents AdapterIO API and usage pattern;
  executor-catalog.md updated for docker and batch to reflect simplified state
Introduce git-like remote project workflows including clone, fetch, pull, push, merge, revert, and DAG checkout, along with project-local/global config support, remote project refs, and the supporting docs, specs, and tests.

Tighten the runtime contract for remote-aware components by requiring explicit remote configuration where remote-backed behavior is used, removing unsupported optional remote-root paths, updating local-only helpers to use local-only primitives, and verifying the result with ruff, pyright, and the full test suite.
Sync the finalized OpenSpec deltas into canonical specs and archive the completed change directories so the active change list stays current.
Still need to implement checkout commit functionality and review defaults
Implement git-like checkout behavior with attached/detached modes and compose clone as fetch+checkout. Align CLI, commit resolution, specs, and archived change artifacts around revision-based terminology and behavior.
Move fetch/pull/push/checkout/merge/revert/clone orchestration out of CLI handlers into DmlOps so project commands stay thin and easier to validate. Add regression tests for delegation and workflow parity, and group git-like commands in top-level help.
amniskin added 25 commits April 29, 2026 19:12
Move init orchestration into DmlOps so CLI remains a thin adapter, adding recovery/bootstrap validation and tests. Sync and archive the refine-dmlops-init-config-resolution OpenSpec change with updated canonical specs.
Consolidate project bootstrap on init by deleting clone/post-clone surfaces across CLI, internal config, specs, and docs, then archive the completed change.
Route DmlOps.init filesystem bootstrap through init_project_layout and remove duplicated local config/gitignore write paths so init behavior stays consistent across recovery and fresh setup flows.
Make init accept URI-only identity while enforcing name/project-uri exclusivity, with clearer config errors for unresolved user-derived ownership. Also allow dag checkout to infer user from config when omitted and update integration/examples and OpenSpec artifacts to reflect the new workflow.
… DmlOps

Move DAG checkout and project remote client orchestration out of CLI handlers so command modules remain transport-only and easier to maintain. Keep behavior stable by updating delegation-focused CLI tests and adding ops-level coverage for extracted workflows.
Restructure tests around contract-first boundaries, add migration docs, and mark integration coverage consistently for fast-path selection. Also fix docker build integration fixture path so S3Store.tar resolves the build context.
Shift CLI/internal workflows to branch names and opaque index IDs so pointer persistence stays in HeadOps, then archive the completed OpenSpec change with synced specs.
…and optimistic publication

Changes:
- HeadOps pointer methods are now file-I/O-only except for create_branch bootstrap.
- IndexOps mutation paths derive commits in LMDB, close transaction, then publish via HeadOps CAS.
- IndexOps retries publication on DmlPointerConflictError using the conflict's current_commit.
- Only HeadOps.create_branch remains transaction-aware (bootstrap only).
- CommitOps and RemoteOps updated to publish pointers after LMDB transaction commit.
- Added regression tests for publication ordering and conflict retry behavior.

This eliminates the corruption window where refs could point to uncommitted commits by ensuring
all pointer updates happen after LMDB transaction durability. Pointer publication now uses
compare-and-swap with automatic retry on stale pointers.

Closes the gap between immutable commit creation in LMDB and mutable ref publication on the
filesystem by making IndexOps own commit derivation and HeadOps own file-backed ref publication.
Allow head and DAG deletion operations to fall back to the attached HEAD branch so temporary runtimes keep working without a branch override.
Unify remote execution tracking, dependency edges, and cache administration around execution ids so adapters, planners, and cache refs all operate on the same runtime model. Sync the corresponding OpenSpec specs and archive the completed change so the implementation, docs, and experimental workflow stay in sync.
Preserve nested adapter launch state during adapter CLI polling so supervisor-backed
script executions keep their own result paths while still refreshing status metadata.
Also restore remote-root propagation for Docker and CFN contrib executors so remote
backed example and CloudFormation flows can run under the execution-id runtime model.
moto[server] version 5.2.0 catches if-none-match errors and raises 500 instead of expected 412.
gh-actions are much slower with sanitization flags, so we don't run them.
node deprecation warnings.
Copilot AI review requested due to automatic review settings May 9, 2026 03:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

amniskin added 3 commits May 9, 2026 07:27
Use project-home and remote-uri across the CLI, sync the related specs, and archive the completed change so the public surface matches shared config terminology.
Route repository workflows through the new Dml surface, centralize selector resolution, and align specs, docs, examples, and tests with the git-like CLI redesign.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants