Conversation
…sion This is a significant internal rework laying foundation for future public APIs: **Native DB Layer** - Add C extension (_db) using LMDB for storage and msgpack for serialization - Replace Python-based ops with optimized C implementations **Codec System** - Introduce codec registry with priority ordering for IndexOps literals - Register built-in NodeCodec in API bootstrap; normalize Node->Ref through put_literal - Simplify Dag.put to literal insertion only; route imports through Dag.load/put_import **Ops Layer Rewrite** - Implement core ops in C: base_ops, cache, commit, dag, gc, head, index, node, remote - Python ops layer now delegates to C extension for performance-critical paths **Execution Contracts** - Finalize adapter/execution/error model with proper type hierarchy - Add contrib runtime APIs for docker and script executors - Isolate funk test execution in temporary working directories **Documentation** - Reorganize docs into PRD/concept/architecture structure with authority sections - Add docs mapping guidance for agents - Document scalar URI node wrapping and S3 tar exclusion/symlink rules **Type Safety** - Add typing stub for native Cython _db extension - Align stubs/annotations with runtime behavior; simplify entry-point loading
- Update CI workflow to use 'uv' for faster dependency management and running tests/lints. - Fix Dockerfile in examples/dkr-ctx to use python 3.13 and remove invalid extras. - Fix Dockerfile in dml-util/tests/assets/dkr-context to correctly copy parent directory for local daggerml installation.
* Initial plan * Add cross-platform CI and C sanitizer (ASAN+UBSAN) checks Co-authored-by: amniskin <10365753+amniskin@users.noreply.github.com> Agent-Logs-Url: https://github.com/daggerml/python-lib/sessions/34e1e5f0-fd70-4089-8c24-504b9ace60a4 * Fix macOS CI failures: PYTHONMALLOC for ASAN and docker test isolation Co-authored-by: amniskin <10365753+amniskin@users.noreply.github.com> Agent-Logs-Url: https://github.com/daggerml/python-lib/sessions/88826cf4-6130-4529-8545-92944bb0e129 * Fix CI: remove macOS sanitize (unreliable), restrict macOS tests to Python 3.13 only Co-authored-by: amniskin <10365753+amniskin@users.noreply.github.com> Agent-Logs-Url: https://github.com/daggerml/python-lib/sessions/68c40a26-42ac-4b83-9dc6-0de508888929 --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: amniskin <10365753+amniskin@users.noreply.github.com>
docs: Make contrib/testing.md conform to spec-schema.md docs: Make specs conform to spec-schema.md wip add stateless ssh contrib executor contrib: simplify executor lifecycle and state record Remove kill method from all executor classes; gc now handles cancellation and cleanup, deleting the state record. Remove owner_executor, owner_instance, lease_expires_ts, updated_ts from StateRecord. Heartbeat staleness via is_stale() is the single source of truth. heartbeat_ts is updated on every state mutation. Simplify DockerExecutor poll/gc: poll reads nested state directly, removed container-inspection dead code (_finish, _fail, _record_result, _cleanup, _container_status, _container_exit_code, _container_logs). Simplify BatchExecutor poll: store batch job state in executor metadata (batch_status) instead of the outer record status. The sub-adapter manages the outer record status. Gracefully handle AWS errors. Remove LEASE_SECONDS from script/supervisor.
Executor interface no longer requires resolve_runnable or poll. Only start and gc are mandatory for back-end executors. Introduces CfnExecutor for CloudFormation stack lifecycle (create, update, poll terminal statuses, surface failure reasons) and a convenience cfn() function. ExecutorBase is now a minimal base class with no default method stubs.
Replace the legacy inline ref-manifest flow with per-dag manifest publication and lookup so remote sync reads, publishes, and garbage-collects refs through direct dag-targeted manifests. - add per-dag manifest helper primitives and reader resolution paths in remote ops - switch publication to write refs with per-dag targets and validate direct dag references only - update GC marking to follow dag refs through the new manifest structure - remove the old inline manifest fallback and align cache/index behavior with the new flow - expand remote ops coverage and adjust executor/runtime tests for the updated contract - fold the design and rollout details into the remote docs and task specs
Keep the CLI suite focused on parser and dispatch behavior while relying on ops tests for command semantics. This cuts duplicate end-to-end coverage and speeds up the test matrix without reducing ownership checks at the CLI boundary.
Move runnable DAG cache publication out of the open index transaction and wait for S3 object visibility when writing remote descriptors, CAS objects, and refs. This avoids success paths racing remote cache lookup before newly published objects are observable.
Unify remote materialization around manifest-pointer loads and raw cache keys so execution, cache publication, and pull all follow one contract. Remove legacy dump and commit-dump paths now that adapters publish argv and results through the remote manifest protocol.
Remove all old symbols (StateBase, LocalState, DynamoState, StateRecord, state_from_comms, lock_from_comms, is_stale). Replace with ExecutionState class implementing advisory locking and state machine transitions (pending -> running -> succeeded/failed). 26 moto-backed tests.
…factor executor infrastructure Redesign the execution state system: replace LocalState/DynamoState/StateBase with a single DynamoDB-backed ExecutionState implementing a proper state machine with advisory locking and atomic claim_running() for duplicate-launch prevention. - ExecutionState: pending->running->succeeded/failed->done state machine - ExecutorBase.handle(): orchestrates start/poll/cleanup dispatch - start_fn: rewritten as orchestrator (cache check, upsert, dispatch, publish, mark_done) - All executors (script, docker, cfn, ssh, batch, lambda) rewritten for new model - Wrapper executors (ssh, batch, docker) use child execution identities - Supervisor: durable terminal writes, version-2-only payloads - Named caches (remote_cache/cache_name/DML_CACHE_KEY) removed entirely - Batch executor: propagate DML_DYNAMODB_TABLE into container env - Executor registry: validate lifecycle callables even when resolve_runnable present - Remove dead _proc_exists from script executor - All docs updated for new DynamoDB execution model - 646 tests passing
…nterface, add is_node_like The contrib executor infrastructure had two problems: - ExecutorState was a standalone DynamoDB-backed class in contrib that mixed execution lifecycle concerns with state storage, making executors hard to reason about and test. Replace it with a new internal ExecutionState (exec_state.py) backed purely by DynamoDB, with a cleaner start/poll/cleanup split across all executors. - Executor kwargs validation used isinstance checks against DelayedActionCodec (an internal codec wrapper) instead of the actual user-facing Delayed* types. Add is_node_like(x) to contrib/api.py as a shared predicate for Node | DelayedRef | DelayedLoad | DelayedRunnable, and update SshExecutor._validate_kw to use it. Executors (ssh, docker, batch, cfn, lambda, script) are all updated to the new interface. Examples and docs updated to match.
Fire-and-monitor executors (docker, batch) previously passed sub-adapter
payloads via local tmpdir mounts or ad-hoc S3Store paths, storing the
resulting URIs in executor state so poll() could locate them later. This
was fragile: tmpdir paths tied docker to the host filesystem, the batch
executor stored input_uri/output_uri in job state that were never needed
again, and neither pattern had a stable, discoverable S3 namespace.
Introduce AdapterIO — a lightweight surrogate for stdin/stdout when direct
piping is not possible. Paths are derived deterministically from
(cache_key, exec_id, name) under the existing fn-exec S3 namespace:
{fn-exec-prefix}/io/{cache_key}/{exec_id}/{name}/input.json
{fn-exec-prefix}/io/{cache_key}/{exec_id}/{name}/output.json
Because paths are derived, poll() can reconstruct AdapterIO from the same
three values without reading them from state. This eliminates workdir,
output_path, input_uri, and output_uri from executor state entirely.
Changes:
- exec_state.py: add AdapterIO class and ExecutionState.adapter_io() factory
- adapters.py: add S3 write branch to AdapterBase._write_output() so the
sub-adapter can write its result directly to an S3 URI (parallel to the
existing S3 read branch in _read_input)
- docker.py: start() writes payload via AdapterIO.write_input() and passes
S3 URIs to the container; _prepare_image tmpdir is now ephemeral (created
and removed within start()); poll() calls io.read_output() instead of
reading a local file; state shape reduced to {container_id, cleanup_image}
- batch.py: start() writes payload via AdapterIO.write_input(); removes
S3Store.cd("jobs") and the stored input_uri/output_uri; poll() reconstructs
AdapterIO("lambda:batch") and calls io.read_output(); state shape reduced
to {job_id, job_definition}
- tests updated throughout; batch tests no longer need _FakeStore
- docs updated: executor-state.md documents AdapterIO API and usage pattern;
executor-catalog.md updated for docker and batch to reflect simplified state
Introduce git-like remote project workflows including clone, fetch, pull, push, merge, revert, and DAG checkout, along with project-local/global config support, remote project refs, and the supporting docs, specs, and tests. Tighten the runtime contract for remote-aware components by requiring explicit remote configuration where remote-backed behavior is used, removing unsupported optional remote-root paths, updating local-only helpers to use local-only primitives, and verifying the result with ruff, pyright, and the full test suite.
Sync the finalized OpenSpec deltas into canonical specs and archive the completed change directories so the active change list stays current.
Still need to implement checkout commit functionality and review defaults
Implement git-like checkout behavior with attached/detached modes and compose clone as fetch+checkout. Align CLI, commit resolution, specs, and archived change artifacts around revision-based terminology and behavior.
Move fetch/pull/push/checkout/merge/revert/clone orchestration out of CLI handlers into DmlOps so project commands stay thin and easier to validate. Add regression tests for delegation and workflow parity, and group git-like commands in top-level help.
Move init orchestration into DmlOps so CLI remains a thin adapter, adding recovery/bootstrap validation and tests. Sync and archive the refine-dmlops-init-config-resolution OpenSpec change with updated canonical specs.
Consolidate project bootstrap on init by deleting clone/post-clone surfaces across CLI, internal config, specs, and docs, then archive the completed change.
Route DmlOps.init filesystem bootstrap through init_project_layout and remove duplicated local config/gitignore write paths so init behavior stays consistent across recovery and fresh setup flows.
Make init accept URI-only identity while enforcing name/project-uri exclusivity, with clearer config errors for unresolved user-derived ownership. Also allow dag checkout to infer user from config when omitted and update integration/examples and OpenSpec artifacts to reflect the new workflow.
… DmlOps Move DAG checkout and project remote client orchestration out of CLI handlers so command modules remain transport-only and easier to maintain. Keep behavior stable by updating delegation-focused CLI tests and adding ops-level coverage for extracted workflows.
Restructure tests around contract-first boundaries, add migration docs, and mark integration coverage consistently for fast-path selection. Also fix docker build integration fixture path so S3Store.tar resolves the build context.
Shift CLI/internal workflows to branch names and opaque index IDs so pointer persistence stays in HeadOps, then archive the completed OpenSpec change with synced specs.
…and optimistic publication Changes: - HeadOps pointer methods are now file-I/O-only except for create_branch bootstrap. - IndexOps mutation paths derive commits in LMDB, close transaction, then publish via HeadOps CAS. - IndexOps retries publication on DmlPointerConflictError using the conflict's current_commit. - Only HeadOps.create_branch remains transaction-aware (bootstrap only). - CommitOps and RemoteOps updated to publish pointers after LMDB transaction commit. - Added regression tests for publication ordering and conflict retry behavior. This eliminates the corruption window where refs could point to uncommitted commits by ensuring all pointer updates happen after LMDB transaction durability. Pointer publication now uses compare-and-swap with automatic retry on stale pointers. Closes the gap between immutable commit creation in LMDB and mutable ref publication on the filesystem by making IndexOps own commit derivation and HeadOps own file-backed ref publication.
Allow head and DAG deletion operations to fall back to the attached HEAD branch so temporary runtimes keep working without a branch override.
Unify remote execution tracking, dependency edges, and cache administration around execution ids so adapters, planners, and cache refs all operate on the same runtime model. Sync the corresponding OpenSpec specs and archive the completed change so the implementation, docs, and experimental workflow stay in sync.
Preserve nested adapter launch state during adapter CLI polling so supervisor-backed script executions keep their own result paths while still refreshing status metadata. Also restore remote-root propagation for Docker and CFN contrib executors so remote backed example and CloudFormation flows can run under the execution-id runtime model.
moto[server] version 5.2.0 catches if-none-match errors and raises 500 instead of expected 412.
gh-actions are much slower with sanitization flags, so we don't run them.
node deprecation warnings.
Use project-home and remote-uri across the CLI, sync the related specs, and archive the completed change so the public surface matches shared config terminology.
Route repository workflows through the new Dml surface, centralize selector resolution, and align specs, docs, examples, and tests with the git-like CLI redesign.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.