Agent deploy + observability overhaul: per-agent virtual keys, full trace, pre-built base image by jaylfc · Pull Request #225 · jaylfc/tinyagentos

jaylfc · 2026-04-18T14:55:45Z

TL;DR

End-to-end agent deploy + chat with per-agent LiteLLM virtual keys backed by Postgres, full observability (message_in/llm_call/tool_call/tool_result/message_out all bucketed under the agent slug), provider-agnostic model discovery (any cloud provider added via the providers app auto-populates LiteLLM's model_list and the agent-creation dropdown), and pre-built openclaw LXC base images (deploy time 90s → 33s).

Squash-merge to keep master history clean — 112 commits worth of iteration through five distinct latent bugs and four feature deliverables.

What shipped

LiteLLM proxy: ownership, auth, persistence

Stop adopting foreign LiteLLM processes on :4000. taOS terminates anything on its port and spawns its own. Adoption was masking master-key mismatch and silent config drift.
Master key + DATABASE_URL propagate from data/.litellm_db_url into the LiteLLM subprocess env.
Prisma client auto-generates on boot; LiteLLM owns the schema migration via its shipped prisma/migrations/ (no more db push corruption loop).
Health probe uses /health/readiness (was hitting 401 on /health with master_key set).
SIGHUP reload removed — single-worker uvicorn dies on SIGHUP. Reload now uses a clean restart path.
api_key_secret references resolve into the subprocess env instead of being passed as os.environ/<name> literal strings.
Non-200 responses on /key/generate log the status + body instead of silently returning None.

Provider pipeline: provider-agnostic + LiteLLM-authoritative

/api/providers/models?refresh=bool passthrough to LiteLLM /v1/models with TTL cache. Frontend AgentsApp's create-agent dialog now reads from this endpoint — LiteLLM is the single source of truth for what models can be assigned.
POST /api/providers auto-discovers models from {url}/models for any CLOUD_BACKEND_TYPES provider. PATCH re-probes when routing-affecting fields change.
Provider catalog with canonical base URLs per type (kilocode, openai, anthropic, openrouter), so an api-key-only entry still lands in model_list correctly.
generate_litellm_config warns when a backend is dropped for missing url/models, instead of silently producing an incomplete config.

Trace observability: both sides of every conversation

message_in events captured at BridgeSessionRegistry.enqueue_user_message — every user message reaching an agent now lands in trace with content, channel_id, message_id.
llm_call events via the LiteLLM CustomLogger (sibling shim file written to the config dir so get_instance_fn can resolve the dotted path).
Slug extraction reads kwargs.litellm_params.metadata.user_api_key_metadata.agent (the path LiteLLM v1.83.4 actually surfaces) — events bucket under the right agent slug, no more _unknown_.
tool_call + tool_result events via the openclaw fork's bridge (jaylfc/openclaw#taos-fork commits ef84a93 + 9bab2e3) — uses getAgentRunContext(runId).sessionKey to correlate events when isControlUiVisible=false (non-webchat channels).
TAOS_LOCAL_TOKEN forwarded to the LiteLLM subprocess so callback POSTs to /api/trace actually authenticate.

Agent UX

Emoji per agent: picker in the create-agent flow (default = framework's icon, override via inline input + 12-item quick-pick grid). Stored on the agent record. Displayed in agent rows, message hub, and the AgentStatusWidget. ARIA-correct.
Tolerant DELETE for orphan agents (failed deploys that left only a config row): skip incus snapshot / incus stop when the container doesn't exist, hard-delete when no chat/trace history, tombstone-archive otherwise.

Pre-built LXC base image

.github/workflows/build-agent-images.yml: matrix arm64 + x64 builds a Debian 13 incus image with Node 22 + openclaw preinstalled, publishes to the rolling-images Release tag.
Deployer auto-imports the image on first deploy (background lifespan task) and uses it via incus launch taos-openclaw-base when present. Falls back to the per-deploy build path when absent.
install.sh branches on $TAOS_BASE_IMAGE_PRESENT to skip apt/npm/tarball steps on the fast path.
Result on Pi 5 Plus: deploy time 60–90s → 33s (~64% reduction). Remaining bottleneck is the dir-pool full-FS-copy during incus launch — see perf(install): use btrfs/ZFS storage pool for incus to unlock CoW container clones (deploy ≤5s) #224.

Closes

#67 openclaw integration: built-in litellm provider, LITELLM_API_KEY, model echo
Image catalog: Intel / NVIDIA / AMD hardware tiers in all manifests #46 Pi E2E openclaw deploy
bug: DELETE /api/agents/{name} fails when container was never created #221 orphan DELETE
feat(openclaw fork): emit tool_call and tool_result bridge events so trace captures intermediate tool steps #222 fork tool emission
feat: pre-built openclaw LXC base image (GitHub Actions, per-arch) — drops deploy time 90s → 10s #220 pre-built base image (PASS criterion of ≤15s blocked on perf(install): use btrfs/ZFS storage pool for incus to unlock CoW container clones (deploy ≤5s) #224 storage pool)

Queued follow-ups (filed during this work)

perf(agents): shrink pre-built base image + make ensure_image_present opt-in/background #223 perf: shrink published image (target ≤200 MB) + make ensure_image_present non-blocking on first boot
perf(install): use btrfs/ZFS storage pool for incus to unlock CoW container clones (deploy ≤5s) #224 perf: switch incus storage pool to btrfs/ZFS for sub-second CoW container clones (unlocks the original ≤15s deploy target)

Test coverage

1700+ tests in the full fast suite green.
Major focus suites all 100%:
- test_llm_proxy (~38 tests)
- test_litellm_callback
- test_litellm_migrate (9 new)
- test_routes_providers (passthrough + cache + PATCH refresh)
- test_routes_agents (deploy emoji + orphan delete)
- test_bridge_session (message_in)
- test_deployer (base image + key minting)
- test_agent_image (10 new)
Pre-existing unrelated failures (test_hardware.py arm64 check on macOS) untouched.

Migration / one-time steps for existing installs

Postgres: install postgresql if not present; create role + DB and write the URL to data/.litellm_db_url (mode 600). Setup script ergonomics are queued for a follow-up.
Prisma client: pip install prisma runs automatically via the new pyproject.toml dep; prisma generate runs on first taOS boot via litellm_migrate.py.
Pre-built image: imports automatically on first boot. ~30–60s download + import (one-time).
Existing kilocode/cloud provider entries without url or models are repaired by the scripts/repair_providers.py migration, which probed the upstream /models endpoint and populated 330 kilocode models.

Pi state at merge

Clean. Only the always-protected mary / naira / stanley containers remain.

Adapters imported uvicorn at module top, so anything that imported them for structural checks (tests, health-endpoint probes) would crash with ModuleNotFoundError when uvicorn wasn't installed. uvicorn.run is only needed when an adapter is run as a standalone process — move the import into the __main__ guard. Clears 19 pre-existing test failures across test_new_adapters.py and test_channel_hub_new.py.

Required by the new agent archive lifecycle: the delete path stops the container then renames it to a dated `taos-archived-{slug}-{ts}` bucket so a later restore can rename it back. Implemented for both LXC (incus rename) and Docker (docker rename).

…install Broken before: `pip3 install openclaw` ran unconditionally, its failure was logged as a warning and the deploy continued, and the container came up missing the deps most agent frameworks need. Now: - apt install includes nodejs, npm, build-essential, python3-dev, ca-certificates, gnupg, wget, with DEBIAN_FRONTEND=noninteractive and --no-install-recommends (timeout 15m for slow arm64 apt). - Framework install dispatches on the manifest's install.method: pip uses manifest.install.package, script pushes + runs manifest.manifest_dir / install.script. Missing script files, unsupported methods, and non-zero exits all raise RuntimeError so the outer try/except rolls back the container and the agent shows status=failed instead of misleadingly 'running'. - TAOS_MODEL env var is injected so the in-container runtime knows which model to send to LiteLLM.

…ation, hot reload generate_litellm_config now: - Registers openrouter (openrouter/ prefix, native LiteLLM support) and kilocode (openai-compatible, explicit api_base) in the backend type maps. - Expands each cloud backend's declared models into their own model_list entries keyed on the real model id, so agents can request a specific model. The 'default' alias is still appended as a fallback. routes/providers.py: add/patch/delete now call proxy.reload_config instead of the stale proxy.write_config, so the running LiteLLM subprocess actually picks up config changes.

The manifest declares method: script -> scripts/install.sh, which didn't exist. The deployer has no way to install openclaw, so the agent came up with no runtime and the chat path had nothing to hit. The new script, run once inside a fresh Debian bookworm LXC: - Creates /opt/openclaw with a pinned venv (fastapi, uvicorn, httpx, openai). - Writes a minimal FastAPI runtime at /opt/openclaw/server.py that listens on 0.0.0.0:8100, accepts POST /message {text, from, thread_id?} and forwards to LiteLLM using the injected OPENAI_BASE_URL, OPENAI_API_KEY, and TAOS_MODEL env vars. - Installs a systemd unit so the runtime survives restarts. - Polls /health up to 20s and fails the install if the server didn't come up. No memory, no tools, no persistence — the host owns all of that. This is the minimum for the end-to-end chat pipeline to land messages on an agent and get a reply back.

…fecycle Several related changes to the agents API and config model that together make agent creation survive the full round trip: - Every agent gets a stable 12-char uuid (agent['id']), backfilled for existing config entries by normalize_agent. - body.model and body.framework land on the agent row at create time; llm_key lands after the background deploy succeeds. - A 1:1 DM channel is auto-created on successful deploy and its id persisted as chat_channel_id so the Messages app sees the agent immediately. - extra_config to deploy_agent now always includes the app registry so the manifest-aware framework install can resolve. Delete is now archive, not destroy. DELETE /api/agents/{name}: stops the container, renames it to taos-archived-{slug}-{ts}, moves workspace/memory dirs under data_dir/archive/{slug}-{ts}/, revokes the LiteLLM key, flags the DM channel archived, and moves the config entry from config.agents to config.archived_agents. New endpoints: GET /api/agents/archived -> list archive entries POST /api/agents/archived/{id}/restore -> reverses the archive DELETE /api/agents/archived/{id} -> true permanent purge Restore handles slug collisions (if a new agent has taken the original name) by suffixing -2, -3, etc. Purge is what the old hard-delete used to do: destroy container, rm -rf archive dir, delete chat channel, drop the archived entry. This also fixes 'can't re-create a deleted agent with the same name' -- the old delete path left the LXC container around; the new archive path renames it out of the way.

User messages in a DM channel now reach the agent's FastAPI runtime on port 8100 inside its LXC container; the reply is persisted as an agent-authored message in the same channel and broadcast over the chat hub so both the webapp and the PWA update in real time. Wiring: - AgentChatRouter (new, tinyagentos/agent_chat_router.py): fire-and-forget dispatch(message, channel). Skips non-user messages, looks up each non-user channel member as an agent, skips agents that aren't running (posts a short system reply instead), and POSTs to http://{agent.host}:8100/message with {text, from, thread_id}. Response content is written back via chat_messages.send_message. All errors caught -- broken agents don't crash the chat path. - routes/chat.py: one-line dispatch call at the end of the HTTP post_message path and the WebSocket 'message' branch, so both entry points route identically. - app.py: router instantiated in the lifespan after chat_hub. No subscription plumbing, no retries -- the router is a direct adapter between two owned stores. Timeouts and connect errors become visible agent replies so the user sees what went wrong.

Adds a collapsible 'Archived' panel below the live agents list in AgentsApp. Shows each archived entry's display name, model, and relative archive time; per-row Restore and Delete Permanently buttons call the new backend endpoints with confirmations. - parseArchiveTimestamp / relativeTimeFromTs helpers convert the YYYYMMDDTHHMMSS format the backend writes. - ArchivedAgentsPanel is inlined (matches AgentRow / DeployWizard living in the same file) and self-hides when there are no archived entries. - handleDelete's confirm copy now mentions archiving so users know it's recoverable. - fetchArchived is called alongside fetchAgents at every existing refresh point. Unit tests for the new helpers under desktop/src/apps/__tests__/AgentsApp.archived.test.tsx.

Incus refuses to rename a running container, so every archive call now has a hard dependency on the container being stopped first. The new stop-force path sends --force (LXC) or kill (Docker) so archive can guarantee the container is down before it attempts the rename. The add_proxy_device method is added to the abstract base, LXC backend, and Docker stub so the deployer can attach incus proxy devices for host-side port forwarding when setting up an agent home.

…0.0.1 Agents now get a dedicated home directory mounted at /root inside the container so their runtime state (env file, model config, logs) persists across container recreation. Proxy devices are attached via the new add_proxy_device backend method so the host-side LiteLLM process can reach the in-container agent port. The taos_host default is hardened to 127.0.0.1 so freshly deployed agents always resolve back to the host loopback rather than relying on a potentially incorrect network variable.

…keys Previously is_running only checked the subprocess handle, which is None for processes the deployer did not start itself (adopted instances). The method now also checks the _adopted flag so that a pre-existing LiteLLM process is correctly reported as running and a fresh API key is minted rather than the deployer trying to start a second instance. The companion reload_config path also skips process management when adopted.

The install script previously wrote configuration to a path that was wiped on container recreation. Now the env file is written to /root/.openclaw/env which sits inside the persistent agent-home mount, so credentials and model config survive container restarts and upgrades without reinstalling. The script also accepts values from environment variables so the deployer can inject them at provision time.

A small module that locates and rewrites the env file inside the agent-home directory on the host without entering the container. This is used by the restore path so a freshly issued LiteLLM key and updated endpoint can be injected into the persistent /root/.openclaw/env without having to reinstall the framework, which would risk breaking the agent's installed state.

…ewrite env on restore Archive now force-stops the container before rename — incus refuses to rename a running instance, and silently leaving it running produced orphan containers. If the rename itself fails, the config entry is left in the live list rather than being moved to archive, keeping the system consistent. The agent home directory now travels with workspace and memory into the archive bucket so the full /root is preserved. On restore, the new host-side env rewrite helper updates /root/.openclaw/env with the freshly issued LiteLLM key and endpoint rather than reinstalling, which avoids breaking the installed framework.

Adds a stable local token that is bootstrapped once at startup and written to a known path with mode 0600. Any request carrying it in an Authorization: Bearer header is granted full access without a session cookie, allowing automated agents and local scripts to call the API without going through the browser-based login flow. The middleware sits before the session check so it has no impact on normal browser sessions.

Documents the agent-home directory layout and mount strategy so it is clear what lives inside the container, what is persisted on the host, and how the env-rewrite helper fits into the restore flow.

notify_task_complete was never called from the image generation route, leaving the RKNN SD server running indefinitely after requests completed. The _legacy_generate path (used when the resource scheduler is absent) now wraps the backend HTTP call in try/finally so notify_task_complete fires on both success and error paths. Chat and embedding traffic routes through LiteLLM and does not hit this endpoint; that keep-alive path is handled separately via a LiteLLM callback. Tests added for both success and failure paths.

Image-gen backend types were silently skipped in loaded_models, causing the Activity widget to show "Loaded Models (0)" even when the RKNN SD server was active. Two new branches added to the probe loop: - rknn-sd: GET {url}/v1/models (rknn_sd_server.py speaks OpenAI-compat), emits one entry per model with purpose=image-generation. - sd-cpp: GET {url}/sdapi/v1/options, reads sd_model_checkpoint for the active checkpoint name, falls back to "unknown" if absent. Both branches follow the existing ConnectError/Timeout/HTTPError swallow pattern. Tests cover success, missing-checkpoint fallback, and offline (connection refused) for both backend types.

Captures go per-agent in the bind-mounted home folder so archive, restore, backup, and cross-worker migration all work via the existing "move the home folder" rule. Each agent's .taos/trace/ directory holds one SQLite bucket per UTC hour (YYYY-MM-DDTHH.db). Bucket routing is driven by the event's created_at, not wall-clock at write time -- a 14:59:59.999 event routed at 15:00:00.001 lands in the T14 file, so rollover never drops events. Zero-loss: every write lands in the SQLite or is appended to a sibling YYYY-MM-DDTHH.jsonl. Nothing is ever silently dropped. The librarian merges both sources at read time. The envelope is v1 and stable: v, id, trace_id, parent_id, created_at, agent_name, kind, channel_id, thread_id, backend_name, model, duration_ms, tokens_in, tokens_out, cost_usd, error, payload. Kinds are enumerated (message_in/out, llm_call, tool_call/result, reasoning, error, lifecycle); each has a documented payload shape so consumers parse without guessing. trace_id + parent_id enable cross-event linkage for reconstructing a full turn end-to-end. POST /api/trace writes; GET /api/agents/{name}/trace reads with filter + limit. POST /api/lifecycle/notify lets the LiteLLM callback reset the keep-alive timer for whichever backend served a request.

Registered in generated litellm_config.yaml under general_settings.custom_callbacks. Runs inside the LiteLLM subprocess with no access to taOS's Python state, so it authenticates to taOS via the local token file on disk and posts over HTTP to /api/trace and /api/lifecycle/notify. Agent name is derived from the virtual key alias ("taos-<slug>") that the deployer sets when minting per-agent keys. This is how the per- agent trace store knows which bucket to route to for a given completion. Failure-mode is swallow-and-log: a broken callback must never fail a real LLM request. A litellm-not-installed environment gets a no-op stub so tests pass without the dep.

…ontainers app.py: instantiate the registry on the data_dir, include the trace router, close all connections on shutdown. deployer.py: inject TAOS_LOCAL_TOKEN (read from data_dir/.auth_local_token at deploy time) and TAOS_TRACE_URL into the container env. Any in- container runtime that wants to post traces (or that we later replace with real openclaw and tap via gateway events) has the credential and endpoint ready.

…+ archive/trace Switches env-snippet from host.docker.internal to 127.0.0.1 and explains incus proxy devices. Drops Docker-only qualifier from workspace/memory status table (LXC now has parity). Adds Per-agent trace capture, Agent archive/restore, and Programmatic access (local token) sections. Extends Related and adds a Related code list pointing at the new modules.

Adds a section distinguishing user memory (long-lived user context) from per-agent trace capture (event log inside agent-home). Explains how the taOSmd librarian bridges both layers and links to the trace design in framework-agnostic-runtime.md.

Step-by-step procedures for archiving a live agent, listing archives, restoring with slug collision handling and LiteLLM key rotation, and permanent purge. Covers failure modes including container rename failure, archive dir collision, and restore container conflicts.

…ttribution Covers the three-endpoint trace API surface with curl examples, query filter parameters, envelope field table, kind/payload reference, direct SQLite access pattern, cost attribution recipe, and librarian consumption pattern. Links to trace_store.py, routes/trace.py, and litellm_callback.py.

… semantics DELETE /api/agents/{name} now archives rather than hard-deletes. Updates the endpoint table to show the archive path and the new purge endpoint.

Primary reference for the real openclaw integration: gateway protocol breakdown, install + runtime, config schema, extension model, known limitations, 35-row capability map, and a 4-phase MVP-to-full roadmap. MVP path is the bridge adapter from the 2026-04-11 framework-integration -bridge-design spec, not the operator-client (raw v3 WS from taOS). The operator-client is kept as a documented fallback only, because it couples taOS to openclaw's gateway protocol version and any upstream bump can break the fleet; the bridge isolates coupling to a single ~200 LoC patch inside our jaylfc/openclaw fork. Review-gate refinements baked into Step 1: feature-flag the patch entry (so an unset TAOS_BRIDGE_URL gives upstream-identical builds), version-stamp the bootstrap, single coupling discipline, channels.kind "external" + provider "taos" (upstreamable), 400 LoC patch ceiling, automated persistence-audit as the trust anchor, parallel upstream PRs, LiteLLM key rotation caveat (safe-on-restart today, reload RPC later). Fixes the Debian-bookworm Node 18 gap (install Node 22.14+ via NodeSource before npm install) and the stale 500MB manifest disk size (real openclaw is 1-2GB on disk). Appendix B lists 12 docs.openclaw.ai pages that 404'd at research time; a follow-up pass using gh api on the repo docs/ tree fills those gaps.

Resolved 2 of 7 Appendix A open questions from primary source on github.com/openclaw/openclaw. Struck through 9 of the 12 404'd docs.openclaw.ai URLs in Appendix B where the repo had a mirror. Added  comments so future readers know which claims are primary-sourced. MVP impact: startup health-check loop updated from ss fallback to `openclaw health --timeout` (Q5 resolved); gateway.bind: "lan" confirmed as the correct key for container external binding (Q1 resolved).

Trace files older than 2h are chmod'd 0o400 during eviction so the librarian's source-of-truth for historic agent activity is tamper-proof on-disk. Rare late-arriving events (clock skew, deferred processing) route to a sibling {bucket}.late.jsonl which stays writable -- zero- loss guarantee preserved even for the extreme edge. list() merges .db + .jsonl + .late.jsonl with dedup by event id (primary wins). Sealing runs opportunistically inside _evict_old_buckets; no background task.

…og + model discovery - LLMProxy accepts database_url; app reads data/.litellm_db_url at boot and exports it as DATABASE_URL into the litellm subprocess so /key/generate can mint per-agent virtual keys. - Add Provider fills canonical URL from PROVIDER_URL_DEFAULTS and probes {url}/models to populate the model list when empty — generic across openai, anthropic, openrouter, kilocode (no per-type branching on the probe). Falls back to per-type seed list (kilocode → kilo-auto/free) when the probe returns nothing so the entry still registers at least one routable model. - Deployer scopes the minted virtual key to the agent's primary + fallback models (models=[req.model, *fallback_models]) instead of defaulting to the unrestricted "default" alias. - Deployer fails loudly when a DB is configured but /key/generate still returns None — hiding that class of failure is what shipped the broken kilocode path in the first place. - generate_litellm_config now WARNs when a cloud-type backend is missing url or models, so silent drops surface in logs instead of showing up as a broken agent much later. - scripts/repair_providers.py repairs legacy config.yaml entries that pre-date the autofill/discovery logic.

Generated LiteLLM configs use os.environ/<name> markers to reference provider api keys, but nothing was actually exporting those names into the subprocess env. Cloud providers therefore hit the litellm OpenAIException "api_key client option must be set" even with a correctly-configured backend list. LLMProxy.start/reload_config now accept a secrets={name: value} map. app.py resolves each backend.api_key_secret from the secrets store at boot and again on catalog-change reload; routes/providers.py does the same on add/patch/delete so newly-added or rotated keys take effect without a full app restart.

…nfigured LiteLLM's /key/generate requires a Postgres-backed Prisma schema, but LiteLLM does not run migrations itself. Fresh installs had to manually run `pip install prisma && prisma generate && prisma db push` before virtual keys worked. New tinyagentos/litellm_migrate.py locates the bundled schema at litellm/proxy/schema.prisma, probes for LiteLLM_VerificationToken in the configured DB, and shells out to the venv's prisma CLI only when the table is missing. Idempotent — safe on every boot. Called from the lifespan hook before LLMProxy.start() so LiteLLM sees a ready schema. Added prisma>=0.11.0 to the proxy optional dependency group so the CLI lands in the venv on fresh installs.

…g shim for get_instance_fn

… cache

…solves under systemd

…un migration

…gration Running prisma db push from our bootstrap created tables without seeding _prisma_migrations, so LiteLLM's own prisma migrate deploy at startup tried to apply migration #1 against an already-populated schema and looped on "type JobStatus already exists", leaving the proxy unhealthy. Our helper's only job now is to make prisma.client importable so LiteLLM can run its shipped migrations itself. Drop the db push and the psql/psycopg probe; keep the systemd PATH fix for prisma generate.

LiteLLM's proxy_cli shells out ``subprocess.run([\"prisma\"])`` during startup to detect whether Prisma is runnable. Under systemd the service's default PATH doesn't include our venv's bin/, so the lookup raises FileNotFoundError and LiteLLM prints "prisma package not found" and skips DB setup entirely — leaving virtual-key issuance broken even though the package IS installed in the venv. Prepend the venv bin that already hosts the litellm binary so the child process resolves ``prisma`` (and ``prisma-client-py`` for generate). Also bump the startup wait from 30s to 120s: LiteLLM on a fresh Pi DB runs ``prisma migrate deploy`` before opening its HTTP port, which takes 45-60s on ARM.

stderr=DEVNULL silently swallowed proxy startup failures (prisma migration errors, config parse errors, model-router failures), turning "why is the proxy unhealthy?" into a 30-minute debugging hunt. Write stderr to a file next to litellm_config.yaml so operators can read it without attaching strace.

Two separate bugs kept LiteLLM from ever settling on the Pi. 1. Startup polling hit ``/health``, which gates on the master key and returns 401 for an unauthenticated client. LiteLLM was healthy within ~50s but ``start()`` kept polling until the 120s timeout, logged "failed to start within 120s", and returned False even though the subprocess was fine. ``/health/readiness`` is the public endpoint. 2. ``reload_config`` sent SIGHUP to trigger a config reload. LiteLLM runs as single-worker uvicorn (no ``--workers``), which does not register a SIGHUP handler, so the default action — terminate — fires. Every ``/api/providers/models?refresh=true`` was silently killing the proxy, then ``_fetch_litellm_models`` got connection-refused and returned []. Drop SIGHUP entirely; the existing stop+start path was already the fallback. Also switch the foreign-process probe to ``/health/readiness`` for the same 401 reason.

The TaosLiteLLMCallback running inside the LiteLLM subprocess POSTs llm_call events back to the taOS bridge at ``/api/trace``, which requires the local auth token. The callback's token-discovery logic checks ``TAOS_LOCAL_TOKEN`` env first, then ``/data/.auth_local_token`` and ``~/.taos/.auth_local_token``. Under systemd the real token lives at ``{data_dir}/.auth_local_token`` — none of the candidate paths — so every callback fired a POST without Authorization and taOS responded 401, leaving trace rows with no ``llm_call`` events despite LiteLLM actually processing requests. Read the token in app.py and forward it via the new ``local_token`` constructor kwarg on LLMProxy, which exports it into the subprocess env.

LiteLLM 1.83.4 surfaces the agent slug in litellm_params.metadata under user_api_key_metadata.agent (matching what LLMProxy.create_agent_key writes when minting the virtual key). The previous extraction read metadata.key_alias which is no longer populated on success events, so every llm_call trace was bucketed under the _unknown_ sentinel slug. Walks four sources in priority order: 1. user_api_key_metadata.agent 2. user_api_key_auth_metadata.agent 3. user_api_key_alias (strips the taos- prefix) 4. key_alias (legacy, kept for older LiteLLM builds)

enqueue_user_message now writes a message_in trace event under the agent's slug, following the ENVELOPE_V1_SCHEMA message_in shape ({from, text}) with extra informational fields (message_id, author_type, delivery). Guards against orphan _unknown_ or empty-slug entries. Fails soft: trace write errors are logged, never raised.

…ilt PWA)

…hen container absent (#221) Failed deploys leave behind a config row with no LXC container, which caused DELETE /api/agents/{name} to error on snapshot_create. Probe container_exists first; for orphans, skip stop/snapshot, revoke any LiteLLM key, and either hard-delete the row (no history) or record a tombstone (chat/trace present so purge is available from Archived). Adds container_exists helper to tinyagentos.containers; four new tests cover the orphan hard-delete, orphan tombstone, skipped-snapshot assertion, and purge of a snapshotless tombstone.

Adds a GitHub Actions workflow that builds per-arch Debian 13 LXC base images with Node 22, openclaw, and recycle-bin scaffolding already installed. Published as assets on the 'rolling-images' Release tag. The deployer now checks for the 'taos-openclaw-base' image alias before launching; when present it uses the cached image and sets TAOS_BASE_IMAGE_PRESENT=1 so install.sh skips the apt-get + npm steps. Without the image the deployer falls back transparently to images:debian/bookworm and install.sh does the full install. tinyagentos.agent_image exposes is_image_present and ensure_image_present helpers; the latter runs as a background task on app startup to bootstrap the image on first boot. Closes #220

… stdin The previous impl passed curl.stdout (a Python StreamReader) as stdin= to the incus subprocess, which asyncio cannot forward as an OS-level FD. Curl would read the first ~90KB then block on a pipe nobody was draining. Using an explicit os.pipe() pair with the read end handed to incus and the write end to curl gives us a real kernel pipe and the import completes.

Incus 6.x rejects '-' as stdin for image import and rejects bare HTTPS URLs (expects an incus image server). Download to a temp file then pass its path. Also fix image list query: positional <alias> arg (--filter=alias=... is only valid for container list).

…acker The design doc still referenced models.providers.taos (a custom provider that was abandoned mid-implementation in favour of openclaw's built-in litellm provider type). Updated the bootstrap example, the integration tracker table, and the openclaw.json shape to match what actually ships. The channels-side "provider: taos" identifier is unchanged; that's the channel-kind name, separate from the LLM provider.

kilo-code-bot · 2026-04-18T14:58:34Z

Code Review Summary

Status: 6 Issues Found | Recommendation: Address before merge

Overview

Severity	Count
CRITICAL	1
WARNING	5
SUGGESTION	0

Issue Details (click to expand)

CRITICAL

File	Line	Issue
`app-catalog/agents/openclaw/scripts/install.sh`	764	SUID/SGID bit not cleared on rm wrapper. `/usr/local/bin/rm` is created with mode 755. Since this script shadows the system `rm` binary, any process running as root that executes `rm` will invoke this wrapper. If this wrapper ever had vulnerabilities, it would be trivial to escalate privileges across the container. Also note that `trash-put` is invoked with whatever user called `rm`, without privilege dropping.

WARNING

File	Line	Issue
`.github/workflows/build-agent-images.yml`	52	Unsafe iptables default policy change: `sudo iptables -P FORWARD ACCEPT` changes the system-wide default policy for forwarded traffic to ACCEPT. This disables host network firewalling for all forwarded traffic including other containers and bridge interfaces. All container traffic will be implicitly allowed without any filtering.
`.github/workflows/build-agent-images.yml`	55	Dynamic interface masquerade rule: The rule uses `$(ip route get 1.1.1.1
`desktop/src/apps/AgentsApp.tsx`	1488	Missing error handling for archive fetch: `fetchArchived()` silently catches all exceptions without setting any error state. If this API endpoint returns 4xx/5xx or network fails, no user feedback is shown; the UI will just never display archived agents.
`desktop/src/apps/AgentsApp.tsx`	1120	Unchecked Promise.allSettled values: When fetching disk states, only rejected promises are skipped. Fulfilled promises with `null` values will be added to the result map as `undefined`. Accessing properties on these will throw null reference errors when rendered in `AgentRow`.
`desktop/src/apps/MessagesApp.tsx`	2267	Cross-app event handler missing cleanup guard: The `taos:open-messages` event listener is added in an useEffect but the cleanup function does not check if the component is still mounted before calling state setters. If the component unmounts while the admin prompt fetch is in flight, it will trigger state updates on an unmounted component.

Files Reviewed (8 files)

.github/workflows/build-agent-images.yml - 2 issues
.gitignore
app-catalog/_common/scripts/recycle-bin-install.sh
app-catalog/agents/agent-zero/manifest.yaml
app-catalog/agents/openclaw/manifest.yaml
app-catalog/agents/openclaw/scripts/install.sh - 1 issue
desktop/src/apps/AgentsApp.tsx - 2 issues
desktop/src/apps/MessagesApp.tsx - 1 issue

Fix these issues in Kilo Cloud

_{Reviewed by seed-2-0-pro-260328 · 170,139 tokens}

CRITICAL: - install.sh: explicit chmod a-s on /usr/local/bin/rm wrapper. Mode 755 doesn't have SUID set, but a future edit accidentally bumping to 4755 on a root-shadow rm would be a textbook escalation primitive. Belt- and-suspenders against that. WARNING: - build-agent-images.yml: stop flipping FORWARD chain default policy to ACCEPT (broadly disables runner forward filtering); rely on the explicit ACCEPT rules for incusbr0 already present. Capture outbound interface once before the masquerade rule so a default-route flip mid-job can't redirect the rule to a different NIC. - AgentsApp.fetchArchived: log non-OK responses, content-type mismatches, and exception bodies to console.warn instead of swallowing silently — empty Archived list now distinguishable from "no archived agents" in DevTools. - AgentsApp.fetchDiskStates: add explicit nullness + structural checks on Promise.allSettled fulfilled values. Existing code worked because null is falsy, but the defensive shape check makes future regressions loud instead of crashing AgentRow on render. - MessagesApp open-messages handler: add cancelled flag to the useEffect cleanup so admin-prompt fetches resolving after unmount don't trigger setState on an unmounted component. Frontend bundle rebuilt. No behavioural changes for any happy path.

jaylfc added 30 commits April 16, 2026 23:53

build: rebuild desktop bundle with Archived section and agent wiring

ea7c81b

docs: per-agent home mount in framework-agnostic-runtime

4d9e34e

Documents the agent-home directory layout and mount strategy so it is clear what lives inside the container, what is persisted on the host, and how the env-rewrite helper fits into the restore flow.

docs(design): update plan-agent-deployer API table to reflect archive…

fd055ba

… semantics DELETE /api/agents/{name} now archives rather than hard-deletes. Updates the endpoint table to show the archive path and the new purge endpoint.

jaylfc added 23 commits April 17, 2026 23:03

fix(litellm_callback): wire callbacks under litellm_settings + siblin…

dd383b7

…g shim for get_instance_fn

feat(providers): /api/providers/models passthrough with refresh + ttl…

b0563f8

… cache

feat(agents): agent-creation model picker reads from LiteLLM passthrough

7ae624a

fix(litellm_migrate): prepend venv bin to PATH so prisma-client-py re…

d1b6d76

…solves under systemd

fix(litellm_migrate): psql probe fallback so boot doesn't wrongly rer…

c749ae3

…un migration

feat(agents): persist optional emoji on agent record + deploy API

3c092e0

feat(agents): emoji picker in create flow + display in agent UI (rebu…

d01a819

…ilt PWA)

ci(agents): fix bridge forwarding + NAT for incus in GHA runner

f0442d5

jaylfc merged commit 93e7474 into master Apr 18, 2026
7 checks passed

jaylfc deleted the fix/agent-creation-flow branch April 18, 2026 15:07

This was referenced Apr 18, 2026

fix: address 6 Kilo Code Review findings on #225 #226

Merged

bug: DELETE /api/agents/{name} fails when container was never created #221

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agent deploy + observability overhaul: per-agent virtual keys, full trace, pre-built base image#225

Agent deploy + observability overhaul: per-agent virtual keys, full trace, pre-built base image#225
jaylfc merged 112 commits into
masterfrom
fix/agent-creation-flow

jaylfc commented Apr 18, 2026

Uh oh!

kilo-code-bot Bot commented Apr 18, 2026 •

edited

Loading

CRITICAL

WARNING

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jaylfc commented Apr 18, 2026

TL;DR

What shipped

LiteLLM proxy: ownership, auth, persistence

Provider pipeline: provider-agnostic + LiteLLM-authoritative

Trace observability: both sides of every conversation

Agent UX

Pre-built LXC base image

Closes

Queued follow-ups (filed during this work)

Test coverage

Migration / one-time steps for existing installs

Pi state at merge

Uh oh!

kilo-code-bot Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

CRITICAL

WARNING

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kilo-code-bot Bot commented Apr 18, 2026 •

edited

Loading