Agent deploy + observability overhaul: per-agent virtual keys, full trace, pre-built base image#225
Merged
Conversation
Adapters imported uvicorn at module top, so anything that imported them for structural checks (tests, health-endpoint probes) would crash with ModuleNotFoundError when uvicorn wasn't installed. uvicorn.run is only needed when an adapter is run as a standalone process — move the import into the __main__ guard. Clears 19 pre-existing test failures across test_new_adapters.py and test_channel_hub_new.py.
Required by the new agent archive lifecycle: the delete path stops
the container then renames it to a dated `taos-archived-{slug}-{ts}`
bucket so a later restore can rename it back. Implemented for both
LXC (incus rename) and Docker (docker rename).
…install Broken before: `pip3 install openclaw` ran unconditionally, its failure was logged as a warning and the deploy continued, and the container came up missing the deps most agent frameworks need. Now: - apt install includes nodejs, npm, build-essential, python3-dev, ca-certificates, gnupg, wget, with DEBIAN_FRONTEND=noninteractive and --no-install-recommends (timeout 15m for slow arm64 apt). - Framework install dispatches on the manifest's install.method: pip uses manifest.install.package, script pushes + runs manifest.manifest_dir / install.script. Missing script files, unsupported methods, and non-zero exits all raise RuntimeError so the outer try/except rolls back the container and the agent shows status=failed instead of misleadingly 'running'. - TAOS_MODEL env var is injected so the in-container runtime knows which model to send to LiteLLM.
…ation, hot reload generate_litellm_config now: - Registers openrouter (openrouter/ prefix, native LiteLLM support) and kilocode (openai-compatible, explicit api_base) in the backend type maps. - Expands each cloud backend's declared models into their own model_list entries keyed on the real model id, so agents can request a specific model. The 'default' alias is still appended as a fallback. routes/providers.py: add/patch/delete now call proxy.reload_config instead of the stale proxy.write_config, so the running LiteLLM subprocess actually picks up config changes.
The manifest declares method: script -> scripts/install.sh, which
didn't exist. The deployer has no way to install openclaw, so the
agent came up with no runtime and the chat path had nothing to hit.
The new script, run once inside a fresh Debian bookworm LXC:
- Creates /opt/openclaw with a pinned venv (fastapi, uvicorn,
httpx, openai).
- Writes a minimal FastAPI runtime at /opt/openclaw/server.py that
listens on 0.0.0.0:8100, accepts POST /message {text, from,
thread_id?} and forwards to LiteLLM using the injected
OPENAI_BASE_URL, OPENAI_API_KEY, and TAOS_MODEL env vars.
- Installs a systemd unit so the runtime survives restarts.
- Polls /health up to 20s and fails the install if the server
didn't come up.
No memory, no tools, no persistence — the host owns all of that.
This is the minimum for the end-to-end chat pipeline to land
messages on an agent and get a reply back.
…fecycle
Several related changes to the agents API and config model that
together make agent creation survive the full round trip:
- Every agent gets a stable 12-char uuid (agent['id']), backfilled
for existing config entries by normalize_agent.
- body.model and body.framework land on the agent row at create
time; llm_key lands after the background deploy succeeds.
- A 1:1 DM channel is auto-created on successful deploy and its
id persisted as chat_channel_id so the Messages app sees the
agent immediately.
- extra_config to deploy_agent now always includes the app
registry so the manifest-aware framework install can resolve.
Delete is now archive, not destroy. DELETE /api/agents/{name}:
stops the container, renames it to taos-archived-{slug}-{ts},
moves workspace/memory dirs under data_dir/archive/{slug}-{ts}/,
revokes the LiteLLM key, flags the DM channel archived, and moves
the config entry from config.agents to config.archived_agents.
New endpoints:
GET /api/agents/archived -> list archive entries
POST /api/agents/archived/{id}/restore -> reverses the archive
DELETE /api/agents/archived/{id} -> true permanent purge
Restore handles slug collisions (if a new agent has taken the
original name) by suffixing -2, -3, etc. Purge is what the old
hard-delete used to do: destroy container, rm -rf archive dir,
delete chat channel, drop the archived entry.
This also fixes 'can't re-create a deleted agent with the same
name' -- the old delete path left the LXC container around; the
new archive path renames it out of the way.
User messages in a DM channel now reach the agent's FastAPI runtime
on port 8100 inside its LXC container; the reply is persisted as
an agent-authored message in the same channel and broadcast over
the chat hub so both the webapp and the PWA update in real time.
Wiring:
- AgentChatRouter (new, tinyagentos/agent_chat_router.py):
fire-and-forget dispatch(message, channel). Skips non-user
messages, looks up each non-user channel member as an agent,
skips agents that aren't running (posts a short system reply
instead), and POSTs to http://{agent.host}:8100/message with
{text, from, thread_id}. Response content is written back via
chat_messages.send_message. All errors caught -- broken agents
don't crash the chat path.
- routes/chat.py: one-line dispatch call at the end of the HTTP
post_message path and the WebSocket 'message' branch, so both
entry points route identically.
- app.py: router instantiated in the lifespan after chat_hub.
No subscription plumbing, no retries -- the router is a direct
adapter between two owned stores. Timeouts and connect errors
become visible agent replies so the user sees what went wrong.
Adds a collapsible 'Archived' panel below the live agents list in AgentsApp. Shows each archived entry's display name, model, and relative archive time; per-row Restore and Delete Permanently buttons call the new backend endpoints with confirmations. - parseArchiveTimestamp / relativeTimeFromTs helpers convert the YYYYMMDDTHHMMSS format the backend writes. - ArchivedAgentsPanel is inlined (matches AgentRow / DeployWizard living in the same file) and self-hides when there are no archived entries. - handleDelete's confirm copy now mentions archiving so users know it's recoverable. - fetchArchived is called alongside fetchAgents at every existing refresh point. Unit tests for the new helpers under desktop/src/apps/__tests__/AgentsApp.archived.test.tsx.
Incus refuses to rename a running container, so every archive call now has a hard dependency on the container being stopped first. The new stop-force path sends --force (LXC) or kill (Docker) so archive can guarantee the container is down before it attempts the rename. The add_proxy_device method is added to the abstract base, LXC backend, and Docker stub so the deployer can attach incus proxy devices for host-side port forwarding when setting up an agent home.
…0.0.1 Agents now get a dedicated home directory mounted at /root inside the container so their runtime state (env file, model config, logs) persists across container recreation. Proxy devices are attached via the new add_proxy_device backend method so the host-side LiteLLM process can reach the in-container agent port. The taos_host default is hardened to 127.0.0.1 so freshly deployed agents always resolve back to the host loopback rather than relying on a potentially incorrect network variable.
…keys Previously is_running only checked the subprocess handle, which is None for processes the deployer did not start itself (adopted instances). The method now also checks the _adopted flag so that a pre-existing LiteLLM process is correctly reported as running and a fresh API key is minted rather than the deployer trying to start a second instance. The companion reload_config path also skips process management when adopted.
The install script previously wrote configuration to a path that was wiped on container recreation. Now the env file is written to /root/.openclaw/env which sits inside the persistent agent-home mount, so credentials and model config survive container restarts and upgrades without reinstalling. The script also accepts values from environment variables so the deployer can inject them at provision time.
A small module that locates and rewrites the env file inside the agent-home directory on the host without entering the container. This is used by the restore path so a freshly issued LiteLLM key and updated endpoint can be injected into the persistent /root/.openclaw/env without having to reinstall the framework, which would risk breaking the agent's installed state.
…ewrite env on restore Archive now force-stops the container before rename — incus refuses to rename a running instance, and silently leaving it running produced orphan containers. If the rename itself fails, the config entry is left in the live list rather than being moved to archive, keeping the system consistent. The agent home directory now travels with workspace and memory into the archive bucket so the full /root is preserved. On restore, the new host-side env rewrite helper updates /root/.openclaw/env with the freshly issued LiteLLM key and endpoint rather than reinstalling, which avoids breaking the installed framework.
Adds a stable local token that is bootstrapped once at startup and written to a known path with mode 0600. Any request carrying it in an Authorization: Bearer header is granted full access without a session cookie, allowing automated agents and local scripts to call the API without going through the browser-based login flow. The middleware sits before the session check so it has no impact on normal browser sessions.
Documents the agent-home directory layout and mount strategy so it is clear what lives inside the container, what is persisted on the host, and how the env-rewrite helper fits into the restore flow.
notify_task_complete was never called from the image generation route, leaving the RKNN SD server running indefinitely after requests completed. The _legacy_generate path (used when the resource scheduler is absent) now wraps the backend HTTP call in try/finally so notify_task_complete fires on both success and error paths. Chat and embedding traffic routes through LiteLLM and does not hit this endpoint; that keep-alive path is handled separately via a LiteLLM callback. Tests added for both success and failure paths.
Image-gen backend types were silently skipped in loaded_models, causing
the Activity widget to show "Loaded Models (0)" even when the RKNN SD
server was active. Two new branches added to the probe loop:
- rknn-sd: GET {url}/v1/models (rknn_sd_server.py speaks OpenAI-compat),
emits one entry per model with purpose=image-generation.
- sd-cpp: GET {url}/sdapi/v1/options, reads sd_model_checkpoint for the
active checkpoint name, falls back to "unknown" if absent.
Both branches follow the existing ConnectError/Timeout/HTTPError swallow
pattern. Tests cover success, missing-checkpoint fallback, and offline
(connection refused) for both backend types.
Captures go per-agent in the bind-mounted home folder so archive,
restore, backup, and cross-worker migration all work via the existing
"move the home folder" rule. Each agent's .taos/trace/ directory holds
one SQLite bucket per UTC hour (YYYY-MM-DDTHH.db). Bucket routing is
driven by the event's created_at, not wall-clock at write time -- a
14:59:59.999 event routed at 15:00:00.001 lands in the T14 file, so
rollover never drops events.
Zero-loss: every write lands in the SQLite or is appended to a sibling
YYYY-MM-DDTHH.jsonl. Nothing is ever silently dropped. The librarian
merges both sources at read time.
The envelope is v1 and stable: v, id, trace_id, parent_id, created_at,
agent_name, kind, channel_id, thread_id, backend_name, model,
duration_ms, tokens_in, tokens_out, cost_usd, error, payload. Kinds are
enumerated (message_in/out, llm_call, tool_call/result, reasoning,
error, lifecycle); each has a documented payload shape so consumers
parse without guessing. trace_id + parent_id enable cross-event linkage
for reconstructing a full turn end-to-end.
POST /api/trace writes; GET /api/agents/{name}/trace reads with filter
+ limit. POST /api/lifecycle/notify lets the LiteLLM callback reset the
keep-alive timer for whichever backend served a request.
Registered in generated litellm_config.yaml under
general_settings.custom_callbacks. Runs inside the LiteLLM subprocess
with no access to taOS's Python state, so it authenticates to taOS via
the local token file on disk and posts over HTTP to /api/trace and
/api/lifecycle/notify.
Agent name is derived from the virtual key alias ("taos-<slug>") that
the deployer sets when minting per-agent keys. This is how the per-
agent trace store knows which bucket to route to for a given completion.
Failure-mode is swallow-and-log: a broken callback must never fail a
real LLM request. A litellm-not-installed environment gets a no-op stub
so tests pass without the dep.
…ontainers app.py: instantiate the registry on the data_dir, include the trace router, close all connections on shutdown. deployer.py: inject TAOS_LOCAL_TOKEN (read from data_dir/.auth_local_token at deploy time) and TAOS_TRACE_URL into the container env. Any in- container runtime that wants to post traces (or that we later replace with real openclaw and tap via gateway events) has the credential and endpoint ready.
…+ archive/trace Switches env-snippet from host.docker.internal to 127.0.0.1 and explains incus proxy devices. Drops Docker-only qualifier from workspace/memory status table (LXC now has parity). Adds Per-agent trace capture, Agent archive/restore, and Programmatic access (local token) sections. Extends Related and adds a Related code list pointing at the new modules.
Adds a section distinguishing user memory (long-lived user context) from per-agent trace capture (event log inside agent-home). Explains how the taOSmd librarian bridges both layers and links to the trace design in framework-agnostic-runtime.md.
Step-by-step procedures for archiving a live agent, listing archives, restoring with slug collision handling and LiteLLM key rotation, and permanent purge. Covers failure modes including container rename failure, archive dir collision, and restore container conflicts.
…ttribution Covers the three-endpoint trace API surface with curl examples, query filter parameters, envelope field table, kind/payload reference, direct SQLite access pattern, cost attribution recipe, and librarian consumption pattern. Links to trace_store.py, routes/trace.py, and litellm_callback.py.
… semantics
DELETE /api/agents/{name} now archives rather than hard-deletes. Updates the
endpoint table to show the archive path and the new purge endpoint.
Primary reference for the real openclaw integration: gateway protocol breakdown, install + runtime, config schema, extension model, known limitations, 35-row capability map, and a 4-phase MVP-to-full roadmap. MVP path is the bridge adapter from the 2026-04-11 framework-integration -bridge-design spec, not the operator-client (raw v3 WS from taOS). The operator-client is kept as a documented fallback only, because it couples taOS to openclaw's gateway protocol version and any upstream bump can break the fleet; the bridge isolates coupling to a single ~200 LoC patch inside our jaylfc/openclaw fork. Review-gate refinements baked into Step 1: feature-flag the patch entry (so an unset TAOS_BRIDGE_URL gives upstream-identical builds), version-stamp the bootstrap, single coupling discipline, channels.kind "external" + provider "taos" (upstreamable), 400 LoC patch ceiling, automated persistence-audit as the trust anchor, parallel upstream PRs, LiteLLM key rotation caveat (safe-on-restart today, reload RPC later). Fixes the Debian-bookworm Node 18 gap (install Node 22.14+ via NodeSource before npm install) and the stale 500MB manifest disk size (real openclaw is 1-2GB on disk). Appendix B lists 12 docs.openclaw.ai pages that 404'd at research time; a follow-up pass using gh api on the repo docs/ tree fills those gaps.
Resolved 2 of 7 Appendix A open questions from primary source on github.com/openclaw/openclaw. Struck through 9 of the 12 404'd docs.openclaw.ai URLs in Appendix B where the repo had a mirror. Added <!-- source: ... --> comments so future readers know which claims are primary-sourced. MVP impact: startup health-check loop updated from ss fallback to `openclaw health --timeout` (Q5 resolved); gateway.bind: "lan" confirmed as the correct key for container external binding (Q1 resolved).
Trace files older than 2h are chmod'd 0o400 during eviction so the
librarian's source-of-truth for historic agent activity is tamper-proof
on-disk. Rare late-arriving events (clock skew, deferred processing)
route to a sibling {bucket}.late.jsonl which stays writable -- zero-
loss guarantee preserved even for the extreme edge. list() merges
.db + .jsonl + .late.jsonl with dedup by event id (primary wins).
Sealing runs opportunistically inside _evict_old_buckets; no
background task.
…og + model discovery
- LLMProxy accepts database_url; app reads data/.litellm_db_url at boot
and exports it as DATABASE_URL into the litellm subprocess so
/key/generate can mint per-agent virtual keys.
- Add Provider fills canonical URL from PROVIDER_URL_DEFAULTS and probes
{url}/models to populate the model list when empty — generic across
openai, anthropic, openrouter, kilocode (no per-type branching on the
probe). Falls back to per-type seed list (kilocode → kilo-auto/free)
when the probe returns nothing so the entry still registers at least
one routable model.
- Deployer scopes the minted virtual key to the agent's primary + fallback
models (models=[req.model, *fallback_models]) instead of defaulting to
the unrestricted "default" alias.
- Deployer fails loudly when a DB is configured but /key/generate still
returns None — hiding that class of failure is what shipped the
broken kilocode path in the first place.
- generate_litellm_config now WARNs when a cloud-type backend is missing
url or models, so silent drops surface in logs instead of showing up
as a broken agent much later.
- scripts/repair_providers.py repairs legacy config.yaml entries that
pre-date the autofill/discovery logic.
Generated LiteLLM configs use os.environ/<name> markers to reference
provider api keys, but nothing was actually exporting those names
into the subprocess env. Cloud providers therefore hit the litellm
OpenAIException "api_key client option must be set" even with a
correctly-configured backend list.
LLMProxy.start/reload_config now accept a secrets={name: value} map.
app.py resolves each backend.api_key_secret from the secrets store at
boot and again on catalog-change reload; routes/providers.py does the
same on add/patch/delete so newly-added or rotated keys take effect
without a full app restart.
…nfigured LiteLLM's /key/generate requires a Postgres-backed Prisma schema, but LiteLLM does not run migrations itself. Fresh installs had to manually run `pip install prisma && prisma generate && prisma db push` before virtual keys worked. New tinyagentos/litellm_migrate.py locates the bundled schema at litellm/proxy/schema.prisma, probes for LiteLLM_VerificationToken in the configured DB, and shells out to the venv's prisma CLI only when the table is missing. Idempotent — safe on every boot. Called from the lifespan hook before LLMProxy.start() so LiteLLM sees a ready schema. Added prisma>=0.11.0 to the proxy optional dependency group so the CLI lands in the venv on fresh installs.
…g shim for get_instance_fn
…solves under systemd
…gration Running prisma db push from our bootstrap created tables without seeding _prisma_migrations, so LiteLLM's own prisma migrate deploy at startup tried to apply migration #1 against an already-populated schema and looped on "type JobStatus already exists", leaving the proxy unhealthy. Our helper's only job now is to make prisma.client importable so LiteLLM can run its shipped migrations itself. Drop the db push and the psql/psycopg probe; keep the systemd PATH fix for prisma generate.
LiteLLM's proxy_cli shells out ``subprocess.run([\"prisma\"])`` during startup to detect whether Prisma is runnable. Under systemd the service's default PATH doesn't include our venv's bin/, so the lookup raises FileNotFoundError and LiteLLM prints "prisma package not found" and skips DB setup entirely — leaving virtual-key issuance broken even though the package IS installed in the venv. Prepend the venv bin that already hosts the litellm binary so the child process resolves ``prisma`` (and ``prisma-client-py`` for generate). Also bump the startup wait from 30s to 120s: LiteLLM on a fresh Pi DB runs ``prisma migrate deploy`` before opening its HTTP port, which takes 45-60s on ARM.
stderr=DEVNULL silently swallowed proxy startup failures (prisma migration errors, config parse errors, model-router failures), turning "why is the proxy unhealthy?" into a 30-minute debugging hunt. Write stderr to a file next to litellm_config.yaml so operators can read it without attaching strace.
Two separate bugs kept LiteLLM from ever settling on the Pi. 1. Startup polling hit ``/health``, which gates on the master key and returns 401 for an unauthenticated client. LiteLLM was healthy within ~50s but ``start()`` kept polling until the 120s timeout, logged "failed to start within 120s", and returned False even though the subprocess was fine. ``/health/readiness`` is the public endpoint. 2. ``reload_config`` sent SIGHUP to trigger a config reload. LiteLLM runs as single-worker uvicorn (no ``--workers``), which does not register a SIGHUP handler, so the default action — terminate — fires. Every ``/api/providers/models?refresh=true`` was silently killing the proxy, then ``_fetch_litellm_models`` got connection-refused and returned []. Drop SIGHUP entirely; the existing stop+start path was already the fallback. Also switch the foreign-process probe to ``/health/readiness`` for the same 401 reason.
The TaosLiteLLMCallback running inside the LiteLLM subprocess POSTs
llm_call events back to the taOS bridge at ``/api/trace``, which
requires the local auth token. The callback's token-discovery logic
checks ``TAOS_LOCAL_TOKEN`` env first, then ``/data/.auth_local_token``
and ``~/.taos/.auth_local_token``. Under systemd the real token lives
at ``{data_dir}/.auth_local_token`` — none of the candidate paths — so
every callback fired a POST without Authorization and taOS responded
401, leaving trace rows with no ``llm_call`` events despite LiteLLM
actually processing requests.
Read the token in app.py and forward it via the new ``local_token``
constructor kwarg on LLMProxy, which exports it into the subprocess env.
LiteLLM 1.83.4 surfaces the agent slug in litellm_params.metadata under user_api_key_metadata.agent (matching what LLMProxy.create_agent_key writes when minting the virtual key). The previous extraction read metadata.key_alias which is no longer populated on success events, so every llm_call trace was bucketed under the _unknown_ sentinel slug. Walks four sources in priority order: 1. user_api_key_metadata.agent 2. user_api_key_auth_metadata.agent 3. user_api_key_alias (strips the taos- prefix) 4. key_alias (legacy, kept for older LiteLLM builds)
enqueue_user_message now writes a message_in trace event under the
agent's slug, following the ENVELOPE_V1_SCHEMA message_in shape
({from, text}) with extra informational fields (message_id,
author_type, delivery).
Guards against orphan _unknown_ or empty-slug entries.
Fails soft: trace write errors are logged, never raised.
…hen container absent (#221) Failed deploys leave behind a config row with no LXC container, which caused DELETE /api/agents/{name} to error on snapshot_create. Probe container_exists first; for orphans, skip stop/snapshot, revoke any LiteLLM key, and either hard-delete the row (no history) or record a tombstone (chat/trace present so purge is available from Archived). Adds container_exists helper to tinyagentos.containers; four new tests cover the orphan hard-delete, orphan tombstone, skipped-snapshot assertion, and purge of a snapshotless tombstone.
Adds a GitHub Actions workflow that builds per-arch Debian 13 LXC base images with Node 22, openclaw, and recycle-bin scaffolding already installed. Published as assets on the 'rolling-images' Release tag. The deployer now checks for the 'taos-openclaw-base' image alias before launching; when present it uses the cached image and sets TAOS_BASE_IMAGE_PRESENT=1 so install.sh skips the apt-get + npm steps. Without the image the deployer falls back transparently to images:debian/bookworm and install.sh does the full install. tinyagentos.agent_image exposes is_image_present and ensure_image_present helpers; the latter runs as a background task on app startup to bootstrap the image on first boot. Closes #220
… stdin The previous impl passed curl.stdout (a Python StreamReader) as stdin= to the incus subprocess, which asyncio cannot forward as an OS-level FD. Curl would read the first ~90KB then block on a pipe nobody was draining. Using an explicit os.pipe() pair with the read end handed to incus and the write end to curl gives us a real kernel pipe and the import completes.
Incus 6.x rejects '-' as stdin for image import and rejects bare HTTPS URLs (expects an incus image server). Download to a temp file then pass its path. Also fix image list query: positional <alias> arg (--filter=alias=... is only valid for container list).
…acker The design doc still referenced models.providers.taos (a custom provider that was abandoned mid-implementation in favour of openclaw's built-in litellm provider type). Updated the bootstrap example, the integration tracker table, and the openclaw.json shape to match what actually ships. The channels-side "provider: taos" identifier is unchanged; that's the channel-kind name, separate from the LLM provider.
Code Review SummaryStatus: 6 Issues Found | Recommendation: Address before merge Overview
Issue Details (click to expand)CRITICAL
WARNING
Files Reviewed (8 files)
Fix these issues in Kilo Cloud Reviewed by seed-2-0-pro-260328 · 170,139 tokens |
jaylfc
added a commit
that referenced
this pull request
Apr 18, 2026
CRITICAL: - install.sh: explicit chmod a-s on /usr/local/bin/rm wrapper. Mode 755 doesn't have SUID set, but a future edit accidentally bumping to 4755 on a root-shadow rm would be a textbook escalation primitive. Belt- and-suspenders against that. WARNING: - build-agent-images.yml: stop flipping FORWARD chain default policy to ACCEPT (broadly disables runner forward filtering); rely on the explicit ACCEPT rules for incusbr0 already present. Capture outbound interface once before the masquerade rule so a default-route flip mid-job can't redirect the rule to a different NIC. - AgentsApp.fetchArchived: log non-OK responses, content-type mismatches, and exception bodies to console.warn instead of swallowing silently — empty Archived list now distinguishable from "no archived agents" in DevTools. - AgentsApp.fetchDiskStates: add explicit nullness + structural checks on Promise.allSettled fulfilled values. Existing code worked because null is falsy, but the defensive shape check makes future regressions loud instead of crashing AgentRow on render. - MessagesApp open-messages handler: add cancelled flag to the useEffect cleanup so admin-prompt fetches resolving after unmount don't trigger setState on an unmounted component. Frontend bundle rebuilt. No behavioural changes for any happy path.
This was referenced Apr 18, 2026
jaylfc
added a commit
that referenced
this pull request
Apr 18, 2026
CRITICAL: - install.sh: explicit chmod a-s on /usr/local/bin/rm wrapper. Mode 755 doesn't have SUID set, but a future edit accidentally bumping to 4755 on a root-shadow rm would be a textbook escalation primitive. Belt- and-suspenders against that. WARNING: - build-agent-images.yml: stop flipping FORWARD chain default policy to ACCEPT (broadly disables runner forward filtering); rely on the explicit ACCEPT rules for incusbr0 already present. Capture outbound interface once before the masquerade rule so a default-route flip mid-job can't redirect the rule to a different NIC. - AgentsApp.fetchArchived: log non-OK responses, content-type mismatches, and exception bodies to console.warn instead of swallowing silently — empty Archived list now distinguishable from "no archived agents" in DevTools. - AgentsApp.fetchDiskStates: add explicit nullness + structural checks on Promise.allSettled fulfilled values. Existing code worked because null is falsy, but the defensive shape check makes future regressions loud instead of crashing AgentRow on render. - MessagesApp open-messages handler: add cancelled flag to the useEffect cleanup so admin-prompt fetches resolving after unmount don't trigger setState on an unmounted component. Frontend bundle rebuilt. No behavioural changes for any happy path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
End-to-end agent deploy + chat with per-agent LiteLLM virtual keys backed by Postgres, full observability (
message_in/llm_call/tool_call/tool_result/message_outall bucketed under the agent slug), provider-agnostic model discovery (any cloud provider added via the providers app auto-populates LiteLLM'smodel_listand the agent-creation dropdown), and pre-built openclaw LXC base images (deploy time 90s → 33s).Squash-merge to keep master history clean — 112 commits worth of iteration through five distinct latent bugs and four feature deliverables.
What shipped
LiteLLM proxy: ownership, auth, persistence
data/.litellm_db_urlinto the LiteLLM subprocess env.prisma/migrations/(no moredb pushcorruption loop)./health/readiness(was hitting 401 on/healthwith master_key set).api_key_secretreferences resolve into the subprocess env instead of being passed asos.environ/<name>literal strings./key/generatelog the status + body instead of silently returning None.Provider pipeline: provider-agnostic + LiteLLM-authoritative
/api/providers/models?refresh=boolpassthrough to LiteLLM/v1/modelswith TTL cache. Frontend AgentsApp's create-agent dialog now reads from this endpoint — LiteLLM is the single source of truth for what models can be assigned./api/providersauto-discovers models from{url}/modelsfor anyCLOUD_BACKEND_TYPESprovider. PATCH re-probes when routing-affecting fields change.model_listcorrectly.generate_litellm_configwarns when a backend is dropped for missing url/models, instead of silently producing an incomplete config.Trace observability: both sides of every conversation
message_inevents captured atBridgeSessionRegistry.enqueue_user_message— every user message reaching an agent now lands in trace withcontent,channel_id,message_id.llm_callevents via the LiteLLM CustomLogger (sibling shim file written to the config dir soget_instance_fncan resolve the dotted path).kwargs.litellm_params.metadata.user_api_key_metadata.agent(the path LiteLLM v1.83.4 actually surfaces) — events bucket under the right agent slug, no more_unknown_.tool_call+tool_resultevents via the openclaw fork's bridge (jaylfc/openclaw#taos-fork commits ef84a93 + 9bab2e3) — usesgetAgentRunContext(runId).sessionKeyto correlate events whenisControlUiVisible=false(non-webchat channels).TAOS_LOCAL_TOKENforwarded to the LiteLLM subprocess so callback POSTs to/api/traceactually authenticate.Agent UX
incus snapshot/incus stopwhen the container doesn't exist, hard-delete when no chat/trace history, tombstone-archive otherwise.Pre-built LXC base image
.github/workflows/build-agent-images.yml: matrix arm64 + x64 builds a Debian 13 incus image with Node 22 + openclaw preinstalled, publishes to therolling-imagesRelease tag.incus launch taos-openclaw-basewhen present. Falls back to the per-deploy build path when absent.install.shbranches on$TAOS_BASE_IMAGE_PRESENTto skip apt/npm/tarball steps on the fast path.incus launch— see perf(install): use btrfs/ZFS storage pool for incus to unlock CoW container clones (deploy ≤5s) #224.Closes
Queued follow-ups (filed during this work)
ensure_image_presentnon-blocking on first bootTest coverage
test_llm_proxy(~38 tests)test_litellm_callbacktest_litellm_migrate(9 new)test_routes_providers(passthrough + cache + PATCH refresh)test_routes_agents(deploy emoji + orphan delete)test_bridge_session(message_in)test_deployer(base image + key minting)test_agent_image(10 new)test_hardware.pyarm64 check on macOS) untouched.Migration / one-time steps for existing installs
postgresqlif not present; create role + DB and write the URL todata/.litellm_db_url(mode 600). Setup script ergonomics are queued for a follow-up.pip install prismaruns automatically via the newpyproject.tomldep;prisma generateruns on first taOS boot vialitellm_migrate.py.urlormodelsare repaired by thescripts/repair_providers.pymigration, which probed the upstream/modelsendpoint and populated 330 kilocode models.Pi state at merge
Clean. Only the always-protected
mary/naira/stanleycontainers remain.