Skip to content

Agent deploy + observability overhaul: per-agent virtual keys, full trace, pre-built base image#225

Merged
jaylfc merged 112 commits into
masterfrom
fix/agent-creation-flow
Apr 18, 2026
Merged

Agent deploy + observability overhaul: per-agent virtual keys, full trace, pre-built base image#225
jaylfc merged 112 commits into
masterfrom
fix/agent-creation-flow

Conversation

@jaylfc
Copy link
Copy Markdown
Owner

@jaylfc jaylfc commented Apr 18, 2026

TL;DR

End-to-end agent deploy + chat with per-agent LiteLLM virtual keys backed by Postgres, full observability (message_in/llm_call/tool_call/tool_result/message_out all bucketed under the agent slug), provider-agnostic model discovery (any cloud provider added via the providers app auto-populates LiteLLM's model_list and the agent-creation dropdown), and pre-built openclaw LXC base images (deploy time 90s → 33s).

Squash-merge to keep master history clean — 112 commits worth of iteration through five distinct latent bugs and four feature deliverables.

What shipped

LiteLLM proxy: ownership, auth, persistence

  • Stop adopting foreign LiteLLM processes on :4000. taOS terminates anything on its port and spawns its own. Adoption was masking master-key mismatch and silent config drift.
  • Master key + DATABASE_URL propagate from data/.litellm_db_url into the LiteLLM subprocess env.
  • Prisma client auto-generates on boot; LiteLLM owns the schema migration via its shipped prisma/migrations/ (no more db push corruption loop).
  • Health probe uses /health/readiness (was hitting 401 on /health with master_key set).
  • SIGHUP reload removed — single-worker uvicorn dies on SIGHUP. Reload now uses a clean restart path.
  • api_key_secret references resolve into the subprocess env instead of being passed as os.environ/<name> literal strings.
  • Non-200 responses on /key/generate log the status + body instead of silently returning None.

Provider pipeline: provider-agnostic + LiteLLM-authoritative

  • /api/providers/models?refresh=bool passthrough to LiteLLM /v1/models with TTL cache. Frontend AgentsApp's create-agent dialog now reads from this endpoint — LiteLLM is the single source of truth for what models can be assigned.
  • POST /api/providers auto-discovers models from {url}/models for any CLOUD_BACKEND_TYPES provider. PATCH re-probes when routing-affecting fields change.
  • Provider catalog with canonical base URLs per type (kilocode, openai, anthropic, openrouter), so an api-key-only entry still lands in model_list correctly.
  • generate_litellm_config warns when a backend is dropped for missing url/models, instead of silently producing an incomplete config.

Trace observability: both sides of every conversation

  • message_in events captured at BridgeSessionRegistry.enqueue_user_message — every user message reaching an agent now lands in trace with content, channel_id, message_id.
  • llm_call events via the LiteLLM CustomLogger (sibling shim file written to the config dir so get_instance_fn can resolve the dotted path).
  • Slug extraction reads kwargs.litellm_params.metadata.user_api_key_metadata.agent (the path LiteLLM v1.83.4 actually surfaces) — events bucket under the right agent slug, no more _unknown_.
  • tool_call + tool_result events via the openclaw fork's bridge (jaylfc/openclaw#taos-fork commits ef84a93 + 9bab2e3) — uses getAgentRunContext(runId).sessionKey to correlate events when isControlUiVisible=false (non-webchat channels).
  • TAOS_LOCAL_TOKEN forwarded to the LiteLLM subprocess so callback POSTs to /api/trace actually authenticate.

Agent UX

  • Emoji per agent: picker in the create-agent flow (default = framework's icon, override via inline input + 12-item quick-pick grid). Stored on the agent record. Displayed in agent rows, message hub, and the AgentStatusWidget. ARIA-correct.
  • Tolerant DELETE for orphan agents (failed deploys that left only a config row): skip incus snapshot / incus stop when the container doesn't exist, hard-delete when no chat/trace history, tombstone-archive otherwise.

Pre-built LXC base image

  • .github/workflows/build-agent-images.yml: matrix arm64 + x64 builds a Debian 13 incus image with Node 22 + openclaw preinstalled, publishes to the rolling-images Release tag.
  • Deployer auto-imports the image on first deploy (background lifespan task) and uses it via incus launch taos-openclaw-base when present. Falls back to the per-deploy build path when absent.
  • install.sh branches on $TAOS_BASE_IMAGE_PRESENT to skip apt/npm/tarball steps on the fast path.
  • Result on Pi 5 Plus: deploy time 60–90s → 33s (~64% reduction). Remaining bottleneck is the dir-pool full-FS-copy during incus launch — see perf(install): use btrfs/ZFS storage pool for incus to unlock CoW container clones (deploy ≤5s) #224.

Closes

Queued follow-ups (filed during this work)

Test coverage

  • 1700+ tests in the full fast suite green.
  • Major focus suites all 100%:
    • test_llm_proxy (~38 tests)
    • test_litellm_callback
    • test_litellm_migrate (9 new)
    • test_routes_providers (passthrough + cache + PATCH refresh)
    • test_routes_agents (deploy emoji + orphan delete)
    • test_bridge_session (message_in)
    • test_deployer (base image + key minting)
    • test_agent_image (10 new)
  • Pre-existing unrelated failures (test_hardware.py arm64 check on macOS) untouched.

Migration / one-time steps for existing installs

  1. Postgres: install postgresql if not present; create role + DB and write the URL to data/.litellm_db_url (mode 600). Setup script ergonomics are queued for a follow-up.
  2. Prisma client: pip install prisma runs automatically via the new pyproject.toml dep; prisma generate runs on first taOS boot via litellm_migrate.py.
  3. Pre-built image: imports automatically on first boot. ~30–60s download + import (one-time).
  4. Existing kilocode/cloud provider entries without url or models are repaired by the scripts/repair_providers.py migration, which probed the upstream /models endpoint and populated 330 kilocode models.

Pi state at merge

Clean. Only the always-protected mary / naira / stanley containers remain.

jaylfc added 30 commits April 16, 2026 23:53
Adapters imported uvicorn at module top, so anything that imported
them for structural checks (tests, health-endpoint probes) would
crash with ModuleNotFoundError when uvicorn wasn't installed.
uvicorn.run is only needed when an adapter is run as a standalone
process — move the import into the __main__ guard.

Clears 19 pre-existing test failures across test_new_adapters.py
and test_channel_hub_new.py.
Required by the new agent archive lifecycle: the delete path stops
the container then renames it to a dated `taos-archived-{slug}-{ts}`
bucket so a later restore can rename it back. Implemented for both
LXC (incus rename) and Docker (docker rename).
…install

Broken before: `pip3 install openclaw` ran unconditionally, its failure
was logged as a warning and the deploy continued, and the container
came up missing the deps most agent frameworks need.

Now:
- apt install includes nodejs, npm, build-essential, python3-dev,
  ca-certificates, gnupg, wget, with DEBIAN_FRONTEND=noninteractive
  and --no-install-recommends (timeout 15m for slow arm64 apt).
- Framework install dispatches on the manifest's install.method:
  pip uses manifest.install.package, script pushes + runs
  manifest.manifest_dir / install.script. Missing script files,
  unsupported methods, and non-zero exits all raise RuntimeError
  so the outer try/except rolls back the container and the agent
  shows status=failed instead of misleadingly 'running'.
- TAOS_MODEL env var is injected so the in-container runtime knows
  which model to send to LiteLLM.
…ation, hot reload

generate_litellm_config now:
- Registers openrouter (openrouter/ prefix, native LiteLLM support)
  and kilocode (openai-compatible, explicit api_base) in the
  backend type maps.
- Expands each cloud backend's declared models into their own
  model_list entries keyed on the real model id, so agents can
  request a specific model. The 'default' alias is still appended
  as a fallback.

routes/providers.py: add/patch/delete now call proxy.reload_config
instead of the stale proxy.write_config, so the running LiteLLM
subprocess actually picks up config changes.
The manifest declares method: script -> scripts/install.sh, which
didn't exist. The deployer has no way to install openclaw, so the
agent came up with no runtime and the chat path had nothing to hit.

The new script, run once inside a fresh Debian bookworm LXC:
- Creates /opt/openclaw with a pinned venv (fastapi, uvicorn,
  httpx, openai).
- Writes a minimal FastAPI runtime at /opt/openclaw/server.py that
  listens on 0.0.0.0:8100, accepts POST /message {text, from,
  thread_id?} and forwards to LiteLLM using the injected
  OPENAI_BASE_URL, OPENAI_API_KEY, and TAOS_MODEL env vars.
- Installs a systemd unit so the runtime survives restarts.
- Polls /health up to 20s and fails the install if the server
  didn't come up.

No memory, no tools, no persistence — the host owns all of that.
This is the minimum for the end-to-end chat pipeline to land
messages on an agent and get a reply back.
…fecycle

Several related changes to the agents API and config model that
together make agent creation survive the full round trip:

- Every agent gets a stable 12-char uuid (agent['id']), backfilled
  for existing config entries by normalize_agent.
- body.model and body.framework land on the agent row at create
  time; llm_key lands after the background deploy succeeds.
- A 1:1 DM channel is auto-created on successful deploy and its
  id persisted as chat_channel_id so the Messages app sees the
  agent immediately.
- extra_config to deploy_agent now always includes the app
  registry so the manifest-aware framework install can resolve.

Delete is now archive, not destroy. DELETE /api/agents/{name}:
stops the container, renames it to taos-archived-{slug}-{ts},
moves workspace/memory dirs under data_dir/archive/{slug}-{ts}/,
revokes the LiteLLM key, flags the DM channel archived, and moves
the config entry from config.agents to config.archived_agents.

New endpoints:
  GET  /api/agents/archived           -> list archive entries
  POST /api/agents/archived/{id}/restore -> reverses the archive
  DELETE /api/agents/archived/{id}    -> true permanent purge

Restore handles slug collisions (if a new agent has taken the
original name) by suffixing -2, -3, etc. Purge is what the old
hard-delete used to do: destroy container, rm -rf archive dir,
delete chat channel, drop the archived entry.

This also fixes 'can't re-create a deleted agent with the same
name' -- the old delete path left the LXC container around; the
new archive path renames it out of the way.
User messages in a DM channel now reach the agent's FastAPI runtime
on port 8100 inside its LXC container; the reply is persisted as
an agent-authored message in the same channel and broadcast over
the chat hub so both the webapp and the PWA update in real time.

Wiring:
- AgentChatRouter (new, tinyagentos/agent_chat_router.py):
  fire-and-forget dispatch(message, channel). Skips non-user
  messages, looks up each non-user channel member as an agent,
  skips agents that aren't running (posts a short system reply
  instead), and POSTs to http://{agent.host}:8100/message with
  {text, from, thread_id}. Response content is written back via
  chat_messages.send_message. All errors caught -- broken agents
  don't crash the chat path.
- routes/chat.py: one-line dispatch call at the end of the HTTP
  post_message path and the WebSocket 'message' branch, so both
  entry points route identically.
- app.py: router instantiated in the lifespan after chat_hub.

No subscription plumbing, no retries -- the router is a direct
adapter between two owned stores. Timeouts and connect errors
become visible agent replies so the user sees what went wrong.
Adds a collapsible 'Archived' panel below the live agents list in
AgentsApp. Shows each archived entry's display name, model, and
relative archive time; per-row Restore and Delete Permanently
buttons call the new backend endpoints with confirmations.

- parseArchiveTimestamp / relativeTimeFromTs helpers convert the
  YYYYMMDDTHHMMSS format the backend writes.
- ArchivedAgentsPanel is inlined (matches AgentRow / DeployWizard
  living in the same file) and self-hides when there are no
  archived entries.
- handleDelete's confirm copy now mentions archiving so users
  know it's recoverable.
- fetchArchived is called alongside fetchAgents at every existing
  refresh point.

Unit tests for the new helpers under
desktop/src/apps/__tests__/AgentsApp.archived.test.tsx.
Incus refuses to rename a running container, so every archive call now
has a hard dependency on the container being stopped first. The new
stop-force path sends --force (LXC) or kill (Docker) so archive can
guarantee the container is down before it attempts the rename. The
add_proxy_device method is added to the abstract base, LXC backend, and
Docker stub so the deployer can attach incus proxy devices for host-side
port forwarding when setting up an agent home.
…0.0.1

Agents now get a dedicated home directory mounted at /root inside the
container so their runtime state (env file, model config, logs) persists
across container recreation. Proxy devices are attached via the new
add_proxy_device backend method so the host-side LiteLLM process can
reach the in-container agent port. The taos_host default is hardened to
127.0.0.1 so freshly deployed agents always resolve back to the host
loopback rather than relying on a potentially incorrect network variable.
…keys

Previously is_running only checked the subprocess handle, which is None
for processes the deployer did not start itself (adopted instances). The
method now also checks the _adopted flag so that a pre-existing LiteLLM
process is correctly reported as running and a fresh API key is minted
rather than the deployer trying to start a second instance. The companion
reload_config path also skips process management when adopted.
The install script previously wrote configuration to a path that was
wiped on container recreation. Now the env file is written to
/root/.openclaw/env which sits inside the persistent agent-home mount,
so credentials and model config survive container restarts and upgrades
without reinstalling. The script also accepts values from environment
variables so the deployer can inject them at provision time.
A small module that locates and rewrites the env file inside the
agent-home directory on the host without entering the container. This
is used by the restore path so a freshly issued LiteLLM key and updated
endpoint can be injected into the persistent /root/.openclaw/env without
having to reinstall the framework, which would risk breaking the agent's
installed state.
…ewrite env on restore

Archive now force-stops the container before rename — incus refuses to
rename a running instance, and silently leaving it running produced
orphan containers. If the rename itself fails, the config entry is
left in the live list rather than being moved to archive, keeping the
system consistent. The agent home directory now travels with workspace
and memory into the archive bucket so the full /root is preserved.
On restore, the new host-side env rewrite helper updates
/root/.openclaw/env with the freshly issued LiteLLM key and endpoint
rather than reinstalling, which avoids breaking the installed framework.
Adds a stable local token that is bootstrapped once at startup and
written to a known path with mode 0600. Any request carrying it in an
Authorization: Bearer header is granted full access without a session
cookie, allowing automated agents and local scripts to call the API
without going through the browser-based login flow. The middleware sits
before the session check so it has no impact on normal browser sessions.
Documents the agent-home directory layout and mount strategy so it is
clear what lives inside the container, what is persisted on the host,
and how the env-rewrite helper fits into the restore flow.
notify_task_complete was never called from the image generation route,
leaving the RKNN SD server running indefinitely after requests completed.
The _legacy_generate path (used when the resource scheduler is absent)
now wraps the backend HTTP call in try/finally so notify_task_complete
fires on both success and error paths. Chat and embedding traffic routes
through LiteLLM and does not hit this endpoint; that keep-alive path is
handled separately via a LiteLLM callback.

Tests added for both success and failure paths.
Image-gen backend types were silently skipped in loaded_models, causing
the Activity widget to show "Loaded Models (0)" even when the RKNN SD
server was active. Two new branches added to the probe loop:

- rknn-sd: GET {url}/v1/models (rknn_sd_server.py speaks OpenAI-compat),
  emits one entry per model with purpose=image-generation.
- sd-cpp: GET {url}/sdapi/v1/options, reads sd_model_checkpoint for the
  active checkpoint name, falls back to "unknown" if absent.

Both branches follow the existing ConnectError/Timeout/HTTPError swallow
pattern. Tests cover success, missing-checkpoint fallback, and offline
(connection refused) for both backend types.
Captures go per-agent in the bind-mounted home folder so archive,
restore, backup, and cross-worker migration all work via the existing
"move the home folder" rule. Each agent's .taos/trace/ directory holds
one SQLite bucket per UTC hour (YYYY-MM-DDTHH.db). Bucket routing is
driven by the event's created_at, not wall-clock at write time -- a
14:59:59.999 event routed at 15:00:00.001 lands in the T14 file, so
rollover never drops events.

Zero-loss: every write lands in the SQLite or is appended to a sibling
YYYY-MM-DDTHH.jsonl. Nothing is ever silently dropped. The librarian
merges both sources at read time.

The envelope is v1 and stable: v, id, trace_id, parent_id, created_at,
agent_name, kind, channel_id, thread_id, backend_name, model,
duration_ms, tokens_in, tokens_out, cost_usd, error, payload. Kinds are
enumerated (message_in/out, llm_call, tool_call/result, reasoning,
error, lifecycle); each has a documented payload shape so consumers
parse without guessing. trace_id + parent_id enable cross-event linkage
for reconstructing a full turn end-to-end.

POST /api/trace writes; GET /api/agents/{name}/trace reads with filter
+ limit. POST /api/lifecycle/notify lets the LiteLLM callback reset the
keep-alive timer for whichever backend served a request.
Registered in generated litellm_config.yaml under
general_settings.custom_callbacks. Runs inside the LiteLLM subprocess
with no access to taOS's Python state, so it authenticates to taOS via
the local token file on disk and posts over HTTP to /api/trace and
/api/lifecycle/notify.

Agent name is derived from the virtual key alias ("taos-<slug>") that
the deployer sets when minting per-agent keys. This is how the per-
agent trace store knows which bucket to route to for a given completion.

Failure-mode is swallow-and-log: a broken callback must never fail a
real LLM request. A litellm-not-installed environment gets a no-op stub
so tests pass without the dep.
…ontainers

app.py: instantiate the registry on the data_dir, include the trace
router, close all connections on shutdown.

deployer.py: inject TAOS_LOCAL_TOKEN (read from data_dir/.auth_local_token
at deploy time) and TAOS_TRACE_URL into the container env. Any in-
container runtime that wants to post traces (or that we later replace
with real openclaw and tap via gateway events) has the credential and
endpoint ready.
…+ archive/trace

Switches env-snippet from host.docker.internal to 127.0.0.1 and explains
incus proxy devices. Drops Docker-only qualifier from workspace/memory status
table (LXC now has parity). Adds Per-agent trace capture, Agent archive/restore,
and Programmatic access (local token) sections. Extends Related and adds a
Related code list pointing at the new modules.
Adds a section distinguishing user memory (long-lived user context) from
per-agent trace capture (event log inside agent-home). Explains how the
taOSmd librarian bridges both layers and links to the trace design in
framework-agnostic-runtime.md.
Step-by-step procedures for archiving a live agent, listing archives,
restoring with slug collision handling and LiteLLM key rotation, and
permanent purge. Covers failure modes including container rename failure,
archive dir collision, and restore container conflicts.
…ttribution

Covers the three-endpoint trace API surface with curl examples, query
filter parameters, envelope field table, kind/payload reference, direct
SQLite access pattern, cost attribution recipe, and librarian consumption
pattern. Links to trace_store.py, routes/trace.py, and litellm_callback.py.
… semantics

DELETE /api/agents/{name} now archives rather than hard-deletes. Updates the
endpoint table to show the archive path and the new purge endpoint.
Primary reference for the real openclaw integration: gateway protocol
breakdown, install + runtime, config schema, extension model, known
limitations, 35-row capability map, and a 4-phase MVP-to-full roadmap.

MVP path is the bridge adapter from the 2026-04-11 framework-integration
-bridge-design spec, not the operator-client (raw v3 WS from taOS). The
operator-client is kept as a documented fallback only, because it
couples taOS to openclaw's gateway protocol version and any upstream
bump can break the fleet; the bridge isolates coupling to a single
~200 LoC patch inside our jaylfc/openclaw fork.

Review-gate refinements baked into Step 1: feature-flag the patch
entry (so an unset TAOS_BRIDGE_URL gives upstream-identical builds),
version-stamp the bootstrap, single coupling discipline, channels.kind
"external" + provider "taos" (upstreamable), 400 LoC patch ceiling,
automated persistence-audit as the trust anchor, parallel upstream PRs,
LiteLLM key rotation caveat (safe-on-restart today, reload RPC later).

Fixes the Debian-bookworm Node 18 gap (install Node 22.14+ via
NodeSource before npm install) and the stale 500MB manifest disk size
(real openclaw is 1-2GB on disk).

Appendix B lists 12 docs.openclaw.ai pages that 404'd at research
time; a follow-up pass using gh api on the repo docs/ tree fills those
gaps.
Resolved 2 of 7 Appendix A open questions from primary source on
github.com/openclaw/openclaw. Struck through 9 of the 12 404'd
docs.openclaw.ai URLs in Appendix B where the repo had a mirror. Added
<!-- source: ... --> comments so future readers know which claims are
primary-sourced.

MVP impact: startup health-check loop updated from ss fallback to
`openclaw health --timeout` (Q5 resolved); gateway.bind: "lan" confirmed
as the correct key for container external binding (Q1 resolved).
Trace files older than 2h are chmod'd 0o400 during eviction so the
librarian's source-of-truth for historic agent activity is tamper-proof
on-disk. Rare late-arriving events (clock skew, deferred processing)
route to a sibling {bucket}.late.jsonl which stays writable -- zero-
loss guarantee preserved even for the extreme edge. list() merges
.db + .jsonl + .late.jsonl with dedup by event id (primary wins).

Sealing runs opportunistically inside _evict_old_buckets; no
background task.
jaylfc added 23 commits April 17, 2026 23:03
…og + model discovery

- LLMProxy accepts database_url; app reads data/.litellm_db_url at boot
  and exports it as DATABASE_URL into the litellm subprocess so
  /key/generate can mint per-agent virtual keys.
- Add Provider fills canonical URL from PROVIDER_URL_DEFAULTS and probes
  {url}/models to populate the model list when empty — generic across
  openai, anthropic, openrouter, kilocode (no per-type branching on the
  probe). Falls back to per-type seed list (kilocode → kilo-auto/free)
  when the probe returns nothing so the entry still registers at least
  one routable model.
- Deployer scopes the minted virtual key to the agent's primary + fallback
  models (models=[req.model, *fallback_models]) instead of defaulting to
  the unrestricted "default" alias.
- Deployer fails loudly when a DB is configured but /key/generate still
  returns None — hiding that class of failure is what shipped the
  broken kilocode path in the first place.
- generate_litellm_config now WARNs when a cloud-type backend is missing
  url or models, so silent drops surface in logs instead of showing up
  as a broken agent much later.
- scripts/repair_providers.py repairs legacy config.yaml entries that
  pre-date the autofill/discovery logic.
Generated LiteLLM configs use os.environ/<name> markers to reference
provider api keys, but nothing was actually exporting those names
into the subprocess env. Cloud providers therefore hit the litellm
OpenAIException "api_key client option must be set" even with a
correctly-configured backend list.

LLMProxy.start/reload_config now accept a secrets={name: value} map.
app.py resolves each backend.api_key_secret from the secrets store at
boot and again on catalog-change reload; routes/providers.py does the
same on add/patch/delete so newly-added or rotated keys take effect
without a full app restart.
…nfigured

LiteLLM's /key/generate requires a Postgres-backed Prisma schema, but
LiteLLM does not run migrations itself. Fresh installs had to manually
run `pip install prisma && prisma generate && prisma db push` before
virtual keys worked.

New tinyagentos/litellm_migrate.py locates the bundled schema at
litellm/proxy/schema.prisma, probes for LiteLLM_VerificationToken in the
configured DB, and shells out to the venv's prisma CLI only when the
table is missing. Idempotent — safe on every boot. Called from the
lifespan hook before LLMProxy.start() so LiteLLM sees a ready schema.

Added prisma>=0.11.0 to the proxy optional dependency group so the CLI
lands in the venv on fresh installs.
…gration

Running prisma db push from our bootstrap created tables without seeding
_prisma_migrations, so LiteLLM's own prisma migrate deploy at startup
tried to apply migration #1 against an already-populated schema and
looped on "type JobStatus already exists", leaving the proxy unhealthy.

Our helper's only job now is to make prisma.client importable so
LiteLLM can run its shipped migrations itself. Drop the db push and
the psql/psycopg probe; keep the systemd PATH fix for prisma generate.
LiteLLM's proxy_cli shells out ``subprocess.run([\"prisma\"])`` during
startup to detect whether Prisma is runnable. Under systemd the service's
default PATH doesn't include our venv's bin/, so the lookup raises
FileNotFoundError and LiteLLM prints "prisma package not found" and skips
DB setup entirely — leaving virtual-key issuance broken even though the
package IS installed in the venv.

Prepend the venv bin that already hosts the litellm binary so the child
process resolves ``prisma`` (and ``prisma-client-py`` for generate).

Also bump the startup wait from 30s to 120s: LiteLLM on a fresh Pi DB
runs ``prisma migrate deploy`` before opening its HTTP port, which takes
45-60s on ARM.
stderr=DEVNULL silently swallowed proxy startup failures (prisma
migration errors, config parse errors, model-router failures),
turning "why is the proxy unhealthy?" into a 30-minute debugging
hunt. Write stderr to a file next to litellm_config.yaml so
operators can read it without attaching strace.
Two separate bugs kept LiteLLM from ever settling on the Pi.

1. Startup polling hit ``/health``, which gates on the master key and
   returns 401 for an unauthenticated client. LiteLLM was healthy within
   ~50s but ``start()`` kept polling until the 120s timeout, logged
   "failed to start within 120s", and returned False even though the
   subprocess was fine. ``/health/readiness`` is the public endpoint.

2. ``reload_config`` sent SIGHUP to trigger a config reload. LiteLLM
   runs as single-worker uvicorn (no ``--workers``), which does not
   register a SIGHUP handler, so the default action — terminate — fires.
   Every ``/api/providers/models?refresh=true`` was silently killing
   the proxy, then ``_fetch_litellm_models`` got connection-refused and
   returned []. Drop SIGHUP entirely; the existing stop+start path was
   already the fallback.

Also switch the foreign-process probe to ``/health/readiness`` for the
same 401 reason.
The TaosLiteLLMCallback running inside the LiteLLM subprocess POSTs
llm_call events back to the taOS bridge at ``/api/trace``, which
requires the local auth token. The callback's token-discovery logic
checks ``TAOS_LOCAL_TOKEN`` env first, then ``/data/.auth_local_token``
and ``~/.taos/.auth_local_token``. Under systemd the real token lives
at ``{data_dir}/.auth_local_token`` — none of the candidate paths — so
every callback fired a POST without Authorization and taOS responded
401, leaving trace rows with no ``llm_call`` events despite LiteLLM
actually processing requests.

Read the token in app.py and forward it via the new ``local_token``
constructor kwarg on LLMProxy, which exports it into the subprocess env.
LiteLLM 1.83.4 surfaces the agent slug in litellm_params.metadata under
user_api_key_metadata.agent (matching what LLMProxy.create_agent_key
writes when minting the virtual key). The previous extraction read
metadata.key_alias which is no longer populated on success events, so
every llm_call trace was bucketed under the _unknown_ sentinel slug.

Walks four sources in priority order:
  1. user_api_key_metadata.agent
  2. user_api_key_auth_metadata.agent
  3. user_api_key_alias (strips the taos- prefix)
  4. key_alias (legacy, kept for older LiteLLM builds)
enqueue_user_message now writes a message_in trace event under the
agent's slug, following the ENVELOPE_V1_SCHEMA message_in shape
({from, text}) with extra informational fields (message_id,
author_type, delivery).

Guards against orphan _unknown_ or empty-slug entries.
Fails soft: trace write errors are logged, never raised.
…hen container absent (#221)

Failed deploys leave behind a config row with no LXC container, which
caused DELETE /api/agents/{name} to error on snapshot_create. Probe
container_exists first; for orphans, skip stop/snapshot, revoke any
LiteLLM key, and either hard-delete the row (no history) or record a
tombstone (chat/trace present so purge is available from Archived).

Adds container_exists helper to tinyagentos.containers; four new tests
cover the orphan hard-delete, orphan tombstone, skipped-snapshot
assertion, and purge of a snapshotless tombstone.
Adds a GitHub Actions workflow that builds per-arch Debian 13 LXC
base images with Node 22, openclaw, and recycle-bin scaffolding
already installed. Published as assets on the 'rolling-images'
Release tag.

The deployer now checks for the 'taos-openclaw-base' image alias
before launching; when present it uses the cached image and sets
TAOS_BASE_IMAGE_PRESENT=1 so install.sh skips the apt-get + npm
steps. Without the image the deployer falls back transparently
to images:debian/bookworm and install.sh does the full install.

tinyagentos.agent_image exposes is_image_present and
ensure_image_present helpers; the latter runs as a background task
on app startup to bootstrap the image on first boot.

Closes #220
… stdin

The previous impl passed curl.stdout (a Python StreamReader) as
stdin= to the incus subprocess, which asyncio cannot forward as
an OS-level FD. Curl would read the first ~90KB then block on a
pipe nobody was draining. Using an explicit os.pipe() pair with
the read end handed to incus and the write end to curl gives us a
real kernel pipe and the import completes.
Incus 6.x rejects '-' as stdin for image import and rejects bare
HTTPS URLs (expects an incus image server). Download to a temp
file then pass its path. Also fix image list query: positional
<alias> arg (--filter=alias=... is only valid for container list).
…acker

The design doc still referenced models.providers.taos (a custom provider
that was abandoned mid-implementation in favour of openclaw's built-in
litellm provider type). Updated the bootstrap example, the integration
tracker table, and the openclaw.json shape to match what actually ships.
The channels-side "provider: taos" identifier is unchanged; that's the
channel-kind name, separate from the LLM provider.
@kilo-code-bot
Copy link
Copy Markdown

kilo-code-bot Bot commented Apr 18, 2026

Code Review Summary

Status: 6 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 1
WARNING 5
SUGGESTION 0
Issue Details (click to expand)

CRITICAL

File Line Issue
app-catalog/agents/openclaw/scripts/install.sh 764 SUID/SGID bit not cleared on rm wrapper. /usr/local/bin/rm is created with mode 755. Since this script shadows the system rm binary, any process running as root that executes rm will invoke this wrapper. If this wrapper ever had vulnerabilities, it would be trivial to escalate privileges across the container. Also note that trash-put is invoked with whatever user called rm, without privilege dropping.

WARNING

File Line Issue
.github/workflows/build-agent-images.yml 52 Unsafe iptables default policy change: sudo iptables -P FORWARD ACCEPT changes the system-wide default policy for forwarded traffic to ACCEPT. This disables host network firewalling for all forwarded traffic including other containers and bridge interfaces. All container traffic will be implicitly allowed without any filtering.
.github/workflows/build-agent-images.yml 55 Dynamic interface masquerade rule: The rule uses `$(ip route get 1.1.1.1
desktop/src/apps/AgentsApp.tsx 1488 Missing error handling for archive fetch: fetchArchived() silently catches all exceptions without setting any error state. If this API endpoint returns 4xx/5xx or network fails, no user feedback is shown; the UI will just never display archived agents.
desktop/src/apps/AgentsApp.tsx 1120 Unchecked Promise.allSettled values: When fetching disk states, only rejected promises are skipped. Fulfilled promises with null values will be added to the result map as undefined. Accessing properties on these will throw null reference errors when rendered in AgentRow.
desktop/src/apps/MessagesApp.tsx 2267 Cross-app event handler missing cleanup guard: The taos:open-messages event listener is added in an useEffect but the cleanup function does not check if the component is still mounted before calling state setters. If the component unmounts while the admin prompt fetch is in flight, it will trigger state updates on an unmounted component.
Files Reviewed (8 files)
  • .github/workflows/build-agent-images.yml - 2 issues
  • .gitignore
  • app-catalog/_common/scripts/recycle-bin-install.sh
  • app-catalog/agents/agent-zero/manifest.yaml
  • app-catalog/agents/openclaw/manifest.yaml
  • app-catalog/agents/openclaw/scripts/install.sh - 1 issue
  • desktop/src/apps/AgentsApp.tsx - 2 issues
  • desktop/src/apps/MessagesApp.tsx - 1 issue

Fix these issues in Kilo Cloud


Reviewed by seed-2-0-pro-260328 · 170,139 tokens

@jaylfc jaylfc merged commit 93e7474 into master Apr 18, 2026
7 checks passed
@jaylfc jaylfc deleted the fix/agent-creation-flow branch April 18, 2026 15:07
jaylfc added a commit that referenced this pull request Apr 18, 2026
CRITICAL:
- install.sh: explicit chmod a-s on /usr/local/bin/rm wrapper. Mode 755
  doesn't have SUID set, but a future edit accidentally bumping to 4755
  on a root-shadow rm would be a textbook escalation primitive. Belt-
  and-suspenders against that.

WARNING:
- build-agent-images.yml: stop flipping FORWARD chain default policy to
  ACCEPT (broadly disables runner forward filtering); rely on the
  explicit ACCEPT rules for incusbr0 already present. Capture outbound
  interface once before the masquerade rule so a default-route flip
  mid-job can't redirect the rule to a different NIC.
- AgentsApp.fetchArchived: log non-OK responses, content-type mismatches,
  and exception bodies to console.warn instead of swallowing silently —
  empty Archived list now distinguishable from "no archived agents" in
  DevTools.
- AgentsApp.fetchDiskStates: add explicit nullness + structural checks
  on Promise.allSettled fulfilled values. Existing code worked because
  null is falsy, but the defensive shape check makes future regressions
  loud instead of crashing AgentRow on render.
- MessagesApp open-messages handler: add cancelled flag to the useEffect
  cleanup so admin-prompt fetches resolving after unmount don't trigger
  setState on an unmounted component.

Frontend bundle rebuilt. No behavioural changes for any happy path.
jaylfc added a commit that referenced this pull request Apr 18, 2026
CRITICAL:
- install.sh: explicit chmod a-s on /usr/local/bin/rm wrapper. Mode 755
  doesn't have SUID set, but a future edit accidentally bumping to 4755
  on a root-shadow rm would be a textbook escalation primitive. Belt-
  and-suspenders against that.

WARNING:
- build-agent-images.yml: stop flipping FORWARD chain default policy to
  ACCEPT (broadly disables runner forward filtering); rely on the
  explicit ACCEPT rules for incusbr0 already present. Capture outbound
  interface once before the masquerade rule so a default-route flip
  mid-job can't redirect the rule to a different NIC.
- AgentsApp.fetchArchived: log non-OK responses, content-type mismatches,
  and exception bodies to console.warn instead of swallowing silently —
  empty Archived list now distinguishable from "no archived agents" in
  DevTools.
- AgentsApp.fetchDiskStates: add explicit nullness + structural checks
  on Promise.allSettled fulfilled values. Existing code worked because
  null is falsy, but the defensive shape check makes future regressions
  loud instead of crashing AgentRow on render.
- MessagesApp open-messages handler: add cancelled flag to the useEffect
  cleanup so admin-prompt fetches resolving after unmount don't trigger
  setState on an unmounted component.

Frontend bundle rebuilt. No behavioural changes for any happy path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant