Skip to content

fix: security hardening — bind, validation, atomic writes, deploy rollback#1

Closed
paralizeer wants to merge 1 commit into
jaylfc:masterfrom
paralizeer:fix/security-hardening
Closed

fix: security hardening — bind, validation, atomic writes, deploy rollback#1
paralizeer wants to merge 1 commit into
jaylfc:masterfrom
paralizeer:fix/security-hardening

Conversation

@paralizeer
Copy link
Copy Markdown

Summary

Security and reliability improvements based on a thorough code review. All changes are backwards-compatible and all 127 existing tests pass.

  • Default bind 127.0.0.1 — server, install.sh, and config.yaml all defaulted to 0.0.0.0, exposing the unauthenticated dashboard to the entire network. Changed to localhost-only; users who need network access can explicitly set host: 0.0.0.0 in config
  • Agent name validation — added ^[a-z0-9][a-z0-9-]{0,62}$ regex check on create and deploy endpoints, preventing malformed names from reaching incus CLI
  • Framework allowlist — deploy endpoint now rejects unknown frameworks instead of pip install-ing arbitrary user input inside containers
  • Atomic config writessave_config() now writes to .yaml.tmp then renames, preventing corruption if the process crashes mid-write
  • Deploy rollback — if any step fails after container creation, the container is automatically destroyed instead of being left half-configured
  • QMD serve hardening — added NoNewPrivileges=true to the systemd unit template
  • Removed hardcoded IPs — Tailscale IPs and LAN addresses removed from committed data/config.yaml

Test plan

  • All 127 existing tests pass (pytest tests/ -v)
  • Manual test: deploy an agent with an invalid name → should get 400
  • Manual test: deploy with framework: "malicious-package" → should get 400
  • Manual test: verify server only listens on localhost after fresh install

🤖 Generated with Claude Code

…es, deploy rollback

- Default server bind: 0.0.0.0 → 127.0.0.1 (config.py, install.sh, data/config.yaml)
- Agent name validation: regex ^[a-z0-9][a-z0-9-]{0,62}$ on create/deploy
- Framework allowlist: deploy rejects unknown frameworks instead of pip-installing arbitrary input
- Atomic config writes: write to .tmp then rename to prevent corruption on crash
- Deploy rollback: destroys container on any failure after creation
- QMD serve inside containers: added NoNewPrivileges=true to systemd unit
- Removed hardcoded Tailscale/LAN IPs from committed config.yaml

All 127 existing tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jaylfc
Copy link
Copy Markdown
Owner

jaylfc commented Apr 5, 2026

Thanks for the thorough review and solid PR @paralizeer. I've cherry-picked the security fixes I wanted and committed them in 824f8cf.

Merged (with credit):

  • Atomic config writes — write to .yaml.tmp then rename. Clean and prevents corruption.
  • Agent name validation — the regex check is important since names go into incus CLI commands.
  • Deploy rollback — try/except with container cleanup on failure. No more orphaned containers.
  • NoNewPrivileges=true on the qmd serve systemd unit inside containers.
  • QMD serve binds 0.0.0.0 inside container — correct, the host needs to reach it.
  • Removed hardcoded IPs — good catch, these should never have been committed. I've also moved config.yaml to config.yaml.example and gitignored the actual config along with hardware.json and installed.json.

Not merged (with reasoning):

  • Default bind 127.0.0.1 — TinyAgentOS is designed to run on LAN devices (SBCs, home servers) where users access the dashboard from another machine on the network. Binding to localhost would break this for most users out of the box. Since there's no auth yet (trusted LAN assumption), the UX tradeoff favours accessibility. Auth is on the roadmap and will address this properly.
  • Hardcoded framework allowlist — instead I validate against the app catalog registry dynamically. This means new frameworks added to the catalog are automatically allowed without code changes.

Closing this PR since the relevant changes have been applied. Appreciate the contribution — the atomic writes and deploy rollback in particular are the kind of reliability fixes that matter on embedded hardware where crashes happen.

@jaylfc jaylfc closed this Apr 5, 2026
jaylfc added a commit that referenced this pull request Apr 11, 2026
Cherry-picked from @paralizeer's PR #1:
- Atomic config writes (write to .tmp then rename)
- Agent name validation (alphanumeric + hyphens, 1-63 chars)
- Deploy rollback on failure (destroys container on error)
- NoNewPrivileges=true on qmd serve systemd unit
- QMD serve binds 0.0.0.0 inside container (host needs access)

Modified from PR:
- Keep 0.0.0.0 as default bind (this is a LAN device, not a public server)
- Dynamic framework validation from catalog registry instead of hardcoded allowlist

Additional fixes:
- Move config.yaml to config.yaml.example (template)
- Gitignore data/config.yaml, hardware.json, installed.json
- Auto-copy example config on first run
- Remove hardcoded Tailscale IPs from deployer defaults
- Remove committed agent configs and backend URLs
jaylfc added a commit that referenced this pull request Apr 18, 2026
…gration

Running prisma db push from our bootstrap created tables without seeding
_prisma_migrations, so LiteLLM's own prisma migrate deploy at startup
tried to apply migration #1 against an already-populated schema and
looped on "type JobStatus already exists", leaving the proxy unhealthy.

Our helper's only job now is to make prisma.client importable so
LiteLLM can run its shipped migrations itself. Drop the db push and
the psql/psycopg probe; keep the systemd PATH fix for prisma generate.
jaylfc added a commit that referenced this pull request Apr 18, 2026
…race, pre-built base image (#225)

* refactor(adapters): defer uvicorn imports so modules load without it

Adapters imported uvicorn at module top, so anything that imported
them for structural checks (tests, health-endpoint probes) would
crash with ModuleNotFoundError when uvicorn wasn't installed.
uvicorn.run is only needed when an adapter is run as a standalone
process — move the import into the __main__ guard.

Clears 19 pre-existing test failures across test_new_adapters.py
and test_channel_hub_new.py.

* feat(containers): add rename_container to the backend abstraction

Required by the new agent archive lifecycle: the delete path stops
the container then renames it to a dated `taos-archived-{slug}-{ts}`
bucket so a later restore can rename it back. Implemented for both
LXC (incus rename) and Docker (docker rename).

* feat(deployer): expanded container deps and manifest-aware framework install

Broken before: `pip3 install openclaw` ran unconditionally, its failure
was logged as a warning and the deploy continued, and the container
came up missing the deps most agent frameworks need.

Now:
- apt install includes nodejs, npm, build-essential, python3-dev,
  ca-certificates, gnupg, wget, with DEBIAN_FRONTEND=noninteractive
  and --no-install-recommends (timeout 15m for slow arm64 apt).
- Framework install dispatches on the manifest's install.method:
  pip uses manifest.install.package, script pushes + runs
  manifest.manifest_dir / install.script. Missing script files,
  unsupported methods, and non-zero exits all raise RuntimeError
  so the outer try/except rolls back the container and the agent
  shows status=failed instead of misleadingly 'running'.
- TAOS_MODEL env var is injected so the in-container runtime knows
  which model to send to LiteLLM.

* feat(litellm): kilocode + openrouter support, per-agent model registration, hot reload

generate_litellm_config now:
- Registers openrouter (openrouter/ prefix, native LiteLLM support)
  and kilocode (openai-compatible, explicit api_base) in the
  backend type maps.
- Expands each cloud backend's declared models into their own
  model_list entries keyed on the real model id, so agents can
  request a specific model. The 'default' alias is still appended
  as a fallback.

routes/providers.py: add/patch/delete now call proxy.reload_config
instead of the stale proxy.write_config, so the running LiteLLM
subprocess actually picks up config changes.

* feat(openclaw): install.sh and in-container agent runtime

The manifest declares method: script -> scripts/install.sh, which
didn't exist. The deployer has no way to install openclaw, so the
agent came up with no runtime and the chat path had nothing to hit.

The new script, run once inside a fresh Debian bookworm LXC:
- Creates /opt/openclaw with a pinned venv (fastapi, uvicorn,
  httpx, openai).
- Writes a minimal FastAPI runtime at /opt/openclaw/server.py that
  listens on 0.0.0.0:8100, accepts POST /message {text, from,
  thread_id?} and forwards to LiteLLM using the injected
  OPENAI_BASE_URL, OPENAI_API_KEY, and TAOS_MODEL env vars.
- Installs a systemd unit so the runtime survives restarts.
- Polls /health up to 20s and fails the install if the server
  didn't come up.

No memory, no tools, no persistence — the host owns all of that.
This is the minimum for the end-to-end chat pipeline to land
messages on an agent and get a reply back.

* feat(agents): uuid identity, persisted fields, DM channel, archive lifecycle

Several related changes to the agents API and config model that
together make agent creation survive the full round trip:

- Every agent gets a stable 12-char uuid (agent['id']), backfilled
  for existing config entries by normalize_agent.
- body.model and body.framework land on the agent row at create
  time; llm_key lands after the background deploy succeeds.
- A 1:1 DM channel is auto-created on successful deploy and its
  id persisted as chat_channel_id so the Messages app sees the
  agent immediately.
- extra_config to deploy_agent now always includes the app
  registry so the manifest-aware framework install can resolve.

Delete is now archive, not destroy. DELETE /api/agents/{name}:
stops the container, renames it to taos-archived-{slug}-{ts},
moves workspace/memory dirs under data_dir/archive/{slug}-{ts}/,
revokes the LiteLLM key, flags the DM channel archived, and moves
the config entry from config.agents to config.archived_agents.

New endpoints:
  GET  /api/agents/archived           -> list archive entries
  POST /api/agents/archived/{id}/restore -> reverses the archive
  DELETE /api/agents/archived/{id}    -> true permanent purge

Restore handles slug collisions (if a new agent has taken the
original name) by suffixing -2, -3, etc. Purge is what the old
hard-delete used to do: destroy container, rm -rf archive dir,
delete chat channel, drop the archived entry.

This also fixes 'can't re-create a deleted agent with the same
name' -- the old delete path left the LXC container around; the
new archive path renames it out of the way.

* feat(chat): route user DM messages to in-container agent runtime

User messages in a DM channel now reach the agent's FastAPI runtime
on port 8100 inside its LXC container; the reply is persisted as
an agent-authored message in the same channel and broadcast over
the chat hub so both the webapp and the PWA update in real time.

Wiring:
- AgentChatRouter (new, tinyagentos/agent_chat_router.py):
  fire-and-forget dispatch(message, channel). Skips non-user
  messages, looks up each non-user channel member as an agent,
  skips agents that aren't running (posts a short system reply
  instead), and POSTs to http://{agent.host}:8100/message with
  {text, from, thread_id}. Response content is written back via
  chat_messages.send_message. All errors caught -- broken agents
  don't crash the chat path.
- routes/chat.py: one-line dispatch call at the end of the HTTP
  post_message path and the WebSocket 'message' branch, so both
  entry points route identically.
- app.py: router instantiated in the lifespan after chat_hub.

No subscription plumbing, no retries -- the router is a direct
adapter between two owned stores. Timeouts and connect errors
become visible agent replies so the user sees what went wrong.

* feat(agents-ui): Archived section with Restore and Delete Permanently

Adds a collapsible 'Archived' panel below the live agents list in
AgentsApp. Shows each archived entry's display name, model, and
relative archive time; per-row Restore and Delete Permanently
buttons call the new backend endpoints with confirmations.

- parseArchiveTimestamp / relativeTimeFromTs helpers convert the
  YYYYMMDDTHHMMSS format the backend writes.
- ArchivedAgentsPanel is inlined (matches AgentRow / DeployWizard
  living in the same file) and self-hides when there are no
  archived entries.
- handleDelete's confirm copy now mentions archiving so users
  know it's recoverable.
- fetchArchived is called alongside fetchAgents at every existing
  refresh point.

Unit tests for the new helpers under
desktop/src/apps/__tests__/AgentsApp.archived.test.tsx.

* build: rebuild desktop bundle with Archived section and agent wiring

* feat(containers): stop --force flag + add_proxy_device across backends

Incus refuses to rename a running container, so every archive call now
has a hard dependency on the container being stopped first. The new
stop-force path sends --force (LXC) or kill (Docker) so archive can
guarantee the container is down before it attempts the rename. The
add_proxy_device method is added to the abstract base, LXC backend, and
Docker stub so the deployer can attach incus proxy devices for host-side
port forwarding when setting up an agent home.

* feat(deployer): agent-home mount, incus proxy devices, taos_host=127.0.0.1

Agents now get a dedicated home directory mounted at /root inside the
container so their runtime state (env file, model config, logs) persists
across container recreation. Proxy devices are attached via the new
add_proxy_device backend method so the host-side LiteLLM process can
reach the in-container agent port. The taos_host default is hardened to
127.0.0.1 so freshly deployed agents always resolve back to the host
loopback rather than relying on a potentially incorrect network variable.

* fix(litellm): is_running honours adopted instances so deployer mints keys

Previously is_running only checked the subprocess handle, which is None
for processes the deployer did not start itself (adopted instances). The
method now also checks the _adopted flag so that a pre-existing LiteLLM
process is correctly reported as running and a fresh API key is minted
rather than the deployer trying to start a second instance. The companion
reload_config path also skips process management when adopted.

* feat(openclaw): write env to /root/.openclaw/env under mounted home

The install script previously wrote configuration to a path that was
wiped on container recreation. Now the env file is written to
/root/.openclaw/env which sits inside the persistent agent-home mount,
so credentials and model config survive container restarts and upgrades
without reinstalling. The script also accepts values from environment
variables so the deployer can inject them at provision time.

* feat(agent-env): host-side helper to rewrite the agent env file in place

A small module that locates and rewrites the env file inside the
agent-home directory on the host without entering the container. This
is used by the restore path so a freshly issued LiteLLM key and updated
endpoint can be injected into the persistent /root/.openclaw/env without
having to reinstall the framework, which would risk breaking the agent's
installed state.

* feat(archive): force-stop, abort on rename failure, carry home dir, rewrite env on restore

Archive now force-stops the container before rename — incus refuses to
rename a running instance, and silently leaving it running produced
orphan containers. If the rename itself fails, the config entry is
left in the live list rather than being moved to archive, keeping the
system consistent. The agent home directory now travels with workspace
and memory into the archive bucket so the full /root is preserved.
On restore, the new host-side env rewrite helper updates
/root/.openclaw/env with the freshly issued LiteLLM key and endpoint
rather than reinstalling, which avoids breaking the installed framework.

* feat(auth): local-token Bearer auth for programmatic access

Adds a stable local token that is bootstrapped once at startup and
written to a known path with mode 0600. Any request carrying it in an
Authorization: Bearer header is granted full access without a session
cookie, allowing automated agents and local scripts to call the API
without going through the browser-based login flow. The middleware sits
before the session check so it has no impact on normal browser sessions.

* docs: per-agent home mount in framework-agnostic-runtime

Documents the agent-home directory layout and mount strategy so it is
clear what lives inside the container, what is persisted on the host,
and how the env-rewrite helper fits into the restore flow.

* fix(lifecycle): arm keep-alive timer on image generation

notify_task_complete was never called from the image generation route,
leaving the RKNN SD server running indefinitely after requests completed.
The _legacy_generate path (used when the resource scheduler is absent)
now wraps the backend HTTP call in try/finally so notify_task_complete
fires on both success and error paths. Chat and embedding traffic routes
through LiteLLM and does not hit this endpoint; that keep-alive path is
handled separately via a LiteLLM callback.

Tests added for both success and failure paths.

* feat(models): /api/models/loaded probes rknn-sd + sd-cpp backends

Image-gen backend types were silently skipped in loaded_models, causing
the Activity widget to show "Loaded Models (0)" even when the RKNN SD
server was active. Two new branches added to the probe loop:

- rknn-sd: GET {url}/v1/models (rknn_sd_server.py speaks OpenAI-compat),
  emits one entry per model with purpose=image-generation.
- sd-cpp: GET {url}/sdapi/v1/options, reads sd_model_checkpoint for the
  active checkpoint name, falls back to "unknown" if absent.

Both branches follow the existing ConnectError/Timeout/HTTPError swallow
pattern. Tests cover success, missing-checkpoint fallback, and offline
(connection refused) for both backend types.

* feat(trace): per-agent hourly-bucketed trace store for the librarian

Captures go per-agent in the bind-mounted home folder so archive,
restore, backup, and cross-worker migration all work via the existing
"move the home folder" rule. Each agent's .taos/trace/ directory holds
one SQLite bucket per UTC hour (YYYY-MM-DDTHH.db). Bucket routing is
driven by the event's created_at, not wall-clock at write time -- a
14:59:59.999 event routed at 15:00:00.001 lands in the T14 file, so
rollover never drops events.

Zero-loss: every write lands in the SQLite or is appended to a sibling
YYYY-MM-DDTHH.jsonl. Nothing is ever silently dropped. The librarian
merges both sources at read time.

The envelope is v1 and stable: v, id, trace_id, parent_id, created_at,
agent_name, kind, channel_id, thread_id, backend_name, model,
duration_ms, tokens_in, tokens_out, cost_usd, error, payload. Kinds are
enumerated (message_in/out, llm_call, tool_call/result, reasoning,
error, lifecycle); each has a documented payload shape so consumers
parse without guessing. trace_id + parent_id enable cross-event linkage
for reconstructing a full turn end-to-end.

POST /api/trace writes; GET /api/agents/{name}/trace reads with filter
+ limit. POST /api/lifecycle/notify lets the LiteLLM callback reset the
keep-alive timer for whichever backend served a request.

* feat(litellm): CustomLogger callback posts llm_call traces + keep-alive

Registered in generated litellm_config.yaml under
general_settings.custom_callbacks. Runs inside the LiteLLM subprocess
with no access to taOS's Python state, so it authenticates to taOS via
the local token file on disk and posts over HTTP to /api/trace and
/api/lifecycle/notify.

Agent name is derived from the virtual key alias ("taos-<slug>") that
the deployer sets when minting per-agent keys. This is how the per-
agent trace store knows which bucket to route to for a given completion.

Failure-mode is swallow-and-log: a broken callback must never fail a
real LLM request. A litellm-not-installed environment gets a no-op stub
so tests pass without the dep.

* feat(trace): wire TraceStoreRegistry in lifespan + pass auth env to containers

app.py: instantiate the registry on the data_dir, include the trace
router, close all connections on shutdown.

deployer.py: inject TAOS_LOCAL_TOKEN (read from data_dir/.auth_local_token
at deploy time) and TAOS_TRACE_URL into the container env. Any in-
container runtime that wants to post traces (or that we later replace
with real openclaw and tap via gateway events) has the credential and
endpoint ready.

* docs(design): update framework-agnostic-runtime for proxy networking + archive/trace

Switches env-snippet from host.docker.internal to 127.0.0.1 and explains
incus proxy devices. Drops Docker-only qualifier from workspace/memory status
table (LXC now has parity). Adds Per-agent trace capture, Agent archive/restore,
and Programmatic access (local token) sections. Extends Related and adds a
Related code list pointing at the new modules.

* docs(design): cross-reference per-agent trace layer from user-memory

Adds a section distinguishing user memory (long-lived user context) from
per-agent trace capture (event log inside agent-home). Explains how the
taOSmd librarian bridges both layers and links to the trace design in
framework-agnostic-runtime.md.

* docs(runbook): add agent archive, restore, and purge runbook

Step-by-step procedures for archiving a live agent, listing archives,
restoring with slug collision handling and LiteLLM key rotation, and
permanent purge. Covers failure modes including container rename failure,
archive dir collision, and restore container conflicts.

* docs(runbook): add trace querying runbook for API, SQLite, and cost attribution

Covers the three-endpoint trace API surface with curl examples, query
filter parameters, envelope field table, kind/payload reference, direct
SQLite access pattern, cost attribution recipe, and librarian consumption
pattern. Links to trace_store.py, routes/trace.py, and litellm_callback.py.

* docs(design): update plan-agent-deployer API table to reflect archive semantics

DELETE /api/agents/{name} now archives rather than hard-deletes. Updates the
endpoint table to show the archive path and the new purge endpoint.

* docs(design): openclaw-integration.md, bridge adapter as MVP

Primary reference for the real openclaw integration: gateway protocol
breakdown, install + runtime, config schema, extension model, known
limitations, 35-row capability map, and a 4-phase MVP-to-full roadmap.

MVP path is the bridge adapter from the 2026-04-11 framework-integration
-bridge-design spec, not the operator-client (raw v3 WS from taOS). The
operator-client is kept as a documented fallback only, because it
couples taOS to openclaw's gateway protocol version and any upstream
bump can break the fleet; the bridge isolates coupling to a single
~200 LoC patch inside our jaylfc/openclaw fork.

Review-gate refinements baked into Step 1: feature-flag the patch
entry (so an unset TAOS_BRIDGE_URL gives upstream-identical builds),
version-stamp the bootstrap, single coupling discipline, channels.kind
"external" + provider "taos" (upstreamable), 400 LoC patch ceiling,
automated persistence-audit as the trust anchor, parallel upstream PRs,
LiteLLM key rotation caveat (safe-on-restart today, reload RPC later).

Fixes the Debian-bookworm Node 18 gap (install Node 22.14+ via
NodeSource before npm install) and the stale 500MB manifest disk size
(real openclaw is 1-2GB on disk).

Appendix B lists 12 docs.openclaw.ai pages that 404'd at research
time; a follow-up pass using gh api on the repo docs/ tree fills those
gaps.

* docs(design): fill openclaw-integration gaps via gh api on the repo

Resolved 2 of 7 Appendix A open questions from primary source on
github.com/openclaw/openclaw. Struck through 9 of the 12 404'd
docs.openclaw.ai URLs in Appendix B where the repo had a mirror. Added
<!-- source: ... --> comments so future readers know which claims are
primary-sourced.

MVP impact: startup health-check loop updated from ss fallback to
`openclaw health --timeout` (Q5 resolved); gateway.bind: "lan" confirmed
as the correct key for container external binding (Q1 resolved).

* feat(trace): seal historic bucket files read-only after 2h rollover

Trace files older than 2h are chmod'd 0o400 during eviction so the
librarian's source-of-truth for historic agent activity is tamper-proof
on-disk. Rare late-arriving events (clock skew, deferred processing)
route to a sibling {bucket}.late.jsonl which stays writable -- zero-
loss guarantee preserved even for the extreme edge. list() merges
.db + .jsonl + .late.jsonl with dedup by event id (primary wins).

Sealing runs opportunistically inside _evict_old_buckets; no
background task.

* feat(openclaw): bridge endpoints — bootstrap, SSE events, reply ingestion

Three endpoints the openclaw fork patch (src/taos-bridge.ts) calls:

  GET  /api/openclaw/bootstrap           config snapshot at startup
  GET  /api/openclaw/sessions/{a}/events SSE stream of user messages
  POST /api/openclaw/sessions/{a}/reply  deltas, final, tool events, errors

BridgeSessionRegistry holds one queue per agent; chat router enqueues
user messages, openclaw subscribes and reads them, replies flow back
through /reply which writes to the per-agent trace store and broadcasts
via the chat hub (message_delta for streaming, edit_message+state
complete for final).

Bearer local-token auth on all three endpoints.

* feat(openclaw): install.sh uses real openclaw from jaylfc fork + Node 22

Replace Python FastAPI stub with real openclaw npm install from the
jaylfc/openclaw fork (taos-fork branch). Installs Node 22.x via NodeSource
since Debian bookworm ships Node 18. Bumps manifest disk_mb to 2000 to
accommodate the Node runtime. Pinned to upstream main SHA be7a415eb096.

* feat(deployer): write openclaw.json + .openclaw/env into agent-home at deploy

Before create_container runs, deployer writes the openclaw gateway config
and the bridge env file into the host-side agent-home directory. The
agent-home bind-mount carries them into /root/.openclaw/ inside the
container, where the systemd unit created by install.sh picks them up.

session_id == req.name (slug) for MVP; bridge endpoints already key on
agent name so no separate UUID is needed at this stage.

* refactor(chat): agent_chat_router enqueues to bridge session, not HTTP POST

Obsolete: raw HTTP POST to container :8100/message was the Python-stub
integration. Real openclaw talks to taOS via the bridge adapter: taOS
owns the SSE stream openclaw's fork patch subscribes to, and replies
come back through POST /api/openclaw/sessions/{agent}/reply (which
broadcasts to the chat hub and writes traces).

So the router shrinks to: on a user message in an agent's DM channel,
call registry.enqueue_user_message(slug, msg). That's it. All reply
plumbing lives in routes/openclaw.py.

* fix(deployer): use bind=instance for incus proxy devices to avoid host port conflict

When litellm is already running on 127.0.0.1:4000 on the host, adding
an incus proxy device with the default bind_mode tries to re-bind that
port on the host and fails with EADDRINUSE.  Setting bind=instance
makes incus bind the listen address inside the container instead, so
host services that already own the port are not disturbed.

* fix(install): chown npm cache before global install to fix EACCES in fresh containers

Debian's apt-installed nodejs leaves /root/.npm with mixed ownership,
causing npm install -g to fail with errno -13.  Fixing ownership before
the install is the documented npm fix for this condition.

* fix(install): remove stale npm cache before global install

Debian's apt npm (v8) creates /root/.npm with state that blocks the
newer npm shipped with Node 22.  rm -rf before the install is more
reliable than chown since the issue is the cache format, not just ownership.

* fix(install): pre-create npm cache dir and use --unsafe-perm for root installs

The Debian npm post-install creates /root/.npm with problematic ownership.
rm + mkdir ensures a clean dir; --unsafe-perm suppresses the root-cache
check in older npm versions that remain in the system PATH during install.

* fix(deployer): remap container root to host process uid via raw.idmap

Agent-home directories are owned by the taOS process user (uid 1000).
Without a UID mapping, incus containers run as an offset uid (100000+)
that cannot write to those host dirs, causing npm and other tools to
fail with Permission Denied when creating files under /root.

Setting raw.idmap 'both <host_uid> 0' before attaching mounts maps
container root to the host process owner so bind-mounted dirs are
writable.  Requires one stop/start cycle after setting the idmap.
Revert the earlier npm --unsafe-perm workaround; it was masking this.

* fix(install): use HTTPS URL for npm install instead of github: shorthand

The github: prefix makes npm resolve to git+ssh://... which requires
GitHub SSH keys that fresh containers do not have, causing hangs.
Using git+https:// avoids the SSH key requirement.

* fix(install): use tarball URL instead of git+https to avoid SSH fallback

npm's git+https:// handling still falls back to SSH for github.com repos.
Using the tarball URL (https://github.com/.../tarball/<branch>) downloads
a plain HTTPS tarball and avoids git transport entirely.

* feat(host-firewall): install systemd one-shot to ACCEPT incusbr0 through docker DROP

Docker installed on the same host as incus sets iptables FORWARD
policy to DROP and adds its own ACCEPT rules only for docker bridges.
Incus-created containers (taOS agents) fall through to the default
DROP for TCP sessions Docker's chains don't claim -- causing symptoms
like github.com unreachable from inside containers while npmjs.org
(cached via Cloudflare) works.

The Docker-blessed fix is to insert ACCEPT rules into the DOCKER-USER
chain. A one-shot systemd unit does this at boot, idempotently. The
fix is reversible (ExecStop removes the rules). install.sh on the Pi
drops the scripts into /opt/tinyagentos/scripts/ and enables the unit
before tinyagentos.service comes up, so containers have working
networking on first boot.

* feat(host-firewall): path unit, timer, subnet probe, connectivity check

Hardens the existing host-firewall oneshot against the realistic install
matrix with Docker in the picture:

  - tinyagentos-host-firewall.path: fires the oneshot whenever
    /var/run/docker.pid appears, so a user who apt-installs Docker
    after taOS gets their rules reapplied without rebooting.
  - tinyagentos-host-firewall.timer: 5-minute re-assertion
    (belt-and-braces for any Docker chain churn we didn't expect).
  - scripts/incus-bridge-probe.sh: detects incusbr0 IPv4 collisions
    with pre-existing bridges and reassigns to a free RFC1918 /24
    at install time; idempotent no-op on clean hosts.
  - install.sh: calls the probe after incus init, then runs a
    throwaway ephemeral container that curls github.com and
    registry.npmjs.org as a post-install connectivity smoke test.
    Warns but does not block on failure so users can diagnose and
    retry.
  - host-firewall-up.sh gains a --check mode (exit 1 when rules are
    missing, used by the timer for visibility) and skips cleanly on
    hosts without incus.
  - host-firewall.service now uses ConditionPathExistsGlob so
    iptables-nft-only systems skip instead of failing.
  - detect_runtime() in containers/backend.py logs the selection
    + alternatives every call and makes the LXC-preferred policy
    explicit in its docstring.

Policy: LXC is the preferred agent runtime; Docker coexists for the
app store's containerised services but never takes precedence over
LXC for agents.

* docs(coexistence): LXC / Docker coexistence policy + runbook

Companion to the host-firewall systemd unit / path / timer.
Documents:

  - Why LXC is the preferred agent runtime and Docker coexists
    without handicapping it.
  - Clash matrix (iptables FORWARD, subnet, chain re-ordering,
    first-boot race, host port, runtime install ordering, cgroups).
  - Install scenarios -- what happens end-to-end for fresh Debian,
    Docker-before-taOS, taOS-then-Docker, Docker restart.
  - Operational runbook: diagnosing a silent network failure,
    adding / removing Docker from a running taOS host, smoke
    testing coexistence.
  - Runtime selection policy: detect_runtime() prefers LXC,
    Docker path exists only as fallback.

framework-agnostic-runtime.md's existing Host firewall subsection
links to this doc for the full story.

* chore(gitignore): exclude AI-assistant artefacts + new superpowers plans/specs

Keeps the repo looking fully human-authored per project policy. CLAUDE.md,
GEMINI.md, AGENTS.md, OPENHANDS.md, .claude/, .aider*, .continue/,
.cursorrules, .cursor/, .copilot-instructions.md, .windsurf/,
.playwright-mcp/, and new docs/superpowers/plans/ + specs/ now stay local.

Existing tracked files under docs/superpowers/ remain tracked (changing
that would rewrite history); only new additions are ignored.

* fix(host-firewall): correct systemd condition syntax + remove path-unit cycle

- tinyagentos-host-firewall.service used a single-line space-separated
  ConditionPathExistsGlob which systemd evaluated to no-match, silently
  skipping the unit. Split into two ConditionPathExists= directives with
  the "|" OR-prefix so the unit runs when iptables exists at either path.

- tinyagentos-host-firewall.path declared After=tinyagentos-host-firewall
  .service alongside the implicit Wants= from [Path].Unit=, producing an
  ordering cycle that systemd rejected. Removed the After= — the path
  unit activates the service on path-change; systemd handles ordering.

* feat(messages): archived channels section + dead-agent grey-out

- Add ArchivedChannel type extensions (settings with archived_at, archived_agent_id)
- Fetch /api/chat/channels?archived=true alongside live channels on init
- Fetch /api/agents and /api/agents/archived for author resolution
- Collapsible "Archived" section in sidebar (desktop + mobile) with per-row
  Restore (RotateCcw) and Delete Permanently (Trash2) hover actions
- Restore calls POST /api/agents/archived/{id}/restore; disabled with tooltip
  when archived agent entry is missing
- Delete calls DELETE /api/chat/channels/{id} with confirmation
- Opening archived channel shows full message history
- Archived banner above composer; input + send disabled for archived channels
- resolveAuthorDisplayState() pure helper: maps author_id to active/archived/removed
- Greyed author names (opacity 0.55, strikethrough) with tooltip for dead agents
  message body remains fully readable at reduced opacity; no body strikethrough
- Accessibility: aria-expanded/aria-controls on collapsible, aria-label on all buttons

* build: rebuild desktop bundle with archived-chats UI

* feat(chat): archived filter on /api/chat/channels + ensure_message helper

channel_store.list_channels() gains an `archived: bool | None` param —
None = no filter (existing callers unchanged), True = only channels with
settings.archived truthy, False = only channels where it's falsy.

MessagesStore.ensure_message(msg) idempotently re-inserts a message by
id (INSERT OR IGNORE). Used by the restore path when re-importing a
chat-export.jsonl; callers can retry safely.

routes/chat.py::list_channels forwards the new query param through.

* feat(archive): export chat to agent-home + reimport on restore + purge channels

At _archive_agent_fully: iterate DM channels the agent is a member of,
dump every message to {agent-home}/{slug}/.taos/chat-export.jsonl (one
envelope per line, 0o600). Channel settings flagged with
{archived: true, archived_at, archived_agent_id, archived_agent_slug}.

At restore_archived_agent: if a chat-export exists, stream it back into
chat_messages via ensure_message() (idempotent by id); unflag every
channel where archived_agent_id matches this archive.

At purge_archived_agent: delete every channel's messages + the channels
themselves for channels flagged with this archive_id, then rm -rf the
archive bucket as before. Irreversible.

Chat history now travels with the agent's home folder — archive /
backup / restore / cross-worker migration all carry it automatically.

* docs(design): accept architecture pivot v2 — 10 decisions resolved

Turning the §10 open questions into §10 resolved decisions:
  1. btrfs storage pool (portable across the mixed-arch cluster via
     each host's Linux layer).
  2. archive.target configurable, default pool:
  3. chat history in both tarball + global DB (already shipped).
  4. snapshot preferred, rsync fallback.
  5. Garage primary S3 NAS + optional FUSE POSIX mount per agent.
  6. Recycle-bin Layer 1 + Layer 3 only (skip libtrash LD_PRELOAD
     in Phase 1).
  7. Forgejo (not Gitea).
  8. taos-archive-<ts> snapshot prefix; auto-snapshots untouched.
  9. Garage sled metadata backend (cross-arch portable).
  10. 40 GiB default per-agent quota; per-agent override.

Status flipped from Proposal to Accepted. Phase 1 (disk quota +
recycle bin) is the next code to land.

* feat(disk-quota): host-side monitor + resize API + threshold notifications

Adds DiskQuotaMonitor with btrfs/incus/df sampling priority chain,
threshold-transition notifications (ok/warn/hard), hard-threshold agent
pausing, and live quota resize via incus. HTTP surface at
GET /api/agents/{name}/disk, POST /api/agents/{name}/quota,
POST /api/disk-quota/scan. NotificationStore gains disk_quota event type.
31 new tests cover edges, transitions, pause behaviour, and routes.

* feat(disk-quota): systemd timer + install.sh wiring

Adds tinyagentos-disk-quota.service (oneshot calling disk-quota-scan.sh)
and tinyagentos-disk-quota.timer (every 5 min, starts 2 min after boot).
install.sh installs the script to /opt/tinyagentos/scripts/ and enables
both units after the existing host-firewall block.

* docs(recycle-bin): runbook for the soft-delete system

Covers how the per-container recycle bin works, trash-cli ops,
escape hatches, what is not covered (Layer 2/3), and admin ops.

* feat(admin-prompts): library of structured admin tasks + HTTP endpoint

tinyagentos/admin_prompts/*.md — initial library:
  - disk-audit: agent audits own disk, proposes [DELETE]/[MOVE-TO-NAS]
    /[KEEP] per item, user confirms
  - memory-audit: RSS + cache inspection with proposed cleanups
  - health-report: read-only status + error-log summary
  - weekly-summary: trace-driven self-report, tokens + cost + highlights

Each prompt forces the agent to check current date first, list
proposed actions before executing, and require user confirmation for
every destructive step. Reminders point at /usr/local/bin/rm (recycle-
bin soft-delete) over /usr/bin/rm.

GET /api/admin-prompts enumerates, GET /api/admin-prompts/<name>
returns the body so the Messages composer can prefill (Phase 1.B UI
lands separately).

* feat(fs-snapshot): Snapper backstop on btrfs pools — Layer 3 recycle-bin

Detects incus storage pool driver on install; if btrfs, installs and
configures Snapper with config name taos-containers, hourly x24 + daily x7
retention. ZFS and dir backends skip gracefully with a clear message.
Wired into install.sh after the host-firewall block; non-fatal if the
optional backstop fails. Includes probe script for operator diagnostics
and a bash structural test.

* docs(fs-snapshot): runbook for Layer 3 recycle-bin backstop

Covers what Layer 3 does versus Layers 1/2, how to verify snapper is
running, listing snapshots, file restoration from btrfs snapshot paths,
disabling the timers, and storage cost expectations.

* feat(recycle): list/restore/purge API routes for container recycle bins

GET  /api/agents/{name}/recycle — list an agent's /var/recycle-bin/
GET  /api/recycle                — aggregated view across all agents
POST /api/agents/{name}/recycle/restore  — restore one item
DELETE /api/agents/{name}/recycle/{id}   — permanent purge

Items are exposed with a base64url id derived from the original path
so the frontend doesn't need to hold a mapping table. All container
interactions go via the existing exec_in_container abstraction, so
the same code works for LXC and Docker backends transparently.

Offline containers return status=container_offline with an empty list
instead of a 5xx — the UI can render "agent is stopped" without
coupling to agent state.

* feat(frameworks): two-tier beta/alpha verification status; openclaw first

Consolidates framework verification statuses from four tiers (tested/beta/experimental/broken)
to two (beta/alpha/broken). openclaw is the only beta entry; all others become alpha.
Updates adapter registry, all 15 agent manifests, and framework route tests with regression
guard against the retired "experimental" status.

* feat(agents-ui): order openclaw first; Beta + Alpha labels in framework picker

Sorts the framework list so openclaw appears at the top. Updates the Framework
interface union type to beta|alpha|broken. Replaces the "beta"/"experimental"
pills with "Beta" (amber) and "Alpha · Testing" (neutral). The show-alpha toggle
and deselect-on-hide logic are wired to the new alpha status.

* build: rebuild desktop bundle with framework picker reorder

* feat(ui): disk-quota card + admin-prompt prefill + recycle-bin browser

Phase 1.B: per-agent disk quota pill (warn/hard) in AgentRow; notification
cards above agent list with Expand +10 GB and Audit with agent actions.
Cross-app navigation via taos:open-messages CustomEvent to MessagesApp.

Phase 1.C-frontend: MessagesApp listens for taos:open-messages, selects
channel, fetches GET /api/admin-prompts/{name}, stuffs body into composer,
shows dismissible prefill banner above input area.

Phase 1.E.2: Recycle Bin location in FilesApp sidebar; fetches GET /api/recycle,
grouped by agent, per-item Restore (POST) and Delete Permanently (DELETE) with
confirm dialogs; empty state and container-offline notice.

* build: rebuild desktop bundle with Phase 1 frontend

Rebuilt after disk-quota card, admin-prompt composer prefill,
and recycle-bin browser features.

* feat(containers): set_root_quota + root_size_gib on create_container

Add set_root_quota(name, size_gib) to the module-level API (__init__.py),
LXCBackend, and DockerBackend. Add root_size_gib param to create_container
in all three; quota is applied after launch, before mounts/env.

Docker overlay2-without-pquota returns success with a soft note rather
than a hard failure. Abstract base class updated with the new abstract
method signatures.

Tests cover success, overlay2 soft path, genuine failure, and pass-through
from create_container for both LXC (module-level) and Docker backends.

* refactor(deployer): snapshot-model -- single trace mount, no workspace/memory/home bind mounts

Replace the three host-side bind mounts (workspace, memory, home) with a
single trace mount: {data_dir}/trace/{slug}/ -> /root/.taos/trace/.

Remove _write_openclaw_bootstrap; install.sh now writes
/root/.openclaw/openclaw.json + .openclaw/env inside the container via
env vars injected at create_container time (TAOS_BRIDGE_URL, OPENAI_BASE_URL,
OPENAI_API_KEY, TAOS_LOCAL_TOKEN, TAOS_AGENT_NAME).

Add root_size_gib=40 to DeployRequest (default per arch pivot v2 S10.10)
and pass it through to create_container.

Update test_deployer.py: remove old three-mount + openclaw host-write
assertions, add test_one_trace_bind_mount, test_no_workspace_memory_home_mount,
test_root_quota_passed_through_default, test_root_quota_custom_value_honoured,
test_trace_dir_created_on_host, test_bridge_url_injected_into_env.

agent_env.py is NOT deleted: routes/agents.py (Phase 2.B) still imports
update_agent_env_file for the restore path. Phase 2.B will clean it up.

* feat(openclaw): install.sh writes /root/.openclaw config + env inside container

Add section 2a between npm install and the recycle-bin block. Uses env vars
injected by the deployer (TAOS_AGENT_NAME, TAOS_MODEL, OPENAI_BASE_URL,
OPENAI_API_KEY, TAOS_BRIDGE_URL, TAOS_LOCAL_TOKEN) to write:

  /root/.openclaw/openclaw.json  (mode 600) — gateway + LiteLLM provider config
  /root/.openclaw/env            (mode 600) — EnvironmentFile for systemd unit

Both files live inside the container rootfs and travel with snapshot archives.
Safe defaults via := fallback for dev/test environments where not all vars
are set.

Remove the old section 3 that only ensured the .openclaw dir existed (no
longer needed; section 2a creates it). Renumber comments for sections 3 and 4.

* feat(containers): snapshot_create + snapshot_restore + snapshot_list + set_env

* refactor(trace): store path moved to {data_dir}/trace/{slug} — bind-mount target

_agent_trace_dir now returns data_dir/trace/slug instead of
data_dir/agent-home/slug/.taos/trace. Aligns with the Phase 2.A
deployer bind-mount (data_dir/trace/{slug}/ → /root/.taos/trace/).
Updates module docstring and framework-agnostic-runtime.md to reflect
the new path and the rationale for separating trace from home-folder.

* feat(migrate): script to move legacy agent-home/*/.taos/trace bucket files

scripts/migrate-trace-paths.sh walks data_dir/agent-home/*/
for .taos/trace directories and moves .db/.jsonl files to
data_dir/trace/{slug}/. Idempotent: skips if source absent, no-clobber
merge if both old and new paths exist. install.sh runs it as a
non-fatal step on every install/upgrade after disk-quota setup.

* refactor(agents): archive/restore/purge use incus snapshot primitives

* feat(config): archive.target configurable (pool | path | s3)

* refactor(env): delete obsolete tinyagentos/agent_env.py -- functionality moved into install.sh + incus set_env

* docs(design): framework-agnostic-runtime thesis evolution — containers hold their own state

Rewrite framework-agnostic-runtime.md to reflect the Phase 2.A–2.C
post-pivot reality: containers hold their own state, hosts hold the
federation. The three bind mounts (workspace/memory/home) are gone;
the single trace bind mount remains. The "Per-agent home" section is
replaced by "Three bind mounts removed" and "agent_env.py removed"
migration notes. The rule application checklist gains a sixth question
on archive atomicity. The "Why the pivot" section summarises the
reasoning. The audit table and Related code section are updated to
match what shipped.

Add a "Post-landing status" banner to architecture-pivot-v2.md marking
the decision record complete with commit ranges for Phases 1 and 2.A–2.C.

Add a note to lxc-docker-coexistence.md that Docker's lack of incus
snapshot primitives means graceful fallback on that backend.

* docs(runbook): rewrite archive/restore/purge for snapshot primitives

Full rewrite of agent-archive-restore.md: archive flow now documents
incus snapshot create, chat export to host-owned path, and snapshot_name
in config. Restore flow documents incus snapshot restore, set_env for
new LLM key, systemctl restart openclaw. Purge documents incus delete
--force destroying snapshots atomically.

Adds Quick reference table, archive.target options table, and legacy
migration section for pre-Phase-2 entries that have no snapshot_name
(identify via jq filter, purge via existing DELETE endpoint, or contact
dev for manual re-snapshot). Troubleshooting covers snapshot-not-found,
rename collision, env-rewrite failure, and openclaw restart failure.

* fix(containers): env setter key=value form; root-quota via override for profile-inherited devices

Two related deploy blockers:

1. `incus config set <name> environment.<key> <value>` parses a value
   starting with `-` as a CLI flag. Local auth tokens from
   `secrets.token_urlsafe(32)` can legitimately start with `-` or `_`,
   so deploys failed with `unknown shorthand flag: 'X'` whenever the
   token landed on that character. Switch to the single-argument
   `environment.<key>=<value>` form, which is the documented canonical
   syntax and parses unambiguously.

2. `incus config device set <name> root size=...` rejects root
   devices inherited from a profile with
   `Device from profile(s) cannot be modified for individual instance.`
   Switch to `incus config device override` which creates a per-instance
   copy of the device if it doesn't exist, then falls back to `set` if
   an override is already present.

Tests cover: dash-prefixed token value, root quota on profile-inherited
device.

* fix(openclaw): npm install via github: shorthand so prepare lifecycle builds in container

We were using the tarball URL form which skips npm's prepare lifecycle.
That pushed the build burden onto the fork branch (committing prebuilt
dist/), which kept landing as incomplete and crashing openclaw.mjs:178.

Switch to `npm install -g github:jaylfc/openclaw#taos-fork`. This
triggers prepare → pnpm build:docker in the destination (container),
which is npm's standard mechanism for git-sourced packages that need
building. corepack-activate pnpm before install.

Adds 2-3 minutes to first deploy on arm64 Pi for the build step.
Reliable; removes the brittle committed-artefact dance.

* fix(openclaw): ensure git installed before npm github: shorthand install

npm install -g github: requires git to clone the repo. Fresh Debian
bookworm containers don't have git; add a guard to install it if
absent before the npm install step.

* fix(openclaw): install prebuilt tarball from GitHub Releases (no per-deploy build)

Per-deploy builds are not beginner-friendly (slow on arm64, fragile
because of pnpm workspace context, depends on Pi having full build
toolchain present and configured). Switch to downloading prebuilt
tarballs published by the fork's CI workflow as Release assets.

Architecture detection picks arm64 vs x64. URL is the GitHub-stable
'releases/latest/download/<asset>' redirect so we always grab the
freshest build with no version bumps in this repo.

Failure mode: if download fails, install.sh exits non-zero with a
clear error — there is intentionally no build fallback. The fix when
GitHub is unreachable is to fix connectivity, not to silently start
building.

* fix(openclaw): add --ignore-scripts to npm install-g from prebuilt tarball

Tarball already has dist/ built by CI. Running prepare at install time
tries to spawn git (for hook config) then falls back to pnpm build:docker;
both fail in a fresh container that has no git/pnpm. The bin entry
(openclaw.mjs) is wired directly — prepare output is not needed.

* fix(openclaw): re-add git prerequisite for libsignal transitive dep

libsignal (@whiskeysockets/baileys dep) has a git+https URL so npm
needs git at install time even when installing a prebuilt tarball.
No build happens — this is purely npm fetching a dependency.

* fix(openclaw): update openclaw.json config to match v2026.4 schema

- providers must be a record keyed by provider id, not an array
- gateway.mode: local must be set or gateway refuses to start
- models field must be an array (empty ok), not default_model string

* fix(openclaw): defer service start until llm_key written to config

openclaw gateway calls the bootstrap endpoint on startup which requires
llm_key in the agent config. Previously install.sh started the service
immediately, causing HTTP 409 crash-loop before the deployer had written
the key. Now:
- install.sh enables the unit but defers start (no --now)
- _background_deploy() starts openclaw.service after writing llm_key
  and saving config, so bootstrap succeeds on first attempt

* fix(openclaw): allow null llm_key in bootstrap, fix providers schema in response

- bootstrap now returns 200 when llm_key is null (no LiteLLM proxy case)
  instead of HTTP 409 which caused the gateway to crash-loop
- bootstrap response uses providers-as-record format to match openclaw
  v2026.4 config schema (same fix as install.sh)

* fix(openclaw): bootstrap uses built-in litellm provider + LITELLM_API_KEY

- Return models.providers.litellm (not taos) with correct openclaw schema
- models[] built from agent.model + fallback_models, each as {id,name,contextWindow,maxTokens,input,reasoning}
- agents.defaults.model.primary = "litellm/<agent.model>"
- Drop default_model field; use \${LITELLM_API_KEY} substitution instead of raw key
- Revert e356a98 empty-string fallback: null llm_key returns 409 with clear message
- Update bootstrap shape assertions + add fallback-models length test + null-key 409 test

* fix(deployer): inject LITELLM_API_KEY + TAOS_MODEL/TAOS_FALLBACK_MODELS env

- Add LITELLM_API_KEY env var for openclaw's litellm provider (same value as per-agent virtual key)
- Keep OPENAI_API_KEY set to same value as compat shim for smolagents and other frameworks
- TAOS_MODEL always set (empty string when unconfigured, not omitted)
- Add TAOS_FALLBACK_MODELS env var (comma-separated) so install.sh can build models[] at install time
- Add fallback_models field to DeployRequest dataclass
- Set LITELLM_API_KEY="" fallback when no proxy configured, matching OPENAI_API_KEY pattern

* fix(openclaw): install.sh writes openclaw.json with litellm provider schema

- Provider name changed from taos to litellm; baseUrl points to 127.0.0.1:4000 (no /v1 suffix)
- apiKey uses \${LITELLM_API_KEY} env-var substitution for openclaw's runtime resolution
- models[] array built at install time from TAOS_MODEL + comma-separated TAOS_FALLBACK_MODELS
- agents.defaults.model.primary set to "litellm/<TAOS_MODEL>" prefix
- env file extended with LITELLM_API_KEY and TAOS_FALLBACK_MODELS entries

* fix(deployer): correct DeployRequest field order (fallback_models after data_dir)

* fix(llm_proxy): own port 4000, drop adoption, log /key/generate failures

Adopting a pre-existing LiteLLM on :4000 silently reuses whatever config
and master key the foreign process booted with, so UI-added providers
never take effect and /key/generate rejects the Bearer sk-taos-master
with 401 (silent return None). Agents then deploy with llm_key=null and
openclaw crashes with LITELLM_API_KEY missing.

start() now terminates any foreign PID on the port (SIGTERM, 5s grace,
SIGKILL) before spawning its own process. is_running() reports ownership
only. reload_config() drops its adopted branch. Admin calls that can
return non-200 now log status + body so master-key mismatches surface
in logs instead of being swallowed.

* fix(llm_proxy): emit master_key + deployer fallback to shared key when no DB

LiteLLM's /key/generate requires Postgres; SQLite is unsupported. On the
default single-user Pi deployment there's no DB, so LiteLLM runs in
routing-only mode and cannot issue per-agent virtual keys. Previously
create_agent_key returned None in that mode and the deployer set
LITELLM_API_KEY="", which crashed the openclaw gateway on boot with
"LITELLM_API_KEY is missing or empty".

Routing-only mode is now the supported default path:
- general_settings.master_key added to the yaml config
- LITELLM_MASTER_KEY exported into the subprocess env
- Single source of truth TAOS_LITELLM_MASTER_KEY constant
- Deployer falls back to the master key when virtual-key issue fails,
  so the container always gets a usable auth token

Users who configure Postgres later still get per-agent virtual keys
through the same /key/generate path.

* fix(providers): postgres-backed virtual keys + generic provider catalog + model discovery

- LLMProxy accepts database_url; app reads data/.litellm_db_url at boot
  and exports it as DATABASE_URL into the litellm subprocess so
  /key/generate can mint per-agent virtual keys.
- Add Provider fills canonical URL from PROVIDER_URL_DEFAULTS and probes
  {url}/models to populate the model list when empty — generic across
  openai, anthropic, openrouter, kilocode (no per-type branching on the
  probe). Falls back to per-type seed list (kilocode → kilo-auto/free)
  when the probe returns nothing so the entry still registers at least
  one routable model.
- Deployer scopes the minted virtual key to the agent's primary + fallback
  models (models=[req.model, *fallback_models]) instead of defaulting to
  the unrestricted "default" alias.
- Deployer fails loudly when a DB is configured but /key/generate still
  returns None — hiding that class of failure is what shipped the
  broken kilocode path in the first place.
- generate_litellm_config now WARNs when a cloud-type backend is missing
  url or models, so silent drops surface in logs instead of showing up
  as a broken agent much later.
- scripts/repair_providers.py repairs legacy config.yaml entries that
  pre-date the autofill/discovery logic.

* fix(llm_proxy): resolve api_key_secret values into subprocess env

Generated LiteLLM configs use os.environ/<name> markers to reference
provider api keys, but nothing was actually exporting those names
into the subprocess env. Cloud providers therefore hit the litellm
OpenAIException "api_key client option must be set" even with a
correctly-configured backend list.

LLMProxy.start/reload_config now accept a secrets={name: value} map.
app.py resolves each backend.api_key_secret from the secrets store at
boot and again on catalog-change reload; routes/providers.py does the
same on add/patch/delete so newly-added or rotated keys take effect
without a full app restart.

* feat(litellm_migrate): auto-apply Prisma schema on boot when DB is configured

LiteLLM's /key/generate requires a Postgres-backed Prisma schema, but
LiteLLM does not run migrations itself. Fresh installs had to manually
run `pip install prisma && prisma generate && prisma db push` before
virtual keys worked.

New tinyagentos/litellm_migrate.py locates the bundled schema at
litellm/proxy/schema.prisma, probes for LiteLLM_VerificationToken in the
configured DB, and shells out to the venv's prisma CLI only when the
table is missing. Idempotent — safe on every boot. Called from the
lifespan hook before LLMProxy.start() so LiteLLM sees a ready schema.

Added prisma>=0.11.0 to the proxy optional dependency group so the CLI
lands in the venv on fresh installs.

* fix(litellm_callback): wire callbacks under litellm_settings + sibling shim for get_instance_fn

* feat(providers): /api/providers/models passthrough with refresh + ttl cache

* feat(agents): agent-creation model picker reads from LiteLLM passthrough

* fix(litellm_migrate): prepend venv bin to PATH so prisma-client-py resolves under systemd

* fix(litellm_migrate): psql probe fallback so boot doesn't wrongly rerun migration

* fix(litellm_migrate): only run prisma generate, let LiteLLM own DB migration

Running prisma db push from our bootstrap created tables without seeding
_prisma_migrations, so LiteLLM's own prisma migrate deploy at startup
tried to apply migration #1 against an already-populated schema and
looped on "type JobStatus already exists", leaving the proxy unhealthy.

Our helper's only job now is to make prisma.client importable so
LiteLLM can run its shipped migrations itself. Drop the db push and
the psql/psycopg probe; keep the systemd PATH fix for prisma generate.

* fix(llm_proxy): prepend venv bin to PATH and widen startup wait

LiteLLM's proxy_cli shells out ``subprocess.run([\"prisma\"])`` during
startup to detect whether Prisma is runnable. Under systemd the service's
default PATH doesn't include our venv's bin/, so the lookup raises
FileNotFoundError and LiteLLM prints "prisma package not found" and skips
DB setup entirely — leaving virtual-key issuance broken even though the
package IS installed in the venv.

Prepend the venv bin that already hosts the litellm binary so the child
process resolves ``prisma`` (and ``prisma-client-py`` for generate).

Also bump the startup wait from 30s to 120s: LiteLLM on a fresh Pi DB
runs ``prisma migrate deploy`` before opening its HTTP port, which takes
45-60s on ARM.

* fix(llm_proxy): capture LiteLLM stderr to sibling log file

stderr=DEVNULL silently swallowed proxy startup failures (prisma
migration errors, config parse errors, model-router failures),
turning "why is the proxy unhealthy?" into a 30-minute debugging
hunt. Write stderr to a file next to litellm_config.yaml so
operators can read it without attaching strace.

* fix(llm_proxy): poll health/readiness and drop SIGHUP reload

Two separate bugs kept LiteLLM from ever settling on the Pi.

1. Startup polling hit ``/health``, which gates on the master key and
   returns 401 for an unauthenticated client. LiteLLM was healthy within
   ~50s but ``start()`` kept polling until the 120s timeout, logged
   "failed to start within 120s", and returned False even though the
   subprocess was fine. ``/health/readiness`` is the public endpoint.

2. ``reload_config`` sent SIGHUP to trigger a config reload. LiteLLM
   runs as single-worker uvicorn (no ``--workers``), which does not
   register a SIGHUP handler, so the default action — terminate — fires.
   Every ``/api/providers/models?refresh=true`` was silently killing
   the proxy, then ``_fetch_litellm_models`` got connection-refused and
   returned []. Drop SIGHUP entirely; the existing stop+start path was
   already the fallback.

Also switch the foreign-process probe to ``/health/readiness`` for the
same 401 reason.

* fix(llm_proxy): forward TAOS_LOCAL_TOKEN to LiteLLM subprocess

The TaosLiteLLMCallback running inside the LiteLLM subprocess POSTs
llm_call events back to the taOS bridge at ``/api/trace``, which
requires the local auth token. The callback's token-discovery logic
checks ``TAOS_LOCAL_TOKEN`` env first, then ``/data/.auth_local_token``
and ``~/.taos/.auth_local_token``. Under systemd the real token lives
at ``{data_dir}/.auth_local_token`` — none of the candidate paths — so
every callback fired a POST without Authorization and taOS responded
401, leaving trace rows with no ``llm_call`` events despite LiteLLM
actually processing requests.

Read the token in app.py and forward it via the new ``local_token``
constructor kwarg on LLMProxy, which exports it into the subprocess env.

* fix(litellm_callback): extract agent slug from user_api_key_metadata

LiteLLM 1.83.4 surfaces the agent slug in litellm_params.metadata under
user_api_key_metadata.agent (matching what LLMProxy.create_agent_key
writes when minting the virtual key). The previous extraction read
metadata.key_alias which is no longer populated on success events, so
every llm_call trace was bucketed under the _unknown_ sentinel slug.

Walks four sources in priority order:
  1. user_api_key_metadata.agent
  2. user_api_key_auth_metadata.agent
  3. user_api_key_alias (strips the taos- prefix)
  4. key_alias (legacy, kept for older LiteLLM builds)

* feat(trace): record message_in events so transcript captures both sides

enqueue_user_message now writes a message_in trace event under the
agent's slug, following the ENVELOPE_V1_SCHEMA message_in shape
({from, text}) with extra informational fields (message_id,
author_type, delivery).

Guards against orphan _unknown_ or empty-slug entries.
Fails soft: trace write errors are logged, never raised.

* feat(agents): persist optional emoji on agent record + deploy API

* feat(agents): emoji picker in create flow + display in agent UI (rebuilt PWA)

* fix(agents): tolerant DELETE for orphan agents — skip snapshot/stop when container absent (#221)

Failed deploys leave behind a config row with no LXC container, which
caused DELETE /api/agents/{name} to error on snapshot_create. Probe
container_exists first; for orphans, skip stop/snapshot, revoke any
LiteLLM key, and either hard-delete the row (no history) or record a
tombstone (chat/trace present so purge is available from Archived).

Adds container_exists helper to tinyagentos.containers; four new tests
cover the orphan hard-delete, orphan tombstone, skipped-snapshot
assertion, and purge of a snapshotless tombstone.

* feat(agents): pre-built openclaw LXC base image for fast deploys

Adds a GitHub Actions workflow that builds per-arch Debian 13 LXC
base images with Node 22, openclaw, and recycle-bin scaffolding
already installed. Published as assets on the 'rolling-images'
Release tag.

The deployer now checks for the 'taos-openclaw-base' image alias
before launching; when present it uses the cached image and sets
TAOS_BASE_IMAGE_PRESENT=1 so install.sh skips the apt-get + npm
steps. Without the image the deployer falls back transparently
to images:debian/bookworm and install.sh does the full install.

tinyagentos.agent_image exposes is_image_present and
ensure_image_present helpers; the latter runs as a background task
on app startup to bootstrap the image on first boot.

Closes #220

* ci(agents): fix bridge forwarding + NAT for incus in GHA runner

* fix(agent_image): use os.pipe() so curl stdout actually reaches incus stdin

The previous impl passed curl.stdout (a Python StreamReader) as
stdin= to the incus subprocess, which asyncio cannot forward as
an OS-level FD. Curl would read the first ~90KB then block on a
pipe nobody was draining. Using an explicit os.pipe() pair with
the read end handed to incus and the write end to curl gives us a
real kernel pipe and the import completes.

* fix(agent_image): use temp file + positional alias for incus 6.x

Incus 6.x rejects '-' as stdin for image import and rejects bare
HTTPS URLs (expects an incus image server). Download to a temp
file then pass its path. Also fix image list query: positional
<alias> arg (--filter=alias=... is only valid for container list).

* docs(openclaw): rename provider to built-in litellm in integration tracker

The design doc still referenced models.providers.taos (a custom provider
that was abandoned mid-implementation in favour of openclaw's built-in
litellm provider type). Updated the bootstrap example, the integration
tracker table, and the openclaw.json shape to match what actually ships.
The channels-side "provider: taos" identifier is unchanged; that's the
channel-kind name, separate from the LLM provider.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants