fix: security hardening — bind, validation, atomic writes, deploy rollback#1
Closed
paralizeer wants to merge 1 commit into
Closed
fix: security hardening — bind, validation, atomic writes, deploy rollback#1paralizeer wants to merge 1 commit into
paralizeer wants to merge 1 commit into
Conversation
…es, deploy rollback
- Default server bind: 0.0.0.0 → 127.0.0.1 (config.py, install.sh, data/config.yaml)
- Agent name validation: regex ^[a-z0-9][a-z0-9-]{0,62}$ on create/deploy
- Framework allowlist: deploy rejects unknown frameworks instead of pip-installing arbitrary input
- Atomic config writes: write to .tmp then rename to prevent corruption on crash
- Deploy rollback: destroys container on any failure after creation
- QMD serve inside containers: added NoNewPrivileges=true to systemd unit
- Removed hardcoded Tailscale/LAN IPs from committed config.yaml
All 127 existing tests pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Owner
|
Thanks for the thorough review and solid PR @paralizeer. I've cherry-picked the security fixes I wanted and committed them in Merged (with credit):
Not merged (with reasoning):
Closing this PR since the relevant changes have been applied. Appreciate the contribution — the atomic writes and deploy rollback in particular are the kind of reliability fixes that matter on embedded hardware where crashes happen. |
jaylfc
added a commit
that referenced
this pull request
Apr 11, 2026
Cherry-picked from @paralizeer's PR #1: - Atomic config writes (write to .tmp then rename) - Agent name validation (alphanumeric + hyphens, 1-63 chars) - Deploy rollback on failure (destroys container on error) - NoNewPrivileges=true on qmd serve systemd unit - QMD serve binds 0.0.0.0 inside container (host needs access) Modified from PR: - Keep 0.0.0.0 as default bind (this is a LAN device, not a public server) - Dynamic framework validation from catalog registry instead of hardcoded allowlist Additional fixes: - Move config.yaml to config.yaml.example (template) - Gitignore data/config.yaml, hardware.json, installed.json - Auto-copy example config on first run - Remove hardcoded Tailscale IPs from deployer defaults - Remove committed agent configs and backend URLs
This was referenced Apr 12, 2026
jaylfc
added a commit
that referenced
this pull request
Apr 18, 2026
…gration Running prisma db push from our bootstrap created tables without seeding _prisma_migrations, so LiteLLM's own prisma migrate deploy at startup tried to apply migration #1 against an already-populated schema and looped on "type JobStatus already exists", leaving the proxy unhealthy. Our helper's only job now is to make prisma.client importable so LiteLLM can run its shipped migrations itself. Drop the db push and the psql/psycopg probe; keep the systemd PATH fix for prisma generate.
jaylfc
added a commit
that referenced
this pull request
Apr 18, 2026
…race, pre-built base image (#225) * refactor(adapters): defer uvicorn imports so modules load without it Adapters imported uvicorn at module top, so anything that imported them for structural checks (tests, health-endpoint probes) would crash with ModuleNotFoundError when uvicorn wasn't installed. uvicorn.run is only needed when an adapter is run as a standalone process — move the import into the __main__ guard. Clears 19 pre-existing test failures across test_new_adapters.py and test_channel_hub_new.py. * feat(containers): add rename_container to the backend abstraction Required by the new agent archive lifecycle: the delete path stops the container then renames it to a dated `taos-archived-{slug}-{ts}` bucket so a later restore can rename it back. Implemented for both LXC (incus rename) and Docker (docker rename). * feat(deployer): expanded container deps and manifest-aware framework install Broken before: `pip3 install openclaw` ran unconditionally, its failure was logged as a warning and the deploy continued, and the container came up missing the deps most agent frameworks need. Now: - apt install includes nodejs, npm, build-essential, python3-dev, ca-certificates, gnupg, wget, with DEBIAN_FRONTEND=noninteractive and --no-install-recommends (timeout 15m for slow arm64 apt). - Framework install dispatches on the manifest's install.method: pip uses manifest.install.package, script pushes + runs manifest.manifest_dir / install.script. Missing script files, unsupported methods, and non-zero exits all raise RuntimeError so the outer try/except rolls back the container and the agent shows status=failed instead of misleadingly 'running'. - TAOS_MODEL env var is injected so the in-container runtime knows which model to send to LiteLLM. * feat(litellm): kilocode + openrouter support, per-agent model registration, hot reload generate_litellm_config now: - Registers openrouter (openrouter/ prefix, native LiteLLM support) and kilocode (openai-compatible, explicit api_base) in the backend type maps. - Expands each cloud backend's declared models into their own model_list entries keyed on the real model id, so agents can request a specific model. The 'default' alias is still appended as a fallback. routes/providers.py: add/patch/delete now call proxy.reload_config instead of the stale proxy.write_config, so the running LiteLLM subprocess actually picks up config changes. * feat(openclaw): install.sh and in-container agent runtime The manifest declares method: script -> scripts/install.sh, which didn't exist. The deployer has no way to install openclaw, so the agent came up with no runtime and the chat path had nothing to hit. The new script, run once inside a fresh Debian bookworm LXC: - Creates /opt/openclaw with a pinned venv (fastapi, uvicorn, httpx, openai). - Writes a minimal FastAPI runtime at /opt/openclaw/server.py that listens on 0.0.0.0:8100, accepts POST /message {text, from, thread_id?} and forwards to LiteLLM using the injected OPENAI_BASE_URL, OPENAI_API_KEY, and TAOS_MODEL env vars. - Installs a systemd unit so the runtime survives restarts. - Polls /health up to 20s and fails the install if the server didn't come up. No memory, no tools, no persistence — the host owns all of that. This is the minimum for the end-to-end chat pipeline to land messages on an agent and get a reply back. * feat(agents): uuid identity, persisted fields, DM channel, archive lifecycle Several related changes to the agents API and config model that together make agent creation survive the full round trip: - Every agent gets a stable 12-char uuid (agent['id']), backfilled for existing config entries by normalize_agent. - body.model and body.framework land on the agent row at create time; llm_key lands after the background deploy succeeds. - A 1:1 DM channel is auto-created on successful deploy and its id persisted as chat_channel_id so the Messages app sees the agent immediately. - extra_config to deploy_agent now always includes the app registry so the manifest-aware framework install can resolve. Delete is now archive, not destroy. DELETE /api/agents/{name}: stops the container, renames it to taos-archived-{slug}-{ts}, moves workspace/memory dirs under data_dir/archive/{slug}-{ts}/, revokes the LiteLLM key, flags the DM channel archived, and moves the config entry from config.agents to config.archived_agents. New endpoints: GET /api/agents/archived -> list archive entries POST /api/agents/archived/{id}/restore -> reverses the archive DELETE /api/agents/archived/{id} -> true permanent purge Restore handles slug collisions (if a new agent has taken the original name) by suffixing -2, -3, etc. Purge is what the old hard-delete used to do: destroy container, rm -rf archive dir, delete chat channel, drop the archived entry. This also fixes 'can't re-create a deleted agent with the same name' -- the old delete path left the LXC container around; the new archive path renames it out of the way. * feat(chat): route user DM messages to in-container agent runtime User messages in a DM channel now reach the agent's FastAPI runtime on port 8100 inside its LXC container; the reply is persisted as an agent-authored message in the same channel and broadcast over the chat hub so both the webapp and the PWA update in real time. Wiring: - AgentChatRouter (new, tinyagentos/agent_chat_router.py): fire-and-forget dispatch(message, channel). Skips non-user messages, looks up each non-user channel member as an agent, skips agents that aren't running (posts a short system reply instead), and POSTs to http://{agent.host}:8100/message with {text, from, thread_id}. Response content is written back via chat_messages.send_message. All errors caught -- broken agents don't crash the chat path. - routes/chat.py: one-line dispatch call at the end of the HTTP post_message path and the WebSocket 'message' branch, so both entry points route identically. - app.py: router instantiated in the lifespan after chat_hub. No subscription plumbing, no retries -- the router is a direct adapter between two owned stores. Timeouts and connect errors become visible agent replies so the user sees what went wrong. * feat(agents-ui): Archived section with Restore and Delete Permanently Adds a collapsible 'Archived' panel below the live agents list in AgentsApp. Shows each archived entry's display name, model, and relative archive time; per-row Restore and Delete Permanently buttons call the new backend endpoints with confirmations. - parseArchiveTimestamp / relativeTimeFromTs helpers convert the YYYYMMDDTHHMMSS format the backend writes. - ArchivedAgentsPanel is inlined (matches AgentRow / DeployWizard living in the same file) and self-hides when there are no archived entries. - handleDelete's confirm copy now mentions archiving so users know it's recoverable. - fetchArchived is called alongside fetchAgents at every existing refresh point. Unit tests for the new helpers under desktop/src/apps/__tests__/AgentsApp.archived.test.tsx. * build: rebuild desktop bundle with Archived section and agent wiring * feat(containers): stop --force flag + add_proxy_device across backends Incus refuses to rename a running container, so every archive call now has a hard dependency on the container being stopped first. The new stop-force path sends --force (LXC) or kill (Docker) so archive can guarantee the container is down before it attempts the rename. The add_proxy_device method is added to the abstract base, LXC backend, and Docker stub so the deployer can attach incus proxy devices for host-side port forwarding when setting up an agent home. * feat(deployer): agent-home mount, incus proxy devices, taos_host=127.0.0.1 Agents now get a dedicated home directory mounted at /root inside the container so their runtime state (env file, model config, logs) persists across container recreation. Proxy devices are attached via the new add_proxy_device backend method so the host-side LiteLLM process can reach the in-container agent port. The taos_host default is hardened to 127.0.0.1 so freshly deployed agents always resolve back to the host loopback rather than relying on a potentially incorrect network variable. * fix(litellm): is_running honours adopted instances so deployer mints keys Previously is_running only checked the subprocess handle, which is None for processes the deployer did not start itself (adopted instances). The method now also checks the _adopted flag so that a pre-existing LiteLLM process is correctly reported as running and a fresh API key is minted rather than the deployer trying to start a second instance. The companion reload_config path also skips process management when adopted. * feat(openclaw): write env to /root/.openclaw/env under mounted home The install script previously wrote configuration to a path that was wiped on container recreation. Now the env file is written to /root/.openclaw/env which sits inside the persistent agent-home mount, so credentials and model config survive container restarts and upgrades without reinstalling. The script also accepts values from environment variables so the deployer can inject them at provision time. * feat(agent-env): host-side helper to rewrite the agent env file in place A small module that locates and rewrites the env file inside the agent-home directory on the host without entering the container. This is used by the restore path so a freshly issued LiteLLM key and updated endpoint can be injected into the persistent /root/.openclaw/env without having to reinstall the framework, which would risk breaking the agent's installed state. * feat(archive): force-stop, abort on rename failure, carry home dir, rewrite env on restore Archive now force-stops the container before rename — incus refuses to rename a running instance, and silently leaving it running produced orphan containers. If the rename itself fails, the config entry is left in the live list rather than being moved to archive, keeping the system consistent. The agent home directory now travels with workspace and memory into the archive bucket so the full /root is preserved. On restore, the new host-side env rewrite helper updates /root/.openclaw/env with the freshly issued LiteLLM key and endpoint rather than reinstalling, which avoids breaking the installed framework. * feat(auth): local-token Bearer auth for programmatic access Adds a stable local token that is bootstrapped once at startup and written to a known path with mode 0600. Any request carrying it in an Authorization: Bearer header is granted full access without a session cookie, allowing automated agents and local scripts to call the API without going through the browser-based login flow. The middleware sits before the session check so it has no impact on normal browser sessions. * docs: per-agent home mount in framework-agnostic-runtime Documents the agent-home directory layout and mount strategy so it is clear what lives inside the container, what is persisted on the host, and how the env-rewrite helper fits into the restore flow. * fix(lifecycle): arm keep-alive timer on image generation notify_task_complete was never called from the image generation route, leaving the RKNN SD server running indefinitely after requests completed. The _legacy_generate path (used when the resource scheduler is absent) now wraps the backend HTTP call in try/finally so notify_task_complete fires on both success and error paths. Chat and embedding traffic routes through LiteLLM and does not hit this endpoint; that keep-alive path is handled separately via a LiteLLM callback. Tests added for both success and failure paths. * feat(models): /api/models/loaded probes rknn-sd + sd-cpp backends Image-gen backend types were silently skipped in loaded_models, causing the Activity widget to show "Loaded Models (0)" even when the RKNN SD server was active. Two new branches added to the probe loop: - rknn-sd: GET {url}/v1/models (rknn_sd_server.py speaks OpenAI-compat), emits one entry per model with purpose=image-generation. - sd-cpp: GET {url}/sdapi/v1/options, reads sd_model_checkpoint for the active checkpoint name, falls back to "unknown" if absent. Both branches follow the existing ConnectError/Timeout/HTTPError swallow pattern. Tests cover success, missing-checkpoint fallback, and offline (connection refused) for both backend types. * feat(trace): per-agent hourly-bucketed trace store for the librarian Captures go per-agent in the bind-mounted home folder so archive, restore, backup, and cross-worker migration all work via the existing "move the home folder" rule. Each agent's .taos/trace/ directory holds one SQLite bucket per UTC hour (YYYY-MM-DDTHH.db). Bucket routing is driven by the event's created_at, not wall-clock at write time -- a 14:59:59.999 event routed at 15:00:00.001 lands in the T14 file, so rollover never drops events. Zero-loss: every write lands in the SQLite or is appended to a sibling YYYY-MM-DDTHH.jsonl. Nothing is ever silently dropped. The librarian merges both sources at read time. The envelope is v1 and stable: v, id, trace_id, parent_id, created_at, agent_name, kind, channel_id, thread_id, backend_name, model, duration_ms, tokens_in, tokens_out, cost_usd, error, payload. Kinds are enumerated (message_in/out, llm_call, tool_call/result, reasoning, error, lifecycle); each has a documented payload shape so consumers parse without guessing. trace_id + parent_id enable cross-event linkage for reconstructing a full turn end-to-end. POST /api/trace writes; GET /api/agents/{name}/trace reads with filter + limit. POST /api/lifecycle/notify lets the LiteLLM callback reset the keep-alive timer for whichever backend served a request. * feat(litellm): CustomLogger callback posts llm_call traces + keep-alive Registered in generated litellm_config.yaml under general_settings.custom_callbacks. Runs inside the LiteLLM subprocess with no access to taOS's Python state, so it authenticates to taOS via the local token file on disk and posts over HTTP to /api/trace and /api/lifecycle/notify. Agent name is derived from the virtual key alias ("taos-<slug>") that the deployer sets when minting per-agent keys. This is how the per- agent trace store knows which bucket to route to for a given completion. Failure-mode is swallow-and-log: a broken callback must never fail a real LLM request. A litellm-not-installed environment gets a no-op stub so tests pass without the dep. * feat(trace): wire TraceStoreRegistry in lifespan + pass auth env to containers app.py: instantiate the registry on the data_dir, include the trace router, close all connections on shutdown. deployer.py: inject TAOS_LOCAL_TOKEN (read from data_dir/.auth_local_token at deploy time) and TAOS_TRACE_URL into the container env. Any in- container runtime that wants to post traces (or that we later replace with real openclaw and tap via gateway events) has the credential and endpoint ready. * docs(design): update framework-agnostic-runtime for proxy networking + archive/trace Switches env-snippet from host.docker.internal to 127.0.0.1 and explains incus proxy devices. Drops Docker-only qualifier from workspace/memory status table (LXC now has parity). Adds Per-agent trace capture, Agent archive/restore, and Programmatic access (local token) sections. Extends Related and adds a Related code list pointing at the new modules. * docs(design): cross-reference per-agent trace layer from user-memory Adds a section distinguishing user memory (long-lived user context) from per-agent trace capture (event log inside agent-home). Explains how the taOSmd librarian bridges both layers and links to the trace design in framework-agnostic-runtime.md. * docs(runbook): add agent archive, restore, and purge runbook Step-by-step procedures for archiving a live agent, listing archives, restoring with slug collision handling and LiteLLM key rotation, and permanent purge. Covers failure modes including container rename failure, archive dir collision, and restore container conflicts. * docs(runbook): add trace querying runbook for API, SQLite, and cost attribution Covers the three-endpoint trace API surface with curl examples, query filter parameters, envelope field table, kind/payload reference, direct SQLite access pattern, cost attribution recipe, and librarian consumption pattern. Links to trace_store.py, routes/trace.py, and litellm_callback.py. * docs(design): update plan-agent-deployer API table to reflect archive semantics DELETE /api/agents/{name} now archives rather than hard-deletes. Updates the endpoint table to show the archive path and the new purge endpoint. * docs(design): openclaw-integration.md, bridge adapter as MVP Primary reference for the real openclaw integration: gateway protocol breakdown, install + runtime, config schema, extension model, known limitations, 35-row capability map, and a 4-phase MVP-to-full roadmap. MVP path is the bridge adapter from the 2026-04-11 framework-integration -bridge-design spec, not the operator-client (raw v3 WS from taOS). The operator-client is kept as a documented fallback only, because it couples taOS to openclaw's gateway protocol version and any upstream bump can break the fleet; the bridge isolates coupling to a single ~200 LoC patch inside our jaylfc/openclaw fork. Review-gate refinements baked into Step 1: feature-flag the patch entry (so an unset TAOS_BRIDGE_URL gives upstream-identical builds), version-stamp the bootstrap, single coupling discipline, channels.kind "external" + provider "taos" (upstreamable), 400 LoC patch ceiling, automated persistence-audit as the trust anchor, parallel upstream PRs, LiteLLM key rotation caveat (safe-on-restart today, reload RPC later). Fixes the Debian-bookworm Node 18 gap (install Node 22.14+ via NodeSource before npm install) and the stale 500MB manifest disk size (real openclaw is 1-2GB on disk). Appendix B lists 12 docs.openclaw.ai pages that 404'd at research time; a follow-up pass using gh api on the repo docs/ tree fills those gaps. * docs(design): fill openclaw-integration gaps via gh api on the repo Resolved 2 of 7 Appendix A open questions from primary source on github.com/openclaw/openclaw. Struck through 9 of the 12 404'd docs.openclaw.ai URLs in Appendix B where the repo had a mirror. Added <!-- source: ... --> comments so future readers know which claims are primary-sourced. MVP impact: startup health-check loop updated from ss fallback to `openclaw health --timeout` (Q5 resolved); gateway.bind: "lan" confirmed as the correct key for container external binding (Q1 resolved). * feat(trace): seal historic bucket files read-only after 2h rollover Trace files older than 2h are chmod'd 0o400 during eviction so the librarian's source-of-truth for historic agent activity is tamper-proof on-disk. Rare late-arriving events (clock skew, deferred processing) route to a sibling {bucket}.late.jsonl which stays writable -- zero- loss guarantee preserved even for the extreme edge. list() merges .db + .jsonl + .late.jsonl with dedup by event id (primary wins). Sealing runs opportunistically inside _evict_old_buckets; no background task. * feat(openclaw): bridge endpoints — bootstrap, SSE events, reply ingestion Three endpoints the openclaw fork patch (src/taos-bridge.ts) calls: GET /api/openclaw/bootstrap config snapshot at startup GET /api/openclaw/sessions/{a}/events SSE stream of user messages POST /api/openclaw/sessions/{a}/reply deltas, final, tool events, errors BridgeSessionRegistry holds one queue per agent; chat router enqueues user messages, openclaw subscribes and reads them, replies flow back through /reply which writes to the per-agent trace store and broadcasts via the chat hub (message_delta for streaming, edit_message+state complete for final). Bearer local-token auth on all three endpoints. * feat(openclaw): install.sh uses real openclaw from jaylfc fork + Node 22 Replace Python FastAPI stub with real openclaw npm install from the jaylfc/openclaw fork (taos-fork branch). Installs Node 22.x via NodeSource since Debian bookworm ships Node 18. Bumps manifest disk_mb to 2000 to accommodate the Node runtime. Pinned to upstream main SHA be7a415eb096. * feat(deployer): write openclaw.json + .openclaw/env into agent-home at deploy Before create_container runs, deployer writes the openclaw gateway config and the bridge env file into the host-side agent-home directory. The agent-home bind-mount carries them into /root/.openclaw/ inside the container, where the systemd unit created by install.sh picks them up. session_id == req.name (slug) for MVP; bridge endpoints already key on agent name so no separate UUID is needed at this stage. * refactor(chat): agent_chat_router enqueues to bridge session, not HTTP POST Obsolete: raw HTTP POST to container :8100/message was the Python-stub integration. Real openclaw talks to taOS via the bridge adapter: taOS owns the SSE stream openclaw's fork patch subscribes to, and replies come back through POST /api/openclaw/sessions/{agent}/reply (which broadcasts to the chat hub and writes traces). So the router shrinks to: on a user message in an agent's DM channel, call registry.enqueue_user_message(slug, msg). That's it. All reply plumbing lives in routes/openclaw.py. * fix(deployer): use bind=instance for incus proxy devices to avoid host port conflict When litellm is already running on 127.0.0.1:4000 on the host, adding an incus proxy device with the default bind_mode tries to re-bind that port on the host and fails with EADDRINUSE. Setting bind=instance makes incus bind the listen address inside the container instead, so host services that already own the port are not disturbed. * fix(install): chown npm cache before global install to fix EACCES in fresh containers Debian's apt-installed nodejs leaves /root/.npm with mixed ownership, causing npm install -g to fail with errno -13. Fixing ownership before the install is the documented npm fix for this condition. * fix(install): remove stale npm cache before global install Debian's apt npm (v8) creates /root/.npm with state that blocks the newer npm shipped with Node 22. rm -rf before the install is more reliable than chown since the issue is the cache format, not just ownership. * fix(install): pre-create npm cache dir and use --unsafe-perm for root installs The Debian npm post-install creates /root/.npm with problematic ownership. rm + mkdir ensures a clean dir; --unsafe-perm suppresses the root-cache check in older npm versions that remain in the system PATH during install. * fix(deployer): remap container root to host process uid via raw.idmap Agent-home directories are owned by the taOS process user (uid 1000). Without a UID mapping, incus containers run as an offset uid (100000+) that cannot write to those host dirs, causing npm and other tools to fail with Permission Denied when creating files under /root. Setting raw.idmap 'both <host_uid> 0' before attaching mounts maps container root to the host process owner so bind-mounted dirs are writable. Requires one stop/start cycle after setting the idmap. Revert the earlier npm --unsafe-perm workaround; it was masking this. * fix(install): use HTTPS URL for npm install instead of github: shorthand The github: prefix makes npm resolve to git+ssh://... which requires GitHub SSH keys that fresh containers do not have, causing hangs. Using git+https:// avoids the SSH key requirement. * fix(install): use tarball URL instead of git+https to avoid SSH fallback npm's git+https:// handling still falls back to SSH for github.com repos. Using the tarball URL (https://github.com/.../tarball/<branch>) downloads a plain HTTPS tarball and avoids git transport entirely. * feat(host-firewall): install systemd one-shot to ACCEPT incusbr0 through docker DROP Docker installed on the same host as incus sets iptables FORWARD policy to DROP and adds its own ACCEPT rules only for docker bridges. Incus-created containers (taOS agents) fall through to the default DROP for TCP sessions Docker's chains don't claim -- causing symptoms like github.com unreachable from inside containers while npmjs.org (cached via Cloudflare) works. The Docker-blessed fix is to insert ACCEPT rules into the DOCKER-USER chain. A one-shot systemd unit does this at boot, idempotently. The fix is reversible (ExecStop removes the rules). install.sh on the Pi drops the scripts into /opt/tinyagentos/scripts/ and enables the unit before tinyagentos.service comes up, so containers have working networking on first boot. * feat(host-firewall): path unit, timer, subnet probe, connectivity check Hardens the existing host-firewall oneshot against the realistic install matrix with Docker in the picture: - tinyagentos-host-firewall.path: fires the oneshot whenever /var/run/docker.pid appears, so a user who apt-installs Docker after taOS gets their rules reapplied without rebooting. - tinyagentos-host-firewall.timer: 5-minute re-assertion (belt-and-braces for any Docker chain churn we didn't expect). - scripts/incus-bridge-probe.sh: detects incusbr0 IPv4 collisions with pre-existing bridges and reassigns to a free RFC1918 /24 at install time; idempotent no-op on clean hosts. - install.sh: calls the probe after incus init, then runs a throwaway ephemeral container that curls github.com and registry.npmjs.org as a post-install connectivity smoke test. Warns but does not block on failure so users can diagnose and retry. - host-firewall-up.sh gains a --check mode (exit 1 when rules are missing, used by the timer for visibility) and skips cleanly on hosts without incus. - host-firewall.service now uses ConditionPathExistsGlob so iptables-nft-only systems skip instead of failing. - detect_runtime() in containers/backend.py logs the selection + alternatives every call and makes the LXC-preferred policy explicit in its docstring. Policy: LXC is the preferred agent runtime; Docker coexists for the app store's containerised services but never takes precedence over LXC for agents. * docs(coexistence): LXC / Docker coexistence policy + runbook Companion to the host-firewall systemd unit / path / timer. Documents: - Why LXC is the preferred agent runtime and Docker coexists without handicapping it. - Clash matrix (iptables FORWARD, subnet, chain re-ordering, first-boot race, host port, runtime install ordering, cgroups). - Install scenarios -- what happens end-to-end for fresh Debian, Docker-before-taOS, taOS-then-Docker, Docker restart. - Operational runbook: diagnosing a silent network failure, adding / removing Docker from a running taOS host, smoke testing coexistence. - Runtime selection policy: detect_runtime() prefers LXC, Docker path exists only as fallback. framework-agnostic-runtime.md's existing Host firewall subsection links to this doc for the full story. * chore(gitignore): exclude AI-assistant artefacts + new superpowers plans/specs Keeps the repo looking fully human-authored per project policy. CLAUDE.md, GEMINI.md, AGENTS.md, OPENHANDS.md, .claude/, .aider*, .continue/, .cursorrules, .cursor/, .copilot-instructions.md, .windsurf/, .playwright-mcp/, and new docs/superpowers/plans/ + specs/ now stay local. Existing tracked files under docs/superpowers/ remain tracked (changing that would rewrite history); only new additions are ignored. * fix(host-firewall): correct systemd condition syntax + remove path-unit cycle - tinyagentos-host-firewall.service used a single-line space-separated ConditionPathExistsGlob which systemd evaluated to no-match, silently skipping the unit. Split into two ConditionPathExists= directives with the "|" OR-prefix so the unit runs when iptables exists at either path. - tinyagentos-host-firewall.path declared After=tinyagentos-host-firewall .service alongside the implicit Wants= from [Path].Unit=, producing an ordering cycle that systemd rejected. Removed the After= — the path unit activates the service on path-change; systemd handles ordering. * feat(messages): archived channels section + dead-agent grey-out - Add ArchivedChannel type extensions (settings with archived_at, archived_agent_id) - Fetch /api/chat/channels?archived=true alongside live channels on init - Fetch /api/agents and /api/agents/archived for author resolution - Collapsible "Archived" section in sidebar (desktop + mobile) with per-row Restore (RotateCcw) and Delete Permanently (Trash2) hover actions - Restore calls POST /api/agents/archived/{id}/restore; disabled with tooltip when archived agent entry is missing - Delete calls DELETE /api/chat/channels/{id} with confirmation - Opening archived channel shows full message history - Archived banner above composer; input + send disabled for archived channels - resolveAuthorDisplayState() pure helper: maps author_id to active/archived/removed - Greyed author names (opacity 0.55, strikethrough) with tooltip for dead agents message body remains fully readable at reduced opacity; no body strikethrough - Accessibility: aria-expanded/aria-controls on collapsible, aria-label on all buttons * build: rebuild desktop bundle with archived-chats UI * feat(chat): archived filter on /api/chat/channels + ensure_message helper channel_store.list_channels() gains an `archived: bool | None` param — None = no filter (existing callers unchanged), True = only channels with settings.archived truthy, False = only channels where it's falsy. MessagesStore.ensure_message(msg) idempotently re-inserts a message by id (INSERT OR IGNORE). Used by the restore path when re-importing a chat-export.jsonl; callers can retry safely. routes/chat.py::list_channels forwards the new query param through. * feat(archive): export chat to agent-home + reimport on restore + purge channels At _archive_agent_fully: iterate DM channels the agent is a member of, dump every message to {agent-home}/{slug}/.taos/chat-export.jsonl (one envelope per line, 0o600). Channel settings flagged with {archived: true, archived_at, archived_agent_id, archived_agent_slug}. At restore_archived_agent: if a chat-export exists, stream it back into chat_messages via ensure_message() (idempotent by id); unflag every channel where archived_agent_id matches this archive. At purge_archived_agent: delete every channel's messages + the channels themselves for channels flagged with this archive_id, then rm -rf the archive bucket as before. Irreversible. Chat history now travels with the agent's home folder — archive / backup / restore / cross-worker migration all carry it automatically. * docs(design): accept architecture pivot v2 — 10 decisions resolved Turning the §10 open questions into §10 resolved decisions: 1. btrfs storage pool (portable across the mixed-arch cluster via each host's Linux layer). 2. archive.target configurable, default pool: 3. chat history in both tarball + global DB (already shipped). 4. snapshot preferred, rsync fallback. 5. Garage primary S3 NAS + optional FUSE POSIX mount per agent. 6. Recycle-bin Layer 1 + Layer 3 only (skip libtrash LD_PRELOAD in Phase 1). 7. Forgejo (not Gitea). 8. taos-archive-<ts> snapshot prefix; auto-snapshots untouched. 9. Garage sled metadata backend (cross-arch portable). 10. 40 GiB default per-agent quota; per-agent override. Status flipped from Proposal to Accepted. Phase 1 (disk quota + recycle bin) is the next code to land. * feat(disk-quota): host-side monitor + resize API + threshold notifications Adds DiskQuotaMonitor with btrfs/incus/df sampling priority chain, threshold-transition notifications (ok/warn/hard), hard-threshold agent pausing, and live quota resize via incus. HTTP surface at GET /api/agents/{name}/disk, POST /api/agents/{name}/quota, POST /api/disk-quota/scan. NotificationStore gains disk_quota event type. 31 new tests cover edges, transitions, pause behaviour, and routes. * feat(disk-quota): systemd timer + install.sh wiring Adds tinyagentos-disk-quota.service (oneshot calling disk-quota-scan.sh) and tinyagentos-disk-quota.timer (every 5 min, starts 2 min after boot). install.sh installs the script to /opt/tinyagentos/scripts/ and enables both units after the existing host-firewall block. * docs(recycle-bin): runbook for the soft-delete system Covers how the per-container recycle bin works, trash-cli ops, escape hatches, what is not covered (Layer 2/3), and admin ops. * feat(admin-prompts): library of structured admin tasks + HTTP endpoint tinyagentos/admin_prompts/*.md — initial library: - disk-audit: agent audits own disk, proposes [DELETE]/[MOVE-TO-NAS] /[KEEP] per item, user confirms - memory-audit: RSS + cache inspection with proposed cleanups - health-report: read-only status + error-log summary - weekly-summary: trace-driven self-report, tokens + cost + highlights Each prompt forces the agent to check current date first, list proposed actions before executing, and require user confirmation for every destructive step. Reminders point at /usr/local/bin/rm (recycle- bin soft-delete) over /usr/bin/rm. GET /api/admin-prompts enumerates, GET /api/admin-prompts/<name> returns the body so the Messages composer can prefill (Phase 1.B UI lands separately). * feat(fs-snapshot): Snapper backstop on btrfs pools — Layer 3 recycle-bin Detects incus storage pool driver on install; if btrfs, installs and configures Snapper with config name taos-containers, hourly x24 + daily x7 retention. ZFS and dir backends skip gracefully with a clear message. Wired into install.sh after the host-firewall block; non-fatal if the optional backstop fails. Includes probe script for operator diagnostics and a bash structural test. * docs(fs-snapshot): runbook for Layer 3 recycle-bin backstop Covers what Layer 3 does versus Layers 1/2, how to verify snapper is running, listing snapshots, file restoration from btrfs snapshot paths, disabling the timers, and storage cost expectations. * feat(recycle): list/restore/purge API routes for container recycle bins GET /api/agents/{name}/recycle — list an agent's /var/recycle-bin/ GET /api/recycle — aggregated view across all agents POST /api/agents/{name}/recycle/restore — restore one item DELETE /api/agents/{name}/recycle/{id} — permanent purge Items are exposed with a base64url id derived from the original path so the frontend doesn't need to hold a mapping table. All container interactions go via the existing exec_in_container abstraction, so the same code works for LXC and Docker backends transparently. Offline containers return status=container_offline with an empty list instead of a 5xx — the UI can render "agent is stopped" without coupling to agent state. * feat(frameworks): two-tier beta/alpha verification status; openclaw first Consolidates framework verification statuses from four tiers (tested/beta/experimental/broken) to two (beta/alpha/broken). openclaw is the only beta entry; all others become alpha. Updates adapter registry, all 15 agent manifests, and framework route tests with regression guard against the retired "experimental" status. * feat(agents-ui): order openclaw first; Beta + Alpha labels in framework picker Sorts the framework list so openclaw appears at the top. Updates the Framework interface union type to beta|alpha|broken. Replaces the "beta"/"experimental" pills with "Beta" (amber) and "Alpha · Testing" (neutral). The show-alpha toggle and deselect-on-hide logic are wired to the new alpha status. * build: rebuild desktop bundle with framework picker reorder * feat(ui): disk-quota card + admin-prompt prefill + recycle-bin browser Phase 1.B: per-agent disk quota pill (warn/hard) in AgentRow; notification cards above agent list with Expand +10 GB and Audit with agent actions. Cross-app navigation via taos:open-messages CustomEvent to MessagesApp. Phase 1.C-frontend: MessagesApp listens for taos:open-messages, selects channel, fetches GET /api/admin-prompts/{name}, stuffs body into composer, shows dismissible prefill banner above input area. Phase 1.E.2: Recycle Bin location in FilesApp sidebar; fetches GET /api/recycle, grouped by agent, per-item Restore (POST) and Delete Permanently (DELETE) with confirm dialogs; empty state and container-offline notice. * build: rebuild desktop bundle with Phase 1 frontend Rebuilt after disk-quota card, admin-prompt composer prefill, and recycle-bin browser features. * feat(containers): set_root_quota + root_size_gib on create_container Add set_root_quota(name, size_gib) to the module-level API (__init__.py), LXCBackend, and DockerBackend. Add root_size_gib param to create_container in all three; quota is applied after launch, before mounts/env. Docker overlay2-without-pquota returns success with a soft note rather than a hard failure. Abstract base class updated with the new abstract method signatures. Tests cover success, overlay2 soft path, genuine failure, and pass-through from create_container for both LXC (module-level) and Docker backends. * refactor(deployer): snapshot-model -- single trace mount, no workspace/memory/home bind mounts Replace the three host-side bind mounts (workspace, memory, home) with a single trace mount: {data_dir}/trace/{slug}/ -> /root/.taos/trace/. Remove _write_openclaw_bootstrap; install.sh now writes /root/.openclaw/openclaw.json + .openclaw/env inside the container via env vars injected at create_container time (TAOS_BRIDGE_URL, OPENAI_BASE_URL, OPENAI_API_KEY, TAOS_LOCAL_TOKEN, TAOS_AGENT_NAME). Add root_size_gib=40 to DeployRequest (default per arch pivot v2 S10.10) and pass it through to create_container. Update test_deployer.py: remove old three-mount + openclaw host-write assertions, add test_one_trace_bind_mount, test_no_workspace_memory_home_mount, test_root_quota_passed_through_default, test_root_quota_custom_value_honoured, test_trace_dir_created_on_host, test_bridge_url_injected_into_env. agent_env.py is NOT deleted: routes/agents.py (Phase 2.B) still imports update_agent_env_file for the restore path. Phase 2.B will clean it up. * feat(openclaw): install.sh writes /root/.openclaw config + env inside container Add section 2a between npm install and the recycle-bin block. Uses env vars injected by the deployer (TAOS_AGENT_NAME, TAOS_MODEL, OPENAI_BASE_URL, OPENAI_API_KEY, TAOS_BRIDGE_URL, TAOS_LOCAL_TOKEN) to write: /root/.openclaw/openclaw.json (mode 600) — gateway + LiteLLM provider config /root/.openclaw/env (mode 600) — EnvironmentFile for systemd unit Both files live inside the container rootfs and travel with snapshot archives. Safe defaults via := fallback for dev/test environments where not all vars are set. Remove the old section 3 that only ensured the .openclaw dir existed (no longer needed; section 2a creates it). Renumber comments for sections 3 and 4. * feat(containers): snapshot_create + snapshot_restore + snapshot_list + set_env * refactor(trace): store path moved to {data_dir}/trace/{slug} — bind-mount target _agent_trace_dir now returns data_dir/trace/slug instead of data_dir/agent-home/slug/.taos/trace. Aligns with the Phase 2.A deployer bind-mount (data_dir/trace/{slug}/ → /root/.taos/trace/). Updates module docstring and framework-agnostic-runtime.md to reflect the new path and the rationale for separating trace from home-folder. * feat(migrate): script to move legacy agent-home/*/.taos/trace bucket files scripts/migrate-trace-paths.sh walks data_dir/agent-home/*/ for .taos/trace directories and moves .db/.jsonl files to data_dir/trace/{slug}/. Idempotent: skips if source absent, no-clobber merge if both old and new paths exist. install.sh runs it as a non-fatal step on every install/upgrade after disk-quota setup. * refactor(agents): archive/restore/purge use incus snapshot primitives * feat(config): archive.target configurable (pool | path | s3) * refactor(env): delete obsolete tinyagentos/agent_env.py -- functionality moved into install.sh + incus set_env * docs(design): framework-agnostic-runtime thesis evolution — containers hold their own state Rewrite framework-agnostic-runtime.md to reflect the Phase 2.A–2.C post-pivot reality: containers hold their own state, hosts hold the federation. The three bind mounts (workspace/memory/home) are gone; the single trace bind mount remains. The "Per-agent home" section is replaced by "Three bind mounts removed" and "agent_env.py removed" migration notes. The rule application checklist gains a sixth question on archive atomicity. The "Why the pivot" section summarises the reasoning. The audit table and Related code section are updated to match what shipped. Add a "Post-landing status" banner to architecture-pivot-v2.md marking the decision record complete with commit ranges for Phases 1 and 2.A–2.C. Add a note to lxc-docker-coexistence.md that Docker's lack of incus snapshot primitives means graceful fallback on that backend. * docs(runbook): rewrite archive/restore/purge for snapshot primitives Full rewrite of agent-archive-restore.md: archive flow now documents incus snapshot create, chat export to host-owned path, and snapshot_name in config. Restore flow documents incus snapshot restore, set_env for new LLM key, systemctl restart openclaw. Purge documents incus delete --force destroying snapshots atomically. Adds Quick reference table, archive.target options table, and legacy migration section for pre-Phase-2 entries that have no snapshot_name (identify via jq filter, purge via existing DELETE endpoint, or contact dev for manual re-snapshot). Troubleshooting covers snapshot-not-found, rename collision, env-rewrite failure, and openclaw restart failure. * fix(containers): env setter key=value form; root-quota via override for profile-inherited devices Two related deploy blockers: 1. `incus config set <name> environment.<key> <value>` parses a value starting with `-` as a CLI flag. Local auth tokens from `secrets.token_urlsafe(32)` can legitimately start with `-` or `_`, so deploys failed with `unknown shorthand flag: 'X'` whenever the token landed on that character. Switch to the single-argument `environment.<key>=<value>` form, which is the documented canonical syntax and parses unambiguously. 2. `incus config device set <name> root size=...` rejects root devices inherited from a profile with `Device from profile(s) cannot be modified for individual instance.` Switch to `incus config device override` which creates a per-instance copy of the device if it doesn't exist, then falls back to `set` if an override is already present. Tests cover: dash-prefixed token value, root quota on profile-inherited device. * fix(openclaw): npm install via github: shorthand so prepare lifecycle builds in container We were using the tarball URL form which skips npm's prepare lifecycle. That pushed the build burden onto the fork branch (committing prebuilt dist/), which kept landing as incomplete and crashing openclaw.mjs:178. Switch to `npm install -g github:jaylfc/openclaw#taos-fork`. This triggers prepare → pnpm build:docker in the destination (container), which is npm's standard mechanism for git-sourced packages that need building. corepack-activate pnpm before install. Adds 2-3 minutes to first deploy on arm64 Pi for the build step. Reliable; removes the brittle committed-artefact dance. * fix(openclaw): ensure git installed before npm github: shorthand install npm install -g github: requires git to clone the repo. Fresh Debian bookworm containers don't have git; add a guard to install it if absent before the npm install step. * fix(openclaw): install prebuilt tarball from GitHub Releases (no per-deploy build) Per-deploy builds are not beginner-friendly (slow on arm64, fragile because of pnpm workspace context, depends on Pi having full build toolchain present and configured). Switch to downloading prebuilt tarballs published by the fork's CI workflow as Release assets. Architecture detection picks arm64 vs x64. URL is the GitHub-stable 'releases/latest/download/<asset>' redirect so we always grab the freshest build with no version bumps in this repo. Failure mode: if download fails, install.sh exits non-zero with a clear error — there is intentionally no build fallback. The fix when GitHub is unreachable is to fix connectivity, not to silently start building. * fix(openclaw): add --ignore-scripts to npm install-g from prebuilt tarball Tarball already has dist/ built by CI. Running prepare at install time tries to spawn git (for hook config) then falls back to pnpm build:docker; both fail in a fresh container that has no git/pnpm. The bin entry (openclaw.mjs) is wired directly — prepare output is not needed. * fix(openclaw): re-add git prerequisite for libsignal transitive dep libsignal (@whiskeysockets/baileys dep) has a git+https URL so npm needs git at install time even when installing a prebuilt tarball. No build happens — this is purely npm fetching a dependency. * fix(openclaw): update openclaw.json config to match v2026.4 schema - providers must be a record keyed by provider id, not an array - gateway.mode: local must be set or gateway refuses to start - models field must be an array (empty ok), not default_model string * fix(openclaw): defer service start until llm_key written to config openclaw gateway calls the bootstrap endpoint on startup which requires llm_key in the agent config. Previously install.sh started the service immediately, causing HTTP 409 crash-loop before the deployer had written the key. Now: - install.sh enables the unit but defers start (no --now) - _background_deploy() starts openclaw.service after writing llm_key and saving config, so bootstrap succeeds on first attempt * fix(openclaw): allow null llm_key in bootstrap, fix providers schema in response - bootstrap now returns 200 when llm_key is null (no LiteLLM proxy case) instead of HTTP 409 which caused the gateway to crash-loop - bootstrap response uses providers-as-record format to match openclaw v2026.4 config schema (same fix as install.sh) * fix(openclaw): bootstrap uses built-in litellm provider + LITELLM_API_KEY - Return models.providers.litellm (not taos) with correct openclaw schema - models[] built from agent.model + fallback_models, each as {id,name,contextWindow,maxTokens,input,reasoning} - agents.defaults.model.primary = "litellm/<agent.model>" - Drop default_model field; use \${LITELLM_API_KEY} substitution instead of raw key - Revert e356a98 empty-string fallback: null llm_key returns 409 with clear message - Update bootstrap shape assertions + add fallback-models length test + null-key 409 test * fix(deployer): inject LITELLM_API_KEY + TAOS_MODEL/TAOS_FALLBACK_MODELS env - Add LITELLM_API_KEY env var for openclaw's litellm provider (same value as per-agent virtual key) - Keep OPENAI_API_KEY set to same value as compat shim for smolagents and other frameworks - TAOS_MODEL always set (empty string when unconfigured, not omitted) - Add TAOS_FALLBACK_MODELS env var (comma-separated) so install.sh can build models[] at install time - Add fallback_models field to DeployRequest dataclass - Set LITELLM_API_KEY="" fallback when no proxy configured, matching OPENAI_API_KEY pattern * fix(openclaw): install.sh writes openclaw.json with litellm provider schema - Provider name changed from taos to litellm; baseUrl points to 127.0.0.1:4000 (no /v1 suffix) - apiKey uses \${LITELLM_API_KEY} env-var substitution for openclaw's runtime resolution - models[] array built at install time from TAOS_MODEL + comma-separated TAOS_FALLBACK_MODELS - agents.defaults.model.primary set to "litellm/<TAOS_MODEL>" prefix - env file extended with LITELLM_API_KEY and TAOS_FALLBACK_MODELS entries * fix(deployer): correct DeployRequest field order (fallback_models after data_dir) * fix(llm_proxy): own port 4000, drop adoption, log /key/generate failures Adopting a pre-existing LiteLLM on :4000 silently reuses whatever config and master key the foreign process booted with, so UI-added providers never take effect and /key/generate rejects the Bearer sk-taos-master with 401 (silent return None). Agents then deploy with llm_key=null and openclaw crashes with LITELLM_API_KEY missing. start() now terminates any foreign PID on the port (SIGTERM, 5s grace, SIGKILL) before spawning its own process. is_running() reports ownership only. reload_config() drops its adopted branch. Admin calls that can return non-200 now log status + body so master-key mismatches surface in logs instead of being swallowed. * fix(llm_proxy): emit master_key + deployer fallback to shared key when no DB LiteLLM's /key/generate requires Postgres; SQLite is unsupported. On the default single-user Pi deployment there's no DB, so LiteLLM runs in routing-only mode and cannot issue per-agent virtual keys. Previously create_agent_key returned None in that mode and the deployer set LITELLM_API_KEY="", which crashed the openclaw gateway on boot with "LITELLM_API_KEY is missing or empty". Routing-only mode is now the supported default path: - general_settings.master_key added to the yaml config - LITELLM_MASTER_KEY exported into the subprocess env - Single source of truth TAOS_LITELLM_MASTER_KEY constant - Deployer falls back to the master key when virtual-key issue fails, so the container always gets a usable auth token Users who configure Postgres later still get per-agent virtual keys through the same /key/generate path. * fix(providers): postgres-backed virtual keys + generic provider catalog + model discovery - LLMProxy accepts database_url; app reads data/.litellm_db_url at boot and exports it as DATABASE_URL into the litellm subprocess so /key/generate can mint per-agent virtual keys. - Add Provider fills canonical URL from PROVIDER_URL_DEFAULTS and probes {url}/models to populate the model list when empty — generic across openai, anthropic, openrouter, kilocode (no per-type branching on the probe). Falls back to per-type seed list (kilocode → kilo-auto/free) when the probe returns nothing so the entry still registers at least one routable model. - Deployer scopes the minted virtual key to the agent's primary + fallback models (models=[req.model, *fallback_models]) instead of defaulting to the unrestricted "default" alias. - Deployer fails loudly when a DB is configured but /key/generate still returns None — hiding that class of failure is what shipped the broken kilocode path in the first place. - generate_litellm_config now WARNs when a cloud-type backend is missing url or models, so silent drops surface in logs instead of showing up as a broken agent much later. - scripts/repair_providers.py repairs legacy config.yaml entries that pre-date the autofill/discovery logic. * fix(llm_proxy): resolve api_key_secret values into subprocess env Generated LiteLLM configs use os.environ/<name> markers to reference provider api keys, but nothing was actually exporting those names into the subprocess env. Cloud providers therefore hit the litellm OpenAIException "api_key client option must be set" even with a correctly-configured backend list. LLMProxy.start/reload_config now accept a secrets={name: value} map. app.py resolves each backend.api_key_secret from the secrets store at boot and again on catalog-change reload; routes/providers.py does the same on add/patch/delete so newly-added or rotated keys take effect without a full app restart. * feat(litellm_migrate): auto-apply Prisma schema on boot when DB is configured LiteLLM's /key/generate requires a Postgres-backed Prisma schema, but LiteLLM does not run migrations itself. Fresh installs had to manually run `pip install prisma && prisma generate && prisma db push` before virtual keys worked. New tinyagentos/litellm_migrate.py locates the bundled schema at litellm/proxy/schema.prisma, probes for LiteLLM_VerificationToken in the configured DB, and shells out to the venv's prisma CLI only when the table is missing. Idempotent — safe on every boot. Called from the lifespan hook before LLMProxy.start() so LiteLLM sees a ready schema. Added prisma>=0.11.0 to the proxy optional dependency group so the CLI lands in the venv on fresh installs. * fix(litellm_callback): wire callbacks under litellm_settings + sibling shim for get_instance_fn * feat(providers): /api/providers/models passthrough with refresh + ttl cache * feat(agents): agent-creation model picker reads from LiteLLM passthrough * fix(litellm_migrate): prepend venv bin to PATH so prisma-client-py resolves under systemd * fix(litellm_migrate): psql probe fallback so boot doesn't wrongly rerun migration * fix(litellm_migrate): only run prisma generate, let LiteLLM own DB migration Running prisma db push from our bootstrap created tables without seeding _prisma_migrations, so LiteLLM's own prisma migrate deploy at startup tried to apply migration #1 against an already-populated schema and looped on "type JobStatus already exists", leaving the proxy unhealthy. Our helper's only job now is to make prisma.client importable so LiteLLM can run its shipped migrations itself. Drop the db push and the psql/psycopg probe; keep the systemd PATH fix for prisma generate. * fix(llm_proxy): prepend venv bin to PATH and widen startup wait LiteLLM's proxy_cli shells out ``subprocess.run([\"prisma\"])`` during startup to detect whether Prisma is runnable. Under systemd the service's default PATH doesn't include our venv's bin/, so the lookup raises FileNotFoundError and LiteLLM prints "prisma package not found" and skips DB setup entirely — leaving virtual-key issuance broken even though the package IS installed in the venv. Prepend the venv bin that already hosts the litellm binary so the child process resolves ``prisma`` (and ``prisma-client-py`` for generate). Also bump the startup wait from 30s to 120s: LiteLLM on a fresh Pi DB runs ``prisma migrate deploy`` before opening its HTTP port, which takes 45-60s on ARM. * fix(llm_proxy): capture LiteLLM stderr to sibling log file stderr=DEVNULL silently swallowed proxy startup failures (prisma migration errors, config parse errors, model-router failures), turning "why is the proxy unhealthy?" into a 30-minute debugging hunt. Write stderr to a file next to litellm_config.yaml so operators can read it without attaching strace. * fix(llm_proxy): poll health/readiness and drop SIGHUP reload Two separate bugs kept LiteLLM from ever settling on the Pi. 1. Startup polling hit ``/health``, which gates on the master key and returns 401 for an unauthenticated client. LiteLLM was healthy within ~50s but ``start()`` kept polling until the 120s timeout, logged "failed to start within 120s", and returned False even though the subprocess was fine. ``/health/readiness`` is the public endpoint. 2. ``reload_config`` sent SIGHUP to trigger a config reload. LiteLLM runs as single-worker uvicorn (no ``--workers``), which does not register a SIGHUP handler, so the default action — terminate — fires. Every ``/api/providers/models?refresh=true`` was silently killing the proxy, then ``_fetch_litellm_models`` got connection-refused and returned []. Drop SIGHUP entirely; the existing stop+start path was already the fallback. Also switch the foreign-process probe to ``/health/readiness`` for the same 401 reason. * fix(llm_proxy): forward TAOS_LOCAL_TOKEN to LiteLLM subprocess The TaosLiteLLMCallback running inside the LiteLLM subprocess POSTs llm_call events back to the taOS bridge at ``/api/trace``, which requires the local auth token. The callback's token-discovery logic checks ``TAOS_LOCAL_TOKEN`` env first, then ``/data/.auth_local_token`` and ``~/.taos/.auth_local_token``. Under systemd the real token lives at ``{data_dir}/.auth_local_token`` — none of the candidate paths — so every callback fired a POST without Authorization and taOS responded 401, leaving trace rows with no ``llm_call`` events despite LiteLLM actually processing requests. Read the token in app.py and forward it via the new ``local_token`` constructor kwarg on LLMProxy, which exports it into the subprocess env. * fix(litellm_callback): extract agent slug from user_api_key_metadata LiteLLM 1.83.4 surfaces the agent slug in litellm_params.metadata under user_api_key_metadata.agent (matching what LLMProxy.create_agent_key writes when minting the virtual key). The previous extraction read metadata.key_alias which is no longer populated on success events, so every llm_call trace was bucketed under the _unknown_ sentinel slug. Walks four sources in priority order: 1. user_api_key_metadata.agent 2. user_api_key_auth_metadata.agent 3. user_api_key_alias (strips the taos- prefix) 4. key_alias (legacy, kept for older LiteLLM builds) * feat(trace): record message_in events so transcript captures both sides enqueue_user_message now writes a message_in trace event under the agent's slug, following the ENVELOPE_V1_SCHEMA message_in shape ({from, text}) with extra informational fields (message_id, author_type, delivery). Guards against orphan _unknown_ or empty-slug entries. Fails soft: trace write errors are logged, never raised. * feat(agents): persist optional emoji on agent record + deploy API * feat(agents): emoji picker in create flow + display in agent UI (rebuilt PWA) * fix(agents): tolerant DELETE for orphan agents — skip snapshot/stop when container absent (#221) Failed deploys leave behind a config row with no LXC container, which caused DELETE /api/agents/{name} to error on snapshot_create. Probe container_exists first; for orphans, skip stop/snapshot, revoke any LiteLLM key, and either hard-delete the row (no history) or record a tombstone (chat/trace present so purge is available from Archived). Adds container_exists helper to tinyagentos.containers; four new tests cover the orphan hard-delete, orphan tombstone, skipped-snapshot assertion, and purge of a snapshotless tombstone. * feat(agents): pre-built openclaw LXC base image for fast deploys Adds a GitHub Actions workflow that builds per-arch Debian 13 LXC base images with Node 22, openclaw, and recycle-bin scaffolding already installed. Published as assets on the 'rolling-images' Release tag. The deployer now checks for the 'taos-openclaw-base' image alias before launching; when present it uses the cached image and sets TAOS_BASE_IMAGE_PRESENT=1 so install.sh skips the apt-get + npm steps. Without the image the deployer falls back transparently to images:debian/bookworm and install.sh does the full install. tinyagentos.agent_image exposes is_image_present and ensure_image_present helpers; the latter runs as a background task on app startup to bootstrap the image on first boot. Closes #220 * ci(agents): fix bridge forwarding + NAT for incus in GHA runner * fix(agent_image): use os.pipe() so curl stdout actually reaches incus stdin The previous impl passed curl.stdout (a Python StreamReader) as stdin= to the incus subprocess, which asyncio cannot forward as an OS-level FD. Curl would read the first ~90KB then block on a pipe nobody was draining. Using an explicit os.pipe() pair with the read end handed to incus and the write end to curl gives us a real kernel pipe and the import completes. * fix(agent_image): use temp file + positional alias for incus 6.x Incus 6.x rejects '-' as stdin for image import and rejects bare HTTPS URLs (expects an incus image server). Download to a temp file then pass its path. Also fix image list query: positional <alias> arg (--filter=alias=... is only valid for container list). * docs(openclaw): rename provider to built-in litellm in integration tracker The design doc still referenced models.providers.taos (a custom provider that was abandoned mid-implementation in favour of openclaw's built-in litellm provider type). Updated the bootstrap example, the integration tracker table, and the openclaw.json shape to match what actually ships. The channels-side "provider: taos" identifier is unchanged; that's the channel-kind name, separate from the LLM provider.
12 tasks
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Security and reliability improvements based on a thorough code review. All changes are backwards-compatible and all 127 existing tests pass.
127.0.0.1— server, install.sh, and config.yaml all defaulted to0.0.0.0, exposing the unauthenticated dashboard to the entire network. Changed to localhost-only; users who need network access can explicitly sethost: 0.0.0.0in config^[a-z0-9][a-z0-9-]{0,62}$regex check on create and deploy endpoints, preventing malformed names from reaching incus CLIpip install-ing arbitrary user input inside containerssave_config()now writes to.yaml.tmpthen renames, preventing corruption if the process crashes mid-writeNoNewPrivileges=trueto the systemd unit templatedata/config.yamlTest plan
pytest tests/ -v)framework: "malicious-package"→ should get 400🤖 Generated with Claude Code