Skip to content

v5.0.0: Server architecture decomposition for enterprise deployment#57

Merged
Number531 merged 18 commits into
mainfrom
worktree-server-refactor
Mar 19, 2026
Merged

v5.0.0: Server architecture decomposition for enterprise deployment#57
Number531 merged 18 commits into
mainfrom
worktree-server-refactor

Conversation

@Number531

Copy link
Copy Markdown
Owner

Summary

  • Decompose 2,229-line claude-sdk-server.js monolith into 7 focused modules (883-line shell + 6 handler modules)
  • Add 78 automated tests across 5 new test suites
  • Port Docker fixes from v4.12.5–v4.12.8 (tini, stderr callbacks, firstMessageTimer, DEBUG_CLAUDE_AGENT_SDK, writable HOME)
  • Add graceful shutdown with 30-second stream draining
  • Reorder auth middleware before body parsing
  • Add defense-in-depth error wrapping at delegation boundary

Why

The monolithic server was the primary source of GCE deployment instability. A single 1,070-line handler with 8 levels of nesting and 30+ closure-coupled variables made debugging, testing, and safe modification impossible. This refactoring preserves identical behavior while enabling modular development, independent testing, and enterprise containerized deployment.

Modules Created

Module Lines Responsibility
clientRegistry.js 190 32 API client singletons + MCP server lazy init
streamContext.js 201 SessionContext class — SSE lifecycle, backpressure
agentStreamHandler.js 632 Agent SDK multi-turn orchestration
p0Orchestrator.js 170 P0 document processing phase
legacyStreamHandler.js 287 Legacy single-turn path
researchHandler.js 277 /api/research endpoint

Verification

  • 78/78 automated tests passing
  • 50+ verification agents confirmed zero functional regressions
  • Live end-to-end session completed via frontend dashboard
  • Docker build context verified — all import chains resolve
  • Dockerfile synced with main (tini, DEBUG env, writable HOME)

Test plan

  • npm test — 78/78 passing
  • Live server startup on port 3099 — all endpoints responding
  • Full research session via frontend dashboard — subagents, tools, hooks, SSE all working
  • Graceful shutdown test (SIGTERM → stream drain → DB pool close)
  • Docker build + deploy to GCE staging
  • Full memorandum generation on containerized instance

🤖 Generated with Claude Code

Number531 and others added 18 commits March 17, 2026 23:28
…ons (Phase 1)

Move all 32 API client imports and 5 lazy-initialized singleton functions
(getClients, getSdkTools, getAgentSdkMcpServer, getDomainMcpServers,
createRateLimiters) into dedicated clientRegistry.js module.

Server: 2229 → 2062 lines (-167). Zero behavior change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ycle (Phase 2)

Create SessionContext class replacing 30 closure-captured variables with
explicit passable state. Includes send() with backpressure, heartbeat,
session timeout, idempotent end(), and disconnect handlers.

10 unit tests covering all SSE lifecycle methods.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move POST /api/research handler into dedicated module with factory
pattern. 34 deps passed via createResearchHandler(deps). Not yet wired
into server — will be connected in Phase 7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hase 3)

Move document-processing-analyst agentQuery loop, message forwarding,
and state file verification into dedicated module. Non-fatal: errors
caught internally, reported via ctx.send(), never thrown.

8 unit tests covering all skip/success/failure/disconnect paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hase 5)

Move legacy Messages API streaming handler (USE_AGENT_SDK=false) into
dedicated module. All closure refs replaced with ctx/deps parameters.
Not yet wired into server — will be connected in Phase 7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hase 4)

Move entire Agent SDK multi-turn orchestration (agentQuery loops,
truncation detection, auto-continuation, P0 delegation, document
conversion, DB persistence) into dedicated 613-line module.

Exports: handleAgentStream, AUTO_CONTINUATION_CONFIG, detectTruncation,
detectCompletion, TRUNCATION_PATTERNS, COMPLETION_PATTERNS.

21 unit tests for pure detection functions and config defaults.
Not yet wired into server — will be connected in Phase 7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Phase 7)

Replace inline /api/stream (1075 lines) and /api/research (190 lines)
handlers with thin delegators to extracted modules. Add serverDeps
object, activeContexts Set, and enhanced gracefulShutdown with 30s
stream drain.

Server shell now contains only: imports, env validation, DB schema init,
middleware, Anthropic client config, route delegation, health/catalog
endpoints, and shutdown logic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix 2 string-match assertions that broke when requestContext.enterWith
and createFreshToolUsage moved to agentStreamHandler.js.

Add server-refactor-regression.test.js: 17 tests verifying module
exports, server shell structure, delegation patterns, graceful shutdown,
and line count reduction.

74/74 tests passing across all 5 refactoring test suites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move cookieAuthMiddleware + authRouter BEFORE express.json() in the
middleware stack. Unauthenticated requests are now rejected without
paying the body parsing cost. /api/auth routes are bypassed by
cookieAuthMiddleware (BYPASS_PATHS).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove orphaned resetSubagentToolUsage import (moved to modules)
- Wrap handleAgentStream/handleLegacyStream calls in try-catch for
  defense-in-depth against handler error escapes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Match agentStreamHandler pattern: set span = null after endSpan() to
prevent the defense-in-depth wrapper from attempting a second close
if an error escapes to the outer catch block.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevent any error in the onEnd callback (activeStreamCount decrement,
activeContexts cleanup) from propagating and potentially crashing
the cleanup path. Callback errors are now absorbed and logged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add 4 behavioral tests addressing test coverage gaps:
- P0: verify agentQuery called with correct system prompt + doc count
- P0: verify MCP servers passed correctly in non-scoped mode
- StreamContext: verify res.end() actually called on end()
- StreamContext: verify res.write() receives correct SSE data format

78/78 tests now passing (was 74 structural-only).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Main branch received 67d6220 after our worktree branched. Ports:
- stderr callback on both P0 and main agentQuery calls (captures
  subprocess errors instead of silently discarding them)
- 120s firstMessageTimer with SSE error event for silent hangs
- Timer cleanup in catch block

Ref: #52

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `debug: !!process.env.AGENT_SDK_DEBUG` to the P0 document
processing agentQuery options, matching main branch behavior.
Enables diagnostic logging when AGENT_SDK_DEBUG=1 is set.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Port 3 critical Docker changes from main (commits 14de481, e577129, fc8fa6e):
- Install tini for proper subprocess signal forwarding (Agent SDK cli.js)
- Create /home/app/.claude with writable HOME (SDK CLI subprocess needs it)
- Set DEBUG_CLAUDE_AGENT_SDK=1 to restore SubagentStart/Stop hooks in Docker
- Add ENTRYPOINT ["tini", "--"] before CMD

Without these, Agent SDK hooks fail silently in containerized deployments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ployment

Major release: decompose 2,229-line monolith into 7 focused modules.
Server shell reduced to 883 lines. 78 automated tests. Enterprise
container deployment ready with tini, subprocess stderr capture,
first-message timeout, and Docker hook reliability.

See CHANGELOG.md for complete details.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve merge conflict in claude-sdk-server.js — keep refactored
modular shell (all monolithic code now lives in extracted modules).
Brings in docs/docker-hooks-issue.md from main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Number531 Number531 merged commit 53292fb into main Mar 19, 2026
Number531 added a commit that referenced this pull request Jun 2, 2026
The 8 KG_* edge-wave flags are absent on main, so they activate in production for the first time on this merge — meaning Wave 4's own rollout policy (higher FP risk; 'leave commented out for the first 7 days after deploy, enable only after manual spot-check') had not been satisfied (the soak never started). Comment KG_CONTRADICTION_EDGES out in flags.env per that policy; the other 7 KG waves ship ON (deterministic/additive/isolated). feature-flags.md #57 + CHANGELOG updated. Enable Wave 4 after a 7-day soak + manual CONTRADICTS spot-check on the first post-merge production sessions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant