Deep dive into Cortex's master-worker architecture and internal systems.
Cortex is a multi-agent AI system that automates repository management using a distributed master-worker pattern. This document explains how it works under the hood.
Masters are domain specialists that route and coordinate work:
- Coordinator Master: Routes tasks to appropriate specialists using MoE pattern matching
- Development Master: Handles features, bug fixes, refactoring
- Security Master: CVE detection, vulnerability scanning, remediation
- Inventory Master: Repository cataloging, documentation
- CI/CD Master: Build automation, deployment, testing
Workers are ephemeral agents that execute specific tasks:
- Implementation Worker: Code implementation
- Fix Worker: Bug fixes
- Test Worker: Test creation and execution
- Scan Worker: Security scanning
- Security Fix Worker: CVE remediation
- Documentation Worker: Documentation generation
- Analysis Worker: Code analysis
All inter-agent communication happens through JSON files in coordination/:
coordination/
├── task-queue.json # Pending/in-progress/completed tasks
├── worker-pool.json # Active workers and their status
├── token-budget.json # Token usage tracking
├── repository-inventory.json # Managed repositories
├── dashboard-events.jsonl # Event stream for dashboard
└── governance/
└── overrides.jsonl # Governance enforcement audit trail
Why file-based?
- Simple: No message broker needed
- Observable: Can inspect state with
catandjq - Debuggable: Full history in version control
- Resilient: Survives restarts
The Coordinator uses MoE pattern matching to route tasks:
- Keyword Analysis: Extract domain keywords from task description
- Confidence Scoring: Calculate confidence for each master (0.0-1.0)
- Sparse Activation: Only activate masters above threshold
- Strategy Selection:
- Single expert (confidence >= 0.70)
- Multi-expert parallel (multiple above 0.25)
- Single expert low confidence (best available)
Routing Methods:
- Keyword (87.5% accuracy): Pattern matching on task description
- Semantic (94.5% accuracy): Embedding-based similarity
- PyTorch Neural (optional): Trained model predictions
9 Autonomous Daemons:
- Coordinator Daemon: Routes tasks continuously
- Worker Daemon: Spawns and manages workers
- PM Daemon: Process management
- Heartbeat Monitor: Detects stuck workers
- Zombie Cleanup: Kills hung processes
- Worker Restart: Auto-restarts failed workers
- Failure Pattern Detection: Identifies systemic issues
- Auto-Fix: Automatic remediation
- Dashboard Server: Real-time metrics
Health Monitoring:
# Check daemon status
./scripts/daemon-control.sh status
# View health metrics
curl http://localhost:3000/api/health1. User/API → task-queue.json
2. Coordinator reads task-queue.json
3. MoE router determines best master
4. Master spawns worker(s)
5. Worker executes task
6. Worker updates worker-pool.json
7. Worker writes result
8. Coordinator marks task complete
9. Dashboard shows progress
1. Task starts → Estimate tokens needed
2. Check token-budget.json
3. If budget available → Allow task
4. Execute task → Track actual usage
5. Update token-budget.json
6. Daily reset at midnight
1. Task submitted
2. Pre-flight validation:
- Check token budget (hard limit at 95%)
- Detect dangerous operations
- Verify critical task approval
- Check master-specific rules
3. If violations → Block and log to overrides.jsonl
4. If approved → Allow worker spawn
5. Audit trail maintained
Uses sentence transformers for embedding-based routing:
Task description → Embedding model →
Cosine similarity with master embeddings →
Highest similarity = Best master
Configuration:
SEMANTIC_ROUTING_ENABLED=true
SEMANTIC_CONFIDENCE_THRESHOLD=0.6Trained neural network for task-to-master prediction:
Task features → PyTorch model →
Softmax output → Master probabilities
Status: Requires training data from routing decisions
Retrieval Augmented Generation for context-aware decisions:
Task → Query vector DB →
Retrieve relevant code/docs →
Augment LLM prompt → Better decisions
Components:
- FAISS vector store
- Sentence-transformers (all-MiniLM-L6-v2)
- Code chunking and indexing
Every routing decision is logged with full reasoning:
{
"task_id": "task-001",
"decision": {
"expert": "development",
"confidence": 0.89,
"keywords_matched": ["bug", "fix"],
"reasoning_trail": [
"Step 1: Keyword Analysis",
"Step 2: Found 2 development keywords",
"Step 3: Confidence scoring",
"Step 4: Routed to development"
]
}
}API: GET /api/decisions/:taskId/explain
A/B testing framework validates ML features:
# Run validation
./llm-mesh/validation/ml-validator.sh
# View results
curl http://localhost:3000/api/ml-validationValidates:
- Semantic routing vs keyword routing accuracy
- RAG effectiveness (when instrumented)
- PyTorch prediction accuracy (when enabled)
Decomposition Savings: 60-80% token reduction
- Instead of one large prompt, break into smaller worker tasks
- Workers only see relevant context
- Parallel execution reduces total time
94% success rate (from production metrics)
- Self-healing recovers from transient failures
- Retry logic with exponential backoff
- Zombie cleanup prevents hung processes
Current Scale: ~20 repositories
- File-based coordination works well
- No performance issues at this scale
Future Scale (100+ repos):
- May need message queue (RabbitMQ, Redis)
- May need worker pooling
- May need distributed coordination
Dashboard requires API key:
curl -H "x-api-key: your-key" http://localhost:3000/api/tasks- Token budget limits prevent runaway costs
- Dangerous operation detection blocks risky commands
- Critical task approval requires human sign-off
- Full audit trail in overrides.jsonl
API keys stored in .env (gitignored):
ANTHROPIC_API_KEY=sk-ant-...Never commit API keys to version control.
- Create master script:
coordination/masters/your-master/main.sh - Add routing patterns to
routing-patterns.json - Update MoE router to include new master
- Create worker types as needed
- Create worker template:
scripts/lib/worker-templates/your-worker.sh - Add spawn logic to
spawn-worker.sh - Define worker capabilities
- Test with sample task
- Enable in
.env:YOUR_FEATURE_ENABLED=true - Implement feature logic
- Add to ML validation config
- Run A/B tests to validate
- If validation passes, keep enabled
- If validation fails, disable and simplify
Workers becoming zombies:
- Check
coordination/worker-pool.json - Review worker logs in
agents/logs/workers/ - Run zombie cleanup:
./scripts/lib/zombie-cleanup.sh
Routing decisions incorrect:
- View decision reasoning:
GET /api/decisions/:taskId/explain - Check routing patterns:
coordination/masters/coordinator/knowledge-base/routing-patterns.json - Run ML validation to compare methods
Token budget exceeded:
- Check usage:
cat coordination/token-budget.json | jq - Review governance blocks:
cat coordination/governance/overrides.jsonl | jq - Adjust daily limit if needed
# System health
./scripts/status-check.sh
# Worker status
./scripts/worker-status.sh
# Live monitoring
./scripts/system-live.sh
# Daemon logs
./scripts/daemon-control.sh logs
# Debug helper
./scripts/debug-helper.shPros:
- Simplicity: No DB to manage
- Observability: Can inspect with standard tools
- Version control: Full history in git
- Portability: Works anywhere with filesystem
Cons:
- Concurrent writes need locking
- Not suitable for 1000+ req/sec
- No complex queries
Verdict: Right choice for current scale (<100 repos)
Bash: System integration, process management, file operations Node.js: Dashboard, API server, real-time updates Python: ML features (PyTorch, transformers, FAISS)
Each language used for its strengths.
Traditional automation (CI/CD, scripts):
- Requires exact instructions
- Breaks on edge cases
- No learning or adaptation
LLM-based automation (Cortex):
- Handles ambiguity
- Adapts to new situations
- Learns from history
- Natural language interface
Trade-off: Higher cost (API tokens) vs higher capability