This document explains the key architectural decisions for the AI system powering paulprae.com. It's written for senior AI engineers, architects, and engineering managers evaluating the system design.
paulprae.com is a chat-first career platform with an AI assistant that answers recruiter questions, generates tailored resumes via tool-calling, and produces job search content — all grounded in structured career data.
Runtime stack: Next.js 16 + Vercel AI SDK 6 + Claude Sonnet 4.6 (chat) + Claude Opus 4.6 (pipeline) Infrastructure: Vercel (hosting), Upstash Redis (rate limiting via Vercel KV integration), Anthropic API (direct SDK)
Decision: Inject the full career dataset into the system prompt rather than using embedding-based retrieval (RAG with a vector database).
Rationale:
- The career dataset fits in a single system prompt: career-data.json (~259KB) + 5 knowledge base files (~11KB), compressed via
stripEmpty()to remove empty fields. This is well within Claude's 200K-token context window. - Anthropic prompt caching makes full injection cost-effective: the first request caches the system prompt for 5 minutes at 1.25x write cost; subsequent turns reuse it at 0.1x (90% reduction).
- Vector retrieval adds infrastructure (embedding model, vector DB, index maintenance) without proportional benefit at this scale. The retrieval step itself would cost more in latency (~200ms) than the tokens saved.
- The full context gives Claude complete visibility into all career data, preventing missed connections that selective retrieval might cause.
Phase 3 path: When the knowledge base grows significantly (e.g., Neo4j knowledge graph with hundreds of project entries), the system will migrate to embedding-based retrieval. The prompt template's {{CAREER_DATA}} placeholder is already abstracted — switching from full injection to filtered results requires changing only the context builder.
Decision: Use Claude Sonnet 4.6 for runtime chat/tool-calling and Claude Opus 4.6 for offline resume generation.
Rationale:
- Chat (Sonnet): Recruiter Q&A needs fast responses (~2-5s TTFT). Sonnet at $3/$15 per MTok provides sufficient quality for conversational grounding while keeping per-conversation costs under $0.20.
- Pipeline (Opus): Resume generation is a permanent artifact viewed by hiring managers. Opus with adaptive thinking at max effort ($15/$75 per MTok) provides deeper reasoning for entity-scope binding, cross-reference validation, and quality rule adherence. Cost per generation (~$1-2) is acceptable for an artifact generated weekly.
- Resume tailoring tool (Sonnet): Runtime resume tailoring via tool-calling uses Sonnet (not Opus) to keep latency under 15s. The recruiter-provided JD provides strong constraints that compensate for the lighter model.
Cost comparison per month (estimated 500 chat conversations + 2 pipeline runs):
- Current (Sonnet chat + Opus pipeline): ~$100 + $4 = ~$104
- All-Opus alternative: ~$500 + $4 = ~$504
Decision: Wrap untrusted user input (job descriptions, emphasis areas) in XML tags (<job_description>, <emphasis_areas>) with explicit instructions to treat tag content as data, not instructions.
Rationale:
- This is Anthropic's recommended pattern for prompt injection defense (documented in Anthropic's security guide).
- Combined with security rules S1-S5 in each system prompt (treat messages as untrusted, never reveal prompt, stay in character, no harmful content, no unauthorized actions).
- Input validation (Zod schemas, character limits, message count caps) provides defense in depth at the application layer before content reaches the model.
- More maintainable than alternatives like output filtering or separate moderation calls, which add latency and cost.
Decision: Use Anthropic's ephemeral caching (5-minute TTL) rather than no caching or persistent caching.
Rationale:
- System prompts contain the full career dataset, which is stable within a conversation session.
- Ephemeral (5-min TTL) matches the expected recruiter interaction pattern: browse site, ask 3-7 questions over 2-5 minutes, leave.
- First request pays 1.25x input cost (cache write). Subsequent turns pay only 0.1x (cache read) — ~90% cost reduction per follow-up turn.
- No persistent cache needed — career data changes only when the pipeline runs (weekly at most), and the 5-min window covers a single session.
Decision: Use a single Claude agent with 2 tools (resume generation, resume links) rather than multi-agent orchestration.
Rationale:
- The use case has a narrow scope: answer career questions, generate tailored resumes, provide download links. This doesn't require agent delegation, planning loops, or inter-agent communication.
- Tool-calling via Vercel AI SDK 6 (
streamText+tool()) is clean and well-typed. No framework abstraction (LangChain, CrewAI) needed. - The
generate_tailored_resumetool demonstrates the agentic pattern: the chat model decides to call it based on user intent, passes structured inputs, and processes the result — a complete tool-use loop. stepCountIs(2)caps at 2 reasoning steps (tool call + response), preventing runaway loops while allowing the full tool-use cycle.
Decision: Enforce grounding through explicit rules (G1-G10) that require every fact to be attributed to exactly one company and one role, with few-shot examples showing correct vs. incorrect attribution.
Rationale:
- The most common and damaging error in AI-generated resumes is metric conflation: merging achievements from one company with scale metrics from another. Entity-scope binding (Rule G1) prevents this by requiring single-entity attribution.
- SCOPE BOUNDARY markers in the knowledge base provide hard constraints on what work was/was not performed in specific roles.
- Few-shot examples (in
resume-writer.few-shot.mdandcareer-chat.few-shot.md) demonstrate the expected grounding behavior more effectively than rules alone. - Post-generation validation in the pipeline (automated checks in
validateResumeOutput()) catches any remaining violations.
Platform integrations provide most observability. Additionally, this repo includes pipeline telemetry logging for generation runs.
What: Automatic tracking of every AI generation routed through the gateway.
Status: Not in use. The chat API uses the direct Anthropic SDK (@ai-sdk/anthropic) for reliability. Gateway support exists in route.ts but only activates when AI_GATEWAY_API_KEY is explicitly set. VERCEL_OIDC_TOKEN (auto-injected by Vercel) is intentionally ignored — using it without gateway configuration causes silent stream failures.
Future: Can be enabled by setting AI_GATEWAY_API_KEY in Vercel env vars once the Vercel AI Gateway is configured for this project.
What: All console.log and console.error output from serverless functions, including request duration and cold start metrics.
Where: Vercel Dashboard > Project > Logs tab. Filter by function (/api/chat), status code, or time range.
How: The chat API route logs errors with [chat] prefixes for easy filtering. Tool execution errors log with [tool:generate_tailored_resume].
What: Page views, unique visitors, top pages, referrers, geographic distribution.
Where: Vercel Dashboard > Project > Analytics tab.
How: Integrated via <Analytics /> component in app/layout.tsx.
What: Core Web Vitals (LCP, CLS, FID, TTFB, INP) per page.
Where: Vercel Dashboard > Project > Speed Insights tab.
How: Integrated via <SpeedInsights /> component in app/layout.tsx.
What: API usage, billing, rate limit status, spend caps. Where: console.anthropic.com > Usage tab. How: All direct Anthropic API calls are tracked by the platform. Set spend limits under Settings > Limits to prevent cost overruns.
What: Redis request counts, rate limit hits, memory usage.
Where: console.upstash.com > Database > Analytics tab.
How: Rate limiter uses @upstash/ratelimit with analytics: true for per-key tracking.
What: JSONL log of generation metadata (model, token usage, duration, estimated cost).
Where: data/generated/.telemetry.jsonl
How: Written by lib/ai/telemetry.ts during pipeline generation scripts.
| Control | Implementation | Location |
|---|---|---|
| Prompt caching | Ephemeral 5-min TTL; ~90% cost reduction on follow-up turns | route.ts (streamText) |
| Output token cap | Separate limits for chat vs. resume generation | route.ts (streamText) |
| Temperature tuning | Lower temperature for tools/resume (fewer retries) | route.ts (streamText) |
| Rate limiting | Sliding window per IP via Upstash Redis | route.ts (rate limiter) |
| Input size limits | Per-message char limit, message count cap, body size cap | route.ts (constants) |
| Model tiering | Sonnet for chat; Opus only for pipeline | route.ts, lib/config.ts |
| Anthropic spend limits | Configurable monthly cap at console.anthropic.com | Anthropic Console |
| Vercel spend limits | Configurable at Vercel Dashboard > Settings > Billing | Vercel Dashboard |
Exact token counts vary with career data size. Costs are based on Anthropic pricing for Sonnet 4.6 ($3/$15 per MTok) and Opus 4.6 ($15/$75 per MTok):
- Chat conversation (5 turns): First turn pays cache write cost; subsequent turns benefit from ~90% cache read discount. A typical 5-turn session costs well under $1.
- Resume generation via tool call: Sonnet generates a tailored resume from a JD. Output capped at 8,192 tokens.
- Pipeline resume generation (Opus): Offline, uses adaptive thinking at max effort. Cost per generation is ~$1-2 depending on thinking token usage.