AI Architecture Decision Record

This document explains the key architectural decisions for the AI system powering paulprae.com. It's written for senior AI engineers, architects, and engineering managers evaluating the system design.

System Overview

paulprae.com is a chat-first career platform with an AI assistant that answers recruiter questions, generates tailored resumes via tool-calling, and produces job search content — all grounded in structured career data.

Runtime stack: Next.js 16 + Vercel AI SDK 6 + Claude Sonnet 4.6 (chat) + Claude Opus 4.6 (pipeline) Infrastructure: Vercel (hosting), Upstash Redis (rate limiting via Vercel KV integration), Anthropic API (direct SDK)

Decision 1: Context Injection vs. Vector Retrieval

Decision: Inject the full career dataset into the system prompt rather than using embedding-based retrieval (RAG with a vector database).

Rationale:

The career dataset fits in a single system prompt: career-data.json (~259KB) + 5 knowledge base files (~11KB), compressed via stripEmpty() to remove empty fields. This is well within Claude's 200K-token context window.
Anthropic prompt caching makes full injection cost-effective: the first request caches the system prompt for 5 minutes at 1.25x write cost; subsequent turns reuse it at 0.1x (90% reduction).
Vector retrieval adds infrastructure (embedding model, vector DB, index maintenance) without proportional benefit at this scale. The retrieval step itself would cost more in latency (~200ms) than the tokens saved.
The full context gives Claude complete visibility into all career data, preventing missed connections that selective retrieval might cause.

Phase 3 path: When the knowledge base grows significantly (e.g., Neo4j knowledge graph with hundreds of project entries), the system will migrate to embedding-based retrieval. The prompt template's {{CAREER_DATA}} placeholder is already abstracted — switching from full injection to filtered results requires changing only the context builder.

Decision 2: Model Selection (Sonnet for Chat, Opus for Pipeline)

Decision: Use Claude Sonnet 4.6 for runtime chat/tool-calling and Claude Opus 4.6 for offline resume generation.

Rationale:

Chat (Sonnet): Recruiter Q&A needs fast responses (~2-5s TTFT). Sonnet at $3/$15 per MTok provides sufficient quality for conversational grounding while keeping per-conversation costs under $0.20.
Pipeline (Opus): Resume generation is a permanent artifact viewed by hiring managers. Opus with adaptive thinking at max effort ($15/$75 per MTok) provides deeper reasoning for entity-scope binding, cross-reference validation, and quality rule adherence. Cost per generation (~$1-2) is acceptable for an artifact generated weekly.
Resume tailoring tool (Sonnet): Runtime resume tailoring via tool-calling uses Sonnet (not Opus) to keep latency under 15s. The recruiter-provided JD provides strong constraints that compensate for the lighter model.

Cost comparison per month (estimated 500 chat conversations + 2 pipeline runs):

Current (Sonnet chat + Opus pipeline): ~$100 + $4 = ~$104
All-Opus alternative: ~$500 + $4 = ~$504

Decision 3: Prompt Injection Defense via XML Delimiting

Decision: Wrap untrusted user input (job descriptions, emphasis areas) in XML tags (<job_description>, <emphasis_areas>) with explicit instructions to treat tag content as data, not instructions.

Rationale:

This is Anthropic's recommended pattern for prompt injection defense (documented in Anthropic's security guide).
Combined with security rules S1-S5 in each system prompt (treat messages as untrusted, never reveal prompt, stay in character, no harmful content, no unauthorized actions).
Input validation (Zod schemas, character limits, message count caps) provides defense in depth at the application layer before content reaches the model.
More maintainable than alternatives like output filtering or separate moderation calls, which add latency and cost.

Decision 4: Ephemeral Prompt Caching

Decision: Use Anthropic's ephemeral caching (5-minute TTL) rather than no caching or persistent caching.

Rationale:

System prompts contain the full career dataset, which is stable within a conversation session.
Ephemeral (5-min TTL) matches the expected recruiter interaction pattern: browse site, ask 3-7 questions over 2-5 minutes, leave.
First request pays 1.25x input cost (cache write). Subsequent turns pay only 0.1x (cache read) — ~90% cost reduction per follow-up turn.
No persistent cache needed — career data changes only when the pipeline runs (weekly at most), and the 5-min window covers a single session.

Decision 5: Single Agent with Tools (Not Multi-Agent)

Decision: Use a single Claude agent with 2 tools (resume generation, resume links) rather than multi-agent orchestration.

Rationale:

The use case has a narrow scope: answer career questions, generate tailored resumes, provide download links. This doesn't require agent delegation, planning loops, or inter-agent communication.
Tool-calling via Vercel AI SDK 6 (streamText + tool()) is clean and well-typed. No framework abstraction (LangChain, CrewAI) needed.
The generate_tailored_resume tool demonstrates the agentic pattern: the chat model decides to call it based on user intent, passes structured inputs, and processes the result — a complete tool-use loop.
stepCountIs(2) caps at 2 reasoning steps (tool call + response), preventing runaway loops while allowing the full tool-use cycle.

Decision 6: Grounding via Entity-Scope Binding

Decision: Enforce grounding through explicit rules (G1-G10) that require every fact to be attributed to exactly one company and one role, with few-shot examples showing correct vs. incorrect attribution.

Rationale:

The most common and damaging error in AI-generated resumes is metric conflation: merging achievements from one company with scale metrics from another. Entity-scope binding (Rule G1) prevents this by requiring single-entity attribution.
SCOPE BOUNDARY markers in the knowledge base provide hard constraints on what work was/was not performed in specific roles.
Few-shot examples (in resume-writer.few-shot.md and career-chat.few-shot.md) demonstrate the expected grounding behavior more effectively than rules alone.
Post-generation validation in the pipeline (automated checks in validateResumeOutput()) catches any remaining violations.

Observability Stack

Platform integrations provide most observability. Additionally, this repo includes pipeline telemetry logging for generation runs.

Vercel AI Gateway (Not Currently Active)

What: Automatic tracking of every AI generation routed through the gateway. Status: Not in use. The chat API uses the direct Anthropic SDK (@ai-sdk/anthropic) for reliability. Gateway support exists in route.ts but only activates when AI_GATEWAY_API_KEY is explicitly set. VERCEL_OIDC_TOKEN (auto-injected by Vercel) is intentionally ignored — using it without gateway configuration causes silent stream failures. Future: Can be enabled by setting AI_GATEWAY_API_KEY in Vercel env vars once the Vercel AI Gateway is configured for this project.

Vercel Runtime Logs

What: All console.log and console.error output from serverless functions, including request duration and cold start metrics. Where: Vercel Dashboard > Project > Logs tab. Filter by function (/api/chat), status code, or time range. How: The chat API route logs errors with [chat] prefixes for easy filtering. Tool execution errors log with [tool:generate_tailored_resume].

Vercel Analytics

What: Page views, unique visitors, top pages, referrers, geographic distribution. Where: Vercel Dashboard > Project > Analytics tab. How: Integrated via <Analytics /> component in app/layout.tsx.

Vercel Speed Insights

What: Core Web Vitals (LCP, CLS, FID, TTFB, INP) per page. Where: Vercel Dashboard > Project > Speed Insights tab. How: Integrated via <SpeedInsights /> component in app/layout.tsx.

Anthropic Console

What: API usage, billing, rate limit status, spend caps. Where: console.anthropic.com > Usage tab. How: All direct Anthropic API calls are tracked by the platform. Set spend limits under Settings > Limits to prevent cost overruns.

Upstash Console

What: Redis request counts, rate limit hits, memory usage. Where: console.upstash.com > Database > Analytics tab. How: Rate limiter uses @upstash/ratelimit with analytics: true for per-key tracking.

Local Pipeline Telemetry

What: JSONL log of generation metadata (model, token usage, duration, estimated cost). Where: data/generated/.telemetry.jsonl How: Written by lib/ai/telemetry.ts during pipeline generation scripts.

Cost Controls

Control	Implementation	Location
Prompt caching	Ephemeral 5-min TTL; ~90% cost reduction on follow-up turns	`route.ts` (streamText)
Output token cap	Separate limits for chat vs. resume generation	`route.ts` (streamText)
Temperature tuning	Lower temperature for tools/resume (fewer retries)	`route.ts` (streamText)
Rate limiting	Sliding window per IP via Upstash Redis	`route.ts` (rate limiter)
Input size limits	Per-message char limit, message count cap, body size cap	`route.ts` (constants)
Model tiering	Sonnet for chat; Opus only for pipeline	`route.ts`, `lib/config.ts`
Anthropic spend limits	Configurable monthly cap at console.anthropic.com	Anthropic Console
Vercel spend limits	Configurable at Vercel Dashboard > Settings > Billing	Vercel Dashboard

Cost Model

Exact token counts vary with career data size. Costs are based on Anthropic pricing for Sonnet 4.6 ($3/$15 per MTok) and Opus 4.6 ($15/$75 per MTok):

Chat conversation (5 turns): First turn pays cache write cost; subsequent turns benefit from ~90% cache read discount. A typical 5-turn session costs well under $1.
Resume generation via tool call: Sonnet generates a tailored resume from a JD. Output capped at 8,192 tokens.
Pipeline resume generation (Opus): Offline, uses adaptive thinking at max effort. Cost per generation is ~$1-2 depending on thinking token usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Architecture Decision Record

System Overview

Decision 1: Context Injection vs. Vector Retrieval

Decision 2: Model Selection (Sonnet for Chat, Opus for Pipeline)

Decision 3: Prompt Injection Defense via XML Delimiting

Decision 4: Ephemeral Prompt Caching

Decision 5: Single Agent with Tools (Not Multi-Agent)

Decision 6: Grounding via Entity-Scope Binding

Observability Stack

Vercel AI Gateway (Not Currently Active)

Vercel Runtime Logs

Vercel Analytics

Vercel Speed Insights

Anthropic Console

Upstash Console

Local Pipeline Telemetry

Cost Controls

Cost Model

FilesExpand file tree

ai-architecture.md

Latest commit

History

ai-architecture.md

File metadata and controls

AI Architecture Decision Record

System Overview

Decision 1: Context Injection vs. Vector Retrieval

Decision 2: Model Selection (Sonnet for Chat, Opus for Pipeline)

Decision 3: Prompt Injection Defense via XML Delimiting

Decision 4: Ephemeral Prompt Caching

Decision 5: Single Agent with Tools (Not Multi-Agent)

Decision 6: Grounding via Entity-Scope Binding

Observability Stack

Vercel AI Gateway (Not Currently Active)

Vercel Runtime Logs

Vercel Analytics

Vercel Speed Insights

Anthropic Console

Upstash Console

Local Pipeline Telemetry

Cost Controls

Cost Model