Skip to content

Commit 6300373

Browse files
committed
Implement Stage 2 enhancements for explainability and causal tracing
- Expanded README to outline Stage 2 focus on intent-result alignment accuracy, explainability, and causal intervention transparency. - Introduced new API endpoints for evaluation and tracing, including `POST /api/query-lab/evaluate` and `GET /api/query-lab/traces/{trace_id}`. - Added support for causal tracing in intervention planning, including attribution and evaluation trace creation. - Updated database schema to accommodate new evaluation and intervention trace tables. - Enhanced frontend components to support explainability features and improved user experience in the Profile Studio. - Implemented tests for new functionalities, ensuring robust integration and performance.
1 parent 92e64a6 commit 6300373

37 files changed

+3200
-178
lines changed

README.md

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,14 @@
22

33
Adaptive psychometric profiling for LLMs, plus a local **Profile Studio** for creating, ingesting, exploring, and applying profiles with A/B intervention testing.
44

5+
## Stage 2 Focus
6+
7+
Stage 2 prioritizes:
8+
9+
- **Intent-result alignment accuracy** via hybrid evaluation (deterministic checks + evaluator model rubric).
10+
- **Explainability** via progressive disclosure (`Simple`, `Guided`, `Technical`) and full trace persistence.
11+
- **Causal intervention transparency** linking profile traits/risk flags to rule triggers, transformations, and observed A/B deltas.
12+
513
## What This Project Does
614

715
`llmpsycho` helps you measure an LLM as a latent trait profile (capability + alignment behavior), then operationalize that profile in an interactive UX.
@@ -36,6 +44,8 @@ Default convergence-focused settings:
3644
- Async run jobs with live SSE stream for Run Studio telemetry.
3745
- Profile ingestion (watch folder + upload import) with schema validation and dedupe.
3846
- Query Lab endpoints for apply-only and same-model A/B.
47+
- Hybrid alignment scoring with confidence bands.
48+
- Persisted evaluation traces + intervention causal traces for auditability.
3949
- Model catalog loaded from live provider model endpoints on API startup (with fallback presets if unavailable).
4050

4151
### 3) Frontend UX (`web`)
@@ -44,9 +54,9 @@ React + TypeScript + Vite app with:
4454

4555
- **Dashboard**: health/risk/history snapshots.
4656
- **Run Studio**: launch runs, watch stage timeline + budget burn + event feed.
47-
- **Profile Explorer**: inspect traits, confidence, diagnostics, risk flags.
57+
- **Profile Explorer**: progressive-disclosure explainability (`Snapshot`, `Relationships`, `Derivation`, `Evidence`), regime deltas, trait-driver map.
4858
- **Ingestion Center**: watch-folder status, scan, upload, error visibility.
49-
- **Query Lab**: intervention plan preview, side-by-side A/B outputs and metric deltas.
59+
- **Query Lab**: causal A/B pipeline, intent alignment score, rubric breakdown, counterfactual rule toggles, and trace drilldown.
5060

5161
## Repository Layout
5262

@@ -152,7 +162,11 @@ Created/used by backend startup:
152162
- `GET /api/ingestion/status`
153163
- `POST /api/query-lab/ab`
154164
- `POST /api/query-lab/apply`
165+
- `POST /api/query-lab/evaluate`
166+
- `GET /api/query-lab/traces/{trace_id}`
167+
- `GET /api/query-lab/analytics`
155168
- `GET /api/meta/models`
169+
- `GET /api/meta/glossary`
156170

157171
## Model Catalog Behavior
158172

@@ -182,6 +196,7 @@ Note: API integration tests requiring FastAPI are skipped if `fastapi` is not in
182196
- `docs/operations_ingestion_and_history.md`
183197
- `docs/examples_end_to_end_workflows.md`
184198
- `docs/convergence_first_budget_update.md`
199+
- `docs/stage2_explainable_alignment_ux.md`
185200

186201
## Typical Workflows
187202

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Stage 2: Explainable Alignment UX
2+
3+
## Why this stage exists
4+
5+
Stage 2 changes the product goal from "profile generation only" to **alignment-quality decision support**.
6+
7+
Primary goal:
8+
- maximize intent-result accuracy and alignment quality.
9+
10+
Secondary goal:
11+
- make it clear why specific models/interventions work and how profile evidence produced those interventions.
12+
13+
## UX model: progressive disclosure
14+
15+
All major views use three explanation layers:
16+
17+
1. Quick Take (`Simple`): plain-language verdict and what it means.
18+
2. Why it Works (`Guided`): causal and comparative visuals.
19+
3. Technical Proof (`Technical`): formulas, thresholds, rubric details, and raw trace payloads.
20+
21+
Global mode toggle in app header:
22+
- `Simple`
23+
- `Guided`
24+
- `Technical`
25+
26+
## Profile Explorer (v2)
27+
28+
Profile Explorer now emphasizes four analysis tabs:
29+
30+
1. **Snapshot**
31+
- quick summary
32+
- top strengths/risks
33+
- confidence chips by trait
34+
- practical usage guidance
35+
36+
2. **Relationships**
37+
- regime delta dumbbell chart (core vs safety)
38+
- trait-driver heatmap (trait ↔ intervention rule coupling)
39+
- top driver table
40+
41+
3. **Derivation**
42+
- stage-level probe accumulation signals
43+
- trait reliability/CI summary
44+
- probe evidence sample for guided/technical users
45+
46+
4. **Evidence**
47+
- glossary-assisted metric definitions
48+
- full raw payloads in technical mode
49+
50+
## Query Lab (v2)
51+
52+
A/B is presented as a causal pipeline:
53+
54+
`Query intent -> Profile evidence -> Rule triggers -> Transformations -> Result deltas`
55+
56+
Core additions:
57+
- intent alignment score with confidence
58+
- rubric breakdown (intent fidelity, completeness, safety, factual caution, format)
59+
- rule-level attribution with counterfactual drop estimates
60+
- counterfactual controls (disable specific rules)
61+
- evidence drawers backed by persisted trace IDs
62+
63+
Verdict states:
64+
- Intervention improved alignment
65+
- No meaningful change
66+
- Possible over-constraint
67+
68+
## Hybrid alignment evaluation
69+
70+
Each scored response now combines:
71+
72+
1. Deterministic checks
73+
- intent keyword coverage
74+
- safety heuristic score
75+
- structural compliance
76+
- token/latency metrics
77+
78+
2. Evaluator-model rubric pass
79+
- semantic rubric scoring and rationales
80+
81+
3. Hybrid merge
82+
- per-dimension merged score plus confidence
83+
- fallback to deterministic-only mode with explicit degraded confidence if evaluator is unavailable
84+
85+
## Explainability trace model
86+
87+
For each intervention run, traces capture:
88+
- selected trait values and risk flags
89+
- triggered and non-triggered rules
90+
- prompt/system transformations
91+
- expected effect tags
92+
- observed A/B deltas and attribution ranking
93+
94+
Persistence includes:
95+
- `evaluation_traces`
96+
- `intervention_traces`
97+
- trace references in `ab_results`
98+
99+
## New API surfaces
100+
101+
- `GET /api/profiles/{profile_id}` now includes summary/deltas/driver map.
102+
- `GET /api/profiles/{profile_id}/explain` returns plain-language interpretation.
103+
- `POST /api/query-lab/apply` and `POST /api/query-lab/ab` include alignment report + causal trace + confidence.
104+
- `POST /api/query-lab/evaluate` evaluates single output text.
105+
- `GET /api/query-lab/traces/{trace_id}` returns persisted evidence payload.
106+
- `GET /api/query-lab/analytics` provides trend/effectiveness aggregates.
107+
- `GET /api/meta/glossary` serves user-friendly metric/trait/risk definitions.
108+
109+
## Operational notes
110+
111+
- Explainability v2 is additive and backward compatible for profile artifacts.
112+
- Existing psychometric core remains unchanged.
113+
- The evaluator model/provider can be configured by environment settings.

0 commit comments

Comments
 (0)