Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
130 commits
Select commit Hold shift + click to select a range
2d2bbc4
test(junior): Tighten integration test boundaries
dcramer Jun 4, 2026
74d0ef4
test(junior): Split Slack turn behavior suites
dcramer Jun 4, 2026
695ca25
test(junior): Split subscribed Slack behavior tests
dcramer Jun 4, 2026
47a6a1e
test(junior): Split Slack image behavior suites
dcramer Jun 4, 2026
106e3e5
test(junior): Split heartbeat integration contracts
dcramer Jun 4, 2026
7eb6f0e
test(junior): Split conversation work component suites
dcramer Jun 4, 2026
ac60e7e
test(junior): Split plugin package registry tests
dcramer Jun 4, 2026
d077cdc
test(junior): Split sandbox egress proxy suites
dcramer Jun 4, 2026
0377f37
docs(testing): Record testing architecture review
dcramer Jun 4, 2026
57ed063
test(junior): Extract lazy sandbox test contracts
dcramer Jun 4, 2026
97042b7
test(junior): Extract sandbox executor fixture
dcramer Jun 4, 2026
d0a3c4a
test(junior): Split sandbox executor snapshots
dcramer Jun 4, 2026
47dd29b
test(junior): Split sandbox executor bash tests
dcramer Jun 4, 2026
960d26d
test(junior): Split sandbox executor tool tests
dcramer Jun 5, 2026
7da6ce9
test(junior): Extract respond runtime fixture
dcramer Jun 5, 2026
fa0bc7a
test(junior): Extract MCP respond harness
dcramer Jun 5, 2026
6738769
test(junior): Split MCP respond scenarios
dcramer Jun 5, 2026
6306560
test(junior): Split CLI check suites
dcramer Jun 5, 2026
5a4e41a
test(junior): Split subscribed routing suites
dcramer Jun 5, 2026
3216ffa
test(junior): Split turn session record suites
dcramer Jun 5, 2026
a04ab94
test(junior): Split Slack schedule tool suites
dcramer Jun 5, 2026
16d3ce6
test(junior): Split MCP OAuth callback suites
dcramer Jun 5, 2026
ee99299
test(junior): Split MCP auth runtime suites
dcramer Jun 5, 2026
7e6850d
test(junior): Split OAuth callback Slack suites
dcramer Jun 5, 2026
452f1fe
test(junior): Move timeout resume runner tests
dcramer Jun 5, 2026
fbda1c9
test(junior): Split runtime dependency snapshot suites
dcramer Jun 5, 2026
d3c0508
test(junior): Split Slack turn resume suites
dcramer Jun 5, 2026
003730a
test(junior): Rework OAuth callback route tests
dcramer Jun 5, 2026
e0a971e
test(junior): Rework MCP OAuth callback route tests
dcramer Jun 5, 2026
b9d0335
test(junior): Split OAuth resume Slack suites
dcramer Jun 5, 2026
8c7f373
test(junior): Move respond runtime orchestration tests
dcramer Jun 5, 2026
463bc69
test(junior): Move lazy sandbox respond coverage
dcramer Jun 5, 2026
6a75fd2
test(junior): Move respond startup errors
dcramer Jun 5, 2026
6dabac6
test(junior): Remove respond runtime mock fixture
dcramer Jun 5, 2026
4b3342f
test(junior): Move MCP respond tests to component ports
dcramer Jun 5, 2026
ab652a4
test(junior): Move sandbox executor coverage to component
dcramer Jun 5, 2026
cac91e5
test(junior): Group Slack resume integration suites
dcramer Jun 5, 2026
72b92be
test(junior): Organize Slack tool integration suites
dcramer Jun 5, 2026
6782640
test(junior): Organize OAuth callback integration suites
dcramer Jun 5, 2026
0a0603d
test(junior): Split MCP OAuth resume lock coverage
dcramer Jun 5, 2026
ecd3b09
docs(testing): Record cleanup completion
dcramer Jun 5, 2026
263448c
test(junior): Split Slack message content suites
dcramer Jun 5, 2026
d4db766
test(junior): Use App Home builder deps
dcramer Jun 5, 2026
d98b60e
test(junior): Use plugin auth orchestration deps
dcramer Jun 5, 2026
ac09878
test(junior): Use MCP auth orchestration deps
dcramer Jun 5, 2026
29fe192
test(junior): Drop turn-session log assertion
dcramer Jun 5, 2026
eb3ecc8
test(junior): Dedupe tool error handler coverage
dcramer Jun 5, 2026
a5df0a0
test(junior): Use real tool error handling in agent tools
dcramer Jun 5, 2026
d5f37a3
test(junior): Move Slack emoji rules to unit coverage
dcramer Jun 5, 2026
4455612
test(junior): Trim duplicate reaction alias coverage
dcramer Jun 5, 2026
07c0eff
test(junior): Use snapshot warmup CLI deps
dcramer Jun 5, 2026
2a4a054
test(junior): Move snapshot tests to component layer
dcramer Jun 5, 2026
3c55118
test(junior): Trim duplicate sandbox data path case
dcramer Jun 5, 2026
dacb3cf
test(junior): Use turn session record services
dcramer Jun 5, 2026
66f90a0
test(junior): Use capability factory deps
dcramer Jun 5, 2026
65382d3
test(junior): Use real plugin package discovery
dcramer Jun 5, 2026
28a04ed
test(junior): Use real skill plugin discovery
dcramer Jun 5, 2026
8ac425f
test(junior): Use snapshot resolver services
dcramer Jun 5, 2026
d290847
test(junior): Use config defaults services
dcramer Jun 5, 2026
8bdad9a
test(junior): Use sandbox executor services
dcramer Jun 5, 2026
bd76c30
test(junior): Use sandbox egress services
dcramer Jun 5, 2026
0566649
test(junior): Use respond MCP services
dcramer Jun 5, 2026
809be0b
test(junior): Move Slack resume tests to component
dcramer Jun 5, 2026
03e4404
test(junior): Use MCP OAuth services
dcramer Jun 5, 2026
e012ab6
test(junior): Use web fetch services
dcramer Jun 5, 2026
96cbf4d
test(junior): Use image generation deps
dcramer Jun 5, 2026
5a0bb0d
test(junior): Use tool error services
dcramer Jun 5, 2026
24f090d
test(junior): Inject OAuth callback handlers
dcramer Jun 5, 2026
4004abf
test(junior): Use MCP client factory
dcramer Jun 5, 2026
53f34b8
test(junior): Use Slack outbound boundary
dcramer Jun 5, 2026
66e6c0c
test(junior): Organize unit test tree
dcramer Jun 5, 2026
ff0269a
test(evals): Inject harness runtime factory
dcramer Jun 5, 2026
ba6ab25
test(junior): Organize root unit tests
dcramer Jun 5, 2026
8bd7e6c
test(junior): Move traced stream test under pi
dcramer Jun 5, 2026
83bde2d
docs(testing): Remove review diary
dcramer Jun 5, 2026
5c40960
test(junior): Trim subscribed classifier cases
dcramer Jun 5, 2026
a2e3d53
test(junior): Dedupe agent auth tool cases
dcramer Jun 5, 2026
ca4ece5
test(junior): Thin duplicated test scaffolding
dcramer Jun 5, 2026
dc150c5
test(junior): Move plugin set checks to owner tests
dcramer Jun 5, 2026
2cea2f7
test(junior): Trim duplicated status and sandbox cases
dcramer Jun 5, 2026
6373d77
test(junior): Tighten shared test fixtures
dcramer Jun 5, 2026
f0ef132
test(junior): Share turn session message fixtures
dcramer Jun 5, 2026
c2e9791
test(junior): Tighten turn result status fixtures
dcramer Jun 5, 2026
fcc85ce
test(junior): Centralize skill test lifecycle fixtures
dcramer Jun 5, 2026
304a927
test(junior): Assert Slack state instead of prompt prose
dcramer Jun 5, 2026
257d6d7
test(junior): Assert Slack state over prompt probes
dcramer Jun 5, 2026
30be0cf
test(junior): Thin thinking level router tests
dcramer Jun 5, 2026
bb1571f
test(junior): Inject sandbox adapter services
dcramer Jun 5, 2026
60d2116
test(junior-evals): Score thinking level routing
dcramer Jun 5, 2026
eddc605
docs(evals): Capture generation fixture boundaries
dcramer Jun 5, 2026
45c9a31
fix(junior-evals): Align harness with eval types
dcramer Jun 5, 2026
83b5035
test(junior): Thin duplicate Slack timing coverage
dcramer Jun 5, 2026
0309eb0
test(junior): Dedupe auth pause assertions
dcramer Jun 5, 2026
5828c7f
test(junior): Drop MCP auth call counters
dcramer Jun 5, 2026
8d8f478
test(junior): Harden testing boundary seams
dcramer Jun 5, 2026
38d64d7
test(junior): Add shared Vitest fixtures
dcramer Jun 5, 2026
2ff7d6c
docs(testing): Tighten mock and telemetry policy
dcramer Jun 5, 2026
718e6b6
test(junior): Remove feature-level telemetry assertions
dcramer Jun 5, 2026
943d002
test(junior): Harden test boundary cleanup
dcramer Jun 5, 2026
76a61df
test(junior): Share direct tool test fixtures
dcramer Jun 5, 2026
7b25e7e
test(junior): Add shared test clock helpers
dcramer Jun 5, 2026
93bb141
test(junior): Centralize fake clock setup
dcramer Jun 5, 2026
5470141
test(junior): Freeze schedule tool fixture clock
dcramer Jun 5, 2026
f0036e1
test(junior): Use deterministic fixture expiries
dcramer Jun 5, 2026
8f34a4d
test(junior): Type plugin auth token store fixture
dcramer Jun 5, 2026
52c3099
test(junior): Type agent tool test fixtures
dcramer Jun 5, 2026
96815de
test(junior): Use real thread context messages
dcramer Jun 5, 2026
c31db71
test(evals): Cover low thinking routing
dcramer Jun 5, 2026
5351589
test(junior): Merge load skill tool tests
dcramer Jun 5, 2026
153e20f
test(junior): Tighten MCP call tool fixtures
dcramer Jun 5, 2026
7ed9d05
test(junior): Tighten web search unit fixtures
dcramer Jun 5, 2026
01857f2
test(junior): Type image generation fixtures
dcramer Jun 5, 2026
df7bed8
test(junior): Reapply cleanup after rebase
dcramer Jun 6, 2026
3e213f6
ref(test): Remove trivial DI from testing seams
dcramer Jun 6, 2026
7d7c854
test(runtime): Trim brittle component test seams
dcramer Jun 6, 2026
2a0620c
test(junior): Prune low-signal behavior checks
dcramer Jun 6, 2026
9955676
test(junior): Finish test-suite cleanup pass
dcramer Jun 6, 2026
a62c15c
ref(test): Flatten Slack runtime test adapters
dcramer Jun 6, 2026
8c8ea68
ref(test): Remove test-only dependency seams
dcramer Jun 6, 2026
6ebd068
test(evals): Cover unavailable image analysis
dcramer Jun 6, 2026
486e7bc
fix(evals): Use runtime adapter overrides
dcramer Jun 6, 2026
fce7970
test(junior): Reconcile runtime fixtures after rebase
dcramer Jun 8, 2026
808336f
test(junior): Fix heartbeat coverage run expectations
dcramer Jun 8, 2026
1390b44
test(junior): Move ingress coverage to integration tests
dcramer Jun 8, 2026
3459432
test(junior): Reconcile testing cleanup after rebase
dcramer Jun 12, 2026
b96460f
fix(evals): Align chat peer dependencies
dcramer Jun 13, 2026
b66f317
ref(test): Tighten test fixture boundaries
dcramer Jun 13, 2026
32c5983
ci: Restore frozen install and coverage timeouts
dcramer Jun 13, 2026
cf672d7
test(junior): Centralize ordinary Vitest timeouts
dcramer Jun 13, 2026
3d69c31
test(junior): Align MCP auth fixture after rebase
dcramer Jun 17, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
"test:watch": "pnpm --filter @sentry/junior test:watch",
"evals": "pnpm --filter @sentry/junior-evals evals",
"evals:record": "pnpm --filter @sentry/junior-evals evals:record",
"typecheck": "pnpm --filter @sentry/junior-plugin-api typecheck && pnpm --filter @sentry/junior-scheduler typecheck && pnpm --filter @sentry/junior typecheck && pnpm --filter @sentry/junior-dashboard typecheck && pnpm --filter @sentry/junior-testing typecheck && pnpm --filter @sentry/junior-example typecheck",
"typecheck": "pnpm --filter @sentry/junior-plugin-api typecheck && pnpm --filter @sentry/junior-scheduler typecheck && pnpm --filter @sentry/junior typecheck && pnpm --filter @sentry/junior-evals typecheck && pnpm --filter @sentry/junior-dashboard typecheck && pnpm --filter @sentry/junior-testing typecheck && pnpm --filter @sentry/junior-example typecheck",
"skills:check": "pnpm --filter @sentry/junior skills:check",
"test:ci": "pnpm --filter @sentry/junior build && pnpm --filter @sentry/junior-dashboard build && pnpm --filter @sentry/junior test:coverage && pnpm --filter @sentry/junior-dashboard test:coverage"
},
Expand Down
33 changes: 20 additions & 13 deletions packages/junior-evals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Quick mapping:
- `evals/*`: Integration-style coverage for conversation-level agent behavior and quality scoring through the runtime harness.
- `tests/unit/*` (or non-integration tests): isolated logic/invariant tests.

This separation is enforced by `pnpm --filter @sentry/junior run test:slack-boundary`.
This separation is enforced by `pnpm --filter @sentry/junior run test:boundaries`.

## What Is In Scope

Expand Down Expand Up @@ -59,22 +59,28 @@ For each `it()` case inside a `describeEval()` suite:
2. Create a fresh runtime instance for the case via the chat composition root; do not mutate the production singleton runtime.
3. Route message events through real ingress + queue-worker behavior, with only the external queue transport replaced by an in-memory harness shim.
4. Return observed artifacts as JSON for LLM judgment, including structured `assistant_posts` with text plus actual attached-file metadata, and Slack-visible metadata.
The output also includes compact `turn_diagnostics` so evals can assert user-facing runtime metadata such as selected thinking level without scraping logs.
The helper pretty-prints this JSON so failure output stays readable in local runs and CI.
5. `vitest-evals` scores the output against `criteria` (A–E → 1.0–0.0).

Harness override knobs (in `EvalOverrides`):

- `auto_complete_mcp_oauth`: after our app genuinely starts an MCP OAuth flow for the listed providers, the harness immediately completes the fake provider callback.
- `auto_complete_oauth`: after our app genuinely starts a generic OAuth flow for the listed providers, the harness immediately completes the fake provider callback.
- `credential_providers`: seed normal provider credentials for the listed providers. GitHub uses dummy GitHub App env vars plus an intercepted installation-token exchange; Sentry uses the normal OAuth token store.
- `fail_reply_call`: force a non-retryable reply failure on a specific call.
- `mock_image_generation`: stub the image-generation HTTP response with a valid image payload while still exercising the real attachment path.
- `plugin_dirs`: load plugin fixtures from eval-local directories without adding workspace packages.
- `reply_texts`: override returned reply text per call.
- `reply_timeout_ms`: lower or set the per-reply harness timeout for a specific scenario. It cannot exceed 30 seconds.
- `subscribed_decisions`: controls the subscribed-message reply gate in the harness. If you use it, do not claim that reply-selection behavior is being validated by the eval itself.

These knobs work by overriding services on the eval-local runtime instance. They must not reintroduce mutable global runtime behavior seams.
- `auth.autoCompleteMcpOAuth`: after our app genuinely starts an MCP OAuth flow for the listed providers, the harness immediately completes the fake provider callback.
- `auth.autoCompleteOAuth`: after our app genuinely starts a generic OAuth flow for the listed providers, the harness immediately completes the fake provider callback.
- `auth.credentialProviders`: seed normal provider credentials for the listed providers. GitHub uses dummy GitHub App env vars plus an intercepted installation-token exchange; Sentry uses the normal OAuth token store.
- `plugins.pluginDirs`: load plugin fixtures from eval-local directories without adding workspace packages.
- `plugins.pluginPackages`: load named workspace plugin packages for plugin-specific behavior evals.
- `plugins.skillDirs`: load skill fixture directories into the real reply-generation path.
- `replyGeneration.cannedResults`: return structured reply results for downstream delivery or resilience scenarios.
- `replyGeneration.cannedTexts`: return reply text per successful call for downstream delivery scenarios.
- `replyGeneration.failCall`: force a non-retryable reply failure on a specific call.
- `replyGeneration.mockImageGeneration`: stub the image-generation HTTP response with a valid image payload while still exercising the real attachment path.
- `replyGeneration.timeoutMs`: lower or set the per-reply harness timeout for a specific scenario. It cannot exceed 30 seconds.
- `replyGeneration.unsetGatewayCredentials`: remove gateway credentials for the duration of real reply generation when the scenario explicitly covers missing credential behavior.
- `subscribedReplyDecisions`: controls the subscribed-message reply gate in the harness. If you use it, do not claim that reply-selection behavior is being validated by the eval itself.

These knobs configure role-named scenario adapters on the eval-local runtime instance. They must not reintroduce mutable global runtime behavior seams or nested production service override bags.
`replyGeneration.cannedTexts` and `replyGeneration.cannedResults` bypass real reply generation, so use them only for downstream delivery behavior, not prompt, model-routing, or thinking-level coverage.

Tool replay:

Expand Down Expand Up @@ -106,7 +112,7 @@ Evals require real Vercel Sandbox access. If sandbox bootstrap fails, the eval f

- Add core cases under `evals/core/*.eval.ts` and plugin-specific cases under `evals/<plugin>/` using `describeEval()` with `slackEvals`.
- Use event builders (`mention`, `threadMessage`, `threadStart`) from `evals/helpers.ts`.
- Use `auto_complete_mcp_oauth` or `auto_complete_oauth` when the harness should instantly complete the fake provider callback after our app has genuinely initiated auth.
- Use `auth.autoCompleteMcpOAuth` or `auth.autoCompleteOAuth` when the harness should instantly complete the fake provider callback after our app has genuinely initiated auth.
- For multi-turn, pass the same `thread` override so events land in one thread.
- Keep each case focused on one primary behavior.
- Encode all expectations in `criteria`; do not add deterministic inline assertions.
Expand All @@ -127,6 +133,7 @@ Do not do these in eval files:

- Do not import `@/chat/slack/*` directly.
- Do not use MSW Slack helpers (`queueSlackApiResponse`, `getCapturedSlackApiCalls`, `queueSlackApiError`, `queueSlackRateLimit`).
- Do not import raw Slack capture wrappers. Use eval artifact helpers that expose Slack-visible posts, reactions, canvases, or files instead.
- Do not validate raw Slack Web API request payload shapes from evals.
- Do not validate implementation internals (exact tool names, sandbox IDs, or other non-user-visible details) unless the scenario explicitly evaluates those surfaces.

Expand Down
Loading
Loading