Skip to content

Prototype Starlark script middleware for vMCP#4714

Draft
jerm-dro wants to merge 1 commit intomainfrom
jerm-dro/script-middleware-prototype
Draft

Prototype Starlark script middleware for vMCP#4714
jerm-dro wants to merge 1 commit intomainfrom
jerm-dro/script-middleware-prototype

Conversation

@jerm-dro
Copy link
Copy Markdown
Contributor

@jerm-dro jerm-dro commented Apr 9, 2026

Note

This is a prototype / proof-of-concept. It validates the Starlark execution model described in RFC THV-0060 without committing to the full session initialization scope. Not intended for merge as-is — this is a starting point for discussion and iteration.

Summary

  • Agents today must make sequential tool calls with model inference between each one. For workflows that touch multiple services (e.g., incident triage across PagerDuty, Datadog, Slack, Jira, GitHub, Confluence), this means 10+ round-trips and significant token spend just to gather context.
  • This PR adds an execute_tool_script virtual tool to vMCP that accepts a Starlark script. The script can call any authorized MCP tool as a function, use loops and conditionals to cross-reference results, fan out calls with parallel(), and return a single aggregated result — all server-side in one tool call.

What's included

  • pkg/script/ — Starlark execution engine, MCP tool bridge with type conversion, parallel() builtin for concurrent fan-out, HTTP middleware for request interception and tools/list injection
  • vMCP integration — wired above authz in the middleware chain so scripts only see/call authorized tools
  • Unit + acceptance tests — 24 tests covering engine, bridge, result parsing, middleware, and a full-stack integration test with the motivating use case
  • K8s e2e test — Ginkgo test deploying yardstick + VirtualMCPServer and executing scripts through the real proxy
  • Demo environment — Kind cluster manifests with 8 enterprise dummy MCP servers (Slack, Jira, GitHub, PagerDuty, Datadog, Confluence, Google Drive, Linear) and /incident-triage skill for interactive demos

Type of change

  • New feature

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Deployed to a local Kind cluster with 8 dummy MCP servers. Connected via thv run as a remote MCP server. Verified execute_tool_script appears in tools/list with dynamic description, executed scripts with loops over degraded services, parallel() fan-out, and string parsing. Compared /incident-triage (scripted) vs /incident-triage-lame (sequential) side-by-side.

Changes

File Change
pkg/script/engine.go Starlark execution engine — wraps scripts for top-level return, step limits, print capture
pkg/script/bridge.go Tool bridge — converts MCP tools to Starlark callables, type conversion, parallel() builtin, result parsing with SDK wrapper unwrapping
pkg/script/middleware.go HTTP middleware — intercepts execute_tool_script, injects into tools/list with dynamic description, innerToolCaller for backend dispatch
pkg/script/*_test.go 24 unit + acceptance tests
pkg/vmcp/server/server.go ScriptMiddleware config field, applied above authz in Handler()
cmd/vmcp/app/commands.go Wire script middleware into vMCP server config
test/e2e/.../virtualmcp_script_test.go K8s acceptance test with yardstick backend
demo/script-middleware/ Kind cluster deploy script + 8 dummy MCP server manifests
.claude/skills/incident-triage/ Skill that steers agents toward execute_tool_script with parallel()
.claude/skills/incident-triage-lame/ Comparison skill — same task, sequential tool calls only

Special notes for reviewers

This is a prototype. Known limitations and things to resolve before any production path:

  • Always-on: no config toggle to enable/disable the script middleware
  • httptest.NewRecorder used in production code for inner tool calls (works fine, but unconventional)
  • No per-script step limit configuration (hardcoded 100K default)
  • parallel() creates a goroutine per callable with no concurrency cap
  • No timeout on individual tool calls within a script
  • SSE transport not tested (JSON-only for now)
  • The {"result": value} structured content unwrapping is specific to mcp-go SDK behavior

Generated with Claude Code

Prototype a "tool script middleware" that lets agents write Starlark scripts
to orchestrate multiple MCP tool calls in a single atomic operation. This
validates the Starlark execution model from RFC THV-0060 without committing
to the full session initialization scope.

Key components:
- Starlark execution engine with step limits and script wrapping for
  top-level return support
- Tool bridge converting MCP tools into callable Starlark functions with
  type conversion between Go/JSON and Starlark values
- parallel() builtin for concurrent fan-out of tool calls
- HTTP middleware intercepting execute_tool_script and injecting it into
  tools/list with dynamic descriptions
- Wired into vMCP server above authz so scripts only see authorized tools

Includes unit tests, in-process acceptance tests, a k8s e2e test, demo
manifests for a Kind cluster with 8 enterprise dummy MCP servers, and
/incident-triage skills for interactive demos.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Apr 9, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant