Skip to content

Add observability for script execution #4743

@jerm-dro

Description

@jerm-dro

User Story

As a cluster operator, I want logging and metrics for script execution so that I can monitor and diagnose issues in production.

Background

The code mode feature (STORY-001) adds server-side Starlark script execution to vMCP via the execute_tool_script virtual tool. Operators need visibility into this new execution path to monitor health, diagnose failures, and understand resource consumption. Without observability, script execution is a black box that operators cannot troubleshoot or capacity-plan around.

The vMCP codebase already has a well-established telemetry pattern using OpenTelemetry (see pkg/vmcp/server/telemetry.go and pkg/telemetry/middleware.go). This story follows those existing patterns to add structured logging and OTel metrics/traces for the script execution engine.

Scope

In Scope

  • Structured logging for script lifecycle events (start, completion, errors, timeouts, limit violations)
  • OTel metrics for script execution (count, duration, error rate, tool calls per script, parallel fan-out)
  • OTel tracing spans for script execution and inner tool calls
  • Follow existing vMCP telemetry patterns (telemetryBackendClient decorator pattern, MCPHistogramBuckets, instrumentationName conventions)

Out of Scope

  • Custom dashboards or alerting rules (operators bring their own observability stack)
  • Metrics for adoption tracking (covered in STORY-003)
  • Changes to the telemetry pipeline or collector configuration

Acceptance Criteria

  • unit: Structured logs emitted for script start, completion, and errors — log entries include script hash, session ID, execution duration, tool call count, and error details; script content is NOT logged
  • unit: OTel counter toolhive_vmcp_script_executions increments with status attribute (success/error/timeout/step_limit) on each script execution
  • unit: OTel histogram toolhive_vmcp_script_duration records script execution duration in seconds
  • unit: OTel histogram toolhive_vmcp_script_tool_calls records the number of inner tool calls per script execution
  • unit: OTel counter toolhive_vmcp_script_parallel_goroutines increments by the number of goroutines spawned per parallel() call
  • unit: A parent trace span is created for execute_tool_script, with child spans for each inner tool call, including attributes script.tool_count, script.parallel_used, script.step_count
  • unit: Metrics and traces use the existing instrumentationName constant and MeterProvider/TracerProvider from the vMCP server

Technical Notes

Existing Patterns to Follow

The telemetryBackendClient in pkg/vmcp/server/telemetry.go demonstrates the project's telemetry decorator pattern:

  • Metrics are created via meter.Int64Counter(), meter.Float64Histogram() with descriptive names prefixed by toolhive_vmcp_
  • Histograms use telemetry.MCPHistogramBuckets for bucket boundaries
  • The record() method pattern creates a span, records start metrics, and returns a deferred cleanup function
  • Attributes follow both backward-compat (tool_name) and OTEL spec (gen_ai.tool.name) conventions

Implementation Approach

  1. Metrics/tracing injection: Accept metric.MeterProvider and trace.TracerProvider in the script engine or middleware constructor. Do not create global meters.
  2. Decorator or inline: Either wrap the script engine with a telemetry decorator (preferred, matches existing pattern) or instrument the engine directly. The decorator approach keeps telemetry concerns separated from execution logic.
  3. Logging: Use slog.With() to attach structured fields. The middleware already has access to the request context for correlation.
  4. Inner tool call spans: Each tool call dispatched from within a script should create a child span under the script execution span. The existing telemetryBackendClient.CallTool may already handle this if inner calls flow through the instrumented backend client.

Key Files

  • pkg/script/engine.go -- Starlark execution engine (instrument here or wrap)
  • pkg/script/bridge.go -- Tool bridge and parallel() (goroutine counting)
  • pkg/script/middleware.go -- HTTP middleware entry point (logging, top-level span)
  • pkg/vmcp/server/telemetry.go -- Existing telemetry patterns to follow
  • pkg/telemetry/middleware.go -- MCPHistogramBuckets and shared telemetry utilities

Dependencies

  • Depends on STORY-001 (the script execution engine must exist before it can be instrumented)
  • Uses go.opentelemetry.io/otel/metric and go.opentelemetry.io/otel/trace (already in go.mod)

References

Metadata

Metadata

Assignees

Labels

code-modevMCP Code Mode (Starlark script middleware)enhancementNew feature or requestgoPull requests that update go codetelemetryvmcpVirtual MCP Server related issues

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions