Skip to content

Change runner to connect to otel collector on pre-job#781

Merged
yhaliaw merged 22 commits intomainfrom
feat/vm-metrics
Apr 24, 2026
Merged

Change runner to connect to otel collector on pre-job#781
yhaliaw merged 22 commits intomainfrom
feat/vm-metrics

Conversation

@yhaliaw
Copy link
Copy Markdown
Collaborator

@yhaliaw yhaliaw commented Apr 14, 2026

Applicable spec:

Overview

This pull request adds support for configuring and enabling OpenTelemetry (OTel) metrics collection for GitHub Runner Manager deployments. It introduces a new configuration option for specifying an OTel collector endpoint, ensures this configuration is validated and propagated through the system, and updates the runner setup to generate and apply the necessary OTel configuration at runtime. Comprehensive tests are included to validate the new logic.

OpenTelemetry Collector Integration:

  • Added a new otel-collector-endpoint configuration option to charmcraft.yaml and integrated it throughout the charm state and service configuration, allowing users to specify an OTel collector endpoint in the format host:port. [1] [2] [3] [4] [5]
  • Introduced the OtelCollectorConfig model to validate and store OTel collector configuration, with robust parsing and validation logic to ensure only valid host:port values are accepted. [1] [2] [3] [4]

Runner Setup and Cloud-Init Updates:

  • Modified the runner pre-job template to generate and apply an OTel configuration file if the endpoint is set, and to enable/start the OTel collector service on the runner VM. [1] [2]

Testing and Validation:

  • Added unit tests to verify correct parsing, validation, and error handling for the OTel collector endpoint, including valid and invalid formats. [1] [2]
  • Updated and extended test fixtures and test cases to cover the new configuration option and its propagation through the system. [1] [2] [3] [4]

Rationale

These changes allow users to easily enable and configure OpenTelemetry metrics collection for their self-hosted runners, improving observability and integration with monitoring systems.

Checklist

  • The charm style guide was applied.
  • The contributing guide was applied.
  • The changes are compliant with ISD054 - Managing Charm Complexity
  • The documentation for charmhub is updated.
  • The PR is tagged with appropriate label (urgent, trivial, complex).
  • The changelog is updated with changes that affects the users of the charm.
  • The application version number is updated in github-runner-manager/pyproject.toml.

@yhaliaw yhaliaw marked this pull request as ready for review April 22, 2026 00:21
Copy link
Copy Markdown
Member

@yanksyoon yanksyoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an otel-collector-endpoint charm config to enable runner-side OpenTelemetry host-metrics collection by generating an OTel Collector config during the runner pre-job phase and propagating this setting through charm state into runner VM provisioning.

Changes:

  • Introduces otel-collector-endpoint config and threads it through CharmStateApplicationConfiguration → runner cloud-init templating.
  • Generates an OTel Collector YAML config in the runner pre-job script and attempts to enable/start the opentelemetry-collector snap.
  • Adds unit/integration tests for parsing/propagation and verifies the config file is written on the runner.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
charmcraft.yaml Adds the new charm config option for OTel collector endpoint.
src/charm_state.py Parses/validates the endpoint and persists it in stored charm state.
src/factories.py Propagates OTel config into the application/service configuration model.
github-runner-manager/src/github_runner_manager/configuration/base.py Adds OtelCollectorConfig and wires it into SupportServiceConfig.
github-runner-manager/src/github_runner_manager/openstack_cloud/openstack_runner_manager.py Passes endpoint into Jinja context for pre-job generation.
github-runner-manager/src/github_runner_manager/templates/pre-job.j2 Writes OTel Collector config and enables/starts the collector service.
tests/unit/conftest.py Extends fixture to include otel_collector_config.
tests/unit/test_factories.py Unit test that OTel config is propagated into service config.
tests/unit/test_charm_state.py Unit tests for endpoint parsing/validation helper.
tests/integration/test_charm_runner.py Integration test verifying config file content is written on the runner VM.

Comment thread src/charm_state.py
Comment thread github-runner-manager/src/github_runner_manager/configuration/base.py Outdated
Comment thread charmcraft.yaml Outdated
Comment thread github-runner-manager/src/github_runner_manager/templates/pre-job.j2 Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Comment thread github-runner-manager/src/github_runner_manager/templates/env.j2 Outdated
@yhaliaw yhaliaw enabled auto-merge (squash) April 24, 2026 02:30
@yhaliaw yhaliaw merged commit 93b7c71 into main Apr 24, 2026
114 of 127 checks passed
@yhaliaw yhaliaw deleted the feat/vm-metrics branch April 24, 2026 07:03
cbartz added a commit to canonical/github-runner-operators that referenced this pull request Apr 24, 2026
Replace github_job_id with github_job and instance with github_runner
to match the actual attribute labels set by the pre-job OTel config
(see canonical/github-runner-operator#781).

Add github_repository and github_workflow template variables so the
dashboard can be filtered the same way as the existing PS6 hostmetrics
dashboard.
cbartz added a commit to canonical/github-runner-operators that referenced this pull request Apr 28, 2026
* feat(observability): add runner VM hostmetrics Grafana dashboard

Adds a read-only Grafana dashboard (editable: false) for runner VM
host-level metrics to be served via cos-configuration-k8s using the
grafana-dashboard relation, which provisions it as an immutable
filesystem dashboard in Grafana.

The dashboard covers:
- CPU utilisation by state and load averages
- Memory usage by state
- Disk I/O throughput and operations
- Filesystem usage % by mount point
- Network traffic, errors and drops

Template variables:
- github_job_id: filter by GitHub Actions workflow run job ID
- instance: filter by runner hostname

Metric names follow the OpenTelemetry hostmetrics receiver prometheus
convention (e.g. system_cpu_time_seconds_total). The github_job_id
label is expected to be set as a resource attribute by the otelcol
pipeline collecting metrics from the runner VMs.

Related: ISD-5152

* docs: document observability layout and rename dashboard directory

Rename grafana_dashboards/ to runner_grafana_dashboards/ to make the
purpose explicit at the repo root level (runner VM host metrics, not
charm workload metrics).

Update README with:
- Repository layout overview
- Observability section explaining the cos-configuration-k8s delivery
  mechanism and the immutability guarantee
- Table of conventions for where dashboards live and what
  grafana_dashboards_path value to use in Terraform

* fix: align dashboard labels with OTel config from github-runner-operator

Replace github_job_id with github_job and instance with github_runner
to match the actual attribute labels set by the pre-job OTel config
(see canonical/github-runner-operator#781).

Add github_repository and github_workflow template variables so the
dashboard can be filtered the same way as the existing PS6 hostmetrics
dashboard.

* refactor(observability): mirror upstream OTel hostmetrics dashboard layout

Restructure the runner VM hostmetrics dashboard to follow the upstream
OpenTelemetry hostmetrics dashboard (Grafana gnetId 24638): Overview row
of CPU/Memory/Root FS gauges plus Load/Cores/Total Memory stats, then
CPU, Memory, Disk I/O, Filesystem and Network sections with read/write
and rx/tx split axes.

Make every templating variable support "All" via includeAll, multi-select
and allValue ".*", and switch all label matchers to =~ so regex
interpolation works.

* fix(observability): correct multi-runner aggregations in hostmetrics dashboard

When the runner variable resolves to multiple series (multi-select or
"All"), several panels previously produced misleading values:

- CPU Cores stat / System Load "cores" reference: count(count by (cpu) ...)
  collapses cpu indexes across runners, returning the max-cores-on-any-host
  rather than fleet total. Group by github_runner so cpu indexes stay
  distinct, then expose total cores in the stat panel and per-runner cores
  on the load panel (so the reference aligns with the averaged load lines).
- System Load 1m/5m/15m: bare metric returns one series per runner with
  identical legends ("1m"/"5m"/"15m"), making the chart unreadable. Wrap
  in avg() to get one fleet-average line per period.
- Disk Busy %: sum by (device) of fractional busy time can exceed 1 with
  multiple runners and gets silently clamped by max:1. Switch to
  avg by (device) so the value stays a meaningful 0-1 fleet average.

Also soften the README guidance on editable: false. cos-configuration-k8s
provisions dashboards from the filesystem, which makes them read-only in
Grafana regardless of the flag, so the explicit "must" requirement was
contradicted by existing dashboards in charms/planner-operator/.

* docs(observability): clarify hostmetrics dashboard variable usage

Expand the dashboard description to spell out the expected usage of the
github_runner variable: scope it to a flavor regex (e.g. flavor-x-.*)
when comparing fleets, or pick a single runner for per-host inspection.

Aggregating by device/mountpoint without grouping by github_runner is
intentional — it produces meaningful fleet totals/averages when the
matched runners share device semantics — but assumes operators don't
mix heterogeneous flavors under "All".

* fix(observability): align Root FS gauge and System Load cores override

- Root FS gauge: restrict the denominator to state=~"used|free" to match
  the Filesystem Utilization bargauge and df semantics. Without this,
  reserved blocks (e.g. ext4's 5% root reservation) inflate the
  denominator and the gauge reads artificially low.
- System Load cores override: the field matcher still pointed at the
  old "cores" legend after the per-runner rename, so the red dashed
  styling never applied. Update the matcher to "cores (per runner)".

* refactor(observability): switch hostmetrics dashboard to single-select

Drop multi-select on the GitHub-context variables (kept includeAll +
allValue: ".*" so picking "All" still widens the scope as a regex).
Single-select matches the upstream OpenTelemetry hostmetrics design and
makes per-host attribution work — multi-runner aggregations under
sum by (device) collapsed identically-named devices across hosts and
hid which runner was responsible for any given spike.

With single-select assured, simplify the dense per-device/per-mountpoint
panels back to bare metrics (drop sum by device on disk I/O throughput,
disk IOPS, disk busy %, memory usage, filesystem usage, network
throughput/packets/errors). Revert the multi-runner-defensive variants
of CPU Cores, System Load 1m/5m/15m and the cores reference series.

Aggregations are kept where they are inherent to the metric: overview
gauges (CPU/Memory/Root FS), Memory Utilization (sum/sum ratio),
Filesystem Utilization (sum by mountpoint ratio) and TCP Connections
(sum by state).

Drop the cross-host-aggregation note from the dashboard description
since the design no longer relies on it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants