Change runner to connect to otel collector on pre-job by yhaliaw · Pull Request #781 · canonical/github-runner-operator

yhaliaw · 2026-04-14T06:26:08Z

Applicable spec:

Overview

This pull request adds support for configuring and enabling OpenTelemetry (OTel) metrics collection for GitHub Runner Manager deployments. It introduces a new configuration option for specifying an OTel collector endpoint, ensures this configuration is validated and propagated through the system, and updates the runner setup to generate and apply the necessary OTel configuration at runtime. Comprehensive tests are included to validate the new logic.

OpenTelemetry Collector Integration:

Added a new otel-collector-endpoint configuration option to charmcraft.yaml and integrated it throughout the charm state and service configuration, allowing users to specify an OTel collector endpoint in the format host:port. [1] [2] [3] [4] [5]
Introduced the OtelCollectorConfig model to validate and store OTel collector configuration, with robust parsing and validation logic to ensure only valid host:port values are accepted. [1] [2] [3] [4]

Runner Setup and Cloud-Init Updates:

Modified the runner pre-job template to generate and apply an OTel configuration file if the endpoint is set, and to enable/start the OTel collector service on the runner VM. [1] [2]

Testing and Validation:

Added unit tests to verify correct parsing, validation, and error handling for the OTel collector endpoint, including valid and invalid formats. [1] [2]
Updated and extended test fixtures and test cases to cover the new configuration option and its propagation through the system. [1] [2] [3] [4]

Rationale

These changes allow users to easily enable and configure OpenTelemetry metrics collection for their self-hosted runners, improving observability and integration with monitoring systems.

Checklist

The charm style guide was applied.
The contributing guide was applied.
The changes are compliant with ISD054 - Managing Charm Complexity
The documentation for charmhub is updated.
The PR is tagged with appropriate label (urgent, trivial, complex).
The changelog is updated with changes that affects the users of the charm.
The application version number is updated in github-runner-manager/pyproject.toml.

yanksyoon

LGTM, thanks!

Copilot

Pull request overview

Adds an otel-collector-endpoint charm config to enable runner-side OpenTelemetry host-metrics collection by generating an OTel Collector config during the runner pre-job phase and propagating this setting through charm state into runner VM provisioning.

Changes:

Introduces otel-collector-endpoint config and threads it through CharmState → ApplicationConfiguration → runner cloud-init templating.
Generates an OTel Collector YAML config in the runner pre-job script and attempts to enable/start the opentelemetry-collector snap.
Adds unit/integration tests for parsing/propagation and verifies the config file is written on the runner.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
charmcraft.yaml	Adds the new charm config option for OTel collector endpoint.
src/charm_state.py	Parses/validates the endpoint and persists it in stored charm state.
src/factories.py	Propagates OTel config into the application/service configuration model.
github-runner-manager/src/github_runner_manager/configuration/base.py	Adds `OtelCollectorConfig` and wires it into `SupportServiceConfig`.
github-runner-manager/src/github_runner_manager/openstack_cloud/openstack_runner_manager.py	Passes endpoint into Jinja context for pre-job generation.
github-runner-manager/src/github_runner_manager/templates/pre-job.j2	Writes OTel Collector config and enables/starts the collector service.
tests/unit/conftest.py	Extends fixture to include `otel_collector_config`.
tests/unit/test_factories.py	Unit test that OTel config is propagated into service config.
tests/unit/test_charm_state.py	Unit tests for endpoint parsing/validation helper.
tests/integration/test_charm_runner.py	Integration test verifying config file content is written on the runner VM.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Agent-Logs-Url: https://github.com/canonical/github-runner-operator/sessions/a046e899-791c-40b0-95f3-ccfef17c9536 Co-authored-by: yhaliaw <43424755+yhaliaw@users.noreply.github.com>

Replace github_job_id with github_job and instance with github_runner to match the actual attribute labels set by the pre-job OTel config (see canonical/github-runner-operator#781). Add github_repository and github_workflow template variables so the dashboard can be filtered the same way as the existing PS6 hostmetrics dashboard.

* feat(observability): add runner VM hostmetrics Grafana dashboard Adds a read-only Grafana dashboard (editable: false) for runner VM host-level metrics to be served via cos-configuration-k8s using the grafana-dashboard relation, which provisions it as an immutable filesystem dashboard in Grafana. The dashboard covers: - CPU utilisation by state and load averages - Memory usage by state - Disk I/O throughput and operations - Filesystem usage % by mount point - Network traffic, errors and drops Template variables: - github_job_id: filter by GitHub Actions workflow run job ID - instance: filter by runner hostname Metric names follow the OpenTelemetry hostmetrics receiver prometheus convention (e.g. system_cpu_time_seconds_total). The github_job_id label is expected to be set as a resource attribute by the otelcol pipeline collecting metrics from the runner VMs. Related: ISD-5152 * docs: document observability layout and rename dashboard directory Rename grafana_dashboards/ to runner_grafana_dashboards/ to make the purpose explicit at the repo root level (runner VM host metrics, not charm workload metrics). Update README with: - Repository layout overview - Observability section explaining the cos-configuration-k8s delivery mechanism and the immutability guarantee - Table of conventions for where dashboards live and what grafana_dashboards_path value to use in Terraform * fix: align dashboard labels with OTel config from github-runner-operator Replace github_job_id with github_job and instance with github_runner to match the actual attribute labels set by the pre-job OTel config (see canonical/github-runner-operator#781). Add github_repository and github_workflow template variables so the dashboard can be filtered the same way as the existing PS6 hostmetrics dashboard. * refactor(observability): mirror upstream OTel hostmetrics dashboard layout Restructure the runner VM hostmetrics dashboard to follow the upstream OpenTelemetry hostmetrics dashboard (Grafana gnetId 24638): Overview row of CPU/Memory/Root FS gauges plus Load/Cores/Total Memory stats, then CPU, Memory, Disk I/O, Filesystem and Network sections with read/write and rx/tx split axes. Make every templating variable support "All" via includeAll, multi-select and allValue ".*", and switch all label matchers to =~ so regex interpolation works. * fix(observability): correct multi-runner aggregations in hostmetrics dashboard When the runner variable resolves to multiple series (multi-select or "All"), several panels previously produced misleading values: - CPU Cores stat / System Load "cores" reference: count(count by (cpu) ...) collapses cpu indexes across runners, returning the max-cores-on-any-host rather than fleet total. Group by github_runner so cpu indexes stay distinct, then expose total cores in the stat panel and per-runner cores on the load panel (so the reference aligns with the averaged load lines). - System Load 1m/5m/15m: bare metric returns one series per runner with identical legends ("1m"/"5m"/"15m"), making the chart unreadable. Wrap in avg() to get one fleet-average line per period. - Disk Busy %: sum by (device) of fractional busy time can exceed 1 with multiple runners and gets silently clamped by max:1. Switch to avg by (device) so the value stays a meaningful 0-1 fleet average. Also soften the README guidance on editable: false. cos-configuration-k8s provisions dashboards from the filesystem, which makes them read-only in Grafana regardless of the flag, so the explicit "must" requirement was contradicted by existing dashboards in charms/planner-operator/. * docs(observability): clarify hostmetrics dashboard variable usage Expand the dashboard description to spell out the expected usage of the github_runner variable: scope it to a flavor regex (e.g. flavor-x-.*) when comparing fleets, or pick a single runner for per-host inspection. Aggregating by device/mountpoint without grouping by github_runner is intentional — it produces meaningful fleet totals/averages when the matched runners share device semantics — but assumes operators don't mix heterogeneous flavors under "All". * fix(observability): align Root FS gauge and System Load cores override - Root FS gauge: restrict the denominator to state=~"used|free" to match the Filesystem Utilization bargauge and df semantics. Without this, reserved blocks (e.g. ext4's 5% root reservation) inflate the denominator and the gauge reads artificially low. - System Load cores override: the field matcher still pointed at the old "cores" legend after the per-runner rename, so the red dashed styling never applied. Update the matcher to "cores (per runner)". * refactor(observability): switch hostmetrics dashboard to single-select Drop multi-select on the GitHub-context variables (kept includeAll + allValue: ".*" so picking "All" still widens the scope as a regex). Single-select matches the upstream OpenTelemetry hostmetrics design and makes per-host attribution work — multi-runner aggregations under sum by (device) collapsed identically-named devices across hosts and hid which runner was responsible for any given spike. With single-select assured, simplify the dense per-device/per-mountpoint panels back to bare metrics (drop sum by device on disk I/O throughput, disk IOPS, disk busy %, memory usage, filesystem usage, network throughput/packets/errors). Revert the multi-runner-defensive variants of CPU Cores, System Load 1m/5m/15m and the cores reference series. Aggregations are kept where they are inherent to the metric: overview gauges (CPU/Memory/Root FS), Memory Utilization (sum/sum ratio), Filesystem Utilization (sum by mountpoint ratio) and TCP Connections (sum by state). Drop the cross-host-aggregation note from the dashboard description since the design no longer relies on it.

Change runner to connect to otel collector on pre-job

241fed5

github-actions Bot added the Libraries: OK label Apr 14, 2026

yhaliaw added 10 commits April 14, 2026 14:41

Fix incorrect passing of otel collector endpoint to pre-job

6b39a3e

Add integration test for otel-collector-endpoint config

ef56b09

Merge branch 'main' into feat/vm-metrics

aea5c78

Merge branch 'main' into feat/vm-metrics

4ff2ca1

Fix lint issues

57591cb

Create empty otel config file prior to write

6387748

Update the integration test to use jubilant

37948ff

Merge branch 'main' into feat/vm-metrics

efdba4c

Format

c39464f

Merge branch 'main' into feat/vm-metrics

014d435

yhaliaw marked this pull request as ready for review April 22, 2026 00:21

yhaliaw requested review from cbartz, florentianayuwono, javierdelapuente, weiiwang01 and yanksyoon as code owners April 22, 2026 00:21

Merge branch 'main' into feat/vm-metrics

745726b

yanksyoon approved these changes Apr 22, 2026

View reviewed changes

florentianayuwono requested a review from Copilot April 22, 2026 03:45

Copilot started reviewing on behalf of florentianayuwono April 22, 2026 03:46 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Fix ssh key permission issues

d48c715

florentianayuwono approved these changes Apr 22, 2026

View reviewed changes

Merge branch 'main' into feat/vm-metrics

a93c93d

florentianayuwono reviewed Apr 22, 2026

View reviewed changes

Comment thread github-runner-manager/src/github_runner_manager/templates/pre-job.j2

florentianayuwono reviewed Apr 22, 2026

View reviewed changes

Comment thread github-runner-manager/src/github_runner_manager/templates/pre-job.j2

florentianayuwono reviewed Apr 22, 2026

View reviewed changes

Comment thread github-runner-manager/src/github_runner_manager/templates/pre-job.j2 Outdated

Potential fix for pull request finding

90de14a

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

yhaliaw added 2 commits April 23, 2026 09:00

Update config descirption for otel-collector-endpoint

9b170ae

Fix the metric labels

209eea2

Copilot started work on behalf of yhaliaw April 23, 2026 01:36 View session

Reject query and fragment in OTel collector endpoint config

bc64122

Agent-Logs-Url: https://github.com/canonical/github-runner-operator/sessions/a046e899-791c-40b0-95f3-ccfef17c9536 Co-authored-by: yhaliaw <43424755+yhaliaw@users.noreply.github.com>

Copilot finished work on behalf of yhaliaw April 23, 2026 01:42

yhaliaw added 2 commits April 23, 2026 15:58

Add env var to expose the endpoint of otel exporter

c5ed02d

Fix lints

3490401

florentianayuwono reviewed Apr 23, 2026

View reviewed changes

Comment thread github-runner-manager/src/github_runner_manager/templates/env.j2 Outdated

yhaliaw added 2 commits April 24, 2026 09:07

Merge branch 'main' into feat/vm-metrics

603923d

Add changelog

5c0b136

yhaliaw enabled auto-merge (squash) April 24, 2026 02:30

yhaliaw merged commit 93b7c71 into main Apr 24, 2026
114 of 127 checks passed

yhaliaw deleted the feat/vm-metrics branch April 24, 2026 07:03

cbartz mentioned this pull request Apr 27, 2026

feat(observability): add runner VM hostmetrics Grafana dashboard canonical/github-runner-operators#187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change runner to connect to otel collector on pre-job#781

Change runner to connect to otel collector on pre-job#781
yhaliaw merged 22 commits intomainfrom
feat/vm-metrics

yhaliaw commented Apr 14, 2026 •

edited

Loading

Uh oh!

yanksyoon left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

yhaliaw commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Rationale

Checklist

Uh oh!

yanksyoon left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yhaliaw commented Apr 14, 2026 •

edited

Loading