Skip to content

Commit 3e5ee2b

Browse files
Copilotyurishkuro
andauthored
fix(ci): Correct metrics snapshot comparison noise (#8246)
Two independent sources of CI noise in E2E metrics snapshot comparisons: 1. **Cassandra false-positives** — `generate_diff` appended `Metrics excluded from A/B: N` metadata even when `unified_diff` returned empty (identical non-excluded metrics). A varying 5xx error count across runs made the diff file non-empty, triggering phantom "changes" with zero actual metric differences. 2. **Baseline provenance question** — No documentation explained whether the comparison baseline was guaranteed to come from `main`, raising valid concerns about whether a PR could accidentally compare against another PR's snapshot. ## Changes ### `scripts/e2e/compare_metrics.py` - Guard exclusion-count metadata behind a non-empty diff check. Exclusion counts are supplementary context, not metric differences; they should never make a diff file non-empty on their own. ```python actual_diff = '\n'.join(unified_diff(metrics1, metrics2, lineterm='', n=0)) if not actual_diff: return '' # suppress metadata-only output # ... append exclusion summary alongside the real diff ``` ### `scripts/e2e/compare_metrics_test.py` - Reverts test that asserted new-metric-only diffs should be silently swallowed (the prior incorrect fix). Both new metrics in the current snapshot and missing metrics relative to the baseline now produce a visible diff — silencing either direction risks hiding genuine flapping. ### `.github/actions/verify-metrics-snapshot/action.yaml` - Adds a top-level `Baseline-from-main guarantee` comment block documenting the invariant: - Cache **save** is gated by `github.ref_name == 'main'` — only main-branch runs ever write under the `{artifact_key}_{run_id}` prefix. - Cache **restore** uses prefix matching against `{artifact_key}`; since only main ever saves under that prefix, PRs always receive the most-recent main-branch snapshot and can never pick up another PR's snapshot. - The exact `key:` in the restore step intentionally never matches (main saves with `_run_id` suffix), forcing the prefix fallback by design. - Documents the no-baseline bootstrap case for newly-added test configurations. - Replaces stale/misleading step-level comments with accurate explanations of each step's role. <!-- START COPILOT CODING AGENT TIPS --> --- ⚡ Quickly spin up Copilot coding agent tasks from anywhere on your macOS or Windows machine with [Raycast](https://gh.io/cca-raycast-docs). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: yurishkuro <3523016+yurishkuro@users.noreply.github.com>
1 parent 8a4c314 commit 3e5ee2b

File tree

5 files changed

+214
-7
lines changed

5 files changed

+214
-7
lines changed

.github/actions/verify-metrics-snapshot/action.yaml

Lines changed: 35 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,32 @@
11
# Copyright (c) 2023 The Jaeger Authors.
22
# SPDX-License-Identifier: Apache-2.0
33

4+
# Baseline-from-main guarantee
5+
# ─────────────────────────────
6+
# This action maintains a strict invariant: the snapshot used as the comparison
7+
# baseline for any PR is ALWAYS taken from the most recent successful run on the
8+
# default branch (main), never from another PR.
9+
#
10+
# How the guarantee is achieved:
11+
#
12+
# 1. Saving: the "Cache metrics snapshot" step is guarded by
13+
# `if: github.ref_name == 'main'`. Only main-branch runs ever write
14+
# a cache entry. The key format is `{artifact_key}_{github.run_id}`,
15+
# e.g. `metrics_snapshot_elasticsearch_9.x_e2e_12345678`.
16+
#
17+
# 2. Restoring: PR runs use `restore-keys: {artifact_key}` (prefix matching).
18+
# Because ONLY main-branch runs ever saved under that prefix, the
19+
# prefix match always resolves to the most recently cached main-branch
20+
# snapshot. A PR can never accidentally pick up a snapshot from
21+
# another PR's run.
22+
#
23+
# 3. No-baseline case: when a new test configuration is introduced (e.g., a new
24+
# storage backend or version added to the matrix), no main-branch
25+
# cache exists yet. In that case `cache-matched-key` is empty, the
26+
# compare step is skipped, and an empty diff artifact is uploaded.
27+
# The baseline is established automatically by the first successful
28+
# main-branch run after the new configuration is added.
29+
430
name: 'Verify Metric Snapshot and Upload Metrics'
531
description: 'Upload or cache the metrics data after verification'
632
inputs:
@@ -20,23 +46,28 @@ runs:
2046
path: ./.metrics/${{ inputs.snapshot }}.txt
2147
retention-days: 7
2248

23-
# The github cache restore successfully restores when cache saved has same key and same path.
24-
# Hence to restore release metric with name relese_{metric_name} , the name must be changed to the same.
49+
# Rename before caching so that main-branch and PR restores use distinct
50+
# file names and cannot accidentally overwrite each other on disk.
2551
- name: Change file name before caching
2652
if: github.ref_name == 'main'
2753
shell: bash
2854
run: |
2955
mv ./.metrics/${{ inputs.snapshot }}.txt ./.metrics/baseline_${{ inputs.snapshot }}.txt
3056
57+
# Save the baseline ONLY on main-branch runs (see guarantee note above).
58+
# The run_id suffix makes each entry unique; prefix matching in the restore
59+
# step always retrieves the most recently saved entry.
3160
- name: Cache metrics snapshot on main branch for longer retention
3261
if: github.ref_name == 'main'
3362
uses: actions/cache/save@1bd1e32a3bdc45362d1e726936510720a7c30a57
3463
with:
3564
path: ./.metrics/baseline_${{ inputs.snapshot }}.txt
3665
key: ${{ inputs.artifact_key }}_${{ github.run_id }}
3766

38-
# Use restore keys to match prefix and fetch the latest cache
39-
# Here , restore keys is an ordered list of prefixes that need to be matched
67+
# Restore the baseline on non-main branches (PRs, feature branches).
68+
# The exact `key` intentionally never matches (main saves with a run_id
69+
# suffix), so the restore always falls through to the `restore-keys`
70+
# prefix search, which returns the most-recently-saved main-branch entry.
4071
- name: Download the cached tagged metrics
4172
id: download-release-snapshot
4273
if: github.ref_name != 'main'

.github/workflows/ci-lint-checks.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,11 @@ jobs:
126126
run: |
127127
SHUNIT2=.tools/shunit2 bash scripts/utils/run-tests.sh
128128
129+
- name: Run Python unit tests for e2e scripts
130+
run: |
131+
pip install prometheus-client
132+
python3 -m unittest discover -s scripts/e2e -p '*_test.py' -v
133+
129134
binary-size-check:
130135
runs-on: ubuntu-latest
131136
steps:

scripts/e2e/compare_metrics.py

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,32 @@ def parse_metrics(content):
112112

113113

114114
def generate_diff(file1_content, file2_content):
115+
"""Compare two Prometheus metrics snapshots and return a unified diff of metric names.
116+
117+
The input files are raw Prometheus text exposition format, scraped directly from
118+
the Jaeger /metrics endpoint by e2e_integration.go (scrapeMetrics), e.g.:
119+
# HELP http_requests_total The total number of HTTP requests.
120+
# TYPE http_requests_total counter
121+
http_requests_total{method="post",code="200"} 1027 1395066363000
122+
http_requests_total{method="post",code="400"} 3 1395066363000
123+
124+
parse_metrics() is where metric values and timestamps are dropped, retaining only
125+
the metric name and its normalised label set as a string like:
126+
http_requests_total{code="200",method="post"}
127+
Certain labels (e.g. service_instance_id) are dropped and entire samples
128+
(e.g. HTTP 5xx responses) are excluded to reduce run-to-run noise.
129+
This exclusion happens here at analysis time, not at snapshot capture time;
130+
the snapshot files always contain the full raw scrape output.
131+
132+
The diff is performed on these sorted, value-free metric strings. If the two
133+
snapshots produce the same set of strings the diff is empty and this function
134+
returns ''. When there are differences, the return value is a unified diff
135+
followed by optional comment lines reporting how many metrics were excluded, e.g.:
136+
# Metrics excluded from A: 3
137+
# Metrics excluded from B: 5
138+
These comment lines (prefixed with `# `) are appended only when the diff is
139+
non-empty; they are informational context, not metric differences themselves.
140+
"""
115141
if isinstance(file1_content, list):
116142
file1_content = ''.join(file1_content)
117143
if isinstance(file2_content, list):
@@ -120,12 +146,20 @@ def generate_diff(file1_content, file2_content):
120146
metrics1,excluded_metrics_count1 = parse_metrics(file1_content)
121147
metrics2,excluded_metrics_count2 = parse_metrics(file2_content)
122148

123-
diff = unified_diff(metrics1, metrics2,lineterm='',n=0)
149+
diff = list(unified_diff(metrics1, metrics2, lineterm='', n=0))
150+
151+
# Exclusion counts are informational context appended to the diff output.
152+
# They must not be written when the diff itself is empty: two snapshots with
153+
# identical non-excluded metrics but different numbers of excluded samples
154+
# would otherwise produce a non-empty output with no actionable differences.
155+
if len(diff) == 0:
156+
return ''
157+
124158
total_excluded = excluded_metrics_count1 + excluded_metrics_count2
125159

126160
exclusion_lines = ''
127161
if total_excluded > 0:
128-
exclusion_lines = f'\nMetrics excluded from A: {excluded_metrics_count1}\nMetrics excluded from B: {excluded_metrics_count2}'
162+
exclusion_lines = f'\n# Metrics excluded from A: {excluded_metrics_count1}\n# Metrics excluded from B: {excluded_metrics_count2}'
129163

130164
return '\n'.join(diff) + exclusion_lines
131165

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Copyright (c) 2025 The Jaeger Authors.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
import unittest
5+
from compare_metrics import generate_diff, parse_metrics
6+
7+
# Minimal Prometheus text-format snippets used across tests.
8+
_METRIC_A = '''\
9+
# HELP counter_a A counter metric
10+
# TYPE counter_a counter
11+
counter_a_total{job="a"} 1
12+
'''
13+
14+
_METRIC_B = '''\
15+
# HELP counter_b Another counter metric
16+
# TYPE counter_b counter
17+
counter_b_total{job="b"} 1
18+
'''
19+
20+
_METRIC_EXCLUDED_5XX = '''\
21+
# HELP http_requests HTTP request counter
22+
# TYPE http_requests counter
23+
http_requests_total{http_response_status_code="500"} 1
24+
'''
25+
26+
_METRIC_A_AND_EXCLUDED = _METRIC_A + _METRIC_EXCLUDED_5XX
27+
28+
29+
class TestGenerateDiff(unittest.TestCase):
30+
"""Tests for generate_diff() covering the comparison rules:
31+
32+
1. Exclusion-count-only diffs (Cassandra noise issue):
33+
When the two snapshots contain the same non-excluded metrics but differ
34+
only in how many metrics were excluded (e.g. different numbers of 5xx
35+
responses captured), the diff must be empty — no false-positive report.
36+
Exclusion-count metadata is only meaningful alongside an actual diff.
37+
38+
2. Real differences are always reported (both directions):
39+
Both missing metrics (in baseline but absent from current snapshot) and
40+
new metrics (in current snapshot but absent from baseline) are flagged.
41+
This ensures regressions and unexpected metric churn are visible, so the
42+
root cause can be identified and fixed rather than silently swallowed.
43+
"""
44+
45+
def test_identical_snapshots_returns_empty(self):
46+
"""Identical snapshots produce no diff."""
47+
result = generate_diff(_METRIC_A, _METRIC_A)
48+
self.assertEqual(result, '')
49+
50+
def test_empty_snapshots_returns_empty(self):
51+
"""Two empty snapshots produce no diff."""
52+
result = generate_diff('', '')
53+
self.assertEqual(result, '')
54+
55+
def test_regression_detected(self):
56+
"""Metric present in baseline but absent from current snapshot → diff is non-empty."""
57+
# current=A only, baseline=A+B → B is missing from current (regression)
58+
result = generate_diff(_METRIC_A, _METRIC_A + _METRIC_B)
59+
self.assertNotEqual(result, '', 'Expected a non-empty diff for a regression')
60+
# The diff must contain a '+' line for the missing metric (counter_b)
61+
self.assertIn('+counter_b', result)
62+
63+
def test_new_metric_in_current_snapshot_produces_diff(self):
64+
"""Metric present in current snapshot but absent from baseline → diff is non-empty.
65+
66+
Both directions of metric change are reported so the root cause can be
67+
identified (e.g. stale baseline, newly added metric, or genuine flapping).
68+
Silently ignoring new metrics would mask intermittent behaviour where a
69+
metric alternates between appearing and disappearing across runs.
70+
"""
71+
# current=A+B, baseline=A only → B is new in current
72+
result = generate_diff(_METRIC_A + _METRIC_B, _METRIC_A)
73+
self.assertNotEqual(result, '', 'New metrics in current snapshot should produce a diff')
74+
# '-' line = in current but not in baseline
75+
self.assertIn('-counter_b', result)
76+
77+
def test_exclusion_count_difference_does_not_produce_diff(self):
78+
"""Snapshots that differ only in excluded-metric counts produce no diff.
79+
80+
When both snapshots have identical non-excluded metrics but differ in how many
81+
samples were excluded (e.g. a transient error occurred in one run but not the
82+
other), the exclusion-count lines are informational metadata and must not make
83+
the diff non-empty on their own.
84+
"""
85+
# current has metric_a + one 5xx (excluded), baseline has metric_a + zero 5xx
86+
result = generate_diff(_METRIC_A_AND_EXCLUDED, _METRIC_A)
87+
self.assertEqual(
88+
result,
89+
'',
90+
'Exclusion-count differences alone must not produce a non-empty diff',
91+
)
92+
93+
def test_mixed_regression_and_new_metric_returns_diff(self):
94+
"""When there is both a regression AND a new metric, the diff is non-empty."""
95+
# current=B only, baseline=A only → A is missing (regression), B is new
96+
result = generate_diff(_METRIC_B, _METRIC_A)
97+
self.assertNotEqual(result, '')
98+
self.assertIn('+counter_a', result)
99+
# The new metric should still appear in the raw diff output for visibility
100+
self.assertIn('-counter_b', result)
101+
102+
def test_regression_with_exclusions_includes_exclusion_summary(self):
103+
"""When there is a regression and excluded metrics, the output includes counts."""
104+
# current=excluded only, baseline=A+excluded → A is missing (regression)
105+
result = generate_diff(_METRIC_EXCLUDED_5XX, _METRIC_A + _METRIC_EXCLUDED_5XX)
106+
self.assertNotEqual(result, '')
107+
self.assertIn('# Metrics excluded from A:', result)
108+
self.assertIn('# Metrics excluded from B:', result)
109+
110+
def test_no_exclusions_means_no_exclusion_summary(self):
111+
"""When there are no excluded metrics, the exclusion summary is omitted."""
112+
result = generate_diff(_METRIC_A, _METRIC_A + _METRIC_B)
113+
self.assertNotIn('Metrics excluded from', result)
114+
115+
116+
class TestParseMetrics(unittest.TestCase):
117+
"""Smoke tests for parse_metrics() to verify label exclusion."""
118+
119+
def test_excluded_labels_are_dropped(self):
120+
content = '''\
121+
# HELP my_counter A counter
122+
# TYPE my_counter counter
123+
my_counter_total{service_instance_id="abc",job="x"} 1
124+
'''
125+
metrics, _ = parse_metrics(content)
126+
self.assertTrue(any('my_counter' in m for m in metrics))
127+
# service_instance_id must have been removed
128+
self.assertFalse(any('service_instance_id' in m for m in metrics))
129+
130+
def test_5xx_metrics_are_excluded(self):
131+
metrics, count = parse_metrics(_METRIC_EXCLUDED_5XX)
132+
self.assertEqual(metrics, [], 'Expected 5xx metric to be excluded')
133+
self.assertEqual(count, 1, 'Expected exclusion count of 1')
134+
135+
136+
if __name__ == '__main__':
137+
unittest.main()

scripts/e2e/metrics_summary.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ def parse_diff_file(diff_path):
2828
original_line = line.rstrip('\n')
2929
stripped = original_line.strip()
3030

31-
if stripped.startswith('Metrics excluded from A: ') or stripped.startswith('Metrics excluded from B: '):
31+
if stripped.startswith('# Metrics excluded from A: ') or stripped.startswith('# Metrics excluded from B: '):
3232
count_str = stripped.split(': ')[1]
3333
exclusion_count += int(count_str)
3434
continue

0 commit comments

Comments
 (0)