fix(ci): Correct metrics snapshot comparison noise (#8246)

Copilot · yurishkuro · web-flow · commit 3e5ee2bfa472 · 2026-03-26T12:25:23.000-04:00
Two independent sources of CI noise in E2E metrics snapshot comparisons: 1. **Cassandra false-positives** — `generate_diff` appended `Metrics excluded from A/B: N` metadata even when `unified_diff` returned empty (identical non-excluded metrics). A varying 5xx error count across runs made the diff file non-empty, triggering phantom "changes" with zero actual metric differences. 2. **Baseline provenance question** — No documentation explained whether the comparison baseline was guaranteed to come from `main`, raising valid concerns about whether a PR could accidentally compare against another PR's snapshot. ## Changes ### `scripts/e2e/compare_metrics.py` - Guard exclusion-count metadata behind a non-empty diff check. Exclusion counts are supplementary context, not metric differences; they should never make a diff file non-empty on their own. ```python actual_diff = '\n'.join(unified_diff(metrics1, metrics2, lineterm='', n=0)) if not actual_diff: return '' # suppress metadata-only output # ... append exclusion summary alongside the real diff ``` ### `scripts/e2e/compare_metrics_test.py` - Reverts test that asserted new-metric-only diffs should be silently swallowed (the prior incorrect fix). Both new metrics in the current snapshot and missing metrics relative to the baseline now produce a visible diff — silencing either direction risks hiding genuine flapping. ### `.github/actions/verify-metrics-snapshot/action.yaml` - Adds a top-level `Baseline-from-main guarantee` comment block documenting the invariant: - Cache **save** is gated by `github.ref_name == 'main'` — only main-branch runs ever write under the `{artifact_key}_{run_id}` prefix. - Cache **restore** uses prefix matching against `{artifact_key}`; since only main ever saves under that prefix, PRs always receive the most-recent main-branch snapshot and can never pick up another PR's snapshot. - The exact `key:` in the restore step intentionally never matches (main saves with `_run_id` suffix), forcing the prefix fallback by design. - Documents the no-baseline bootstrap case for newly-added test configurations. - Replaces stale/misleading step-level comments with accurate explanations of each step's role.  --- ⚡ Quickly spin up Copilot coding agent tasks from anywhere on your macOS or Windows machine with [Raycast](https://gh.io/cca-raycast-docs). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: yurishkuro <3523016+yurishkuro@users.noreply.github.com>
diff --git a/.github/actions/verify-metrics-snapshot/action.yaml b/.github/actions/verify-metrics-snapshot/action.yaml
@@ -1,6 +1,32 @@
 # Copyright (c) 2023 The Jaeger Authors.
 # SPDX-License-Identifier: Apache-2.0
 
+# Baseline-from-main guarantee
+# ─────────────────────────────
+# This action maintains a strict invariant: the snapshot used as the comparison
+# baseline for any PR is ALWAYS taken from the most recent successful run on the
+# default branch (main), never from another PR.
+#
+# How the guarantee is achieved:
+#
+# 1. Saving:   the "Cache metrics snapshot" step is guarded by
+#              `if: github.ref_name == 'main'`.  Only main-branch runs ever write
+#              a cache entry.  The key format is `{artifact_key}_{github.run_id}`,
+#              e.g. `metrics_snapshot_elasticsearch_9.x_e2e_12345678`.
+#
+# 2. Restoring: PR runs use `restore-keys: {artifact_key}` (prefix matching).
+#              Because ONLY main-branch runs ever saved under that prefix, the
+#              prefix match always resolves to the most recently cached main-branch
+#              snapshot.  A PR can never accidentally pick up a snapshot from
+#              another PR's run.
+#
+# 3. No-baseline case: when a new test configuration is introduced (e.g., a new
+#              storage backend or version added to the matrix), no main-branch
+#              cache exists yet.  In that case `cache-matched-key` is empty, the
+#              compare step is skipped, and an empty diff artifact is uploaded.
+#              The baseline is established automatically by the first successful
+#              main-branch run after the new configuration is added.
+
 name: 'Verify Metric Snapshot and Upload Metrics'
 description: 'Upload or cache the metrics data after verification'
 inputs:
@@ -20,23 +46,28 @@ runs:
         path: ./.metrics/${{ inputs.snapshot }}.txt
         retention-days: 7
 
-      # The github cache restore successfully restores when cache saved has same key and same path.
-      # Hence to restore release metric with name relese_{metric_name} , the name must be changed to the same.
+    # Rename before caching so that main-branch and PR restores use distinct
+    # file names and cannot accidentally overwrite each other on disk.
     - name: Change file name before caching
       if: github.ref_name == 'main'
       shell: bash
       run: |
         mv ./.metrics/${{ inputs.snapshot }}.txt ./.metrics/baseline_${{ inputs.snapshot }}.txt
 
+    # Save the baseline ONLY on main-branch runs (see guarantee note above).
+    # The run_id suffix makes each entry unique; prefix matching in the restore
+    # step always retrieves the most recently saved entry.
     - name: Cache metrics snapshot on main branch for longer retention
       if: github.ref_name == 'main'
       uses: actions/cache/save@1bd1e32a3bdc45362d1e726936510720a7c30a57
       with:
         path: ./.metrics/baseline_${{ inputs.snapshot }}.txt
         key: ${{ inputs.artifact_key }}_${{ github.run_id }}
 
-    # Use restore keys to match prefix and fetch the latest cache
-    # Here , restore keys is an ordered list of prefixes that need to be matched
+    # Restore the baseline on non-main branches (PRs, feature branches).
+    # The exact `key` intentionally never matches (main saves with a run_id
+    # suffix), so the restore always falls through to the `restore-keys`
+    # prefix search, which returns the most-recently-saved main-branch entry.
     - name: Download the cached tagged metrics
       id: download-release-snapshot
       if: github.ref_name != 'main'
diff --git a/.github/workflows/ci-lint-checks.yaml b/.github/workflows/ci-lint-checks.yaml
@@ -126,6 +126,11 @@ jobs:
       run: |
         SHUNIT2=.tools/shunit2 bash scripts/utils/run-tests.sh
 
+    - name: Run Python unit tests for e2e scripts
+      run: |
+        pip install prometheus-client
+        python3 -m unittest discover -s scripts/e2e -p '*_test.py' -v
+
   binary-size-check:
     runs-on: ubuntu-latest
     steps:
diff --git a/scripts/e2e/compare_metrics.py b/scripts/e2e/compare_metrics.py
@@ -112,6 +112,32 @@ def parse_metrics(content):
 
 
 def generate_diff(file1_content, file2_content):
+    """Compare two Prometheus metrics snapshots and return a unified diff of metric names.
+
+    The input files are raw Prometheus text exposition format, scraped directly from
+    the Jaeger /metrics endpoint by e2e_integration.go (scrapeMetrics), e.g.:
+        # HELP http_requests_total The total number of HTTP requests.
+        # TYPE http_requests_total counter
+        http_requests_total{method="post",code="200"} 1027 1395066363000
+        http_requests_total{method="post",code="400"}    3 1395066363000
+
+    parse_metrics() is where metric values and timestamps are dropped, retaining only
+    the metric name and its normalised label set as a string like:
+        http_requests_total{code="200",method="post"}
+    Certain labels (e.g. service_instance_id) are dropped and entire samples
+    (e.g. HTTP 5xx responses) are excluded to reduce run-to-run noise.
+    This exclusion happens here at analysis time, not at snapshot capture time;
+    the snapshot files always contain the full raw scrape output.
+
+    The diff is performed on these sorted, value-free metric strings.  If the two
+    snapshots produce the same set of strings the diff is empty and this function
+    returns ''.  When there are differences, the return value is a unified diff
+    followed by optional comment lines reporting how many metrics were excluded, e.g.:
+        # Metrics excluded from A: 3
+        # Metrics excluded from B: 5
+    These comment lines (prefixed with `# `) are appended only when the diff is
+    non-empty; they are informational context, not metric differences themselves.
+    """
     if isinstance(file1_content, list):
         file1_content = ''.join(file1_content)
     if isinstance(file2_content, list):
@@ -120,12 +146,20 @@ def generate_diff(file1_content, file2_content):
     metrics1,excluded_metrics_count1 = parse_metrics(file1_content)
     metrics2,excluded_metrics_count2 = parse_metrics(file2_content)
 
-    diff = unified_diff(metrics1, metrics2,lineterm='',n=0)
+    diff = list(unified_diff(metrics1, metrics2, lineterm='', n=0))
+
+    # Exclusion counts are informational context appended to the diff output.
+    # They must not be written when the diff itself is empty: two snapshots with
+    # identical non-excluded metrics but different numbers of excluded samples
+    # would otherwise produce a non-empty output with no actionable differences.
+    if len(diff) == 0:
+        return ''
+
     total_excluded = excluded_metrics_count1 + excluded_metrics_count2
     
     exclusion_lines = ''
     if total_excluded > 0:
-        exclusion_lines = f'\nMetrics excluded from A: {excluded_metrics_count1}\nMetrics excluded from B: {excluded_metrics_count2}'
+        exclusion_lines = f'\n# Metrics excluded from A: {excluded_metrics_count1}\n# Metrics excluded from B: {excluded_metrics_count2}'
     
     return '\n'.join(diff) + exclusion_lines
 
diff --git a/scripts/e2e/compare_metrics_test.py b/scripts/e2e/compare_metrics_test.py
@@ -0,0 +1,137 @@
+# Copyright (c) 2025 The Jaeger Authors.
+# SPDX-License-Identifier: Apache-2.0
+
+import unittest
+from compare_metrics import generate_diff, parse_metrics
+
+# Minimal Prometheus text-format snippets used across tests.
+_METRIC_A = '''\
+# HELP counter_a A counter metric
+# TYPE counter_a counter
+counter_a_total{job="a"} 1
+'''
+
+_METRIC_B = '''\
+# HELP counter_b Another counter metric
+# TYPE counter_b counter
+counter_b_total{job="b"} 1
+'''
+
+_METRIC_EXCLUDED_5XX = '''\
+# HELP http_requests HTTP request counter
+# TYPE http_requests counter
+http_requests_total{http_response_status_code="500"} 1
+'''
+
+_METRIC_A_AND_EXCLUDED = _METRIC_A + _METRIC_EXCLUDED_5XX
+
+
+class TestGenerateDiff(unittest.TestCase):
+    """Tests for generate_diff() covering the comparison rules:
+
+    1. Exclusion-count-only diffs (Cassandra noise issue):
+       When the two snapshots contain the same non-excluded metrics but differ
+       only in how many metrics were excluded (e.g. different numbers of 5xx
+       responses captured), the diff must be empty — no false-positive report.
+       Exclusion-count metadata is only meaningful alongside an actual diff.
+
+    2. Real differences are always reported (both directions):
+       Both missing metrics (in baseline but absent from current snapshot) and
+       new metrics (in current snapshot but absent from baseline) are flagged.
+       This ensures regressions and unexpected metric churn are visible, so the
+       root cause can be identified and fixed rather than silently swallowed.
+    """
+
+    def test_identical_snapshots_returns_empty(self):
+        """Identical snapshots produce no diff."""
+        result = generate_diff(_METRIC_A, _METRIC_A)
+        self.assertEqual(result, '')
+
+    def test_empty_snapshots_returns_empty(self):
+        """Two empty snapshots produce no diff."""
+        result = generate_diff('', '')
+        self.assertEqual(result, '')
+
+    def test_regression_detected(self):
+        """Metric present in baseline but absent from current snapshot → diff is non-empty."""
+        # current=A only, baseline=A+B → B is missing from current (regression)
+        result = generate_diff(_METRIC_A, _METRIC_A + _METRIC_B)
+        self.assertNotEqual(result, '', 'Expected a non-empty diff for a regression')
+        # The diff must contain a '+' line for the missing metric (counter_b)
+        self.assertIn('+counter_b', result)
+
+    def test_new_metric_in_current_snapshot_produces_diff(self):
+        """Metric present in current snapshot but absent from baseline → diff is non-empty.
+
+        Both directions of metric change are reported so the root cause can be
+        identified (e.g. stale baseline, newly added metric, or genuine flapping).
+        Silently ignoring new metrics would mask intermittent behaviour where a
+        metric alternates between appearing and disappearing across runs.
+        """
+        # current=A+B, baseline=A only → B is new in current
+        result = generate_diff(_METRIC_A + _METRIC_B, _METRIC_A)
+        self.assertNotEqual(result, '', 'New metrics in current snapshot should produce a diff')
+        # '-' line = in current but not in baseline
+        self.assertIn('-counter_b', result)
+
+    def test_exclusion_count_difference_does_not_produce_diff(self):
+        """Snapshots that differ only in excluded-metric counts produce no diff.
+
+        When both snapshots have identical non-excluded metrics but differ in how many
+        samples were excluded (e.g. a transient error occurred in one run but not the
+        other), the exclusion-count lines are informational metadata and must not make
+        the diff non-empty on their own.
+        """
+        # current has metric_a + one 5xx (excluded), baseline has metric_a + zero 5xx
+        result = generate_diff(_METRIC_A_AND_EXCLUDED, _METRIC_A)
+        self.assertEqual(
+            result,
+            '',
+            'Exclusion-count differences alone must not produce a non-empty diff',
+        )
+
+    def test_mixed_regression_and_new_metric_returns_diff(self):
+        """When there is both a regression AND a new metric, the diff is non-empty."""
+        # current=B only, baseline=A only → A is missing (regression), B is new
+        result = generate_diff(_METRIC_B, _METRIC_A)
+        self.assertNotEqual(result, '')
+        self.assertIn('+counter_a', result)
+        # The new metric should still appear in the raw diff output for visibility
+        self.assertIn('-counter_b', result)
+
+    def test_regression_with_exclusions_includes_exclusion_summary(self):
+        """When there is a regression and excluded metrics, the output includes counts."""
+        # current=excluded only, baseline=A+excluded → A is missing (regression)
+        result = generate_diff(_METRIC_EXCLUDED_5XX, _METRIC_A + _METRIC_EXCLUDED_5XX)
+        self.assertNotEqual(result, '')
+        self.assertIn('# Metrics excluded from A:', result)
+        self.assertIn('# Metrics excluded from B:', result)
+
+    def test_no_exclusions_means_no_exclusion_summary(self):
+        """When there are no excluded metrics, the exclusion summary is omitted."""
+        result = generate_diff(_METRIC_A, _METRIC_A + _METRIC_B)
+        self.assertNotIn('Metrics excluded from', result)
+
+
+class TestParseMetrics(unittest.TestCase):
+    """Smoke tests for parse_metrics() to verify label exclusion."""
+
+    def test_excluded_labels_are_dropped(self):
+        content = '''\
+# HELP my_counter A counter
+# TYPE my_counter counter
+my_counter_total{service_instance_id="abc",job="x"} 1
+'''
+        metrics, _ = parse_metrics(content)
+        self.assertTrue(any('my_counter' in m for m in metrics))
+        # service_instance_id must have been removed
+        self.assertFalse(any('service_instance_id' in m for m in metrics))
+
+    def test_5xx_metrics_are_excluded(self):
+        metrics, count = parse_metrics(_METRIC_EXCLUDED_5XX)
+        self.assertEqual(metrics, [], 'Expected 5xx metric to be excluded')
+        self.assertEqual(count, 1, 'Expected exclusion count of 1')
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/scripts/e2e/metrics_summary.py b/scripts/e2e/metrics_summary.py
@@ -28,7 +28,7 @@ def parse_diff_file(diff_path):
         original_line = line.rstrip('\n')
         stripped = original_line.strip()
 
-        if stripped.startswith('Metrics excluded from A: ') or stripped.startswith('Metrics excluded from B: '):
+        if stripped.startswith('# Metrics excluded from A: ') or stripped.startswith('# Metrics excluded from B: '):
             count_str = stripped.split(': ')[1]
             exclusion_count += int(count_str)
             continue