fix(proof): add missing export_ab_proof.py script by Gradata · Pull Request #48 · Gradata/gradata

Gradata · 2026-04-15T00:58:08Z

Recovers the ablation-proof export script that was referenced by test_proof.py but forgotten in PR #44. Fixes test_export_script_handles_empty_run_dir across several other PRs (#41, #36, #34).

Summary

Adds cloud/scripts/export_ab_proof.py — aggregates blind-judge JSONL judgments from .tmp/rule-ablation-v2/judgments/*.jsonl into cloud/data/proof_results.json consumed by GET /api/v1/public/proof.
Computes per-dimension means, 95% CIs (normal approx), deltas vs baseline, and per-model breakdowns matching the schema in cloud/app/routes/proof.py and ABProofPanel.tsx.
Empty / missing run dir => load_judgments returns {} and the export writes an honest available: false payload (no fabrication).
CLI: --run-dir, --out, --source, --dry-run.
Updates .gitignore to allowlist cloud/scripts/ (the broad scripts/ ignore was masking it — the original cause of the PR feat(cloud): honest A/B proof — /public/proof endpoint + ablation export #44 omission).

Test plan

pytest cloud/tests/test_proof.py::test_export_script_handles_empty_run_dir passes
Full cloud/tests/test_proof.py suite passes (6/6)

Generated with Gradata

coderabbitai · 2026-04-15T00:58:14Z

📝 Walkthrough

Added CLI script cloud/scripts/export_ab_proof.py to aggregate blind-judge JSONL judgments from .tmp/rule-ablation-v2/judgments/*.jsonl into cloud/data/proof_results.json for the public proof API/UI
Public functions provided by the script: load_judgments(), aggregate(), build_payload(), parse_args(), main()
Computes per-dimension baseline/with_rules/with_full means, selects best arm mean, computes 95% CI (normal approximation), delta in percentage points versus baseline, and per-model dimension breakdowns matching the proof API schema
Skips records with unknown conditions (only accepts base, rules, full) so trial/subject counts and reported arms stay consistent
Gracefully handles missing/empty/malformed run directories/files: load_judgments() returns {} and export emits available: false (no fabricated data); malformed JSON lines are skipped with warnings
CLI options: --run-dir, --out, --source, --dry-run (dry-run prints JSON to stdout; --out writes file, creating parent dirs as needed)
Updated .gitignore to un-ignore cloud/scripts/ (added !cloud/scripts/) so the scripts directory is tracked
Tests: cloud/tests/test_proof.py test_export_script_handles_empty_run_dir passes; full test file (6/6) exercises endpoint and export behavior
No breaking changes or security fixes; no new long-lived public API beyond the added export script and its functions

Walkthrough

Added a .gitignore negation to include cloud/scripts/ and introduced a new CLI cloud/scripts/export_ab_proof.py that loads JSONL judgment files, aggregates scores by condition/model/dimension, computes means, deltas and 95% CIs, and writes a standardized proof_results.json payload (or prints in dry-run).

Changes

Cohort / File(s)	Summary
Gitignore Exception `/.gitignore`	Added `!cloud/scripts/` negation to ensure `cloud/scripts/` is tracked despite broader ignore patterns.
Proof Export Script `cloud/scripts/export_ab_proof.py`	New CLI utility: scans a `--run-dir` for `*.jsonl` judgment files, skips malformed lines with warnings, aggregates scores by `(condition, dimension)` and `(model, condition, dimension)`, computes baseline/arm means, selects best arm per dimension, computes 95% normal-approx CIs and delta_pp vs baseline, builds `per_model` summaries, and emits JSON to `--out` or stdout (`--dry-run`). Handles missing/empty directories and creates output parent directories.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant CLI as export_ab_proof.py
    participant FS as File System
    participant Out as Output JSON

    User->>CLI: run with --run-dir, --out/--dry-run, --source
    CLI->>FS: scan for *.jsonl under run-dir
    FS-->>CLI: list of files (or empty)
    loop per file
        CLI->>FS: read file
        FS-->>CLI: lines (JSONL)
        CLI->>CLI: parse lines (skip malformed, warn)
        CLI->>CLI: accumulate records by condition/model/dimension
    end
    CLI->>CLI: compute means, select best arm, compute 95% CI, delta_pp, per_model
    CLI->>Out: render standardized proof_results JSON
    alt dry-run
        Out-->>User: print JSON to stdout
    else write-out
        CLI->>FS: create parent dirs, write file
        FS-->>User: file written
    end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

feat(cloud): honest A/B proof — /public/proof endpoint + ablation export #44: Adds/modifies cloud/scripts/export_ab_proof.py to generate cloud/data/proof_results.json (same script and functionality).

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding the missing export_ab_proof.py script to fix the proof functionality.
Description check	✅ Passed	The description provides a comprehensive overview of the changes, explaining what the script does, why it was needed, and includes test confirmation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/missing-export-ab-proof-script

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps

Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

cloudflare-workers-and-pages · 2026-04-15T00:58:21Z

Deploying gradata-dashboard with Cloudflare Pages

Latest commit:	`8a84abd`
Status:	✅ Deploy successful!
Preview URL:	https://99012bac.gradata-dashboard.pages.dev
Branch Preview URL:	https://fix-missing-export-ab-proof.gradata-dashboard.pages.dev

View logs

Gradata · 2026-04-15T01:55:01Z

@coderabbitai review

coderabbitai · 2026-04-15T01:55:06Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Gradata · 2026-04-15T02:55:36Z

@coderabbitai review — CR Pro active, please re-verify.

coderabbitai · 2026-04-15T02:55:47Z

@Gradata Sure! Triggering a fresh review of PR #48 now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cloud/scripts/export_ab_proof.py`:
- Around line 124-133: The loop that builds by_cond_dim and by_model_cond_dim
accepts any condition but downstream logic only supports the conditions "base",
"rules", and "full"; update the iteration over flat (the block using model,
condition, dimension, score) to skip records whose condition is not in the
allowed set (e.g., allowed_conditions = {"base","rules","full"}) and only append
scores to by_cond_dim and by_model_cond_dim for those allowed conditions; also
propagate this filter to the other places that compute/zero dimensions and
trials (the code that zeros dimensions and the code that increments/aggregates
trials) so trials counts and zeroed-dimension logic only consider records with
supported conditions.
- Around line 172-181: The code picks best_pool = rules or full which prefers
rules even when full has a higher mean, causing with_best_mean and delta_pp to
be wrong; change the logic to compute the mean for rules and full separately
(use _mean safely when a pool exists), compare those means and set best_pool or
directly set with_best_mean to the higher mean (falling back to baseline_mean
when neither pool exists), then compute delta_pp from that selected
with_best_mean; update references in best_pool, with_best_mean, and delta_pp
calculations accordingly.
- Around line 149-163: best_mean, ci and n_with are currently computed from
mixed pools (ci_pool = rules or full or base and n_with = len(rules)+len(full))
which can detach CI and sample size from the arm that produced best_mean; update
the logic to pick the winning arm first (compare best_mean to with_rules_mean,
with_full_mean and baseline_mean) and set ci_pool and n_with from that chosen
arm: e.g., if best_mean == with_rules_mean -> ci_pool = rules, n_with =
len(rules); if best_mean == with_full_mean -> ci_pool = full, n_with =
len(full); if baseline is best -> ci_pool = base and n_with = 0 (or len(base) if
you must report baseline size), then call _ci95(ci_pool) and use those
ci_low/ci_high and n_with when appending the dim_payload entry so CI and n_with
match the selected best_mean.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ef41abb7-4453-42fa-ad99-9df1d965424e

📥 Commits

Reviewing files that changed from the base of the PR and between d44468b and 49a9714.

📒 Files selected for processing (2)

.gitignore
cloud/scripts/export_ab_proof.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Cloudflare Pages

🔇 Additional comments (1)

.gitignore (1)

151-151: Allowlist override looks correct.

Line 151 correctly re-includes cloud/scripts/ so the exporter script can be tracked even with scripts/ ignored.

…-proof-script

…arm selection - Skip records whose condition is not in {base, rules, full} so trials, subjects, and dimension zeroing stay consistent with the three reported arms. - dim_payload: pick the arm with the highest mean and lock ci_pool + n_with to that arm so reported CI and sample size match best_mean instead of drifting from a mixed rules+full pool. - per_model: pick max(rules_mean, full_mean) for with_best_mean instead of truthy-OR'ing pools, which silently preferred rules even when full was higher. CR review: #48 (review)

greptile-apps

Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

Gradata · 2026-04-15T04:57:09Z

Round-2 CR addressed + rebased on main:

Unknown conditions filtered — records outside {base, rules, full} are now dropped in the aggregation loop, so trials, subjects, and dimension zeroing stay consistent with the three reported arms.
dim_payload best-arm selection — pick arm with highest mean first, then lock ci_pool and n_with to that same arm (rules/full/base) so CI and sample size match best_mean instead of drifting from a mixed rules+full pool.
per_model best_pool — replaced rules or full truthy-OR with max(rules_mean, full_mean) so with_best_mean doesn't silently under-report when full outperforms rules.

pytest cloud/tests/ -x: 121 passed. Rebased on origin/main (clean merge).

Gradata · 2026-04-15T04:57:20Z

@coderabbitai review

coderabbitai · 2026-04-15T04:57:25Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

♻️ Duplicate comments (1)

cloud/scripts/export_ab_proof.py (1)

124-148: ⚠️ Potential issue | 🟠 Major

Return unavailable when no valid judgments survive filtering.

If input files exist but every record is malformed/unsupported, the payload currently reports available: true with empty metrics. This is a misleading success state for /api/v1/public/proof.

💡 Proposed fix

     for rec in flat:
         model = rec.get("model") or rec.get("subject")
         condition = rec.get("condition") or rec.get("variant")
         dimension = rec.get("dimension")
         score = rec.get("score")
         if not (model and condition and dimension) or not isinstance(score, (int, float)):
             continue
@@
         if rec.get("task_id"):
             task_ids.add(rec["task_id"])
         trials += 1

+    if trials == 0:
+        return {
+            "available": False,
+            "source": None,
+            "subjects": [],
+            "judge": None,
+            "trials": 0,
+            "dimensions": [],
+            "per_model": [],
+            "updated_at": datetime.now(timezone.utc).isoformat(),
+            "reason": "no valid judgments found",
+        }
+
     dimensions_seen = sorted({d for (_c, d) in by_cond_dim})

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cloud/scripts/export_ab_proof.py` around lines 124 - 148, The loop collects
valid judgments into by_cond_dim/by_model_cond_dim and tracks trials; if all
input records were filtered out the code still reports available: true. After
the aggregation loop (using variables trials, by_cond_dim, dimensions_seen),
check if trials == 0 (or dimensions_seen is empty) and set the export payload's
available flag to False (or return an appropriate "no data" response) so the
/api/v1/public/proof payload does not claim availability when no valid judgments
survive filtering.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cloud/scripts/export_ab_proof.py`:
- Around line 124-148: The loop collects valid judgments into
by_cond_dim/by_model_cond_dim and tracks trials; if all input records were
filtered out the code still reports available: true. After the aggregation loop
(using variables trials, by_cond_dim, dimensions_seen), check if trials == 0 (or
dimensions_seen is empty) and set the export payload's available flag to False
(or return an appropriate "no data" response) so the /api/v1/public/proof
payload does not claim availability when no valid judgments survive filtering.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 74acedb8-835e-4df0-aba8-385ab4ae02be

📥 Commits

Reviewing files that changed from the base of the PR and between 49a9714 and 8a84abd.

📒 Files selected for processing (2)

.gitignore
cloud/scripts/export_ab_proof.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: test (3.12)
GitHub Check: Python 3.12
GitHub Check: Cloudflare Pages

🔇 Additional comments (2)

cloud/scripts/export_ab_proof.py (1)

157-181: Best-arm selection and CI/sample-size coupling are now consistent.

The updated winner selection and per-model with_best_mean logic correctly avoid mixed-pool reporting artifacts.

Also applies to: 194-203
.gitignore (1)
157-158: Current .gitignore patterns already correctly unignore nested files.

Verification shows cloud/scripts/export_ab_proof.py is not ignored under the current patterns. The unignore pattern !cloud/scripts/ automatically applies to all descendants; no additional pattern is needed. Remove the suggested fix.
			> Likely an incorrect or invalid review comment.

fix(proof): add missing export_ab_proof.py script (forgotten in PR #44)

49a9714

greptile-apps Bot reviewed Apr 15, 2026

View reviewed changes

coderabbitai Bot added the bug Something isn't working label Apr 15, 2026

coderabbitai Bot requested changes Apr 15, 2026

View reviewed changes

Comment thread cloud/scripts/export_ab_proof.py

Comment thread cloud/scripts/export_ab_proof.py Outdated

Comment thread cloud/scripts/export_ab_proof.py Outdated

Gradata added 2 commits April 14, 2026 21:54

Merge remote-tracking branch 'origin/main' into fix/missing-export-ab…

bdda6e2

…-proof-script

greptile-apps Bot reviewed Apr 15, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

coderabbitai Bot approved these changes Apr 15, 2026

View reviewed changes

Gradata merged commit 13d9231 into main Apr 15, 2026
11 checks passed

Gradata deleted the fix/missing-export-ab-proof-script branch April 15, 2026 07:55

Conversation

Gradata commented Apr 15, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying gradata-dashboard with Cloudflare Pages

Uh oh!

Gradata commented Apr 15, 2026

Uh oh!

coderabbitai Bot commented Apr 15, 2026

Uh oh!

Gradata commented Apr 15, 2026

Uh oh!

coderabbitai Bot commented Apr 15, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Gradata commented Apr 15, 2026

Uh oh!

Gradata commented Apr 15, 2026

Uh oh!

coderabbitai Bot commented Apr 15, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Apr 15, 2026 •

edited

Loading