Skip to content

fix(proof): add missing export_ab_proof.py script#48

Merged
Gradata merged 3 commits into
mainfrom
fix/missing-export-ab-proof-script
Apr 15, 2026
Merged

fix(proof): add missing export_ab_proof.py script#48
Gradata merged 3 commits into
mainfrom
fix/missing-export-ab-proof-script

Conversation

@Gradata

@Gradata Gradata commented Apr 15, 2026

Copy link
Copy Markdown
Owner

Recovers the ablation-proof export script that was referenced by test_proof.py but forgotten in PR #44. Fixes test_export_script_handles_empty_run_dir across several other PRs (#41, #36, #34).

Summary

  • Adds cloud/scripts/export_ab_proof.py — aggregates blind-judge JSONL judgments from .tmp/rule-ablation-v2/judgments/*.jsonl into cloud/data/proof_results.json consumed by GET /api/v1/public/proof.
  • Computes per-dimension means, 95% CIs (normal approx), deltas vs baseline, and per-model breakdowns matching the schema in cloud/app/routes/proof.py and ABProofPanel.tsx.
  • Empty / missing run dir => load_judgments returns {} and the export writes an honest available: false payload (no fabrication).
  • CLI: --run-dir, --out, --source, --dry-run.
  • Updates .gitignore to allowlist cloud/scripts/ (the broad scripts/ ignore was masking it — the original cause of the PR feat(cloud): honest A/B proof — /public/proof endpoint + ablation export #44 omission).

Test plan

  • pytest cloud/tests/test_proof.py::test_export_script_handles_empty_run_dir passes
  • Full cloud/tests/test_proof.py suite passes (6/6)

Generated with Gradata

@coderabbitai

coderabbitai Bot commented Apr 15, 2026

Copy link
Copy Markdown
📝 Walkthrough
  • Added CLI script cloud/scripts/export_ab_proof.py to aggregate blind-judge JSONL judgments from .tmp/rule-ablation-v2/judgments/*.jsonl into cloud/data/proof_results.json for the public proof API/UI
  • Public functions provided by the script: load_judgments(), aggregate(), build_payload(), parse_args(), main()
  • Computes per-dimension baseline/with_rules/with_full means, selects best arm mean, computes 95% CI (normal approximation), delta in percentage points versus baseline, and per-model dimension breakdowns matching the proof API schema
  • Skips records with unknown conditions (only accepts base, rules, full) so trial/subject counts and reported arms stay consistent
  • Gracefully handles missing/empty/malformed run directories/files: load_judgments() returns {} and export emits available: false (no fabricated data); malformed JSON lines are skipped with warnings
  • CLI options: --run-dir, --out, --source, --dry-run (dry-run prints JSON to stdout; --out writes file, creating parent dirs as needed)
  • Updated .gitignore to un-ignore cloud/scripts/ (added !cloud/scripts/) so the scripts directory is tracked
  • Tests: cloud/tests/test_proof.py test_export_script_handles_empty_run_dir passes; full test file (6/6) exercises endpoint and export behavior
  • No breaking changes or security fixes; no new long-lived public API beyond the added export script and its functions

Walkthrough

Added a .gitignore negation to include cloud/scripts/ and introduced a new CLI cloud/scripts/export_ab_proof.py that loads JSONL judgment files, aggregates scores by condition/model/dimension, computes means, deltas and 95% CIs, and writes a standardized proof_results.json payload (or prints in dry-run).

Changes

Cohort / File(s) Summary
Gitignore Exception
/.gitignore
Added !cloud/scripts/ negation to ensure cloud/scripts/ is tracked despite broader ignore patterns.
Proof Export Script
cloud/scripts/export_ab_proof.py
New CLI utility: scans a --run-dir for *.jsonl judgment files, skips malformed lines with warnings, aggregates scores by (condition, dimension) and (model, condition, dimension), computes baseline/arm means, selects best arm per dimension, computes 95% normal-approx CIs and delta_pp vs baseline, builds per_model summaries, and emits JSON to --out or stdout (--dry-run). Handles missing/empty directories and creates output parent directories.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant CLI as export_ab_proof.py
    participant FS as File System
    participant Out as Output JSON

    User->>CLI: run with --run-dir, --out/--dry-run, --source
    CLI->>FS: scan for *.jsonl under run-dir
    FS-->>CLI: list of files (or empty)
    loop per file
        CLI->>FS: read file
        FS-->>CLI: lines (JSONL)
        CLI->>CLI: parse lines (skip malformed, warn)
        CLI->>CLI: accumulate records by condition/model/dimension
    end
    CLI->>CLI: compute means, select best arm, compute 95% CI, delta_pp, per_model
    CLI->>Out: render standardized proof_results JSON
    alt dry-run
        Out-->>User: print JSON to stdout
    else write-out
        CLI->>FS: create parent dirs, write file
        FS-->>User: file written
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding the missing export_ab_proof.py script to fix the proof functionality.
Description check ✅ Passed The description provides a comprehensive overview of the changes, explaining what the script does, why it was needed, and includes test confirmation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/missing-export-ab-proof-script

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Apr 15, 2026

Copy link
Copy Markdown

Deploying gradata-dashboard with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8a84abd
Status: ✅  Deploy successful!
Preview URL: https://99012bac.gradata-dashboard.pages.dev
Branch Preview URL: https://fix-missing-export-ab-proof.gradata-dashboard.pages.dev

View logs

@Gradata

Gradata commented Apr 15, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Apr 15, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@Gradata

Gradata commented Apr 15, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai review — CR Pro active, please re-verify.

@coderabbitai

coderabbitai Bot commented Apr 15, 2026

Copy link
Copy Markdown

@Gradata Sure! Triggering a fresh review of PR #48 now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot added the bug Something isn't working label Apr 15, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cloud/scripts/export_ab_proof.py`:
- Around line 124-133: The loop that builds by_cond_dim and by_model_cond_dim
accepts any condition but downstream logic only supports the conditions "base",
"rules", and "full"; update the iteration over flat (the block using model,
condition, dimension, score) to skip records whose condition is not in the
allowed set (e.g., allowed_conditions = {"base","rules","full"}) and only append
scores to by_cond_dim and by_model_cond_dim for those allowed conditions; also
propagate this filter to the other places that compute/zero dimensions and
trials (the code that zeros dimensions and the code that increments/aggregates
trials) so trials counts and zeroed-dimension logic only consider records with
supported conditions.
- Around line 172-181: The code picks best_pool = rules or full which prefers
rules even when full has a higher mean, causing with_best_mean and delta_pp to
be wrong; change the logic to compute the mean for rules and full separately
(use _mean safely when a pool exists), compare those means and set best_pool or
directly set with_best_mean to the higher mean (falling back to baseline_mean
when neither pool exists), then compute delta_pp from that selected
with_best_mean; update references in best_pool, with_best_mean, and delta_pp
calculations accordingly.
- Around line 149-163: best_mean, ci and n_with are currently computed from
mixed pools (ci_pool = rules or full or base and n_with = len(rules)+len(full))
which can detach CI and sample size from the arm that produced best_mean; update
the logic to pick the winning arm first (compare best_mean to with_rules_mean,
with_full_mean and baseline_mean) and set ci_pool and n_with from that chosen
arm: e.g., if best_mean == with_rules_mean -> ci_pool = rules, n_with =
len(rules); if best_mean == with_full_mean -> ci_pool = full, n_with =
len(full); if baseline is best -> ci_pool = base and n_with = 0 (or len(base) if
you must report baseline size), then call _ci95(ci_pool) and use those
ci_low/ci_high and n_with when appending the dim_payload entry so CI and n_with
match the selected best_mean.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ef41abb7-4453-42fa-ad99-9df1d965424e

📥 Commits

Reviewing files that changed from the base of the PR and between d44468b and 49a9714.

📒 Files selected for processing (2)
  • .gitignore
  • cloud/scripts/export_ab_proof.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Cloudflare Pages
🔇 Additional comments (1)
.gitignore (1)

151-151: Allowlist override looks correct.

Line 151 correctly re-includes cloud/scripts/ so the exporter script can be tracked even with scripts/ ignored.

Comment thread cloud/scripts/export_ab_proof.py
Comment thread cloud/scripts/export_ab_proof.py Outdated
Comment thread cloud/scripts/export_ab_proof.py Outdated
Gradata added 2 commits April 14, 2026 21:54
…arm selection

- Skip records whose condition is not in {base, rules, full} so trials, subjects,
  and dimension zeroing stay consistent with the three reported arms.
- dim_payload: pick the arm with the highest mean and lock ci_pool + n_with to
  that arm so reported CI and sample size match best_mean instead of drifting
  from a mixed rules+full pool.
- per_model: pick max(rules_mean, full_mean) for with_best_mean instead of
  truthy-OR'ing pools, which silently preferred rules even when full was higher.

CR review: #48 (review)

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@Gradata

Gradata commented Apr 15, 2026

Copy link
Copy Markdown
Owner Author

Round-2 CR addressed + rebased on main:

  • Unknown conditions filtered — records outside {base, rules, full} are now dropped in the aggregation loop, so trials, subjects, and dimension zeroing stay consistent with the three reported arms.
  • dim_payload best-arm selection — pick arm with highest mean first, then lock ci_pool and n_with to that same arm (rules/full/base) so CI and sample size match best_mean instead of drifting from a mixed rules+full pool.
  • per_model best_pool — replaced rules or full truthy-OR with max(rules_mean, full_mean) so with_best_mean doesn't silently under-report when full outperforms rules.

pytest cloud/tests/ -x: 121 passed. Rebased on origin/main (clean merge).

@Gradata

Gradata commented Apr 15, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Apr 15, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
cloud/scripts/export_ab_proof.py (1)

124-148: ⚠️ Potential issue | 🟠 Major

Return unavailable when no valid judgments survive filtering.

If input files exist but every record is malformed/unsupported, the payload currently reports available: true with empty metrics. This is a misleading success state for /api/v1/public/proof.

💡 Proposed fix
     for rec in flat:
         model = rec.get("model") or rec.get("subject")
         condition = rec.get("condition") or rec.get("variant")
         dimension = rec.get("dimension")
         score = rec.get("score")
         if not (model and condition and dimension) or not isinstance(score, (int, float)):
             continue
@@
         if rec.get("task_id"):
             task_ids.add(rec["task_id"])
         trials += 1

+    if trials == 0:
+        return {
+            "available": False,
+            "source": None,
+            "subjects": [],
+            "judge": None,
+            "trials": 0,
+            "dimensions": [],
+            "per_model": [],
+            "updated_at": datetime.now(timezone.utc).isoformat(),
+            "reason": "no valid judgments found",
+        }
+
     dimensions_seen = sorted({d for (_c, d) in by_cond_dim})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cloud/scripts/export_ab_proof.py` around lines 124 - 148, The loop collects
valid judgments into by_cond_dim/by_model_cond_dim and tracks trials; if all
input records were filtered out the code still reports available: true. After
the aggregation loop (using variables trials, by_cond_dim, dimensions_seen),
check if trials == 0 (or dimensions_seen is empty) and set the export payload's
available flag to False (or return an appropriate "no data" response) so the
/api/v1/public/proof payload does not claim availability when no valid judgments
survive filtering.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cloud/scripts/export_ab_proof.py`:
- Around line 124-148: The loop collects valid judgments into
by_cond_dim/by_model_cond_dim and tracks trials; if all input records were
filtered out the code still reports available: true. After the aggregation loop
(using variables trials, by_cond_dim, dimensions_seen), check if trials == 0 (or
dimensions_seen is empty) and set the export payload's available flag to False
(or return an appropriate "no data" response) so the /api/v1/public/proof
payload does not claim availability when no valid judgments survive filtering.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 74acedb8-835e-4df0-aba8-385ab4ae02be

📥 Commits

Reviewing files that changed from the base of the PR and between 49a9714 and 8a84abd.

📒 Files selected for processing (2)
  • .gitignore
  • cloud/scripts/export_ab_proof.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: test (3.12)
  • GitHub Check: Python 3.12
  • GitHub Check: Cloudflare Pages
🔇 Additional comments (2)
cloud/scripts/export_ab_proof.py (1)

157-181: Best-arm selection and CI/sample-size coupling are now consistent.

The updated winner selection and per-model with_best_mean logic correctly avoid mixed-pool reporting artifacts.

Also applies to: 194-203

.gitignore (1)

157-158: Current .gitignore patterns already correctly unignore nested files.

Verification shows cloud/scripts/export_ab_proof.py is not ignored under the current patterns. The unignore pattern !cloud/scripts/ automatically applies to all descendants; no additional pattern is needed. Remove the suggested fix.

			> Likely an incorrect or invalid review comment.

@Gradata Gradata merged commit 13d9231 into main Apr 15, 2026
11 checks passed
@Gradata Gradata deleted the fix/missing-export-ab-proof-script branch April 15, 2026 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant