fix(proof): add missing export_ab_proof.py script#48
Conversation
📝 Walkthrough
WalkthroughAdded a Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant CLI as export_ab_proof.py
participant FS as File System
participant Out as Output JSON
User->>CLI: run with --run-dir, --out/--dry-run, --source
CLI->>FS: scan for *.jsonl under run-dir
FS-->>CLI: list of files (or empty)
loop per file
CLI->>FS: read file
FS-->>CLI: lines (JSONL)
CLI->>CLI: parse lines (skip malformed, warn)
CLI->>CLI: accumulate records by condition/model/dimension
end
CLI->>CLI: compute means, select best arm, compute 95% CI, delta_pp, per_model
CLI->>Out: render standardized proof_results JSON
alt dry-run
Out-->>User: print JSON to stdout
else write-out
CLI->>FS: create parent dirs, write file
FS-->>User: file written
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
Deploying gradata-dashboard with
|
| Latest commit: |
8a84abd
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://99012bac.gradata-dashboard.pages.dev |
| Branch Preview URL: | https://fix-missing-export-ab-proof.gradata-dashboard.pages.dev |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@coderabbitai review — CR Pro active, please re-verify. |
|
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cloud/scripts/export_ab_proof.py`:
- Around line 124-133: The loop that builds by_cond_dim and by_model_cond_dim
accepts any condition but downstream logic only supports the conditions "base",
"rules", and "full"; update the iteration over flat (the block using model,
condition, dimension, score) to skip records whose condition is not in the
allowed set (e.g., allowed_conditions = {"base","rules","full"}) and only append
scores to by_cond_dim and by_model_cond_dim for those allowed conditions; also
propagate this filter to the other places that compute/zero dimensions and
trials (the code that zeros dimensions and the code that increments/aggregates
trials) so trials counts and zeroed-dimension logic only consider records with
supported conditions.
- Around line 172-181: The code picks best_pool = rules or full which prefers
rules even when full has a higher mean, causing with_best_mean and delta_pp to
be wrong; change the logic to compute the mean for rules and full separately
(use _mean safely when a pool exists), compare those means and set best_pool or
directly set with_best_mean to the higher mean (falling back to baseline_mean
when neither pool exists), then compute delta_pp from that selected
with_best_mean; update references in best_pool, with_best_mean, and delta_pp
calculations accordingly.
- Around line 149-163: best_mean, ci and n_with are currently computed from
mixed pools (ci_pool = rules or full or base and n_with = len(rules)+len(full))
which can detach CI and sample size from the arm that produced best_mean; update
the logic to pick the winning arm first (compare best_mean to with_rules_mean,
with_full_mean and baseline_mean) and set ci_pool and n_with from that chosen
arm: e.g., if best_mean == with_rules_mean -> ci_pool = rules, n_with =
len(rules); if best_mean == with_full_mean -> ci_pool = full, n_with =
len(full); if baseline is best -> ci_pool = base and n_with = 0 (or len(base) if
you must report baseline size), then call _ci95(ci_pool) and use those
ci_low/ci_high and n_with when appending the dim_payload entry so CI and n_with
match the selected best_mean.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: ef41abb7-4453-42fa-ad99-9df1d965424e
📒 Files selected for processing (2)
.gitignorecloud/scripts/export_ab_proof.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cloudflare Pages
🔇 Additional comments (1)
.gitignore (1)
151-151: Allowlist override looks correct.Line 151 correctly re-includes
cloud/scripts/so the exporter script can be tracked even withscripts/ignored.
…arm selection
- Skip records whose condition is not in {base, rules, full} so trials, subjects,
and dimension zeroing stay consistent with the three reported arms.
- dim_payload: pick the arm with the highest mean and lock ci_pool + n_with to
that arm so reported CI and sample size match best_mean instead of drifting
from a mixed rules+full pool.
- per_model: pick max(rules_mean, full_mean) for with_best_mean instead of
truthy-OR'ing pools, which silently preferred rules even when full was higher.
CR review: #48 (review)
There was a problem hiding this comment.
Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.
|
Round-2 CR addressed + rebased on main:
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
♻️ Duplicate comments (1)
cloud/scripts/export_ab_proof.py (1)
124-148:⚠️ Potential issue | 🟠 MajorReturn unavailable when no valid judgments survive filtering.
If input files exist but every record is malformed/unsupported, the payload currently reports
available: truewith empty metrics. This is a misleading success state for/api/v1/public/proof.💡 Proposed fix
for rec in flat: model = rec.get("model") or rec.get("subject") condition = rec.get("condition") or rec.get("variant") dimension = rec.get("dimension") score = rec.get("score") if not (model and condition and dimension) or not isinstance(score, (int, float)): continue @@ if rec.get("task_id"): task_ids.add(rec["task_id"]) trials += 1 + if trials == 0: + return { + "available": False, + "source": None, + "subjects": [], + "judge": None, + "trials": 0, + "dimensions": [], + "per_model": [], + "updated_at": datetime.now(timezone.utc).isoformat(), + "reason": "no valid judgments found", + } + dimensions_seen = sorted({d for (_c, d) in by_cond_dim})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cloud/scripts/export_ab_proof.py` around lines 124 - 148, The loop collects valid judgments into by_cond_dim/by_model_cond_dim and tracks trials; if all input records were filtered out the code still reports available: true. After the aggregation loop (using variables trials, by_cond_dim, dimensions_seen), check if trials == 0 (or dimensions_seen is empty) and set the export payload's available flag to False (or return an appropriate "no data" response) so the /api/v1/public/proof payload does not claim availability when no valid judgments survive filtering.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@cloud/scripts/export_ab_proof.py`:
- Around line 124-148: The loop collects valid judgments into
by_cond_dim/by_model_cond_dim and tracks trials; if all input records were
filtered out the code still reports available: true. After the aggregation loop
(using variables trials, by_cond_dim, dimensions_seen), check if trials == 0 (or
dimensions_seen is empty) and set the export payload's available flag to False
(or return an appropriate "no data" response) so the /api/v1/public/proof
payload does not claim availability when no valid judgments survive filtering.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 74acedb8-835e-4df0-aba8-385ab4ae02be
📒 Files selected for processing (2)
.gitignorecloud/scripts/export_ab_proof.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test (3.12)
- GitHub Check: Python 3.12
- GitHub Check: Cloudflare Pages
🔇 Additional comments (2)
cloud/scripts/export_ab_proof.py (1)
157-181: Best-arm selection and CI/sample-size coupling are now consistent.The updated winner selection and per-model
with_best_meanlogic correctly avoid mixed-pool reporting artifacts.Also applies to: 194-203
.gitignore (1)
157-158: Current.gitignorepatterns already correctly unignore nested files.Verification shows
cloud/scripts/export_ab_proof.pyis not ignored under the current patterns. The unignore pattern!cloud/scripts/automatically applies to all descendants; no additional pattern is needed. Remove the suggested fix.> Likely an incorrect or invalid review comment.
Recovers the ablation-proof export script that was referenced by test_proof.py but forgotten in PR #44. Fixes test_export_script_handles_empty_run_dir across several other PRs (#41, #36, #34).
Summary
Test plan
Generated with Gradata