Add average AGI, unusualness, and Kish ESS to quality report and write a per-area CSV for every area on every run by donboyd5 · Pull Request #509 · PSLmodels/tax-microdata-benchmarking

donboyd5 · 2026-04-25T10:47:27Z

Summary

Adds three per-area metrics to tmd/areas/quality_report.py and writes a per-area CSV for every area on every run.

The PER-AREA DETAIL table gains three columns at the right:

AvgAGI — weighted AGI per return, in $K (sum(w * c00100) / sum(w)).
Unusual% — a single number per area summarizing how far the area's targets sit from a straight population-proportional allocation of the corresponding national totals. For each target row in <area>_targets.csv, the national total is computed by applying the target's recipe (varname, count, scope, agi range, fstatus) to the full TMD with s006 weighting; we then take the mean of |area_target - pop_share * national| / |pop_share * national| across rows. The XTOT population row is excluded because pop_share * national_pop = area_pop by construction. National totals are cached across areas, so this is cheap. An area that looked exactly like the nation would score 0%; for the current Congress 118 CD weights, NY-13 ≈ 45% (closest to typical) and NY-12 ≈ 256% (farthest).
ESS — Kish effective sample size, (sum w)^2 / sum(w^2), computed on the area weight vector. Lower ESS means the optimizer had to push weights further from population-proportional.

The detail table still shows all areas for state scope and the top-20-by-violation subset for CDs/counties. To support further analysis of all areas (not just the displayed subset), every run also writes a CSV at <weight_dir>/quality_report_per_area.csv with one row per area and the full set of columns including the new metrics. The CSV path is printed at the bottom of the report.

For Congress 118 (n=436 solved CDs), the new metrics correlate as expected: unusualness vs. |avg_agi/median - 1| Pearson r ≈ 0.70; unusualness vs. ESS r ≈ −0.57; |avg_agi/median - 1| vs. ESS r ≈ −0.58. More-unusual areas tend to have more atypical average AGI and lower effective sample size, which matches intuition.

No change to weight-solving, target construction, or any other pipeline output — this PR only affects what quality_report.py reads, prints, and writes.

Test plan

python -m tmd.areas.quality_report --scope states runs cleanly, table shows the three new columns, and tmd/areas/weights/states/quality_report_per_area.csv is written with 51 area rows.
python -m tmd.areas.quality_report --scope cds --congress 118 runs cleanly, and tmd/areas/weights/cds_118/quality_report_per_area.csv is written with 436 area rows.
python -m tmd.areas.quality_report --scope cds --congress 119 runs cleanly, and tmd/areas/weights/cds_119/quality_report_per_area.csv is written with 436 area rows.
python -m tmd.areas.quality_report --scope NY12,NY13 --congress 118 shows AvgAGI, Unusual%, and ESS for both CDs.
make format && make lint both succeed.

Per-area quality_report now reports three additional metrics for each area, in both the per-area detail table and a new CSV that always contains every area: - AvgAGI: weighted AGI per return, in $K - Unusual%: mean across the area's targets of |area_target - pop_share * national_total| / |pop_share * national_total| where the national total is computed by applying the target's recipe (varname, count, scope, agi range, fstatus) to the full TMD with s006 weighting. National totals are cached across areas. The XTOT population row is excluded because pop_share * national_pop = area_pop by construction. - ESS: Kish effective sample size, (sum w)^2 / sum(w^2), computed on the area weight vector. The detail table previously showed all areas (states) or only the top 20 by violations / weight distortion (CDs, counties). It still does, but the report now also writes a per-area CSV at <weight_dir>/quality_report_per_area.csv with all areas and the new metrics, and prints the path at the bottom of the report so the data is available for further analysis. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

donboyd5 merged commit af9dfb7 into master Apr 25, 2026
1 check passed

donboyd5 deleted the improve-quality-report branch April 25, 2026 10:47

This was referenced Apr 25, 2026

Add effective sample size (Kish ESS) and measure of unusualness to the area quality report #505

Closed

Project Overview: Update TMD national and area data #381

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add average AGI, unusualness, and Kish ESS to quality report and write a per-area CSV for every area on every run#509

Add average AGI, unusualness, and Kish ESS to quality report and write a per-area CSV for every area on every run#509
donboyd5 merged 1 commit intomasterfrom
improve-quality-report

donboyd5 commented Apr 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

donboyd5 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

donboyd5 commented Apr 25, 2026 •

edited

Loading