Skip to content

Add average AGI, unusualness, and Kish ESS to quality report and write a per-area CSV for every area on every run#509

Merged
donboyd5 merged 1 commit intomasterfrom
improve-quality-report
Apr 25, 2026
Merged

Add average AGI, unusualness, and Kish ESS to quality report and write a per-area CSV for every area on every run#509
donboyd5 merged 1 commit intomasterfrom
improve-quality-report

Conversation

@donboyd5
Copy link
Copy Markdown
Collaborator

@donboyd5 donboyd5 commented Apr 25, 2026

Summary

Adds three per-area metrics to tmd/areas/quality_report.py and writes a per-area CSV for every area on every run.

The PER-AREA DETAIL table gains three columns at the right:

  • AvgAGI — weighted AGI per return, in $K (sum(w * c00100) / sum(w)).
  • Unusual% — a single number per area summarizing how far the area's targets sit from a straight population-proportional allocation of the corresponding national totals. For each target row in <area>_targets.csv, the national total is computed by applying the target's recipe (varname, count, scope, agi range, fstatus) to the full TMD with s006 weighting; we then take the mean of |area_target - pop_share * national| / |pop_share * national| across rows. The XTOT population row is excluded because pop_share * national_pop = area_pop by construction. National totals are cached across areas, so this is cheap. An area that looked exactly like the nation would score 0%; for the current Congress 118 CD weights, NY-13 ≈ 45% (closest to typical) and NY-12 ≈ 256% (farthest).
  • ESS — Kish effective sample size, (sum w)^2 / sum(w^2), computed on the area weight vector. Lower ESS means the optimizer had to push weights further from population-proportional.

The detail table still shows all areas for state scope and the top-20-by-violation subset for CDs/counties. To support further analysis of all areas (not just the displayed subset), every run also writes a CSV at <weight_dir>/quality_report_per_area.csv with one row per area and the full set of columns including the new metrics. The CSV path is printed at the bottom of the report.

For Congress 118 (n=436 solved CDs), the new metrics correlate as expected: unusualness vs. |avg_agi/median - 1| Pearson r ≈ 0.70; unusualness vs. ESS r ≈ −0.57; |avg_agi/median - 1| vs. ESS r ≈ −0.58. More-unusual areas tend to have more atypical average AGI and lower effective sample size, which matches intuition.

No change to weight-solving, target construction, or any other pipeline output — this PR only affects what quality_report.py reads, prints, and writes.

Test plan

  • python -m tmd.areas.quality_report --scope states runs cleanly, table shows the three new columns, and tmd/areas/weights/states/quality_report_per_area.csv is written with 51 area rows.
  • python -m tmd.areas.quality_report --scope cds --congress 118 runs cleanly, and tmd/areas/weights/cds_118/quality_report_per_area.csv is written with 436 area rows.
  • python -m tmd.areas.quality_report --scope cds --congress 119 runs cleanly, and tmd/areas/weights/cds_119/quality_report_per_area.csv is written with 436 area rows.
  • python -m tmd.areas.quality_report --scope NY12,NY13 --congress 118 shows AvgAGI, Unusual%, and ESS for both CDs.
  • make format && make lint both succeed.

Per-area quality_report now reports three additional metrics for each
area, in both the per-area detail table and a new CSV that always
contains every area:

- AvgAGI: weighted AGI per return, in $K
- Unusual%: mean across the area's targets of
    |area_target - pop_share * national_total|
    / |pop_share * national_total|
  where the national total is computed by applying the target's recipe
  (varname, count, scope, agi range, fstatus) to the full TMD with s006
  weighting. National totals are cached across areas. The XTOT
  population row is excluded because pop_share * national_pop = area_pop
  by construction.
- ESS: Kish effective sample size, (sum w)^2 / sum(w^2), computed on the
  area weight vector.

The detail table previously showed all areas (states) or only the top
20 by violations / weight distortion (CDs, counties). It still does,
but the report now also writes a per-area CSV at
<weight_dir>/quality_report_per_area.csv with all areas and the new
metrics, and prints the path at the bottom of the report so the data
is available for further analysis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@donboyd5 donboyd5 merged commit af9dfb7 into master Apr 25, 2026
1 check passed
@donboyd5 donboyd5 deleted the improve-quality-report branch April 25, 2026 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant