Skip to content
Draft
Changes from 1 commit
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
e6940e9
make diff of time series
May 4, 2025
8337e9a
`AliasDataFrame` is a small utility that extends `pandas.DataFrame` f…
May 29, 2025
8ddfbf7
adding perfmonitor
May 31, 2025
350f786
adding PerfromanceLogger extracted from calibration code
May 31, 2025
1ba0686
supressing linter warning
May 31, 2025
4a7d520
Add dtype support and alias dependency graph to AliasDataFrame
Jun 1, 2025
54de3fd
Add support for dtype persistence and alias filtering in save/load
Jun 1, 2025
b8e241e
Save aliases directly to pyarrow metadata
Jun 1, 2025
fcb9bb9
add FormulaLinearModel.py used for the dEdx and distortion calibration
Jun 2, 2025
cfe72d4
add FormulaLinearModel.py used for the dEdx and distortion calibration
Jun 2, 2025
9087f54
special treatment for constants - should be enver materialized but used
Jun 2, 2025
60e26cb
special treatment for constants
Jun 2, 2025
b188456
special treatment for constants
Jun 2, 2025
f77f57c
`Add ROOT SetAlias export and Python-to-ROOT AST translation for alia…
Jun 3, 2025
664db50
Add dependency-aware alias materialization with optional cleanup and …
Jun 4, 2025
679141b
Extended usnit test for the sub_frames
Jun 9, 2025
6561696
Add extended unit tests for AliasDataFrame including lazy join and er…
Jun 9, 2025
3aae8ee
fixed - Circular dependency detection
Jun 9, 2025
6759c26
fixing all unit test - except oth the automatic materialization
Jun 9, 2025
071a860
fixing automatic materialization test + working in the distrtion cali…
Jun 9, 2025
ea7c0d6
fixing circular depndency test - all test are OK now
Jun 9, 2025
7389cda
adding unit test for the export_import tree
Jun 10, 2025
da90789
add failing test for export/import of the subframes
Jun 10, 2025
64b27cb
make test_export_tree_read_tree_with_subframe already OK
Jun 10, 2025
2a6bd71
adding metadata to all trees
Jun 11, 2025
9b7a038
Updated documentation
Jun 11, 2025
c2e7ca6
AliasDataFrame: add index-based subframe join and robust error handling
Jun 11, 2025
3753500
AliasDataFrame: Add __getattr__ support for subframes + docstring/typ…
Jun 12, 2025
718259a
AliasDataFrame: Add support for __getattr__ access to subframes and c…
Jun 12, 2025
d55b796
Refactor GroupByRegressor with robust fit logic, dtype casting, and u…
Jun 12, 2025
c45e5d0
Fix: ensure regression outputs are preserved for underpopulated groups
Jun 12, 2025
4f4f425
Fix NaN handling in robust regression and enable predictor-specific m…
Jun 12, 2025
22ce23c
Add NaN filtering and robust fit fallback logic to GroupByRegressor
Jun 12, 2025
2785bc4
Add flexible regression model selection via `fitter` parameter
Jun 12, 2025
c3d3617
* removing pylint warning
Jun 13, 2025
67e3699
* adding __init__.py
Jun 13, 2025
27c9fbe
* adding protection for infinite recursion
Jun 13, 2025
e9da107
pylint fix
Jun 13, 2025
d4d20e6
adding test for the logger
Jun 23, 2025
4d44bb2
adding conversions to the function list
Jun 25, 2025
cb4b5d1
adding chunksize and compression as argument
Jun 27, 2025
87fa521
adding chunksize and compression as argument
Jun 27, 2025
4ef6973
adding df drawing interface similar to the tree::Draw
Aug 14, 2025
512323d
docs(quantile_fit_nd): add v3.1 Δq-centered ND quantile fitting spec
Oct 11, 2025
257d2ea
Commit latest working version of AliasDataFrame
Oct 11, 2025
fc54430
Commit latest working version of perfoemance_logger.py
Oct 11, 2025
161f0f0
Commit latest working version of groupby_regression.py
Oct 11, 2025
53db0b8
feat(DataFrameUtils): Enhance docstrings and error handling for scatt…
Oct 11, 2025
0ae7eac
feat(dfextensions): add ND quantile fitting (Δq-centered) + tests & b…
Oct 11, 2025
273d6f8
test(dfextensions): fix quantile ND tests vs synthetic truth; add rob…
Oct 11, 2025
6d65a12
fix(quantile_fit_nd): exclude q_center from nuisance axes; silence si…
Oct 11, 2025
b4b5b41
fix(dfextensions/quantile_fit_nd): evaluator axis bug + window-local …
Oct 11, 2025
a578c17
tests(quantile_fit_nd): snapshot pre-fix state with rich diagnostics …
Oct 11, 2025
5d9cacd
fix(quantile_fit_nd): do not floor degenerate Δq windows; keep NaN an…
Oct 11, 2025
30b7ee7
tests(quantile_fit_nd): handle Poisson via randomized PIT pre-processing
Oct 11, 2025
12d5fe4
docs(quantile_fit_nd): add Discrete Inputs policy and utilities
Oct 11, 2025
1b2ed00
bench(quantile_fit_nd): correct scaling assertions — α_b≈−0.5, α_rt≈0.0
Oct 11, 2025
8625857
docs(quantile_fit_nd): add contextLLM.md (cold-start guide + policies)
Oct 11, 2025
2b27e47
docs(quantile_fit_nd): add contextLLM.md (cold-start guide + policies)
Oct 11, 2025
ec9f424
Forgottend commit of refernce test and bench log
Oct 11, 2025
cd63f42
feat(bench): add single-file GroupBy regression benchmark + reports
Oct 22, 2025
57b3293
docs(groupby_regression): add Performance & Benchmarking section + fi…
Oct 22, 2025
7d215d3
docs(bench): set default to 5k groups; document 30% outlier scenario
Oct 22, 2025
bb51bc0
docs(restartContext): update with 5k/5 default, 30% outliers, and lev…
Oct 22, 2025
5c9d14b
feat(groupby_regression): add optional per-group diagnostics (diag, d…
Oct 22, 2025
aa024b0
feat(bench): integrate class-level diagnostics summary into benchmark…
Oct 23, 2025
a71cc4d
docs(restartContext): record diagnostics integration and real-data va…
Oct 23, 2025
cc1ecb4
docs(restartContext): record diagnostics integration and real-data va…
Oct 23, 2025
5cf7431
use faster compression by default
Oct 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
docs(restartContext): update with 5k/5 default, 30% outliers, and lev…
…erage-outlier plan

- Record new cross-platform results (Mac vs Linux) and observation that response-only outliers do not slow runtime
- Add action plan: leverage-outlier generator + refit counters + multi-target cost check
- Keep PR target; align benchmarks and docs with 5k/5 default
  • Loading branch information
miranov25 committed Oct 22, 2025
commit bb51bc03ff637c29dc5789baa90270e7e62d8830
133 changes: 133 additions & 0 deletions UTILS/dfextensions/restartContext_groupby_regression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
Short answer: **yes — a small update to `restartContext_groupby_regression.md` now will help** keep everyone aligned. We learned that response-only outliers don’t reproduce the slowdown; we set a new 5k/5 default; and we added a 30% scenario. That’s enough to warrant a “delta” refresh so Claude/others don’t chase the wrong thing.

Here’s a **drop-in replacement** you can paste over the current file (keeps the same structure, updated content). If you prefer to append, you can paste the “What changed” + “Action plan” sections at the top.

---

# Restart Context: GroupBy Regression Benchmarking & Documentation

**Date:** October 22, 2025
**Project:** dfextensions (ALICE O2 Physics)
**Focus:** `groupby_regression.py` — benchmarking & performance docs
**Target:** PR by Oct 25, 2025
**Collaborators:** GPT (Primary Coder), Claude (Reviewer), User (Approval)

---

## Executive Summary (updated)

Benchmarks on synthetic data show that **response-only outliers (shift in y)** do **not** slow down the OLS/robust path; runtime remains essentially unchanged even at **30% contamination**. Both Mac and Linux show similar **scaling** (Linux ≈2–2.5× slower wall time per 1k groups due to platform factors).
The **real-data 25× slowdown** is therefore likely due to **sigmaCut-triggered robust re-fits driven by leverage outliers in X** and/or multi-target fits (e.g., `dX,dY,dZ`) compounding the cost.

**New default benchmark:** **5,000 groups × 5 rows/group** (fast, representative).
**New scenarios:** include **30% outliers (5σ)** to demonstrate stability of response-only contamination.

---

## What changed since last update

* **Benchmark defaults:** `--rows-per-group 5 --groups 5000` adopted for docs & CI-friendly runs.
* **Scenarios:** Added **30% outliers (5σ)** in serial + parallel.
* **Findings:**

* Mac (per 1k groups): serial ~**1.69 s**, parallel(10) ~**0.50 s**.
* Linux (per 1k groups): serial ~**4.14 s**, parallel(10) ~**0.98 s**.
* 5–30% response outliers: **no runtime increase** vs clean.
* **Conclusion:** Synthetic setup doesn’t trigger the **re-fit loop**; real data likely has **leverage** characteristics or different fitter path.

---

## Problem Statement (refined)

Observed **~25× slowdown** on real datasets when using `sigmaCut` robust filtering. Root cause is presumed **iterative re-fitting per group** when the mask updates (MAD-based) repeatedly exclude many points — common under **leverage outliers in X** or mixed contamination (X & y). Multi-target fitting (e.g., 3 targets) likely multiplies cost.

---

## Cross-Platform Note

Linux runs are **~2–2.5×** slower in absolute time than Mac, but **parallel speed-ups are consistent** (~4–5×). Differences are due to CPU/BLAS/spawn model (Apptainer), not algorithmic changes.

---

## Action Plan (next 48h)

1. **Add leverage-outlier generator** to benchmarks

* API: `create_data_with_outliers(..., mode="response|leverage|both", x_mag=8.0)`
* Goal: Reproduce sigmaCut re-fit slow path (target 10–25×).
2. **Instrument the fitter**

* Add counters in `process_group_robust()`:

* `n_refits`, `mask_fraction`, and per-group timings.
* Emit aggregated stats in `dfGB` (or a side JSON) for diagnostics.
3. **Multi-target cost check**

* Run with `fit_columns=['dX']`, then `['dX','dY','dZ']` to quantify multiplicative cost.
4. **Config toggles for mitigation** (document in perf section)

* `sigmaCut=100` (disable re-fit) as a “fast path” when upstream filtering is trusted.
* Optional `max_refits` (cap iterations), log a warning when hit.
* Consider `fitter='huber'` fast-path if available.
5. **Finalize docs**

* Keep 5k/5 as **doc default**; show Mac+Linux tables.
* Add a **“Stress Test (Leverage)”** table once generator is merged.

---

## Deliverables Checklist

* [x] Single-file benchmark with 5k/5 default & 30% outlier scenarios
* [x] Performance section in `groupby_regression.md` (Mac/Linux tables)
* [ ] **Leverage-outlier generator** (+ scenarios)
* [ ] Fitter instrumentation (refit counters, timings)
* [ ] Performance tests (CI thresholds for clean vs stress)
* [ ] `BENCHMARKS.md` with full runs & environment capture

---

## Current Commands

**Default quick run (docs/CI):**

```bash
python3 bench_groupby_regression.py \
--rows-per-group 5 --groups 5000 \
--n-jobs 10 --sigmaCut 5 --fitter ols \
--out bench_out --emit-csv
```

**Stress test placeholder (to be added):**

```bash
python3 bench_groupby_regression.py \
--rows-per-group 5 --groups 5000 \
--n-jobs 10 --sigmaCut 5 --fitter ols \
--mode leverage --x-mag 8.0 \
--out bench_out_stress --emit-csv
```

---

## Risks & Open Questions

* What outlier **structure** in real data triggers the re-fit? (X leverage? heteroscedasticity? group size variance?)
* Is the slowdown proportional to **targets × refits × groups**?
* Do container spawn/backends (forkserver/spawn) amplify overhead for very small groups?

---

**Last updated:** Oct 22, 2025 (this revision)

---

### Commit message

```
docs(restartContext): update with 5k/5 default, 30% outliers, and leverage-outlier plan

- Record new cross-platform results (Mac vs Linux) and observation that response-only outliers do not slow runtime
- Add action plan: leverage-outlier generator + refit counters + multi-target cost check
- Keep PR target; align benchmarks and docs with 5k/5 default
```