Search Stability Lab

This repository implements a reproducible experiment codebase for studying search stability under finite context. It operationalizes the theory paper with two layers:

Layer A: a controlled simulator with simulator-exact latent adequacy and exact failure-channel attribution.
Layer B: a lightweight task harness with deterministic family proxies and bundled mechanistic probe tasks.

The repository is designed for scientific use. It tests scoped hypotheses about controller laws under finite-context constraints. It does not prove the theory universally, does not bundle copyrighted benchmark assets, and does not fabricate empirical results. The strongest current cross-layer result is the substitution-first story (C3 > C0). Layer B now also shows directional reset-probe evidence against C0, plus limited outcome-sensitive compression evidence on a tiny authored suite, and some local CPU results remain confounded by malformed structured output.

Theoretical basis

This repository is an implementation-oriented companion to the following preprint, which defines the theory and terminology used here:

Takahashi, K. (2026). Search Stability under Finite Context: A Minimal Theory of Adequacy Preservation, Compression, and Reset in Long-Running Agents. Zenodo. https://doi.org/10.5281/zenodo.18905242

For experiment design and interpretation, see docs/THEORY_VALIDATION_GUIDE.md and docs/SCIENTIFIC_CHECKLIST.md. The final experiment-readiness audit is recorded in docs/FINAL_AUDIT.md. The post-pilot redesign rationale is recorded in docs/EXPERIMENT_REDESIGN.md. The current conservative status page is docs/CURRENT_STATUS.md. The repository scientific scope note is docs/SCIENTIFIC_SCOPE.md. The results interpretation note is docs/RESULTS_INTERPRETATION.md. Layer B-specific scope and probe design notes are docs/LAYER_B_STATUS.md, docs/LAYER_B_PROXY_MODEL.md, and docs/LAYER_B_PROBE_MATRIX.md. The completed Layer B probe-block report is docs/LAYER_B_PROBE_REPORT_2026-03-09.md.

What this repo tests

Whether controller-law changes affect adequate-family survival under finite active-context budgets.
Whether substitution-first, reserve-aware, compression-cautious, and reset-aware policies differ under controlled conditions.
Whether small real-task harnesses show directionally similar controller effects when only proxies are available.

Who this repo is for

AI engineers who need a lightweight, theory-aligned harness for long-running agent evaluation
AI agents that need explicit rules for what may be claimed from simulator versus real-task evidence
researchers who want a falsifiable, CPU-feasible layer before investing in larger benchmark runs

Why this theory is useful

The repository is meant to make long-running agent failures more legible. Instead of treating every miss as generic model weakness, it helps test whether failure came from:

loss of adequate-family reserve before verification
harmful compression aliasing
avoidable retirement of a still-useful route family
stale-legacy continuation when reset would have been rational
a mixture of ecological and raw-model failure

What this repo does not test

It is not a leaderboard implementation.
It does not claim benchmark completeness for SWE-Bench Lite.
It does not validate the theory outside the implemented simulator conditions and the chosen fixed task slice.

Setup

Python 3.11+ is required.

python -m pip install -r requirements.txt

Environment variables:

GEMINI_API_KEY for Gemini runs
LOCAL_LLM_ENDPOINT and LOCAL_LLM_MODEL when using a local CPU model server

The repository ships .env.example only. Do not commit .env. Local scripts load .env automatically when present and never write its values into configs or logs. If a real secret was ever tracked earlier, see docs/SECURITY_RESPONSE.md.

Quickstart

Simulator smoke test:

python scripts/run_simulator.py --config configs/experiments/pilot.yaml --max-conditions 1 --max-episodes 1

Real-task harness smoke test:

python scripts/run_real_tasks.py --config configs/experiments/real_tasks.yaml

Bundled non-mock Layer B run on the frozen micro-task slice:

python scripts/run_real_tasks.py --config configs/experiments/real_tasks_gemini_nonmock_frozen.yaml

Expanded bundled non-mock Layer B run:

python scripts/validate_task_assets.py --manifest tasks/manifests/frozen_task_slice_v3.yaml
python scripts/run_real_tasks.py --config configs/experiments/real_tasks_gemini_nonmock_expanded.yaml

Compression-focused Layer B probe block:

python scripts/run_real_tasks.py --config configs/experiments/real_tasks_gemini_compression_probe.yaml

Reset-focused Layer B probe block:

python scripts/run_real_tasks.py --config configs/experiments/real_tasks_gemini_reset_probe.yaml

Local CPU model smoke test:

python scripts/run_real_tasks.py --config configs/experiments/real_tasks_local_cpu.yaml --max-tasks 1 --controllers C0

Gemini smoke test:

python scripts/run_real_tasks.py --config configs/experiments/real_tasks_gemini.yaml --max-tasks 1 --controllers C0

Analyze generated logs:

python scripts/analyze_results.py --input-dir artifacts --output-dir artifacts/analysis

Before zipping or publishing:

python scripts/release_clean.py --dry-run
python scripts/check_public_safety.py --mode release

Read the run status and result interpretation guide:

type result_summary.md
type docs/CURRENT_STATUS.md

Validate a design before main runs:

python scripts/check_experiment_design.py --layer layer_a --config configs/experiments/pilot_gemini.yaml
python scripts/check_experiment_design.py --layer layer_b --config configs/experiments/real_tasks_gemini.yaml

Two-layer design

Layer A is the theory-faithful core. It includes:

finite family sets and route instances
a nonempty hidden adequate-family set
delayed strong verification
hard-cap budgets
staleness, inertia, overlap burden, legacy contamination
lossy compression with alias logging
reset, branch, continue, retire, substitute, and tool actions
exact recoverability and deterministic exact failure attribution

Layer B is a small-scope harness. It includes:

a task manifest format for a fixed task slice
deterministic family proxy construction
an explicit proxy-construction layer separating task-authored, derived, and runtime proxies
mock mode for offline validation
a bundled public-safe frozen micro-task slice for non-mock runs
explicit substitution, compression, and reset probe suites
an expanded bundled eight-task frozen slice for stronger Layer B contrasts
graceful behavior when external task assets are absent
proxy-only instrumentation aligned to the theory

Unified model adapters

The repository exposes a single adapter interface:

plan_step(context)
compress_state(context)
diagnose_failure(trace)
choose_continue_branch_reset(context)

Supported backbones:

Gemini via config plus GEMINI_API_KEY
a local CPU endpoint via Ollama-style or OpenAI-compatible HTTP APIs
a deterministic mock adapter for offline tests and smoke checks

Exact vs proxy note

Layer A logs exact simulator quantities. Layer B logs only proxies and task outcomes. The code and analysis pipeline keep these separate in field names, tables, figure titles, and documentation.

Primary hypotheses

H1: Success falls sharply once budget drops below a recoverability-supporting threshold.
H2: Within-family substitution preserves success better than greedy deletion.
H3: Lossy compression can create decision-relevant aliasing.
H4: Compression harm grows when strong verification is delayed.
H5: Reset-aware control can dominate stale continuation when contamination is high enough.
H6: These effects are attributable to controller law under a fixed backbone, not only to backbone changes.

Reproducibility note

Every run is driven by YAML config plus explicit seeds. Logs record:

layer
controller
backbone model ID
prompt version
code revision
run and episode identifiers
structured trajectory events

Each run directory also includes a run_manifest.json with the resolved experiment config, model config, controller IDs, theory hypotheses, and a scientific-guardrail report.

If the repository is not inside a Git checkout, code_revision is logged as unknown.

Directory map

configs/: model, controller, and experiment YAML
prompts/: versioned prompt templates
schemas/: JSON Schemas for structured model outputs
simulator/: Layer A generator, engine, scenarios, attribution
controllers/: controller laws C0 through C6
models/: Gemini, local endpoint, and mock adapters
tasks/: Layer B harness, proxy rule, and example manifest
logging/: logging notes and field conventions
analysis/: aggregation, statistics, and figure generation
scripts/: runnable entry points
docs/: build plan, implementation notes, and metric registry
result_summary.md: top-level execution and interpretation summary
tests/: unit tests, smoke checks, and offline pipeline tests

Running a pilot simulator experiment

python scripts/check_experiment_design.py --layer layer_a --config configs/experiments/pilot.yaml
python scripts/run_simulator.py --config configs/experiments/pilot.yaml
python scripts/analyze_results.py --input-dir artifacts/pilot

Backbone-specific pilot configs are also provided:

configs/experiments/pilot_local_cpu.yaml
configs/experiments/pilot_gemini.yaml
configs/experiments/layer_a_identification_pilot_gemini.yaml
configs/experiments/layer_a_h1_budget_threshold_gemini.yaml
configs/experiments/layer_a_h2_substitution_gemini.yaml
configs/experiments/layer_a_h3_h4_compression_gemini.yaml
configs/experiments/layer_a_h5_reset_gemini.yaml

This will generate logs and summary outputs from actual runs only. No figure is generated unless matching logs exist.

How to read results correctly

Use Layer A to make exact mechanism claims about reserve loss, compression aliasing, retirement, and reset behavior.
Use Layer B to ask whether controller-law effects remain directionally visible under a fixed small task slice.
Treat smoke runs as pipeline checks, not evidence for the theory.
Treat mock-mode Layer B runs as instrumentation validation, not external-validity evidence.
Report null or weak effects explicitly when they occur.

Plugging in a real task slice later

Prepare a manifest following tasks/README.md.
Point configs/experiments/real_tasks.yaml at that manifest.
Keep backbone, prompt, turn budget, and tool permissions fixed within each comparison block.
Run the harness in non-mock mode only when the task assets are available and frozen.
Run scripts/check_experiment_design.py before main runs and keep the generated run_manifest.json.

Bundled non-mock assets are already provided for the shipped micro-task slice under tasks/assets/. The larger frozen slice is tasks/manifests/frozen_task_slice_v3.yaml.

Safety and privacy

No secrets are hardcoded.
Default configs use placeholders or mock backbones.
No local absolute paths are shipped.
The repository does not bundle external benchmark data.
Run python scripts/release_clean.py and python scripts/check_public_safety.py --mode release before packaging a public zip.

Limitations

Layer B now ships a bundled frozen task slice, but it is still a small micro-task set rather than a benchmark-scale evaluation.
The simulator is lightweight and theorem-aligned, not benchmark-realistic.
Small smoke runs may produce unstable estimates or degenerate regressions; the analysis code preserves those outputs rather than hiding them.
The paper discusses richer posterior-robust and theorem-local audit quantities than this lightweight implementation currently exposes online; those are documented as scoped simplifications rather than omitted silently.

Citation

If you use this repository, cite the underlying theory preprint:

Takahashi, K. (2026). Search Stability under Finite Context: A Minimal Theory of Adequacy Preservation, Compression, and Reset in Long-Running Agents. Zenodo. https://doi.org/10.5281/zenodo.18905242

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search Stability Lab

Theoretical basis

What this repo tests

Who this repo is for

Why this theory is useful

What this repo does not test

Setup

Quickstart

Two-layer design

Unified model adapters

Exact vs proxy note

Primary hypotheses

Reproducibility note

Directory map

Running a pilot simulator experiment

How to read results correctly

Plugging in a real task slice later

Safety and privacy

Limitations

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
analysis		analysis
configs		configs
controllers		controllers
core		core
docs		docs
logging		logging
models		models
prompts/v1		prompts/v1
schemas		schemas
scripts		scripts
simulator		simulator
tasks		tasks
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CITATION.cff		CITATION.cff
EXPERIMENT_SPEC.md		EXPERIMENT_SPEC.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
result_summary.md		result_summary.md

Folders and files

Latest commit

History

Repository files navigation

Search Stability Lab

Theoretical basis

What this repo tests

Who this repo is for

Why this theory is useful

What this repo does not test

Setup

Quickstart

Two-layer design

Unified model adapters

Exact vs proxy note

Primary hypotheses

Reproducibility note

Directory map

Running a pilot simulator experiment

How to read results correctly

Plugging in a real task slice later

Safety and privacy

Limitations

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages