Agent Engineering

Building a coding agent from scratch, with an eval framework, memory, multi-agent orchestration, and local model inference on H100s.

Project Structure

agent_loop/          # ReAct loop, tool calling, model-agnostic LLM client
evals/               # Eval harness, trajectory assessment, observability tracer
context/             # Context management and file retrieval
memory/              # In-context, external, and episodic memory
multiagent/          # Orchestrator/subagent pattern
local_model/         # vLLM deployment on H100, model comparison

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Copy .env.example to .env and add your API key:

OPENAI_API_KEY=your_key_here

Running the agent

python3 agent_loop/main.py

Running eval benchmarks

Host runner (default)

python3 -m evals.run --all
python3 -m evals.run --compare gpt-4o-mini gpt-4o --runs 1 --quiet

Docker runner (reproducible)

Build the benchmark image (pinned runtime/deps):

python3 -m evals.run --build-image --docker-image coding-agent-evals:latest

Run a fast docker smoke suite:

python3 -m evals.run --docker-smoke --runner docker --docker-image coding-agent-evals:latest --quiet

Run the full docker rebaseline matrix:

python3 -m evals.run --compare gpt-4o-mini gpt-4o-mini+tools gpt-4o gpt-5.2-2025-12-11 \
  --runs 3 --output results_docker.json --quiet --runner docker --docker-image coding-agent-evals:latest

Capture a live JSONL benchmark log (task results + tool-gen attempts):

python3 -m evals.run --compare gpt-4o-mini+tools --runs 1 --quiet \
  --runner docker --docker-image coding-agent-evals:latest \
  --benchmark-log benchmark_live.jsonl

Key runner flags:

--runner host|docker
--docker-image <name:tag>
--build-image
--docker-smoke

Tool generation uses a no-leakage default: hidden verifier output is not passed into tool generation prompts. To opt into old behavior for debugging:

python3 -m evals.run ... --allow-verifier-feedback
python3 -m tool_gen.run ... --allow-verifier-feedback

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
agent_loop		agent_loop
evals		evals
tool_gen		tool_gen
tool_library		tool_library
.gitignore		.gitignore
LEARNINGS.md		LEARNINGS.md
README.md		README.md
requirements.txt		requirements.txt
results.json		results.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Engineering

Project Structure

Setup

Running the agent

Running eval benchmarks

Host runner (default)

Docker runner (reproducible)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent Engineering

Project Structure

Setup

Running the agent

Running eval benchmarks

Host runner (default)

Docker runner (reproducible)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages