Skip to content

Build Repair Agent: initial evaluation harness#5762

Open
evgenyrp wants to merge 20 commits intomozilla:masterfrom
evgenyrp:build_repair_agent
Open

Build Repair Agent: initial evaluation harness#5762
evgenyrp wants to merge 20 commits intomozilla:masterfrom
evgenyrp:build_repair_agent

Conversation

@evgenyrp
Copy link

@evgenyrp evgenyrp commented Mar 3, 2026

Here's an example of running it on a dataset of 85 examples. An outage at Anthropic contributed to some fails but it works overall.

The baseline agent is very simple; most of the complexity is about integration with Weave and building Firefox.

What works so far:

  • Creating a dataset of one commit build failures (with one commit fixes)
  • Running evaluation from a local machine under docker-compose
  • Creating and cleaning worktrees for each example
  • Basic sandboxing
  • Tracing agent's steps in Weave
  • Running multiple trials per example
  • Separate stages for analysis and fixing (allows deploying analysis only)
  • Building Firefox locally inside Docker to verify the fix
  • Basic scorers
  • Cost tracking
  • Verifying data contamination (meaning the LLM is trained before the example date)

Out of scope for this PR:

  • Reusing a pool of worktrees not to run ./mach bootstrap on every example (I have the code, but it didn't work well, so I kept the simple approach)
  • Pushing to TRY to verify a specific build: the main issue is propagating credentials to Docker (an alternative of using ./mach load-task also didn't work for me)
  • LLM-as-a-judge scorers
  • Better integration with Weave (hacky implementation of tracing, verify metrics when all scorers work, verify error handling)
  • Production deployment-related work
  • Improving the agent itself (reading through the Claude Agents SDK docs and applying best practices, making sure the Firefox MCP and skills work, maybe doing another iteration if the build is still failing etc.)
  • Improving sandboxing

@evgenyrp evgenyrp requested a review from suhaibmujahid March 3, 2026 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant