Skip to content

docs: add real-world benchmark leaderboard for 10 open-source agents#11

Merged
himmi-01 merged 1 commit into
Corbell-AI:mainfrom
himmi-01:docs/benchmark-results-leaderboard
May 23, 2026
Merged

docs: add real-world benchmark leaderboard for 10 open-source agents#11
himmi-01 merged 1 commit into
Corbell-AI:mainfrom
himmi-01:docs/benchmark-results-leaderboard

Conversation

@himmi-01
Copy link
Copy Markdown
Contributor

  • Results from benchmarking GPT Researcher, OpenResearcher, Browser Agent, Browser-Use Couchbase, Open Deep Research, deep-research, Local Docs AI, Index Browser, Goose, and OnCell Support Agent
  • Benchmarks: HotpotQA, TruthfulQA, MMLU (3 scenarios each)
  • Chaos profiles: client_prompt_injection + client_schema_mutation
  • Eval judge: Claude Sonnet 4.5 via AWS Bedrock
  • Shows baseline score, chaos score, production reliability, and chaos drop
  • Key findings: average 23pt drop under chaos; 3 agents improved under chaos; top agent scores 2.6x higher than bottom despite same LLM backend
  • Section placed prominently after At a Glance, before Dashboard

- Results from benchmarking GPT Researcher, OpenResearcher, Browser Agent,
  Browser-Use Couchbase, Open Deep Research, deep-research, Local Docs AI,
  Index Browser, Goose, and OnCell Support Agent
- Benchmarks: HotpotQA, TruthfulQA, MMLU (3 scenarios each)
- Chaos profiles: client_prompt_injection + client_schema_mutation
- Eval judge: Claude Sonnet 4.5 via AWS Bedrock
- Shows baseline score, chaos score, production reliability, and chaos drop
- Key findings: average 23pt drop under chaos; 3 agents improved under chaos;
  top agent scores 2.6x higher than bottom despite same LLM backend
- Section placed prominently after At a Glance, before Dashboard
@himmi-01 himmi-01 merged commit 16f4505 into Corbell-AI:main May 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant