docs: add real-world benchmark leaderboard for 10 open-source agents by himmi-01 · Pull Request #11 · Corbell-AI/evalmonkey

himmi-01 · 2026-05-23T19:43:31Z

Results from benchmarking GPT Researcher, OpenResearcher, Browser Agent, Browser-Use Couchbase, Open Deep Research, deep-research, Local Docs AI, Index Browser, Goose, and OnCell Support Agent
Benchmarks: HotpotQA, TruthfulQA, MMLU (3 scenarios each)
Chaos profiles: client_prompt_injection + client_schema_mutation
Eval judge: Claude Sonnet 4.5 via AWS Bedrock
Shows baseline score, chaos score, production reliability, and chaos drop
Key findings: average 23pt drop under chaos; 3 agents improved under chaos; top agent scores 2.6x higher than bottom despite same LLM backend
Section placed prominently after At a Glance, before Dashboard

- Results from benchmarking GPT Researcher, OpenResearcher, Browser Agent, Browser-Use Couchbase, Open Deep Research, deep-research, Local Docs AI, Index Browser, Goose, and OnCell Support Agent - Benchmarks: HotpotQA, TruthfulQA, MMLU (3 scenarios each) - Chaos profiles: client_prompt_injection + client_schema_mutation - Eval judge: Claude Sonnet 4.5 via AWS Bedrock - Shows baseline score, chaos score, production reliability, and chaos drop - Key findings: average 23pt drop under chaos; 3 agents improved under chaos; top agent scores 2.6x higher than bottom despite same LLM backend - Section placed prominently after At a Glance, before Dashboard

himmi-01 merged commit 16f4505 into Corbell-AI:main May 23, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add real-world benchmark leaderboard for 10 open-source agents#11

docs: add real-world benchmark leaderboard for 10 open-source agents#11
himmi-01 merged 1 commit into
Corbell-AI:mainfrom
himmi-01:docs/benchmark-results-leaderboard

himmi-01 commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

himmi-01 commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant