The Source of Truth for LLM Benchmarks. Compare top models like DeepSeek V3, Claude 3.5 Sonnet, and GPT-4o across trusted evaluation sets.
- Global Leaderboard: Sortable, filterable index with tier filtering (Verified vs Discovered)
- Interactive Comparison: "Versus Mode" with Radar Charts and Delta tables
- Deep Specs: Context window, pricing (input/output/cache/reasoning), max output tokens
- Verified Scores: Distinguishes between third-party, provider, community, and estimated results
- Data Freshness: Training cutoff dates and score age indicators
- Tier System: Verified (curated) vs Discovered (auto-imported) model classification
- Capability Filtering: Filter by reasoning, vision, tools, audio, code specialization
- Family System: Model family grouping (Llama, GPT, Claude, Gemini, etc.)
- API Access: REST API with rate limiting and OpenAPI 3.0 spec
- Data Validation: Built-in scripts to prevent broken IDs and out-of-range scores
- Scalable Architecture: Supports 10,000+ models with on-demand data loading
- Framework: Next.js 16 (App Router)
- Styling: Tailwind CSS v4 + Shadcn UI
- Data: Hybrid architecture (manifest + full data)
- State Management: SWR for on-demand data fetching
- Charts: Recharts
- Deployment: Cloudflare Pages (static + edge delivery)
Hybrid Data Loading:
- Registry Manifest (~50KB): Lightweight model list for discovery
- Full Model Data (~870KB): Complete specs and benchmarks (used where needed)
- Score Files (<1KB each): On-demand score loading
Key Hooks:
useRegistry()- Fetch model lists with tier filteringuseModelScores()- Load scores on-demanduseRegistryFamilies()- Get unique model families
See docs/SCALABLE_ARCHITECTURE.md for complete documentation.
-
Install dependencies:
bun install
-
Generate registry manifest:
bun run import:models-dev bun run generate:manifest
-
Run the development server:
bun dev
-
Import models.dev metadata:
bun run import:models-dev
-
Generate registry manifest:
bun run generate:manifest
-
Validate registry integrity:
bun run validate:data
-
Run strict validation (CI parity):
bun run validate:data:strict
-
Generate a category and benchmark coverage report:
bun run report:coverage
-
Run tests:
bun run test
This project deploys with OpenNext to Cloudflare Workers (not Cloudflare Pages).
- One-time auth:
bunx wrangler login
- Build and preview locally in the Workers runtime:
bun run preview
- Deploy to production:
bun run deploy
Use Workers Builds (not Pages) for fully automated deploys on every push.
- Project type:
Workers - Worker name:
llm-registry(must matchwrangler.jsonc) - Root directory:
/ - Build command: leave empty (or
true) - Deploy command:
bun run deploy
This gives a single automated pipeline step per commit: build + deploy.
- Global and category views use normalized benchmark scores (0-100).
- Lower-is-better metrics are inverted so higher normalized score always means better performance.
- Category averages are computed over available scores for that category.
- Compare view defaults to strict shared-benchmark analysis for fair model-vs-model deltas.
- Exploratory compare mode allows partial overlap; missing values stay explicit as
N/A. - Capability profile (radar) shows all available domains in scope and never treats missing data as zero.
- Leaderboard supports Coverage-Assisted mode by default; use
coverageMode=strict(Observed Only) to rank using measured scores only. - Full methodology page:
/about - Ongoing SEO operations checklist:
SEO_CHECKLIST.md
src/
├── app/ # Next.js App Router pages
├── components/
│ ├── dashboard/ # Leaderboard, compare, and data viz components
│ └── ui/ # Shadcn UI components
├── data/
│ ├── models.ts # Model definitions and scores
│ ├── benchmarks.ts # Benchmark taxonomy and metadata
│ ├── aa-overrides.ts # Artificial Analysis data imports
│ ├── sources.ts # Data source registry
│ └── changelog.ts # Version history
├── lib/
│ ├── registry-data.ts # Data processing and queries
│ └── leaderboard-query.ts # Leaderboard filtering logic
├── types/ # TypeScript type definitions
└── scripts/ # Data validation and import scripts
The registry provides a REST API for programmatic access:
GET /api/v1/models- List all modelsGET /api/v1/models/[id]- Get specific model detailsGET /api/v1/benchmarks- List all benchmarksGET /api/v1/leaderboard- Get leaderboard data with filteringGET /api/v1/export?format=json|csv- Export data for research workflows
Full API documentation available at /api-docs
- Open
src/data/models.ts - Add a new object to the
modelsarray following theModelinterface - Add scores for existing benchmarks
- Include provenance metadata (source, verification level, as-of date)
- Run validation:
bun run validate:data:strict
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License - see LICENSE for details.
Benchmark data includes contributions from:
- Artificial Analysis (https://artificialanalysis.ai/) - Imported under current policy with explicit attribution
- Provider-reported scores from model publishers
- Third-party evaluation results
All imported data includes provenance tracking with source IDs, verification levels, and as-of dates.
bun run build:cfThis will:
- Generate registry manifest (1,581 models)
- Copy score files to public directory
- Build Next.js application
- Output static files to
dist/
# Manual deployment
bun run build:cf
npx wrangler pages deploy dist/
# Or connect GitHub repo for auto-deployGitHub Actions automatically:
- Imports latest models.dev data (every Monday 2 AM UTC)
- Detects changes
- Creates pull request for review
See .github/workflows/update-models-dev.yml
| Component | Size | Notes |
|---|---|---|
| Client Bundle | ~150KB | React + app code |
| Registry Manifest | ~50KB | 1,581 models |
| Score Files | <1KB | Per model |
| Page | Initial Load | Data Fetch | Total |
|---|---|---|---|
| Leaderboard | <1s | <200ms | <1.2s |
| Model Detail | <1s | <50ms | <1.05s |
| Explore | <1s | N/A | <1s |
- Current Models: 1,581
- Max Supported: 50,000+
- Build Time: ~3 seconds
- API Response: <20ms (edge cached)
REST API available at /api/v1/:
GET /api/v1/models- List all modelsGET /api/v1/models/[id]- Model detailsGET /api/v1/benchmarks- List benchmarksGET /api/v1/scores- Query scoresGET /api/v1/leaderboards/[category]- Category rankingsGET /api/v1/export- Export data (JSON/CSV)
Rate Limiting: 100 requests/minute per IP (via Cloudflare WAF)
Full documentation: /api-docs
- Scalable Architecture - Data loading patterns
- Cloudflare WAF Setup - Rate limiting
- Migration Guide - Upgrading to v0.7.0
- API Documentation - REST API reference
Current: v0.7.0 (2026-03-01)
Recent Changes:
- Tier system (Verified/Discovered models)
- On-demand data loading with hooks
- Automated models.dev import
- 1,581 models with rich metadata
- Advanced filtering (family, capability, provider)
- Score files for on-demand loading
See Changelog for complete history.
MIT License - see LICENSE file for details.
Data Sources:
- models.dev - Model metadata (MIT License)
- Artificial Analysis - Score overrides
- Manual curation - Benchmark scores
Technologies:
- Next.js 16
- TypeScript 5
- React 19
- Tailwind CSS v4
- Shadcn UI
- SWR
- Cloudflare Pages
Built with ❤️ for the AI community.