LLM Compare

A platform for testing, iterating on, and comparing LLM outputs across different models to help you find the best prompts and models for your specific use cases.

Built for the Cerebras / Cline / GLM-4.7 Hackathon

This started with the VueFlowFast template project that I previously built as the foundation for AI assisted coding projects like this. The template provides a generic CRUDL interface for data management. In this case, we stuck with the Pinia store in-browser, but the interface can be consistently used with a serverless backend (AWS and Cloudflare versions exist). Because of the nature of this project, it makes sense to keep the data in-browser and directly connect to LLM providers from the browser.

The DESIGN_DOC.md was developed in an interactive LLM chat session and then refined with GLM-4.7 on Cerebras in Cline. The inital design assumed an AI Gateway on the backend, but with AI SDK, we can now directly connect to providers from the browser while still maintaining a consistent interface. This also fits within the extreme prototyping approach that VueFlowFast enables -- start with the fast development in browser, then move the same logic from the browser to a serverless function as needed. The interface to the generic CRUDL or the AI SDK calls gets abstracted in the browser so the UI and frontend business logic can be developed without worrying about which backend we are using (browser store, DynamoDB, D1, PostgreSQL, etc).

For this project, the design doc captured the inital design idea (early version of the README.md before being moved to DESIGN_DOC.md) augmented with AI SDK. The "context7" MCP server was added to Cline and used to update the design doc with up-to-date AI SDK v6 specifics. The design was then implemented with a single-shot prompt in Cline of "Implement the features in @/DESIGN_DOC.md" and double checked with a "Confirm which tasks are completed in the @/DESIGN_DOC.md and implement any missing features" (no new features were added with this prompt).

A few issues with the generated code were addressed and fell into a few categories.

The expected issues are quirks related to using Pug for clean HTML templates and Tailwind CSS utility classes for styling. Every LLM has struggled with not using Tailwind specific characters that Pug chokes on (like ':[]/' for modifiers and custom cases in Tailwind classes), so I skim manually to find the stray .md:col-span-3 and dark:bg-gray-950 classes and adding them to a (class="") block in the Pug tag.
The file-based routing setup in VueFlowFast allows for nesting templates by having a Vue file named the same as a directory. GLM-4.7 did not consider this initially when generating the experiments.vue and experiments/[id].vue files. This was fixed by renaming the directoy to experiment/ and updating the references to specific experiments inexperiments.vue`.
The only unpredictable issue (the above are not surprising) was a modal visibility variable being defined and toggled, but not being tied to an actual modal component. A single Cline prompt handled that cleanly by referencing an existing modal for add/edit instead of the undefined one.

A few additional features were added after the initial implementation to make the app more useful. Adding the "Test" button to the model management page was a simple addition to the UI and a few lines of code to test the model with a simple prompt. Duplicating prompts was another simple prompt to Cline. Adding an orientation screen (after moving the CRUDL demo page off the root path) was another single prompt to Cline.

Comparison with Other Models

I have implemented this same design doc with multiple LLM providers. Most models struggled with the combination of Pug templates and Tailwind CSS classes. Several open source models just didn't play well with the VueFlowFast starter project overall (the perils of falling in love with Vue SFCs in a React heavy world).

The main difference between this GLM-4.7 implementation and the pervious winner (a Claude Sonnet implementation) was that Sonnet split each of the possible screens into separate routes, while GLM-4.7 grouped many of them as tabs in the individual experiment view. Initially, I thought that the groupings were a poorer choice, but as I have used it, I'm finding it more intuitive to have all the experiment details in one place.

Personal Thoughts

Overall, I'm much more impressed with the GLM-4.7 model than I was with some other open source models. (E.g. Kimi K2 couldn't work with my Vue project without writing React code.) GLM 4.7 is a nice improvement over GLM-4.5 and slightly smoother than GLM-4.6 for my uses.

This is the first time I am using Cline, but I have used forks of this codebase before. For this style project, it is a good choice. Most of my LLM assisted coding is currently done in more of a "pair programming" style, but Cline is a solid choice for a more "delegated to a junior dev" style of agentic coding.

As usual, Cerebras is providing ridiculously fast responses and has spoiled me for other providers. Once they have worked out the remaining kinks on the rate limits for the GLM-4.7 model, it will be spectacular for this more agentic style of LLM assisted coding. And in the meantime, I'll keep pairing with it as my LLM driver while I navigate at a higher level of abstraction and read each line to confirm it.

🎯 What is this?

LLM Compare is a tool designed for developers, writers, and AI enthusiasts who want to systematically test how different LLM models perform with their prompts. Instead of guessing which model works best, you can run experiments, compare outputs side-by-side, and use data to make informed decisions.

Who is this for?

People who are familiar with using LLMs but want to:

Systematically test different prompt variations
Compare outputs from multiple models (OpenAI, Anthropic, Google, Cerebras, etc.)
Find the best model for their specific writing tasks
Track their prompt engineering progress over time

🚀 Getting Started

Prerequisites

Node.js and npm installed
API keys for your preferred LLM providers

Installation

# Install dependencies
npm install

# Start the development server
npm run dev

Quick Start (3 Steps)

Add Your Models - Navigate to the Models page and add the LLMs you want to test (OpenAI, Anthropic, Cerebras, etc.)
Create an Experiment - Define a testing objective with a specific goal
Run & Compare - Run prompts against multiple models and compare outputs side-by-side

Check out the welcome page when you start the app for a guided walkthrough!

🏗️ Architecture

This project is built on top of the VueFlowFast template project, leveraging its generic CRUDL interface as the foundation for data management.

Tech Stack

Frontend: Vue 3 with Composition API
Unplugin Helpers: Auto imported components, file-based routing
Styling: Pug HTML templates and Tailwind CSS with PrimeVue UI components
Data Storage: Browser-based Pinia store with persistence to localStorage
LLM Integration: AI SDK v6 (direct browser-to-provider communication)

Built Using

Cline in VS Code
GLM-4.7 on Cerebras

Key Features

1. Experiment Management

Define experiments to group related prompt tests. Each experiment can have:

A clear goal and description
A designated "control" prompt for baseline comparison
Multiple prompt variations
Trackable progress over time

2. Dynamic Prompt Builder

Create structured prompts with custom sections:

Define your own section types (e.g., "System Prompt", "Background", "Voice", "Outline")
Reuse section types across experiments
Build a library of prompt components
System remembers your frequently used sections

3. Model Management

Configure and test multiple LLM providers:

Supported Providers: OpenAI, Anthropic, Google/Gemini, Cerebras, OpenRouter, and custom endpoints
Test models before using them in experiments
Store API keys locally in your browser
Track model performance across experiments

4. Run Experiments

Execute prompts against multiple models:

Select a prompt and choose which models to test
Run complete responses or stream in real-time
All outputs are automatically saved with full metadata
Easy error handling and user feedback

5. Side-by-Side Comparison

Compare outputs from different models:

View two outputs side-by-side in a clean interface
Mark your preference with a single click
The "winner becomes control" - your preferred output becomes the new baseline
Skip comparisons to move to the next pair

6. Strike System

Quickly eliminate underperforming models:

"Three strikes, you're out" system
Record reasons for strikes (e.g., "Poor tone", "Off-topic", "Hallucination")
Automatic model elimination after threshold is reached
Track strike counts and reasons for each model

7. Progress Tracking

Monitor your experiment progress:

See active vs. eliminated models
View strike counts and reasons
Track total runs and prompts tested
Understand why models were eliminated

📊 How It Works

┌─────────────┐    ┌──────────────┐    ┌─────────────┐    ┌──────────────┐
│   Craft     │ -> │   Select     │ -> │   Run &     │ -> │  Compare &   │
│  Prompts    │    │   Models     │    │  Capture    │    │  Iterate     │
└─────────────┘    └──────────────┘    └─────────────┘    └──────────────┘
     │                   │                   │                   │
     ▼                   ▼                   ▼                   ▼
Custom sections    Choose from your    Auto-save all      Pick winners,
with reusable      configured models   outputs with       update controls,
section names                         full metadata       eliminate poor
                                                          performers

🗂️ Data Model

The platform uses a generic CRUDL (Create, Read, Update, Delete, List) interface with hierarchical relationships:

Experiments - Top-level containers for testing objectives
Prompts - Structured prompt configurations within experiments
LLM Models - Configured model instances with API keys
Experiment Runs - Captured outputs linking experiments, prompts, and models
Prompt Section Types - Global library of reusable section names

All data is stored locally in your browser during the prototype phase.

🧪 AI SDK Integration

The platform uses AI SDK v6 for direct browser-to-provider communication.

Supported Providers

OpenAI: GPT-5, GPT-4o, etc.
Anthropic: Claude Opus 4.1, Sonnet 4.5, etc.
Google: Gemini 3 Pro, etc.
Cerebras: Open source models (Llama, Qwen, Z.ai) with fast inference
OpenRouter: Access to hundreds of models
Custom: Any OpenAI-compatible API

🔒 Privacy & Security

API Keys: Stored locally in your browser during the prototype phase
Data: All experiments, prompts, and results are stored locally
Communication: Direct browser-to-provider communication (no intermediate server)

Happy experimenting! 🧪✨

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
public		public
src		src
.gitignore		.gitignore
DESIGN_DOC.md		DESIGN_DOC.md
README.md		README.md
index.html		index.html
jsconfig.json		jsconfig.json
package-lock.json		package-lock.json
package.json		package.json
vite.config.js		vite.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Compare

Built for the Cerebras / Cline / GLM-4.7 Hackathon

Comparison with Other Models

Personal Thoughts

🎯 What is this?

Who is this for?

🚀 Getting Started

Prerequisites

Installation

Quick Start (3 Steps)

🏗️ Architecture

Tech Stack

Built Using

Key Features

1. Experiment Management

2. Dynamic Prompt Builder

3. Model Management

4. Run Experiments

5. Side-by-Side Comparison

6. Strike System

7. Progress Tracking

📊 How It Works

🗂️ Data Model

🧪 AI SDK Integration

Supported Providers

🔒 Privacy & Security

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Compare

Built for the Cerebras / Cline / GLM-4.7 Hackathon

Comparison with Other Models

Personal Thoughts

🎯 What is this?

Who is this for?

🚀 Getting Started

Prerequisites

Installation

Quick Start (3 Steps)

🏗️ Architecture

Tech Stack

Built Using

Key Features

1. Experiment Management

2. Dynamic Prompt Builder

3. Model Management

4. Run Experiments

5. Side-by-Side Comparison

6. Strike System

7. Progress Tracking

📊 How It Works

🗂️ Data Model

🧪 AI SDK Integration

Supported Providers

🔒 Privacy & Security

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages