Skip to content

WayneBuckhanan/LLM-Compare

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LLM Compare

A platform for testing, iterating on, and comparing LLM outputs across different models to help you find the best prompts and models for your specific use cases.


Built for the Cerebras / Cline / GLM-4.7 Hackathon

This started with the VueFlowFast template project that I previously built as the foundation for AI assisted coding projects like this. The template provides a generic CRUDL interface for data management. In this case, we stuck with the Pinia store in-browser, but the interface can be consistently used with a serverless backend (AWS and Cloudflare versions exist). Because of the nature of this project, it makes sense to keep the data in-browser and directly connect to LLM providers from the browser.

The DESIGN_DOC.md was developed in an interactive LLM chat session and then refined with GLM-4.7 on Cerebras in Cline. The inital design assumed an AI Gateway on the backend, but with AI SDK, we can now directly connect to providers from the browser while still maintaining a consistent interface. This also fits within the extreme prototyping approach that VueFlowFast enables -- start with the fast development in browser, then move the same logic from the browser to a serverless function as needed. The interface to the generic CRUDL or the AI SDK calls gets abstracted in the browser so the UI and frontend business logic can be developed without worrying about which backend we are using (browser store, DynamoDB, D1, PostgreSQL, etc).

For this project, the design doc captured the inital design idea (early version of the README.md before being moved to DESIGN_DOC.md) augmented with AI SDK. The "context7" MCP server was added to Cline and used to update the design doc with up-to-date AI SDK v6 specifics. The design was then implemented with a single-shot prompt in Cline of "Implement the features in @/DESIGN_DOC.md" and double checked with a "Confirm which tasks are completed in the @/DESIGN_DOC.md and implement any missing features" (no new features were added with this prompt).

A few issues with the generated code were addressed and fell into a few categories.

  1. The expected issues are quirks related to using Pug for clean HTML templates and Tailwind CSS utility classes for styling. Every LLM has struggled with not using Tailwind specific characters that Pug chokes on (like ':[]/' for modifiers and custom cases in Tailwind classes), so I skim manually to find the stray .md:col-span-3 and dark:bg-gray-950 classes and adding them to a (class="") block in the Pug tag.

  2. The file-based routing setup in VueFlowFast allows for nesting templates by having a Vue file named the same as a directory. GLM-4.7 did not consider this initially when generating the experiments.vue and experiments/[id].vue files. This was fixed by renaming the directoy to experiment/ and updating the references to specific experiments inexperiments.vue`.

  3. The only unpredictable issue (the above are not surprising) was a modal visibility variable being defined and toggled, but not being tied to an actual modal component. A single Cline prompt handled that cleanly by referencing an existing modal for add/edit instead of the undefined one.

A few additional features were added after the initial implementation to make the app more useful. Adding the "Test" button to the model management page was a simple addition to the UI and a few lines of code to test the model with a simple prompt. Duplicating prompts was another simple prompt to Cline. Adding an orientation screen (after moving the CRUDL demo page off the root path) was another single prompt to Cline.

Comparison with Other Models

I have implemented this same design doc with multiple LLM providers. Most models struggled with the combination of Pug templates and Tailwind CSS classes. Several open source models just didn't play well with the VueFlowFast starter project overall (the perils of falling in love with Vue SFCs in a React heavy world).

The main difference between this GLM-4.7 implementation and the pervious winner (a Claude Sonnet implementation) was that Sonnet split each of the possible screens into separate routes, while GLM-4.7 grouped many of them as tabs in the individual experiment view. Initially, I thought that the groupings were a poorer choice, but as I have used it, I'm finding it more intuitive to have all the experiment details in one place.

Personal Thoughts

Overall, I'm much more impressed with the GLM-4.7 model than I was with some other open source models. (E.g. Kimi K2 couldn't work with my Vue project without writing React code.) GLM 4.7 is a nice improvement over GLM-4.5 and slightly smoother than GLM-4.6 for my uses.

This is the first time I am using Cline, but I have used forks of this codebase before. For this style project, it is a good choice. Most of my LLM assisted coding is currently done in more of a "pair programming" style, but Cline is a solid choice for a more "delegated to a junior dev" style of agentic coding.

As usual, Cerebras is providing ridiculously fast responses and has spoiled me for other providers. Once they have worked out the remaining kinks on the rate limits for the GLM-4.7 model, it will be spectacular for this more agentic style of LLM assisted coding. And in the meantime, I'll keep pairing with it as my LLM driver while I navigate at a higher level of abstraction and read each line to confirm it.


🎯 What is this?

LLM Compare is a tool designed for developers, writers, and AI enthusiasts who want to systematically test how different LLM models perform with their prompts. Instead of guessing which model works best, you can run experiments, compare outputs side-by-side, and use data to make informed decisions.

Who is this for?

People who are familiar with using LLMs but want to:

  • Systematically test different prompt variations
  • Compare outputs from multiple models (OpenAI, Anthropic, Google, Cerebras, etc.)
  • Find the best model for their specific writing tasks
  • Track their prompt engineering progress over time

πŸš€ Getting Started

Prerequisites

  • Node.js and npm installed
  • API keys for your preferred LLM providers

Installation

# Install dependencies
npm install

# Start the development server
npm run dev

Quick Start (3 Steps)

  1. Add Your Models - Navigate to the Models page and add the LLMs you want to test (OpenAI, Anthropic, Cerebras, etc.)
  2. Create an Experiment - Define a testing objective with a specific goal
  3. Run & Compare - Run prompts against multiple models and compare outputs side-by-side

Check out the welcome page when you start the app for a guided walkthrough!

πŸ—οΈ Architecture

This project is built on top of the VueFlowFast template project, leveraging its generic CRUDL interface as the foundation for data management.

Tech Stack

  • Frontend: Vue 3 with Composition API
  • Unplugin Helpers: Auto imported components, file-based routing
  • Styling: Pug HTML templates and Tailwind CSS with PrimeVue UI components
  • Data Storage: Browser-based Pinia store with persistence to localStorage
  • LLM Integration: AI SDK v6 (direct browser-to-provider communication)

Built Using

Key Features

1. Experiment Management

Define experiments to group related prompt tests. Each experiment can have:

  • A clear goal and description
  • A designated "control" prompt for baseline comparison
  • Multiple prompt variations
  • Trackable progress over time

2. Dynamic Prompt Builder

Create structured prompts with custom sections:

  • Define your own section types (e.g., "System Prompt", "Background", "Voice", "Outline")
  • Reuse section types across experiments
  • Build a library of prompt components
  • System remembers your frequently used sections

3. Model Management

Configure and test multiple LLM providers:

  • Supported Providers: OpenAI, Anthropic, Google/Gemini, Cerebras, OpenRouter, and custom endpoints
  • Test models before using them in experiments
  • Store API keys locally in your browser
  • Track model performance across experiments

4. Run Experiments

Execute prompts against multiple models:

  • Select a prompt and choose which models to test
  • Run complete responses or stream in real-time
  • All outputs are automatically saved with full metadata
  • Easy error handling and user feedback

5. Side-by-Side Comparison

Compare outputs from different models:

  • View two outputs side-by-side in a clean interface
  • Mark your preference with a single click
  • The "winner becomes control" - your preferred output becomes the new baseline
  • Skip comparisons to move to the next pair

6. Strike System

Quickly eliminate underperforming models:

  • "Three strikes, you're out" system
  • Record reasons for strikes (e.g., "Poor tone", "Off-topic", "Hallucination")
  • Automatic model elimination after threshold is reached
  • Track strike counts and reasons for each model

7. Progress Tracking

Monitor your experiment progress:

  • See active vs. eliminated models
  • View strike counts and reasons
  • Track total runs and prompts tested
  • Understand why models were eliminated

πŸ“Š How It Works

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Craft     β”‚ -> β”‚   Select     β”‚ -> β”‚   Run &     β”‚ -> β”‚  Compare &   β”‚
β”‚  Prompts    β”‚    β”‚   Models     β”‚    β”‚  Capture    β”‚    β”‚  Iterate     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚                   β”‚                   β”‚                   β”‚
     β–Ό                   β–Ό                   β–Ό                   β–Ό
Custom sections    Choose from your    Auto-save all      Pick winners,
with reusable      configured models   outputs with       update controls,
section names                         full metadata       eliminate poor
                                                          performers

πŸ—‚οΈ Data Model

The platform uses a generic CRUDL (Create, Read, Update, Delete, List) interface with hierarchical relationships:

  • Experiments - Top-level containers for testing objectives
  • Prompts - Structured prompt configurations within experiments
  • LLM Models - Configured model instances with API keys
  • Experiment Runs - Captured outputs linking experiments, prompts, and models
  • Prompt Section Types - Global library of reusable section names

All data is stored locally in your browser during the prototype phase.

πŸ§ͺ AI SDK Integration

The platform uses AI SDK v6 for direct browser-to-provider communication.

Supported Providers

  • OpenAI: GPT-5, GPT-4o, etc.
  • Anthropic: Claude Opus 4.1, Sonnet 4.5, etc.
  • Google: Gemini 3 Pro, etc.
  • Cerebras: Open source models (Llama, Qwen, Z.ai) with fast inference
  • OpenRouter: Access to hundreds of models
  • Custom: Any OpenAI-compatible API

πŸ”’ Privacy & Security

  • API Keys: Stored locally in your browser during the prototype phase
  • Data: All experiments, prompts, and results are stored locally
  • Communication: Direct browser-to-provider communication (no intermediate server)

Happy experimenting! πŸ§ͺ✨

About

Test different prompt across LLM providers head-to-head

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors