A platform for testing, iterating on, and comparing LLM outputs across different models to help you find the best prompts and models for your specific use cases.
This started with the VueFlowFast template project that I previously built as the foundation for AI assisted coding projects like this. The template provides a generic CRUDL interface for data management. In this case, we stuck with the Pinia store in-browser, but the interface can be consistently used with a serverless backend (AWS and Cloudflare versions exist). Because of the nature of this project, it makes sense to keep the data in-browser and directly connect to LLM providers from the browser.
The DESIGN_DOC.md was developed in an interactive LLM chat session and then refined with GLM-4.7 on Cerebras in Cline. The inital design assumed an AI Gateway on the backend, but with AI SDK, we can now directly connect to providers from the browser while still maintaining a consistent interface. This also fits within the extreme prototyping approach that VueFlowFast enables -- start with the fast development in browser, then move the same logic from the browser to a serverless function as needed. The interface to the generic CRUDL or the AI SDK calls gets abstracted in the browser so the UI and frontend business logic can be developed without worrying about which backend we are using (browser store, DynamoDB, D1, PostgreSQL, etc).
For this project, the design doc captured the inital design idea (early version of the README.md before being moved to DESIGN_DOC.md) augmented with AI SDK. The "context7" MCP server was added to Cline and used to update the design doc with up-to-date AI SDK v6 specifics. The design was then implemented with a single-shot prompt in Cline of "Implement the features in @/DESIGN_DOC.md" and double checked with a "Confirm which tasks are completed in the @/DESIGN_DOC.md and implement any missing features" (no new features were added with this prompt).
A few issues with the generated code were addressed and fell into a few categories.
-
The expected issues are quirks related to using Pug for clean HTML templates and Tailwind CSS utility classes for styling. Every LLM has struggled with not using Tailwind specific characters that Pug chokes on (like ':[]/' for modifiers and custom cases in Tailwind classes), so I skim manually to find the stray
.md:col-span-3anddark:bg-gray-950classes and adding them to a(class="")block in the Pug tag. -
The file-based routing setup in VueFlowFast allows for nesting templates by having a Vue file named the same as a directory. GLM-4.7 did not consider this initially when generating the
experiments.vueandexperiments/[id].vuefiles. This was fixed by renaming the directoy toexperiment/and updating the references to specific experiments inexperiments.vue`. -
The only unpredictable issue (the above are not surprising) was a modal visibility variable being defined and toggled, but not being tied to an actual modal component. A single Cline prompt handled that cleanly by referencing an existing modal for add/edit instead of the undefined one.
A few additional features were added after the initial implementation to make the app more useful. Adding the "Test" button to the model management page was a simple addition to the UI and a few lines of code to test the model with a simple prompt. Duplicating prompts was another simple prompt to Cline. Adding an orientation screen (after moving the CRUDL demo page off the root path) was another single prompt to Cline.
I have implemented this same design doc with multiple LLM providers. Most models struggled with the combination of Pug templates and Tailwind CSS classes. Several open source models just didn't play well with the VueFlowFast starter project overall (the perils of falling in love with Vue SFCs in a React heavy world).
The main difference between this GLM-4.7 implementation and the pervious winner (a Claude Sonnet implementation) was that Sonnet split each of the possible screens into separate routes, while GLM-4.7 grouped many of them as tabs in the individual experiment view. Initially, I thought that the groupings were a poorer choice, but as I have used it, I'm finding it more intuitive to have all the experiment details in one place.
Overall, I'm much more impressed with the GLM-4.7 model than I was with some other open source models. (E.g. Kimi K2 couldn't work with my Vue project without writing React code.) GLM 4.7 is a nice improvement over GLM-4.5 and slightly smoother than GLM-4.6 for my uses.
This is the first time I am using Cline, but I have used forks of this codebase before. For this style project, it is a good choice. Most of my LLM assisted coding is currently done in more of a "pair programming" style, but Cline is a solid choice for a more "delegated to a junior dev" style of agentic coding.
As usual, Cerebras is providing ridiculously fast responses and has spoiled me for other providers. Once they have worked out the remaining kinks on the rate limits for the GLM-4.7 model, it will be spectacular for this more agentic style of LLM assisted coding. And in the meantime, I'll keep pairing with it as my LLM driver while I navigate at a higher level of abstraction and read each line to confirm it.
LLM Compare is a tool designed for developers, writers, and AI enthusiasts who want to systematically test how different LLM models perform with their prompts. Instead of guessing which model works best, you can run experiments, compare outputs side-by-side, and use data to make informed decisions.
People who are familiar with using LLMs but want to:
- Systematically test different prompt variations
- Compare outputs from multiple models (OpenAI, Anthropic, Google, Cerebras, etc.)
- Find the best model for their specific writing tasks
- Track their prompt engineering progress over time
- Node.js and npm installed
- API keys for your preferred LLM providers
# Install dependencies
npm install
# Start the development server
npm run dev- Add Your Models - Navigate to the Models page and add the LLMs you want to test (OpenAI, Anthropic, Cerebras, etc.)
- Create an Experiment - Define a testing objective with a specific goal
- Run & Compare - Run prompts against multiple models and compare outputs side-by-side
Check out the welcome page when you start the app for a guided walkthrough!
This project is built on top of the VueFlowFast template project, leveraging its generic CRUDL interface as the foundation for data management.
- Frontend: Vue 3 with Composition API
- Unplugin Helpers: Auto imported components, file-based routing
- Styling: Pug HTML templates and Tailwind CSS with PrimeVue UI components
- Data Storage: Browser-based Pinia store with persistence to localStorage
- LLM Integration: AI SDK v6 (direct browser-to-provider communication)
Define experiments to group related prompt tests. Each experiment can have:
- A clear goal and description
- A designated "control" prompt for baseline comparison
- Multiple prompt variations
- Trackable progress over time
Create structured prompts with custom sections:
- Define your own section types (e.g., "System Prompt", "Background", "Voice", "Outline")
- Reuse section types across experiments
- Build a library of prompt components
- System remembers your frequently used sections
Configure and test multiple LLM providers:
- Supported Providers: OpenAI, Anthropic, Google/Gemini, Cerebras, OpenRouter, and custom endpoints
- Test models before using them in experiments
- Store API keys locally in your browser
- Track model performance across experiments
Execute prompts against multiple models:
- Select a prompt and choose which models to test
- Run complete responses or stream in real-time
- All outputs are automatically saved with full metadata
- Easy error handling and user feedback
Compare outputs from different models:
- View two outputs side-by-side in a clean interface
- Mark your preference with a single click
- The "winner becomes control" - your preferred output becomes the new baseline
- Skip comparisons to move to the next pair
Quickly eliminate underperforming models:
- "Three strikes, you're out" system
- Record reasons for strikes (e.g., "Poor tone", "Off-topic", "Hallucination")
- Automatic model elimination after threshold is reached
- Track strike counts and reasons for each model
Monitor your experiment progress:
- See active vs. eliminated models
- View strike counts and reasons
- Track total runs and prompts tested
- Understand why models were eliminated
βββββββββββββββ ββββββββββββββββ βββββββββββββββ ββββββββββββββββ
β Craft β -> β Select β -> β Run & β -> β Compare & β
β Prompts β β Models β β Capture β β Iterate β
βββββββββββββββ ββββββββββββββββ βββββββββββββββ ββββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
Custom sections Choose from your Auto-save all Pick winners,
with reusable configured models outputs with update controls,
section names full metadata eliminate poor
performers
The platform uses a generic CRUDL (Create, Read, Update, Delete, List) interface with hierarchical relationships:
- Experiments - Top-level containers for testing objectives
- Prompts - Structured prompt configurations within experiments
- LLM Models - Configured model instances with API keys
- Experiment Runs - Captured outputs linking experiments, prompts, and models
- Prompt Section Types - Global library of reusable section names
All data is stored locally in your browser during the prototype phase.
The platform uses AI SDK v6 for direct browser-to-provider communication.
- OpenAI: GPT-5, GPT-4o, etc.
- Anthropic: Claude Opus 4.1, Sonnet 4.5, etc.
- Google: Gemini 3 Pro, etc.
- Cerebras: Open source models (Llama, Qwen, Z.ai) with fast inference
- OpenRouter: Access to hundreds of models
- Custom: Any OpenAI-compatible API
- API Keys: Stored locally in your browser during the prototype phase
- Data: All experiments, prompts, and results are stored locally
- Communication: Direct browser-to-provider communication (no intermediate server)
Happy experimenting! π§ͺβ¨