Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions benchmark/CLAUDE.skill.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
This repository has .graph.* files next to source files containing code relationship data from Supermodel.

The naming convention: for src/Foo.py the graph file is src/Foo.graph.py (insert .graph before the extension). Each graph file has up to three sections:
- [deps] — what this file imports and what imports it
- [calls] — function call relationships with file paths and line numbers
- [impact] — blast radius: risk level, affected domains, direct/transitive dependents

**Read the .graph file before the source file.** It shows the full dependency and call picture in far fewer tokens. Construct the path directly — don't ls the directory to discover it.

Before grepping to understand how code connects, check the relevant .graph files. They already answer most structural navigation questions: what calls what, what imports what, and what breaks if you change something. When you grep for a function name, .graph files appear in results showing every caller and callee — use this to navigate instead of searching for each one individually.
Binary file modified benchmark/results/benchmark_results.zip
Binary file not shown.
36 changes: 19 additions & 17 deletions benchmark/results/blog-post-draft.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 40% cheaper. 4× faster. Same correct answer.
# 60% cheaper. 4× faster. Same correct answer.

We ran a test: give Claude Code the same task twice — once by itself, once with Supermodel. Both had to make 8 failing tests pass in a 270k-line codebase. Both used the same model. Same starting point.
We ran a test: give Claude Code the same task four ways — naked, with a hand-crafted prompt, with our auto-generated prompt, and with a different shard format. All had to make 8 failing tests pass in a 270k-line codebase. Same model. Same starting point.

Here's what happened.

Expand Down Expand Up @@ -29,24 +29,24 @@ No plugins. No special AI tools. Just better context up front.

## Results

| | Naked Claude | + Supermodel |
|---------------------|-------------|--------------|
| **Cost** | $0.2212 | $0.1329 |
| **Turns** | 13 | 7 |
| **Duration** | 95.9s | 24.1s |
| **Cache reads** | 235,456 tok | 90,479 tok |
| **Tests passed** | ✓ YES | ✓ YES |
| Tool calls | Bash ×8, Read ×2, Write ×2 | Bash ×2, Read ×2, Glob ×1, Write ×1 |
| | Naked Claude | + Supermodel (crafted) | + Supermodel (auto) | Three-file shards |
|---------------------|-------------|------------------------|---------------------|-------------------|
| **Cost** | $0.30 | $0.12 | $0.15 | $0.25 |
| **Turns** | 20 | 9 | 11 | 16 |
| **Duration** | 122s | 29s | 42s | 73s |
| **Tests passed** | ✓ YES | ✓ YES | ✓ YES | ✓ YES |

**40% cheaper. 6 fewer turns. 72 seconds faster.**
**60% cheaper. 4× faster. 55% fewer turns.**

Both got the right answer. The only difference was how much digging each one had to do first.
All four got the right answer. The only difference was how much digging each one had to do first.

"Crafted" is a hand-written CLAUDE.md with Django-specific hints. "Auto" is what `supermodel skill` generates — a generic prompt that works on any repo. The auto prompt captured 83% of the crafted prompt's savings with zero manual effort.
Comment on lines +32 to +43
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Benchmark numbers are now internally inconsistent with later narrative sections.

After updating this table, the body still reports older values (for example, Line 49 shows 13 turns, $0.22, Line 70 shows 7 turns, $0.13, and Line 103 says Net result: 40% cheaper). This makes the post read as contradictory and weakens credibility.

Please align the downstream prose with this updated table before publish.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/results/blog-post-draft.md` around lines 32 - 43, The downstream
prose still contains old benchmark numbers that conflict with the updated table;
update any occurrences of the outdated stats (for example the strings "13 turns,
$0.22", "7 turns, $0.13", and "Net result: 40% cheaper") so they match the table
values (Naked Claude $0.30/20 turns/122s; + Supermodel (crafted) $0.12/9
turns/29s; + Supermodel (auto) $0.15/11 turns/42s; Three-file shards $0.25/16
turns/73s) and replace the summary line with the new aggregate claim "60%
cheaper. 4× faster. 55% fewer turns."; search the file for any other numeric
mentions of cost/turns/duration and reconcile them to these table values and
recalculated percentages so all prose is consistent with the table.


---

## What actually happened

### Without Supermodel (13 turns, $0.22)
### Without Supermodel (20 turns, $0.30)

Claude read the tests, then spent 6 turns poking around to figure out how the codebase worked:

Expand All @@ -67,7 +67,7 @@ Bash: run tests → all pass

Six commands just to answer basic questions: *How does Django wire things together? Where do signals go? What version is this?* Then it wrote the code.

### With Supermodel (7 turns, $0.13)
### With Supermodel — auto prompt (11 turns, $0.15)

```
Bash: run tests → see 8 errors
Expand All @@ -82,7 +82,7 @@ No digging. The summary files had already answered the structural questions. Cla

Here's what Claude said to itself before writing, in each run:

**Without Supermodel** (after 6 exploration turns):
**Without Supermodel** (after 7+ exploration turns):
> "Now I understand the structure. I need to implement `EmailChangeRecord` in models.py and wire up signals to track email changes. I'll create an AppConfig to properly connect signals."

**With Supermodel** (before touching anything):
Expand All @@ -100,7 +100,7 @@ There are two ways to spend tokens: reading files to learn things, and writing f

The naked run read 235k tokens — mostly source files it combed through to understand the codebase. The Supermodel run read only 90k. That 145k gap is where most of the savings came from.

Here's the twist: the Supermodel run actually *wrote* more tokens (23k vs 19k), because it loaded the summary files into memory upfront. So it spent a little more on the cheap thing. But way less on the expensive thing. Net result: 40% cheaper.
Here's the twist: the Supermodel run actually *wrote* more tokens (23k vs 19k), because it loaded the summary files into memory upfront. So it spent a little more on the cheap thing. But way less on the expensive thing. Net result: 50% cheaper ($0.30 → $0.15 with the auto prompt; 60% with the hand-crafted one).

The summary files are built once. When the AI starts working, the answers are already there. It never has to go looking.

Expand All @@ -118,7 +118,9 @@ That's real exploratory work. The summary files answered all of it before Claude

The savings didn't come from a cheaper model or a smaller prompt. They came from not making the AI rediscover things the codebase already knows about itself.

On a 270k-line repo with a hard task, one analysis pass meant 6 fewer turns and 72 fewer seconds — every single time. For tasks you run over and over — reviews, debugging, new features — that adds up fast.
On a 270k-line repo with a hard task, one analysis pass meant 9 fewer turns and 80 fewer seconds with the auto prompt — or 11 fewer turns and 93 fewer seconds with a hand-crafted one. And `supermodel skill` generates the CLAUDE.md for you — no hand-tuning required, 50% cheaper than naked.

For tasks you run over and over — reviews, debugging, new features — that adds up fast.

Run the analysis once. Save on every task after.

Expand Down
94 changes: 94 additions & 0 deletions benchmark/results/skill-v2.txt

Large diffs are not rendered by default.

18 changes: 8 additions & 10 deletions benchmark/results/summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,15 @@

## Results

| | naked | supermodel |
|--------------------|--------------|--------------|
| Cost | $0.2212 | $0.1329 |
| Turns | 13 | 7 |
| Duration | 95.9s | 24.1s |
| Cache tokens read | 235,456 | 90,479 |
| Cache tokens built | 18,681 | 23,281 |
| All tests passed | YES | YES |
| Tool calls | {'Bash': 8, 'Read': 2, 'Write': 2} | {'Bash': 2, 'Read': 2, 'Glob': 1, 'Write': 1} |
| | naked | supermodel (crafted) | skill (generic) | three-file |
|--------------------|--------------|----------------------|-----------------|--------------|
| Cost | $0.30 | $0.12 | $0.15 | $0.25 |
| Turns | 20 | 9 | 11 | 16 |
| Duration | 122s | 29s | 42s | 73s |
| All tests passed | YES | YES | YES | YES |

**supermodel: $0.0883 (39.9%) cheaper, 6 fewer turns, 72s faster**
**supermodel (crafted prompt): 60% cheaper, 76% faster, 55% fewer turns vs naked**
**skill (generic prompt): 50% cheaper, 66% faster, 45% fewer turns vs naked**

## How supermodel helped
The graph files gave Claude the architecture upfront. The supermodel run went straight
Expand Down
37 changes: 37 additions & 0 deletions cmd/skill.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
package cmd

import (
"fmt"

"github.com/spf13/cobra"
)

const skillPrompt = `This repository has .graph.* files next to source files containing code relationship data from Supermodel.

The naming convention: for src/Foo.py the graph file is src/Foo.graph.py (insert .graph before the extension). Each graph file has up to three sections:
- [deps] — what this file imports and what imports it
- [calls] — function call relationships with file paths and line numbers
- [impact] — blast radius: risk level, affected domains, direct/transitive dependents

**Read the .graph file before the source file.** It shows the full dependency and call picture in far fewer tokens. Construct the path directly — don't ls the directory to discover it.

Before grepping to understand how code connects, check the relevant .graph files. They already answer most structural navigation questions: what calls what, what imports what, and what breaks if you change something. When you grep for a function name, .graph files appear in results showing every caller and callee — use this to navigate instead of searching for each one individually.`

func init() {
c := &cobra.Command{
Use: "skill",
Short: "Print agent awareness prompt for graph files",
Long: `Prints a prompt that teaches AI coding agents how to use Supermodel's
graph files. Pipe into your agent's instructions:

supermodel skill >> CLAUDE.md
supermodel skill >> AGENTS.md
supermodel skill >> .cursorrules`,
Args: cobra.NoArgs,
Run: func(cmd *cobra.Command, args []string) {
fmt.Println(skillPrompt)
},
}

rootCmd.AddCommand(c)
}
32 changes: 32 additions & 0 deletions cmd/skill_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
package cmd

import (
"strings"
"testing"
)

func TestSkillPrompt_ContainsKeyElements(t *testing.T) {
required := []struct {
substr string
reason string
}{
{".graph.", "must reference graph file extension"},
{"[deps]", "must document deps section"},
{"[calls]", "must document calls section"},
{"[impact]", "must document impact section"},
{".graph.py", "must show naming convention with concrete example"},
{"before the source file", "must instruct read-order (graph first)"},
}

for _, r := range required {
if !strings.Contains(skillPrompt, r.substr) {
t.Errorf("skill prompt missing %q — %s", r.substr, r.reason)
}
}
}

func TestSkillPrompt_NotEmpty(t *testing.T) {
if len(strings.TrimSpace(skillPrompt)) < 100 {
t.Error("skill prompt is suspiciously short")
}
}
5 changes: 4 additions & 1 deletion internal/find/zip_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,10 @@ func TestCreateZip_CleanGitRepo(t *testing.T) {
// TestCreateZip_CreateTempError covers L48-50: createZip returns an error when
// os.CreateTemp fails due to an invalid TMPDIR.
func TestCreateZip_CreateTempError(t *testing.T) {
t.Setenv("TMPDIR", filepath.Join(t.TempDir(), "nonexistent-tmp"))
badTmp := filepath.Join(t.TempDir(), "nonexistent-tmp")
t.Setenv("TMPDIR", badTmp)
t.Setenv("TMP", badTmp)
t.Setenv("TEMP", badTmp)
_, err := createZip(t.TempDir())
if err == nil {
t.Error("createZip should fail when os.CreateTemp fails")
Expand Down
Loading