Ascii Maze Benchmark

This is a benchmark for testing how capable different LLMs are at solving ascii mazes. Here is an example 4x4 maze:

START
 v
# #######
#       #
# ##### #
# #     #
# #######
#     # #
##### # #
#       #
####### #
       ^
   FINISH

Here is the solution:

#.#######
#.      #
#.##### #
#.#     #
#.#######
#.....# #
#####.# #
#    ...#
#######.#

The benchmark randomly generates mazes from a seed, and evaluates LLMs ability to solve the maze.

Some LLMs tend to struggle with perfectly formatting the output for some reason, so we report scores at varying string distances to the correct response.

We evaluate all models using the OpenRouter API, to keep it simple. If it's not on open router, the benchmark will not be run.

Usage

Setup

Copy .env.example to .env and add your OpenRouter API key:
```
cp .env.example .env
```
Edit the .env file and replace your_api_key_here with your actual OpenRouter API key.

Generate Example Mazes

To generate and solve an example maze:

uv run ascii-maze-benchmark generate-example WIDTH HEIGHT [--seed SEED]

Example:

uv run ascii-maze-benchmark generate-example 5 5 --seed 42

Run Benchmarks

To run benchmarks against a specific model:

uv run ascii-maze-benchmark run-benchmark MODEL_ID [OPTIONS]

Options:

--maze-sizes TEXT: Comma-separated list of maze sizes to test (format: WIDTHxHEIGHT)
--mazes-per-size INTEGER: Number of different mazes to generate per size
--seed INTEGER: Random seed for reproducible maze generation
--cache-dir TEXT: Directory to cache API responses (uses platform-specific directory if not specified)

Example:

uv run ascii-maze-benchmark run-benchmark anthropic/claude-3-haiku-20240307 --maze-sizes 3x3,4x4 --mazes-per-size 2

Development Tips

API responses are cached in a platform-specific directory:
- Linux: ~/.cache/ascii-maze-benchmark/api_responses
- macOS: ~/Library/Caches/ascii-maze-benchmark/api_responses
- Windows: C:\Users\<username>\AppData\Local\ascii-maze-benchmark\Cache\api_responses
Test the benchmarking code on a cheap model on OpenRouter first, to save costs.
Use the .env file to manage OpenRouter credentials.
Use uv for package management and running commands.
There is a src/ascii_maze_benchmark/generate_maze_script.py file you can use as a reference for maze generation logic.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
src/ascii_maze_benchmark		src/ascii_maze_benchmark
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
benchmark_command()		benchmark_command()
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ascii Maze Benchmark

Usage

Setup

Generate Example Mazes

Run Benchmarks

Development Tips

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ascii Maze Benchmark

Usage

Setup

Generate Example Mazes

Run Benchmarks

Development Tips

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages