Prompt Technique Benchmark Suite

A rigorous comparison of core prompt engineering strategies on a real-world classification task, with automated scoring and Langfuse observability.

What It Does

Benchmarks 4 prompt techniques on customer support intent classification across 10 labeled test cases:

Technique	Description
`zero_shot`	Direct instruction, no examples
`few_shot`	4 labeled examples in context
`cot`	Step-by-step reasoning before answering
`self_consistency`	3 samples, majority vote

Metrics collected per run: accuracy, latency, token usage — logged to Langfuse for traceability.

Project Structure

prompt-bench/
├── benchmark.py          # Main runner — executes all techniques
├── analyze_results.py    # Generates markdown findings report
├── findings.md           # Analysis output (generated after run)
├── requirements.txt
├── .env.example
├── evals/
│   └── scorer.py         # Exact match + LLM-as-judge scoring logic
├── utils/
│   └── langfuse_logger.py # Observability — traces, scores, metadata
└── results/
    └── run_YYYYMMDD_HHMMSS.json  # Per-run output (auto-generated)

Setup

# 1. Clone and install
git clone https://github.com/Codegrammer999/prompt-bench.git
cd prompt-bench
pip install -r requirements.txt

# 2. Configure environment
cp .env.example .env
# Edit .env with your API keys

# 3. Run benchmark
python benchmark.py

# 4. Generate findings report
python analyze_results.py

Configuration

Edit .env to switch models (LiteLLM format):

BENCHMARK_MODEL=gemini/gemini-2.5-flash   # default
BENCHMARK_MODEL=gpt-4o-mini               # OpenAI
BENCHMARK_MODEL=ollama/phi3.5             # local

⚠️ Rate Limiting (Free Tier)

The Gemini free tier allows only 5 requests per minute. The benchmark handles this automatically:

A 15-second delay is added between every API call
If a 429 error still occurs, the runner auto-retries up to 3 times with exponential backoff (60s → 120s → 180s)
The full benchmark (~40 calls) takes approximately 10 minutes on the free tier

To remove the delay if you have a paid API key:

# In your .env
RATE_LIMIT_DELAY=0

Observability

Langfuse integration logs every run with:

Per-technique traces and generations
Accuracy scores per test case
Token usage and latency metadata

Set LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY in .env to enable.
Runs gracefully without Langfuse if keys are not set.

Results

After running, check results/run_*.json for raw data or run analyze_results.py for a formatted report.

Sample output file:

Key Design Decisions

LiteLLM as the model abstraction layer — swap providers with one env var change
Exact match scoring with label normalization — deterministic and auditable
LLM-as-judge prompt included in scorer.py for ambiguous cases
Self-consistency uses temperature=0.7 + majority vote to reduce variance
Langfuse traces link every prediction back to its prompt, model, and score

Extending This Project

Add more test cases to TEST_CASES in benchmark.py
Swap the task — change prompts + labels for summarization, extraction, etc.
Add a new technique — implement a prompt function and add it to the techniques list
Enable LLM-as-judge scoring for softer evaluation
Add cost tracking using LiteLLM's response_cost field

Tech Stack

LiteLLM — model abstraction
Langfuse — observability & tracing
Python 3.11+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompt Technique Benchmark Suite

What It Does

Project Structure

Setup

Configuration

⚠️ Rate Limiting (Free Tier)

Observability

Results

Key Design Decisions

Extending This Project

Tech Stack

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Prompt Technique Benchmark Suite

What It Does

Project Structure

Setup

Configuration

⚠️ Rate Limiting (Free Tier)

Observability

Results

Key Design Decisions

Extending This Project

Tech Stack