A rigorous comparison of core prompt engineering strategies on a real-world classification task, with automated scoring and Langfuse observability.
Benchmarks 4 prompt techniques on customer support intent classification across 10 labeled test cases:
| Technique | Description |
|---|---|
zero_shot |
Direct instruction, no examples |
few_shot |
4 labeled examples in context |
cot |
Step-by-step reasoning before answering |
self_consistency |
3 samples, majority vote |
Metrics collected per run: accuracy, latency, token usage — logged to Langfuse for traceability.
prompt-bench/
├── benchmark.py # Main runner — executes all techniques
├── analyze_results.py # Generates markdown findings report
├── findings.md # Analysis output (generated after run)
├── requirements.txt
├── .env.example
├── evals/
│ └── scorer.py # Exact match + LLM-as-judge scoring logic
├── utils/
│ └── langfuse_logger.py # Observability — traces, scores, metadata
└── results/
└── run_YYYYMMDD_HHMMSS.json # Per-run output (auto-generated)
# 1. Clone and install
git clone https://github.com/Codegrammer999/prompt-bench.git
cd prompt-bench
pip install -r requirements.txt
# 2. Configure environment
cp .env.example .env
# Edit .env with your API keys
# 3. Run benchmark
python benchmark.py
# 4. Generate findings report
python analyze_results.pyEdit .env to switch models (LiteLLM format):
BENCHMARK_MODEL=gemini/gemini-2.5-flash # default
BENCHMARK_MODEL=gpt-4o-mini # OpenAI
BENCHMARK_MODEL=ollama/phi3.5 # localThe Gemini free tier allows only 5 requests per minute. The benchmark handles this automatically:
- A 15-second delay is added between every API call
- If a
429error still occurs, the runner auto-retries up to 3 times with exponential backoff (60s → 120s → 180s) - The full benchmark (~40 calls) takes approximately 10 minutes on the free tier
To remove the delay if you have a paid API key:
# In your .env
RATE_LIMIT_DELAY=0Langfuse integration logs every run with:
- Per-technique traces and generations
- Accuracy scores per test case
- Token usage and latency metadata
Set LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY in .env to enable.
Runs gracefully without Langfuse if keys are not set.
After running, check results/run_*.json for raw data or run analyze_results.py for a formatted report.
Sample output file:
- LiteLLM as the model abstraction layer — swap providers with one env var change
- Exact match scoring with label normalization — deterministic and auditable
- LLM-as-judge prompt included in
scorer.pyfor ambiguous cases - Self-consistency uses temperature=0.7 + majority vote to reduce variance
- Langfuse traces link every prediction back to its prompt, model, and score
- Add more test cases to
TEST_CASESinbenchmark.py - Swap the task — change prompts + labels for summarization, extraction, etc.
- Add a new technique — implement a prompt function and add it to the
techniqueslist - Enable LLM-as-judge scoring for softer evaluation
- Add cost tracking using LiteLLM's
response_costfield
