Skip to content

Codegrammer999/prompt-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prompt Technique Benchmark Suite

A rigorous comparison of core prompt engineering strategies on a real-world classification task, with automated scoring and Langfuse observability.

What It Does

Benchmarks 4 prompt techniques on customer support intent classification across 10 labeled test cases:

Technique Description
zero_shot Direct instruction, no examples
few_shot 4 labeled examples in context
cot Step-by-step reasoning before answering
self_consistency 3 samples, majority vote

Metrics collected per run: accuracy, latency, token usage — logged to Langfuse for traceability.

Project Structure

prompt-bench/
├── benchmark.py          # Main runner — executes all techniques
├── analyze_results.py    # Generates markdown findings report
├── findings.md           # Analysis output (generated after run)
├── requirements.txt
├── .env.example
├── evals/
│   └── scorer.py         # Exact match + LLM-as-judge scoring logic
├── utils/
│   └── langfuse_logger.py # Observability — traces, scores, metadata
└── results/
    └── run_YYYYMMDD_HHMMSS.json  # Per-run output (auto-generated)

Setup

# 1. Clone and install
git clone https://github.com/Codegrammer999/prompt-bench.git
cd prompt-bench
pip install -r requirements.txt

# 2. Configure environment
cp .env.example .env
# Edit .env with your API keys

# 3. Run benchmark
python benchmark.py

# 4. Generate findings report
python analyze_results.py

Configuration

Edit .env to switch models (LiteLLM format):

BENCHMARK_MODEL=gemini/gemini-2.5-flash   # default
BENCHMARK_MODEL=gpt-4o-mini               # OpenAI
BENCHMARK_MODEL=ollama/phi3.5             # local

⚠️ Rate Limiting (Free Tier)

The Gemini free tier allows only 5 requests per minute. The benchmark handles this automatically:

  • A 15-second delay is added between every API call
  • If a 429 error still occurs, the runner auto-retries up to 3 times with exponential backoff (60s → 120s → 180s)
  • The full benchmark (~40 calls) takes approximately 10 minutes on the free tier

To remove the delay if you have a paid API key:

# In your .env
RATE_LIMIT_DELAY=0

Observability

Langfuse integration logs every run with:

  • Per-technique traces and generations
  • Accuracy scores per test case
  • Token usage and latency metadata

Set LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY in .env to enable.
Runs gracefully without Langfuse if keys are not set.

Results

After running, check results/run_*.json for raw data or run analyze_results.py for a formatted report.

Sample output file:

Sample results JSON

Key Design Decisions

  • LiteLLM as the model abstraction layer — swap providers with one env var change
  • Exact match scoring with label normalization — deterministic and auditable
  • LLM-as-judge prompt included in scorer.py for ambiguous cases
  • Self-consistency uses temperature=0.7 + majority vote to reduce variance
  • Langfuse traces link every prediction back to its prompt, model, and score

Extending This Project

  • Add more test cases to TEST_CASES in benchmark.py
  • Swap the task — change prompts + labels for summarization, extraction, etc.
  • Add a new technique — implement a prompt function and add it to the techniques list
  • Enable LLM-as-judge scoring for softer evaluation
  • Add cost tracking using LiteLLM's response_cost field

Tech Stack

  • LiteLLM — model abstraction
  • Langfuse — observability & tracing
  • Python 3.11+

About

This is a benchmark suite comparing zero-shot, few-shot, Chain-of-Thought, and self-consistency on a classification task. Each run is traced in Langfuse with accuracy scores, latency, and token usage. The findings.md documents which technique wins and why.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages