Skip to content

Latest commit

 

History

History
118 lines (84 loc) · 3.82 KB

File metadata and controls

118 lines (84 loc) · 3.82 KB

Prompt Technique Benchmark Suite

A rigorous comparison of core prompt engineering strategies on a real-world classification task, with automated scoring and Langfuse observability.

What It Does

Benchmarks 4 prompt techniques on customer support intent classification across 10 labeled test cases:

Technique Description
zero_shot Direct instruction, no examples
few_shot 4 labeled examples in context
cot Step-by-step reasoning before answering
self_consistency 3 samples, majority vote

Metrics collected per run: accuracy, latency, token usage — logged to Langfuse for traceability.

Project Structure

prompt-bench/
├── benchmark.py          # Main runner — executes all techniques
├── analyze_results.py    # Generates markdown findings report
├── findings.md           # Analysis output (generated after run)
├── requirements.txt
├── .env.example
├── evals/
│   └── scorer.py         # Exact match + LLM-as-judge scoring logic
├── utils/
│   └── langfuse_logger.py # Observability — traces, scores, metadata
└── results/
    └── run_YYYYMMDD_HHMMSS.json  # Per-run output (auto-generated)

Setup

# 1. Clone and install
git clone https://github.com/Codegrammer999/prompt-bench.git
cd prompt-bench
pip install -r requirements.txt

# 2. Configure environment
cp .env.example .env
# Edit .env with your API keys

# 3. Run benchmark
python benchmark.py

# 4. Generate findings report
python analyze_results.py

Configuration

Edit .env to switch models (LiteLLM format):

BENCHMARK_MODEL=gemini/gemini-2.5-flash   # default
BENCHMARK_MODEL=gpt-4o-mini               # OpenAI
BENCHMARK_MODEL=ollama/phi3.5             # local

⚠️ Rate Limiting (Free Tier)

The Gemini free tier allows only 5 requests per minute. The benchmark handles this automatically:

  • A 15-second delay is added between every API call
  • If a 429 error still occurs, the runner auto-retries up to 3 times with exponential backoff (60s → 120s → 180s)
  • The full benchmark (~40 calls) takes approximately 10 minutes on the free tier

To remove the delay if you have a paid API key:

# In your .env
RATE_LIMIT_DELAY=0

Observability

Langfuse integration logs every run with:

  • Per-technique traces and generations
  • Accuracy scores per test case
  • Token usage and latency metadata

Set LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY in .env to enable.
Runs gracefully without Langfuse if keys are not set.

Results

After running, check results/run_*.json for raw data or run analyze_results.py for a formatted report.

Sample output file:

Sample results JSON

Key Design Decisions

  • LiteLLM as the model abstraction layer — swap providers with one env var change
  • Exact match scoring with label normalization — deterministic and auditable
  • LLM-as-judge prompt included in scorer.py for ambiguous cases
  • Self-consistency uses temperature=0.7 + majority vote to reduce variance
  • Langfuse traces link every prediction back to its prompt, model, and score

Extending This Project

  • Add more test cases to TEST_CASES in benchmark.py
  • Swap the task — change prompts + labels for summarization, extraction, etc.
  • Add a new technique — implement a prompt function and add it to the techniques list
  • Enable LLM-as-judge scoring for softer evaluation
  • Add cost tracking using LiteLLM's response_cost field

Tech Stack

  • LiteLLM — model abstraction
  • Langfuse — observability & tracing
  • Python 3.11+