MetivtaEval is an AI benchmarking platform for Torah Q&A and retrieval systems. It combines a private evaluation toolkit, a deployable API platform, and public leaderboard surfaces for teams that want to benchmark real systems instead of toy prompts. Under the hood it ships a Go gateway, FastAPI v2 application, Flask compatibility API, Celery worker, canonical Postgres-backed persistence, and a seeded Docker demo that proves the stack end-to-end.
This README is intentionally narrow. Every command and workflow below was verified in this repository before being documented.
MetivtaEval exists to answer a simple product question: if a team ships a Torah AI system, how do they measure whether it actually performs well enough to trust in production?
The platform supports two evaluation tracks:
- DAAT evaluation End-to-end Q&A benchmarking for answer quality, scholarly structure, attribution quality, and source-grounded behavior.
- MTEB-style retrieval evaluation Retrieval benchmarking for ranked passage quality using standard IR metrics such as nDCG, MAP, MRR, Recall, and Precision.
The product surface is built around two user journeys:
- Public benchmarking Teams expose an API endpoint, submit it for evaluation, and compare against a shared leaderboard.
- Private development Teams inspect the local dataset, run controlled validation, sync optional artifacts to LangSmith, and harden integrations before release.
- It benchmarks a running system, not a handcrafted answer file.
- It keeps a compatibility path for older integrations while exposing a modern FastAPI v2 surface.
- It supports both answer-generation and retrieval-only systems.
- It can run locally with Docker, then scale outward with optional SaaS integrations.
DAAT is the Q&A track. You provide an answer endpoint, the platform sends real questions, and the evaluation stack scores the responses for:
- answer completeness and correctness
- scholarly structure and citation behavior
- attribution quality through the DAAT score
- retrieval-adjacent hygiene such as response format and URL quality
- optional web validation when Browserless is configured
Your system contract for DAAT is a single HTTP endpoint:
POST /answer
Content-Type: application/json
{"question":"..."}Expected response:
{"answer":"..."}The bundled demo-answer service was validated against this exact contract in Docker.
MTEB is the retrieval track. You provide a retrieval endpoint, the platform sends benchmark queries, and the evaluator computes standard information retrieval metrics:
- nDCG@10
- MAP@100
- MRR@10
- Recall@100
- Precision@10
Your system contract for MTEB is:
POST /retrieve
Content-Type: application/json
{"query":"...","top_k":100}Verified response shape:
{
"results": [
{"id":"passage_001","score":0.9234},
{"id":"passage_002","score":0.8876}
]
}The bundled reference_retrieval_api.py was started locally and validated through a live MTEB submission against the Docker stack.
- Expose a system endpoint that matches either the DAAT or MTEB contract.
- Register through the API, create credentials, and submit the endpoint URL.
- Let the platform call the system, score the results, and persist them.
- Review the final evaluation payload or leaderboard ranking.
There are two public leaderboard views in the product:
- DAAT leaderboard for end-to-end Q&A systems
- MTEB leaderboard for retrieval systems
The README preview images live in a dedicated folder:
docs/assets/readme/previews/leaderboard.pngdocs/assets/readme/previews/signup.pngdocs/assets/readme/previews/api-reference.png
This is the public-facing ranking surface where teams compare live systems, benchmark iterations, and show progress over time.
This screenshot belongs to the runnable stack, not the public metivta.co documentation site. It
is the onboarding surface for evaluators and submitters when you launch the full application
locally or self-host it.
This is the operator and integrator surface. It should make the contract legible at a glance: authentication, evaluation payloads, and leaderboard endpoints.
The public metivta.co site does not issue API keys. The two supported issuance paths below belong
to the runnable stack you launch locally or self-host.
For user-facing onboarding inside the runnable stack, the Flask compatibility app serves a registration page at:
- local Docker demo: http://localhost:18080/signup
- self-hosted runtime:
<your-runtime-base-url>/signup
That page submits to POST /register and immediately returns an API key. This path was verified
against the running stack.
The legacy registration payload is:
{
"email": "team@example.com",
"name": "Team Name",
"organization": "Optional Org"
}Verified response shape:
{
"message": "Registration successful! Save your API key - it won't be shown again.",
"api_key": "mtv_...",
"key_prefix": "mtv_..."
}For modern integrations, users register and then create scoped API keys through FastAPI v2:
POST /api/v2/auth/registerPOST /api/v2/auth/loginPOST /api/v2/auth/api-keys
This path was also verified end-to-end in the Docker stack.
- Auth works end-to-end: register, login, refresh, and API-key creation.
- DAAT evaluation works from the local JSON dataset without requiring a pre-created LangSmith dataset.
- MTEB evaluation works against a live retrieval endpoint and persists to the retrieval leaderboard.
- Legacy
/submit?async=truecompatibility still works through the worker and status polling path. - Leaderboards render through both FastAPI JSON APIs and the Flask dashboard.
- Scalar API docs are live at
/api/v2/docsand open expanded by default. - Optional LangSmith sync, Anthropic-backed evaluators, and Browserless-backed web validation work when credentials are configured.
- Python linting, type checking, pytest, and Go race tests pass in the current repo state.
| Component | Role |
|---|---|
| Go gateway | Front door on :18000, routes traffic to the application stack |
| FastAPI v2 | Primary API for auth, evaluations, leaderboard APIs, and docs |
| Flask compatibility API | Preserves historical /submit and /status/<task_id> contracts |
| Celery worker | Executes async evaluation jobs |
| Postgres | Canonical persistence for users, API keys, evaluations, and leaderboard state |
| Redis | Celery broker and result backend |
| Local dataset loader | Loads DAAT examples directly from JSON at runtime |
| LangSmith | Optional dataset sync and tracing, not a mandatory DAAT runtime dependency |
MetivtaEval supports both managed and self-hosted deployment targets:
- Azure Container Apps (verified): production-grade hosted deployment with custom domains and managed TLS.
- Render: blueprint-based deployment path retained for teams already on Render.
- Docker Compose: self-hosted local or VM-based deployment.
For Azure-specific setup and commands, see docs/DEPLOYMENT.md and follow the Azure Container Apps section.
If you are launching only a homepage + API reference first, set:
PUBLIC_DOCS_ONLY=trueIn this mode:
/and/api/v2/docsremain public/api/v2/openapi.jsonremains public- evaluation and auth APIs are intentionally not exposed publicly
The public metivta.co host is documentation-only. Signup, API key issuance, leaderboard writes,
and evaluation execution belong to the runtime you launch locally or self-host.
Expand quick start (requirements, local stack, and production checks)
- Docker with Compose v2
- Python 3.11+
uvjqfor the shell examples- Go 1.24+ for local Go tests
make setupFor normal local development, use the unseeded stack:
make compose-devFor the seeded end-to-end proof path, use the demo stack:
make compose-demoSecurity Note: The demo stack reads
METIVTA_DEMO_PASSWORDfor local auth seeding and otherwise generates an ephemeral demo credential at runtime. Do not expose a live instance configured with thedemoprofile to the public internet.
For live fault-injection verification against Docker, use:
make compose-faultsThe stack is ready when demo-seeder exits with code 0:
docker compose --profile legacy --profile demo ps --all
docker compose --profile legacy --profile demo logs demo-seeder --tail=50make compose-demo intentionally caps the dataset to two examples for fast local verification. For a
full local benchmark run with the demo services enabled and no demo cap, start the same stack
without that environment override:
docker compose --profile legacy --profile demo up -d --buildYou can also override the benchmark dataset or evaluator profile at launch time without editing code:
METIVTA_DATASET_NAME=My-Benchmark \
METIVTA_DATASET_LOCAL_PATH=/app/custom-dataset \
METIVTA_DATASET_FILES_QUESTIONS=questions.json \
METIVTA_EVALUATION_DAAT_EVALUATORS=hebrew_presence,url_format,response_length,daat_score \
docker compose --profile legacy --profile demo up -d --build- Gateway: http://localhost:18000
- Scalar API docs: http://localhost:18000/api/v2/docs
- OpenAPI JSON: http://localhost:18000/api/v2/openapi.json
- Dataset info: http://localhost:18080/dataset-info
- Runtime signup page: http://localhost:18080/signup
- Flask leaderboard: http://localhost:18080/leaderboard
These checks were validated against the running Docker stack:
curl -fsS http://localhost:18000/ready | jq .
curl -fsS http://localhost:18080/dataset-info | jq .
curl -fsS http://localhost:18080/signup >/dev/null && echo runtime_signup_ok
curl -fsS http://localhost:18080/leaderboard >/dev/null && echo leaderboard_ok
curl -fsS \
-X POST http://localhost:18080/validate-endpoint \
-H 'Content-Type: application/json' \
-d '{"endpoint_url":"http://demo-answer:5001/answer"}' | jq .What each surface is for:
| URL | Purpose |
|---|---|
http://localhost:18000/ready |
Verify DB, Redis, API, and local DAAT dataset readiness |
http://localhost:18000/api/v2/docs |
Verify the public API contract in Scalar |
http://localhost:18080/signup |
Verify the runtime API key issuance page |
http://localhost:18080/leaderboard |
Verify the browser leaderboard experience |
http://localhost:18080/dataset-info |
Verify which dataset file the stack is actually using |
http://localhost:18080/validate-endpoint |
Verify a DAAT endpoint matches the harness contract |
Expand DAAT end-to-end API walkthrough
The following shell flow was executed against the running Docker stack. It creates a user, logs in, creates an API key, submits a DAAT evaluation, fetches results, and reads the leaderboard.
Important:
- Keep
endpoint_urlset tohttp://demo-answer:5001/answerwhile the demo stack is running. - That hostname is resolved from inside the Docker network by the evaluator service, not by your local shell.
BASE_URL=http://localhost:18000
EMAIL="readme-$(date +%s)@example.com"
PASSWORD="${PASSWORD:-replace-with-local-demo-password}"
REGISTER_PAYLOAD=$(jq -n \
--arg email "$EMAIL" \
--arg password "$PASSWORD" \
'{email:$email,name:"README User",password:$password,organization:"README"}')
curl -fsS \
-X POST "$BASE_URL/api/v2/auth/register" \
-H 'Content-Type: application/json' \
-d "$REGISTER_PAYLOAD"
LOGIN_PAYLOAD=$(jq -n \
--arg email "$EMAIL" \
--arg password "$PASSWORD" \
'{email:$email,password:$password}')
ACCESS_TOKEN=$(curl -fsS \
-X POST "$BASE_URL/api/v2/auth/login" \
-H 'Content-Type: application/json' \
-d "$LOGIN_PAYLOAD" | jq -r '.access_token')
curl -fsS \
-X POST "$BASE_URL/api/v2/auth/api-keys" \
-H "Authorization: Bearer $ACCESS_TOKEN" \
-H 'Content-Type: application/json' \
-d '{"name":"readme-key","scopes":["eval:read","eval:write"]}'
EVAL_ID=$(curl -fsS \
-X POST "$BASE_URL/api/v2/eval/" \
-H "Authorization: Bearer $ACCESS_TOKEN" \
-H 'Content-Type: application/json' \
-d '{
"system_name":"README Demo",
"system_version":"1.0.0",
"endpoint_url":"http://demo-answer:5001/answer",
"mode":"daat",
"dataset_name":"default",
"async_mode":false
}' | jq -r '.id')
curl -fsS \
-H "Authorization: Bearer $ACCESS_TOKEN" \
"$BASE_URL/api/v2/eval/$EVAL_ID/results"
curl -fsS "$BASE_URL/api/v2/leaderboard/"Expand MTEB retrieval walkthrough
The retrieval path was also validated end-to-end with the bundled reference retrieval API.
uv run python examples/reference_retrieval_api.pyIt exposes:
GET /healthPOST /retrieve
curl -fsS http://127.0.0.1:5001/health
curl -fsS \
-X POST http://127.0.0.1:5001/retrieve \
-H 'Content-Type: application/json' \
-d '{"query":"What does the Divrei Yoel say about Bereishis?","top_k":3}'When the Docker stack is running, use host.docker.internal so the FastAPI container can reach the
host-side retrieval service:
BASE_URL=http://localhost:18000
EMAIL="mteb-$(date +%s)@example.com"
PASSWORD="${PASSWORD:-replace-with-local-demo-password}"
REGISTER_PAYLOAD=$(jq -n \
--arg email "$EMAIL" \
--arg password "$PASSWORD" \
'{email:$email,name:"MTEB User",password:$password,organization:"README"}')
curl -fsS \
-X POST "$BASE_URL/api/v2/auth/register" \
-H 'Content-Type: application/json' \
-d "$REGISTER_PAYLOAD"
ACCESS_TOKEN=$(curl -fsS \
-X POST "$BASE_URL/api/v2/auth/login" \
-H 'Content-Type: application/json' \
-d "$(jq -n --arg email "$EMAIL" --arg password "$PASSWORD" '{email:$email,password:$password}')" \
| jq -r '.access_token')
EVAL_ID=$(curl -fsS \
-X POST "$BASE_URL/api/v2/eval/" \
-H "Authorization: Bearer $ACCESS_TOKEN" \
-H 'Content-Type: application/json' \
-d '{
"system_name":"README MTEB Demo",
"system_version":"1.0.0",
"endpoint_url":"http://host.docker.internal:5001/retrieve",
"mode":"mteb",
"dataset_name":"default",
"async_mode":false
}' | jq -r '.id')
curl -fsS \
-H "Authorization: Bearer $ACCESS_TOKEN" \
"$BASE_URL/api/v2/eval/$EVAL_ID/results"
curl -fsS "$BASE_URL/api/v2/leaderboard/?mode=mteb"This flow was validated with:
- evaluation id
a53adf9a-8e97-4cf0-832e-2e358f06eb78 ndcg_10=0.8642322944270175map_100=0.75mrr_10=0.8666666666666666
The repository still ships the original Flask compatibility surface for older clients:
GET /signupPOST /registerPOST /submitGET /status/<task_id>
That path is preserved and validated in the Docker demo through the seeded legacy async submission
flow. New integrations should target /api/v2/*.
MetivtaEval keeps the DAAT dataset local by default. The runtime reads the configured JSON file directly, which means DAAT does not depend on a remote LangSmith dataset being pre-provisioned.
Holdback dataset: The public holdback file lives at
src/metivta_eval/dataset/Q1-holdback.json. Since centralized result collection is not currently being run, use this holdback directly for manual or independent verification.
make show-questionsuv run python -m metivta_eval.scripts.show_questions --export /tmp/metivta-answers.jsonIf LANGSMITH_API_KEY or LANGCHAIN_API_KEY is configured, this command uploads the local dataset
to the Metivta-Eval dataset in LangSmith:
make dataset-uploadExpand benchmark customization guide
If you want to turn this repository into your own evaluation harness and public leaderboard, start with BUILD_YOUR_OWN_BENCHMARK.md.
The short operator version is:
- Replace the dataset and rubric files that define the benchmark.
- Pin the evaluator profile you want in
config.toml. - Launch
make compose-devfor the normal product surface ormake compose-demofor the seeded proof path, then verify/ready,/signup,/leaderboard, and/api/v2/docsin your self-hosted runtime. - Issue a key, submit real systems by URL, and review the API and browser leaderboard surfaces.
The main control points are:
| Goal | Change here |
|---|---|
| Rename the benchmark | config.toml -> [dataset].name and [dataset].version |
| Replace DAAT questions and answers | config.toml -> [dataset.files].questions |
| Publish a safe question-only template | config.toml -> [dataset.files].questions_only |
| Replace the scholarly rubric | config.toml -> [dataset.files].format_rubric |
| Replace the MTEB benchmark | config.toml -> [dataset.mteb] |
| Switch to a more deterministic DAAT harness | config.toml -> [evaluation.daat].evaluators |
| Tune DAAT scoring weights | config.toml -> [evaluation.daat.weights] |
| Disable or tune remote web validation | config.toml -> [evaluation.web_validator] |
| Change the default local target for script-based evaluation | config.toml -> [evaluation].endpoint_url |
| Change scoring logic itself | src/metivta_eval/evaluators/ |
For teams that want a deterministic baseline harness, the verified no-LLM profile is:
[evaluation.daat]
enabled = true
evaluators = ["hebrew_presence", "url_format", "response_length", "daat_score"]That guide covers:
- local Docker launch and pre-production URL checks
- where users get API keys
- which endpoint URL your users submit for DAAT or MTEB
- how to replace the dataset and rubrics
- how to tune the DAAT evaluator profile in
config.toml - how to host your own leaderboard around a custom benchmark
Expand optional integration setup
These integrations are supported, but the core Docker demo does not require them to exist ahead of time:
LANGSMITH_API_KEYorLANGCHAIN_API_KEYPurpose: optional dataset sync and tracingANTHROPIC_API_KEYPurpose: LLM-backed evaluator tiersBROWSERLESS_TOKENPurpose: remote web validation
In the verified Docker run for this repository state:
- LangSmith dataset sync succeeded for
Metivta-Eval - Anthropic-backed evaluator metrics were present in the DAAT result
- Browserless-backed web validation metrics were present in the DAAT result
- API docs: http://localhost:18000/api/v2/docs
- OpenAPI JSON: http://localhost:18000/api/v2/openapi.json
- Readiness: http://localhost:18000/ready
- Flask leaderboard: http://localhost:18080/leaderboard
These commands were executed successfully against the current repository state:
uv run ruff check .
uv run pylint api src tests
uv run mypy src
uv run pytest -q
go test -race ./...Expand repository layout
api/
fastapi_app/ FastAPI v2 application and routers
workers/ Celery worker entrypoints and tasks
server.py Flask compatibility API
cmd/
metivta-gateway/ Go gateway entrypoint
src/metivta_eval/
dataset/ DAAT and MTEB datasets
evaluators/ DAAT and MTEB evaluation logic
persistence/ Canonical SQLAlchemy models and repository layer
scripts/ Operational and developer tooling
deploy/
Dockerfiles and deployment manifests
docker-compose.yml
Local deployment and seeded demo stack
- Treat the README as a contract. If a command or workflow changes, verify it and then update the documentation.
- Prefer the FastAPI v2 API for new integrations.
- Preserve the Flask compatibility API unless you are intentionally removing a migration path.
- Keep DAAT runtime behavior tied to the local dataset loader unless there is a strong operational reason to reintroduce a hard remote dependency.
MIT



