Test prompts against datasets and measure quality with evals.
classify-sentiment.prompt.mdx classifies text sentiment and returns a structured result. It's paired with a dataset (sentiment-data.jsonl) containing test inputs and expected outputs.
classify-sentiment.prompt.mdx— The prompt with eval configurationsentiment-data.jsonl— Test dataset with inputs and expected outputs
# Run the experiment (executes prompt against every dataset item + evals)
agentmark run-experiment agentmark/classify-sentiment.prompt.mdx
# Output as JSON
agentmark run-experiment agentmark/classify-sentiment.prompt.mdx --format json
# Fail if less than 80% of items pass
agentmark run-experiment agentmark/classify-sentiment.prompt.mdx --threshold 80
# Skip evals, just run the prompt against the dataset
agentmark run-experiment agentmark/classify-sentiment.prompt.mdx --skip-evalEach line in the JSONL file is an object with input and expected_output:
{"input": {"text": "I love this product!"}, "expected_output": "{\"sentiment\": \"positive\"}"}input— The props passed to the prompt (matchesinput_schema)expected_output— A JSON string of the expected result (compared byexact_match_json)
test_settings.datasetpoints to a JSONL file with test datatest_settings.evalslists the evaluation functions to run (registered inagentmark.client.ts)- The schema only returns
sentiment(notconfidence) soexact_match_jsoncan compare deterministically run-experimentshows a table with pass/fail per item and overall pass rate- Use
--thresholdin CI/CD to gate deployments on quality