This directory contains two Jupyter notebooks used for evaluating and analyzing the performance of models on the Agent-X benchmark. Each notebook serves a distinct purpose in the evaluation pipeline.
This notebook performs quantitative evaluation of model performance on the Agent-X benchmark, focusing on reasoning accuracy, tool usage correctness, and outcome quality.
It is structured to compute and visualize:
- Goal Accuracy
- Tool Metrics for Generative Queries
- Tool Call Success/Failure
- Reasoning Step Trends
- Difficulty-based Breakdown
The notebook begins by computing goal accuracy (Gacc) for each example.
We exclude generation-based examples (GENERATIVE_IDS) when computing the global goal accuracy because they follow a different evaluation scheme.
# Clear goal_accuracy for generative rows
df.at[idx, "goal_accuray"] = ""After filtering, the notebook computes the average goal_accuray across the rest of the dataset for a reliable benchmark.
Since generative queries don't have ground-truth answers, we approximate their goal success by averaging the following tool-based scores:
precision_scoretool_accuraytoolset_accuray
This forms the Ga* (Goal Accuracy Star) metric.
subset_means = {
"precision_score": ...,
"tool_accuray": ...,
"toolset_accuray": ...
}We analyze how often tools were used successfully or failed across different models. This helps uncover issues like:
- Missing tool outputs
- Missing tool names
- Invalid tool calls
allowed_tools = {
"Calculator", "OCR", "ObjectCounter", "SceneDescriber",
"WebSearch", "RegionDescriber", "LocateObjectByText",
"CodePlotter", "MathOCR", "Solver", "DrawBoundingBox",
"OverlayText", "ImageGenerator", "ImageStylization"
}For each JSON file of reasoning traces, the notebook extracts:
- Total reasoning steps
- Unique tools used
- Tool usage distribution
Saved as a *.csv file to compare models and enable trend plots.
{
"id": 43,
"total_steps": 5,
"unique_tools_used": 3,
...
}This part plots how reasoning depth and tool diversity affect performance.
compare_models_goal_accuracy_trends([...])We use a GPT-4o-generated categorization of query difficulty (easy, medium, hard) to plot how well models perform on hard vs. easy tasks.
grouped = df.groupby("difficulty")["goal_accuray"].mean()This notebook provides:
- A principled way to isolate evaluation of generative and non-generative queries
- Insights into tool usage effectiveness
- Trend analysis on reasoning depth and difficulty
- Exportable CSVs for further aggregation or leaderboard integration
This notebook supports qualitative evaluation of reasoning traces generated by vision-language models in the Agent-X benchmark. It focuses on identifying specific types of reasoning failures through structured keyword-based comparisons.
The notebook provides modules to diagnose two high-level failure modes:
- Detects errors in final answers where key visual details are missing or incorrect.
- Uses keywords extracted from ground truth (GT) final answers + justifications.
- Evaluates if reasoning traces reflect correct tool use, spatial awareness, and intermediate logic.
- Uses keywords extracted from GT reasoning steps.
We define ground truth (GT) keywords using spaCy (noun chunks, object names, concepts) and check if these appear in model predictions.
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
keywords = [chunk.text.lower().strip() for chunk in doc.noun_chunks if 1 < len(chunk.text) < 30]Extracts up to 21 keywords from GT final answers and their justifications (data.json), and checks if each is present in model predictions.
{
"gt_keywords": ["green helmet", "kid", "left side"],
"pred_text": "There is a boy on the right wearing blue...",
"matched": ["kid"],
"unmatched": ["green helmet", "left side"],
"match_count": 1
}- A prediction is considered failed if
match_count < 1.
Extracts up to 21 keywords from GT reasoning traces and checks if they appear in the predicted tool-use trace.
task,tool,output,thought
Same as Module 1 but for reasoning traces instead of final answer.
For each model:
compute_binary_score(pred_report, "model_name")Prints how many examples had zero matching keywords (complete mismatch), helping quantify how often the model entirely misses key concepts.
Install dependencies before running:
pip install spacy==3.5.4 --user
python -m spacy download en_core_web_smThis notebook provides:
- A principled way to quantify model hallucinations or omissions
- GT-aligned keyword extraction for final answers and reasoning traces
- Binary error signal for systematic failure detection
- Easily extendable to new categories (e.g., tool misuse, factual inconsistency)


