Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,844 changes: 922 additions & 922 deletions eval/evaluation_report.json

Large diffs are not rendered by default.

75 changes: 75 additions & 0 deletions eval/routine_eval/compile_evaluation_report_qwen35flash-fast.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
{
"compile_evaluation": {
"timestamp": "2026-04-24 14:10:03",
"unix_timestamp": 1777011003.2591848,
"summary": {
"fixture_count": 3,
"judged_count": 3,
"passed_count": 0,
"pass_rate": 0.0,
"compile_model": "qwen35flash-fast",
"judge_model": "qwen36plus-fast",
"mean_intent_match": 0.5833,
"mean_keyword_placement": 0.4167,
"mean_asking_behavior": 0.3333,
"total_proxy_cost": 0.080138,
"total_proxy_tokens": 14279
},
"fixture_results": {
"finviz_filter_clear": {
"success": true,
"final_status": "review",
"error": null,
"asked_questions_count": 0,
"compile_duration": 59.39,
"proxy_cost": 0.02378,
"proxy_tokens": 4165,
"overall_pass": false,
"intent_match": 0.6,
"keyword_placement": 0.75,
"asking_behavior": 0.0,
"reasoning": {
"intent_match": "The routine correctly sets all 5 filters, switches to the Fundamental tab, switches to Performance view, and sorts by Perf Month. However, Steps 9-11 hard-code specific tickers (UISA, SHXD, NRGB) instead of clicking the top matching results as the user explicitly intended (\"I want the routine to open whatever stocks match the criterion, not these specific tickers\"). This is a significant deviation from the user's goal, as the routine would only work if those exact tickers appear at the top again, rather than being reusable for any screening results.",
"keyword_placement": "Steps 1-6 correctly carry Keywords lines with acceptable tokens (fs_cap, fs_fa_div, fs_sh_relvol, Fundamental, fs_fa_pe, fs_fa_pb). Step 7 has a Keywords line but uses \"Performance\" instead of the acceptable token \"view-tab\", which is a mismatch. Step 8 (Perf Month column header) is completely missing a Keywords line when the fixture requires one with acceptable tokens [\"perf4w\", \"table-header\"]. 6 of 8 required keyword placements are correct.",
"asking_behavior": "The compiler asked zero clarification questions. The fixture's expected_questions.required lists one required topic: \"Whether the 3 clicked stocks are the top 3 or specific tickers.\" The compiler failed to ask this question entirely, which directly led to the hard-coded ticker problem in the compiled routine. Since no required topics were covered, the score is 0.0."
}
},
"github-trending-contenteditable-question": {
"success": true,
"final_status": "review",
"error": null,
"asked_questions_count": 0,
"compile_duration": 54.03,
"proxy_cost": 0.026734,
"proxy_tokens": 4842,
"overall_pass": false,
"intent_match": 0.3,
"keyword_placement": 0.5,
"asking_behavior": 0.0,
"reasoning": {
"intent_match": "The routine captures the basic flow (open GitHub trending, navigate to repo, create Yuque doc in AI专用 knowledge base, set dated title, paste URL and description) but completely fails on the core requirement: recognizing the typed sentence \"Write also: 1. A brief intro 2. What's special 3. Why's it trending\" as agent instructions. The fixture explicitly exists to catch this regression. The compiled routine has no steps for the replay agent to visit the repo page and write the three investigation sections (brief intro, what's special, why it's trending) into the Yuque document. This is the primary failure mode the fixture is designed to detect. Additionally, Steps 2-3 reference \"huggingface\" as a keyword and use \"huggingface / ml-intern\" as an example, which hardcodes the specific repo from the recording rather than treating it as a position-based selection.",
"keyword_placement": "The fixture requires Keywords lines for two steps: (1) Yuque document title input must use \"lake-title\" — Step 8 instead uses \"input\", which is overly generic and does not match the required token. (2) Yuque new-document button must use \"新建文档\" or \"文档\" — Step 6 correctly uses \"文档\", which is acceptable. Steps 2 and 3 use \"huggingface\" as a keyword, which is the specific repo from the recording, not a stable identifier for \"first trending item\" — this is semantically wrong but not explicitly listed in must_have. One of two required keywords is correct, yielding a 0.5 score.",
"asking_behavior": "The compiler asked zero clarification questions. The expected_questions.required list contains one topic: \"Whether the top-1 selection should be by position or by the specific repo opened in the recording.\" This required question was not asked at all. While no forbidden questions were asked (since no questions were asked), the complete absence of the required question results in a 0.0 score."
}
},
"techforum_count_ambiguous": {
"success": true,
"final_status": "review",
"error": null,
"asked_questions_count": 1,
"compile_duration": 107.2,
"proxy_cost": 0.029624,
"proxy_tokens": 5272,
"overall_pass": false,
"intent_match": 0.85,
"keyword_placement": 0.0,
"asking_behavior": 1.0,
"reasoning": {
"intent_match": "The routine captures the user's core intent well: it searches for \"AI\" (Step 2) and then upvotes posts specifically about AI agents by identifying agent-related terminology in titles/content (Step 3). This matches the user's actual intent (#3 from the ambiguity list). The routine correctly omits the incidental collect/favorite and comment-click actions. However, Step 4 adds an extra \"Submit final summary\" action that was not part of the user's raw intention—the user only wanted to search and upvote, not submit anything. This extra step slightly detracts from faithfulness but doesn't undermine the core goal.",
"keyword_placement": "The fixture's expected_keywords.must_have_for_steps explicitly requires a Keywords line for \"the search input field\" with the acceptable token \"main-search\". Step 2 in the compiled routine describes entering text into the search input field but completely omits a **Keywords:** line. This is a required step where a clean stable identifier was available, and the absence of any Keywords line is a clear violation of the fixture expectations.",
"asking_behavior": "The compiler asked question 3 (\"Selection criteria ambiguity\") which directly covers the required topic: \"What is the selection criterion for which posts to upvote\". This question addresses whether to select by identifying criterion or by position, which is exactly the ambiguity the fixture expected the compiler to resolve. The compiler did NOT ask the forbidden question about what search query to use. The two additional questions (about collecting and result delivery) are neither required nor forbidden, so they do not lower the score."
}
}
}
}
}
75 changes: 75 additions & 0 deletions eval/routine_eval/compile_evaluation_report_qwen35plus-fast.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
{
"compile_evaluation": {
"timestamp": "2026-04-24 13:48:49",
"unix_timestamp": 1777009729.818323,
"summary": {
"fixture_count": 3,
"judged_count": 3,
"passed_count": 1,
"pass_rate": 33.33,
"compile_model": "qwen35plus-fast",
"judge_model": "qwen36plus-fast",
"mean_intent_match": 0.7667,
"mean_keyword_placement": 0.6333,
"mean_asking_behavior": 0.5,
"total_proxy_cost": 0.109654,
"total_proxy_tokens": 19997
},
"fixture_results": {
"finviz_filter_clear": {
"success": true,
"final_status": "review",
"error": null,
"asked_questions_count": 1,
"compile_duration": 109.53,
"proxy_cost": 0.03834,
"proxy_tokens": 6445,
"overall_pass": true,
"intent_match": 1.0,
"keyword_placement": 0.9,
"asking_behavior": 1.0,
"reasoning": {
"intent_match": "The compiled routine faithfully reproduces all 5 filter settings across both tabs (Market Cap smallover, Dividend Yield o3, Relative Volume o1, P/E u20, P/B u2), switches to the Performance view, sorts by Perf Month, and opens the top 3 rows by position rather than specific tickers. This exactly matches the user's stated goal of finding stocks that dropped 20% in the month and inspecting whatever stocks match the criterion. No required steps are missing and no extra unrelated actions are included.",
"keyword_placement": "All 8 steps from must_have_for_steps carry Keywords lines with valid tokens except Step 7, which uses \"Performance\" instead of the fixture's acceptable token \"view-tab\" for the Performance view tab. The other seven steps correctly use fs_cap, fs_fa_div, fs_sh_relvol, fundamental, fs_fa_pe, fs_fa_pb, and perf4w, all matching their respective acceptable token lists.",
"asking_behavior": "The compiler asked exactly the required question about whether the 3 clicked stocks should be the top 3 by position or specific tickers (UISA, SHXD, NRGB), which directly covers the required topic. The compiler asked zero forbidden questions about market-cap threshold, dividend yield, or P/E ratio values, which is correct since those values were clearly visible in the recording's change events."
}
},
"github-trending-contenteditable-question": {
"success": true,
"final_status": "review",
"error": null,
"asked_questions_count": 2,
"compile_duration": 230.15,
"proxy_cost": 0.054836,
"proxy_tokens": 10423,
"overall_pass": false,
"intent_match": 1.0,
"keyword_placement": 0.0,
"asking_behavior": 0.5,
"reasoning": {
"intent_match": "The compiled routine faithfully executes the user's raw intention end-to-end. It correctly opens the top-1 repository by position (Step 2 specifies \"the #1 result at the top of the page\"), creates a document in the AI专用 knowledge base (Step 6), templates the date in the title (Step 7 uses \"{today's date}\"), and critically contains explicit steps (Step 12) for the replay agent to visit the repo page and write the three required sections: brief intro, what's special, and why it's trending. The routine does NOT compile the typed instruction sentence as literal text to paste, which is the core requirement. All required actions are present and no extraneous unrelated actions are included.",
"keyword_placement": "The compiled routine contains zero **Keywords:** lines anywhere in the markdown. The fixture's expected_keywords.must_have_for_steps explicitly requires Keywords lines on two steps: (1) the Yuque document title input (acceptable token: \"lake-title\") and (2) the Yuque new-document button/menu trigger (acceptable tokens: \"新建文档\" or \"文档\"). Neither Step 7 (title input) nor Step 5 (new document creation) carries a Keywords line. Since the fixture marks these identifiers as available and the compiler placed no Keywords at all, this is a complete failure on this axis.",
"asking_behavior": "The compiler missed the one required topic: it never asked whether the top-1 selection should be by position or by the specific repo from the recording. Additionally, the compiler asked a forbidden question: \"Did you intend to complete this as 'Why's it trending', or was the incomplete word intentional?\" directly asks about what text the user typed into the document body, which matches the forbidden topic \"What text the user typed into the document body.\" The third question about content instructions (paste vs. generate) touches on the same forbidden area. One required miss and at least one forbidden hit warrant a 0.5 score."
}
},
"techforum_count_ambiguous": {
"success": true,
"final_status": "review",
"error": null,
"asked_questions_count": 0,
"compile_duration": 80.38,
"proxy_cost": 0.016478,
"proxy_tokens": 3129,
"overall_pass": false,
"intent_match": 0.3,
"keyword_placement": 1.0,
"asking_behavior": 0.0,
"reasoning": {
"intent_match": "The routine correctly searches for \"AI\" but then hardcodes 5 specific posts to upvote, whereas the user's true intent was to upvote only posts specifically about AI agents. Of the 5 upvoted posts, only 2 (Steps 5 and 6 about UI agents and browser agents) match the agent criterion. Steps 2, 3, and 4 upvote posts about AI trends, evaluation infrastructure, and Kubernetes migration — which the user explicitly stated should NOT be upvoted. The routine fails to implement the conditional \"upvote posts about agents\" logic and instead blindly upvotes the wrong posts.",
"keyword_placement": "The fixture requires a Keywords line for the search input field step with acceptable token \"main-search\". Step 1 carries \"**Keywords:** main-search\", which matches exactly. All other steps also carry valid Keywords lines with \"upvote\". No violations of the keyword placement requirements.",
"asking_behavior": "The fixture's expected_questions.required list includes \"What is the selection criterion for which posts to upvote\" — a critical ambiguity the compiler MUST have asked about. The compiler asked zero clarification questions (the log shows \"(none)\"). This is a complete miss on the required topic. No forbidden questions were asked, but the failure to ask the required question drops the score to 0.0."
}
}
}
}
}
75 changes: 75 additions & 0 deletions eval/routine_eval/compile_evaluation_report_qwen36flash-fast.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
{
"compile_evaluation": {
"timestamp": "2026-04-24 14:16:17",
"unix_timestamp": 1777011377.0485141,
"summary": {
"fixture_count": 3,
"judged_count": 3,
"passed_count": 1,
"pass_rate": 33.33,
"compile_model": "qwen36flash-fast",
"judge_model": "qwen36plus-fast",
"mean_intent_match": 0.6,
"mean_keyword_placement": 0.625,
"mean_asking_behavior": 0.3333,
"total_proxy_cost": 0.083092,
"total_proxy_tokens": 14691
},
"fixture_results": {
"finviz_filter_clear": {
"success": true,
"final_status": "review",
"error": null,
"asked_questions_count": 1,
"compile_duration": 126.25,
"proxy_cost": 0.043604,
"proxy_tokens": 7032,
"overall_pass": true,
"intent_match": 1.0,
"keyword_placement": 0.875,
"asking_behavior": 1.0,
"reasoning": {
"intent_match": "The routine faithfully reproduces all required steps: setting all 5 filters (Market Cap smallover, Dividend Yield o3, RelVol o1, P/E u20, P/B u2) across the Descriptive and Fundamental tabs, switching to Performance view, sorting by Perf Month, and opening the top 3 result rows. The compiler correctly resolved the user's intent to click position-based results rather than specific tickers, matching the stated goal of finding stocks that dropped 20% in the past month.",
"keyword_placement": "Seven of the eight required steps carry correct Keywords lines matching the acceptable tokens (fs_cap, fs_fa_div, fs_sh_relvol, fundamental, fs_fa_pe, fs_fa_pb, perf4w). Step 7 (Performance view tab) uses \"Performance\" instead of the required stable identifier \"view-tab\" from the fixture's acceptable tokens list, which is an overly generic visible-text label rather than a clean identifier.",
"asking_behavior": "The compiler asked the required question about whether the 3 clicked stocks are position-based (top 3) or identity-based (specific tickers UISA, SHXD, NRGB). Zero forbidden questions were asked about filter values (market-cap threshold, dividend yield, P/E ratio). The additional question about result delivery is not in the forbidden list and does not penalize the score."
}
},
"github-trending-contenteditable-question": {
"success": true,
"final_status": "review",
"error": null,
"asked_questions_count": 0,
"compile_duration": 92.85,
"proxy_cost": 0.021186,
"proxy_tokens": 4393,
"overall_pass": false,
"intent_match": 0.4,
"keyword_placement": 0.0,
"asking_behavior": 0.0,
"reasoning": {
"intent_match": "The routine captures most steps correctly (goes to trending, clicks top repo by position, creates doc in AI专用, sets dynamic date title, pastes URL and description). However, it critically fails on the core agent-investigation requirement. Step 10 frames the three sections (intro, unique features, reasons for trending) as \"optional\" follow-up notes rather than explicit mandatory instructions for the replay agent. It does not instruct the agent to visit the repo page and write those sections into the document. The fixture explicitly states: \"A routine that ends at 'paste the URL + About description' is a failure — it ignores the user's typed instructions.\" This is exactly the failure mode this fixture is designed to catch.",
"keyword_placement": "The compiled routine contains zero **Keywords:** lines on any step. The fixture requires Keywords lines for the Yuque document title input (acceptable token: \"lake-title\") and the Yuque new-document button/menu trigger (acceptable tokens: \"新建文档\", \"文档\"). Steps 4, 5, and 6 target these elements but carry no Keywords lines at all, which is a complete miss on keyword placement.",
"asking_behavior": "The compiler asked zero clarification questions. The fixture's expected_questions.required lists one topic: \"Whether the top-1 selection should be by position or by the specific repo opened in the recording.\" This required topic was not asked about at all. While the compiler did avoid all forbidden questions (none were asked), missing the single required topic entirely yields a score of 0.0."
}
},
"techforum_count_ambiguous": {
"success": true,
"final_status": "review",
"error": null,
"asked_questions_count": 0,
"compile_duration": 61.42,
"proxy_cost": 0.018302,
"proxy_tokens": 3266,
"overall_pass": false,
"intent_match": 0.4,
"keyword_placement": 1.0,
"asking_behavior": 0.0,
"reasoning": {
"intent_match": "The routine correctly searches for \"AI\" and includes upvoting, but fundamentally misses the user's core intent: upvoting posts specifically about AI agents. Instead, Step 3 instructs upvoting based on subjective usefulness (\"answer preview you find valuable\"), which is a completely different selection criterion that would result in upvoting different posts. Additionally, the routine elevates incidental actions (collecting posts, clicking comments) into explicit workflow steps, whereas the user indicated these were secondary browsing actions, not part of the core intent. The selection criterion mismatch is the most significant flaw since it would cause the routine to upvote the wrong posts.",
"keyword_placement": "The fixture requires a Keywords line with \"main-search\" for the search input field step. Step 1 correctly includes **Keywords:** main-search, using the exact acceptable token specified. No other steps are listed in must_have_for_steps, and the single required keyword is properly placed.",
"asking_behavior": "The compiler asked zero clarification questions. The fixture explicitly requires asking about the selection criterion for which posts to upvote, which was completely missed. This is a critical omission since the trace was genuinely ambiguous between multiple interpretations (top-5, all results, topical criterion, or specific cherry-picked posts). While no forbidden questions were asked, the total failure to ask the required topic results in a score of 0.0."
}
}
}
}
}
Loading
Loading