Summary
There appears to be a success-threshold bug in the BrowserGym MiniWoB wrapper for miniwob.find-greatest.
The task text says:
Find and pick the card with the greatest number, then press submit.
However, a run can be marked successful after selecting a card that is not the greatest card.
In the attached concrete run, the page had cards with values 4, 7, and 2. The agent selected the card with value 2 and submitted. The final DOM still shows another card with value 7, so this should not satisfy the task. The native evaluator nevertheless reports success: true.
Why this happens
In MiniWoB++ find-greatest.html, selecting the correct card ends with reward 1.0, but selecting the wrong card and submitting still ends the episode with a positive partial reward:
if(userIndex === expectedIndex.toString()) core.endEpisode(1.0, true);
else core.endEpisode(0.1, true);
In BrowserGym's MiniWoB wrapper, AbstractMiniwobTask.validate() converts any positive raw reward into a full success reward:
reward = float(info["RAW_REWARD_GLOBAL"] > 0) # TODO: shouldn't it be 0.5?
As a result, the wrong-card partial reward 0.1 becomes reward: 1.0 and success: true in the run output.
Actual behavior in the attached run
The attached evidence/native_evaluator_output_agent_a.json shows:
"RAW_REWARD_GLOBAL": 0.1,
"reward": 1.0,
"success": true
It also shows the agent actions:
"action": "click('17')"
"action": "click('20')"
The attached final DOM snapshot evidence/final_dom_step_002_agent_a.html shows that bid 17 is the revealed selected card, with value 2:
<div class="card" data-index="2" ... bid="17">
<span class="card-value" ...>2</span>
</div>
The same DOM line also shows another card with value 7:
<div class="card hidden" data-index="1" ... bid="15">
<span class="card-value" ...>7</span>
</div>
So the submitted card was not the greatest card, but the run was still reported as successful.
Expected behavior
For find-greatest, BrowserGym should only report success when the selected card is actually the greatest card.
Possible fixes:
- change the MiniWoB success threshold so partial positive rewards such as
0.1 do not count as success, for example RAW_REWARD_GLOBAL > 0.5 or RAW_REWARD_GLOBAL >= 1.0; or
- add task-specific handling where partial rewards are not treated as benchmark success; or
- if the intended fix belongs in MiniWoB++, make wrong-card submission produce a non-success raw reward rather than
0.1.
The issue seems most directly triggered by BrowserGym treating any RAW_REWARD_GLOBAL > 0 as success.
Attached evidence package
source/browsergym_miniwob_base.py: BrowserGym MiniWoB wrapper showing reward = float(info["RAW_REWARD_GLOBAL"] > 0).
source/find-greatest.html: MiniWoB++ task source showing the task text and wrong-card core.endEpisode(0.1, true) branch.
evidence/native_evaluator_output_agent_a.json: run output showing RAW_REWARD_GLOBAL=0.1, reward=1.0, and success=true.
evidence/final_dom_step_002_agent_a.html: final DOM showing selected card value 2 while another card value is 7.
evidence/final_screenshot_step_002_agent_a.png: final screenshot for the same run.
evidence/validation_final_agent_a.json and evidence/task_info_final_agent_a.json: final validator/task state.
evidence/miniwob_scores_flat.csv: score table; row 170 contains this miniwob.find-greatest / Agent A run.
find_greatest_agent_a_wrong_card_success_issue.zip
Summary
There appears to be a success-threshold bug in the BrowserGym MiniWoB wrapper for
miniwob.find-greatest.The task text says:
However, a run can be marked successful after selecting a card that is not the greatest card.
In the attached concrete run, the page had cards with values
4,7, and2. The agent selected the card with value2and submitted. The final DOM still shows another card with value7, so this should not satisfy the task. The native evaluator nevertheless reportssuccess: true.Why this happens
In MiniWoB++
find-greatest.html, selecting the correct card ends with reward1.0, but selecting the wrong card and submitting still ends the episode with a positive partial reward:In BrowserGym's MiniWoB wrapper,
AbstractMiniwobTask.validate()converts any positive raw reward into a full success reward:As a result, the wrong-card partial reward
0.1becomesreward: 1.0andsuccess: truein the run output.Actual behavior in the attached run
The attached
evidence/native_evaluator_output_agent_a.jsonshows:It also shows the agent actions:
The attached final DOM snapshot
evidence/final_dom_step_002_agent_a.htmlshows that bid17is the revealed selected card, with value2:The same DOM line also shows another card with value
7:So the submitted card was not the greatest card, but the run was still reported as successful.
Expected behavior
For
find-greatest, BrowserGym should only report success when the selected card is actually the greatest card.Possible fixes:
0.1do not count as success, for exampleRAW_REWARD_GLOBAL > 0.5orRAW_REWARD_GLOBAL >= 1.0; or0.1.The issue seems most directly triggered by BrowserGym treating any
RAW_REWARD_GLOBAL > 0as success.Attached evidence package
source/browsergym_miniwob_base.py: BrowserGym MiniWoB wrapper showingreward = float(info["RAW_REWARD_GLOBAL"] > 0).source/find-greatest.html: MiniWoB++ task source showing the task text and wrong-cardcore.endEpisode(0.1, true)branch.evidence/native_evaluator_output_agent_a.json: run output showingRAW_REWARD_GLOBAL=0.1,reward=1.0, andsuccess=true.evidence/final_dom_step_002_agent_a.html: final DOM showing selected card value2while another card value is7.evidence/final_screenshot_step_002_agent_a.png: final screenshot for the same run.evidence/validation_final_agent_a.jsonandevidence/task_info_final_agent_a.json: final validator/task state.evidence/miniwob_scores_flat.csv: score table; row 170 contains thisminiwob.find-greatest / Agent Arun.find_greatest_agent_a_wrong_card_success_issue.zip