Skip to content

miniwob.find-greatest can be marked successful after selecting a non-maximum card #392

@gss10282023

Description

@gss10282023

Summary

There appears to be a success-threshold bug in the BrowserGym MiniWoB wrapper for miniwob.find-greatest.

The task text says:

Find and pick the card with the greatest number, then press submit.

However, a run can be marked successful after selecting a card that is not the greatest card.

In the attached concrete run, the page had cards with values 4, 7, and 2. The agent selected the card with value 2 and submitted. The final DOM still shows another card with value 7, so this should not satisfy the task. The native evaluator nevertheless reports success: true.

Why this happens

In MiniWoB++ find-greatest.html, selecting the correct card ends with reward 1.0, but selecting the wrong card and submitting still ends the episode with a positive partial reward:

if(userIndex === expectedIndex.toString()) core.endEpisode(1.0, true);
else core.endEpisode(0.1, true);

In BrowserGym's MiniWoB wrapper, AbstractMiniwobTask.validate() converts any positive raw reward into a full success reward:

reward = float(info["RAW_REWARD_GLOBAL"] > 0)  # TODO: shouldn't it be 0.5?

As a result, the wrong-card partial reward 0.1 becomes reward: 1.0 and success: true in the run output.

Actual behavior in the attached run

The attached evidence/native_evaluator_output_agent_a.json shows:

"RAW_REWARD_GLOBAL": 0.1,
"reward": 1.0,
"success": true

It also shows the agent actions:

"action": "click('17')"
"action": "click('20')"

The attached final DOM snapshot evidence/final_dom_step_002_agent_a.html shows that bid 17 is the revealed selected card, with value 2:

<div class="card" data-index="2" ... bid="17">
  <span class="card-value" ...>2</span>
</div>

The same DOM line also shows another card with value 7:

<div class="card hidden" data-index="1" ... bid="15">
  <span class="card-value" ...>7</span>
</div>

So the submitted card was not the greatest card, but the run was still reported as successful.

Expected behavior

For find-greatest, BrowserGym should only report success when the selected card is actually the greatest card.

Possible fixes:

  • change the MiniWoB success threshold so partial positive rewards such as 0.1 do not count as success, for example RAW_REWARD_GLOBAL > 0.5 or RAW_REWARD_GLOBAL >= 1.0; or
  • add task-specific handling where partial rewards are not treated as benchmark success; or
  • if the intended fix belongs in MiniWoB++, make wrong-card submission produce a non-success raw reward rather than 0.1.

The issue seems most directly triggered by BrowserGym treating any RAW_REWARD_GLOBAL > 0 as success.

Attached evidence package

  • source/browsergym_miniwob_base.py: BrowserGym MiniWoB wrapper showing reward = float(info["RAW_REWARD_GLOBAL"] > 0).
  • source/find-greatest.html: MiniWoB++ task source showing the task text and wrong-card core.endEpisode(0.1, true) branch.
  • evidence/native_evaluator_output_agent_a.json: run output showing RAW_REWARD_GLOBAL=0.1, reward=1.0, and success=true.
  • evidence/final_dom_step_002_agent_a.html: final DOM showing selected card value 2 while another card value is 7.
  • evidence/final_screenshot_step_002_agent_a.png: final screenshot for the same run.
  • evidence/validation_final_agent_a.json and evidence/task_info_final_agent_a.json: final validator/task state.
  • evidence/miniwob_scores_flat.csv: score table; row 170 contains this miniwob.find-greatest / Agent A run.

find_greatest_agent_a_wrong_card_success_issue.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions