Ghostwriter evaluation results 2024-12-21_13-57-31 There are 4 scenarios and 4 test cases with 3 attempts (48 total tests). Test: blank_math claude_sonnet_latest_with_seg 10 gpt-4o-mini_no_seg gpt-4o_with_seg 10 claude_sonnet_latest_no_seg Test: tic_tac_toe_1 claude_sonnet_latest_with_seg Your turn! Place an O anywhere you'd like. gpt-4o-mini_no_seg gpt-4o_with_seg claude_sonnet_latest_no_seg Test: x_in_box claude_sonnet_latest_with_seg gpt-4o-mini_no_seg gpt-4o_with_seg claude_sonnet_latest_no_seg Test: x_in_boxes claude_sonnet_latest_with_seg gpt-4o-mini_no_seg gpt-4o_with_seg claude_sonnet_latest_no_seg