Skip to content

Commit 6dd02fc

Browse files
feat(mcqa): Add custom answer extraction via template_metadata to support STEM MCQA dataset (#128)
Adds support for custom answer extraction in MCQA resources server via the optional `template_metadata.output_regex` field. This enables handling STEM datasets with custom prompt formats that don't match the standard grading modes. --------- Signed-off-by: Pritam Gundecha <pgundecha@nvidia.com>
1 parent eb676a5 commit 6dd02fc

File tree

9 files changed

+489
-80
lines changed

9 files changed

+489
-80
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -240,4 +240,6 @@ outputs
240240

241241
# Environment with sensitive information like API keys
242242
env.yaml
243+
244+
# Backup files
243245
*.backup

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ NeMo Gym includes a curated collection of resource servers for training and eval
8585
| instruction_following | Instruction Following | <a href='resources_servers/instruction_following/configs/instruction_following.yaml'>resources_servers/instruction_following/configs/instruction_following.yaml</a> | Apache 2.0 | Train, Example |
8686
| instruction_following | Multineedle | <a href='resources_servers/multineedle/configs/multineedle.yaml'>resources_servers/multineedle/configs/multineedle.yaml</a> | Apache 2.0 | Train, Validation, Example |
8787
| knowledge | Equivalence Llm Judge | <a href='resources_servers/equivalence_llm_judge/configs/equivalence_llm_judge.yaml'>resources_servers/equivalence_llm_judge/configs/equivalence_llm_judge.yaml</a> | None | Example, Example |
88-
| knowledge | Mcqa | <a href='resources_servers/mcqa/configs/mcqa.yaml'>resources_servers/mcqa/configs/mcqa.yaml</a> | Apache 2.0 | Train, Example |
88+
| knowledge | Mcqa | <a href='resources_servers/mcqa/configs/mcqa.yaml'>resources_servers/mcqa/configs/mcqa.yaml</a> | Apache 2.0 | Train, Example, Example |
8989
| math | Library Judge Math | <a href='resources_servers/library_judge_math/configs/bytedtsinghua_dapo17k.yaml'>resources_servers/library_judge_math/configs/bytedtsinghua_dapo17k.yaml</a> | Apache 2.0 | Train, Validation |
9090
| math | Library Judge Math | <a href='resources_servers/library_judge_math/configs/dapo17k.yaml'>resources_servers/library_judge_math/configs/dapo17k.yaml</a> | Apache 2.0 | Train, Validation |
9191
| math | Library Judge Math | <a href='resources_servers/library_judge_math/configs/library_judge_math.yaml'>resources_servers/library_judge_math/configs/library_judge_math.yaml</a> | Creative Commons Attribution 4.0 International | Train, Validation, Example |

resources_servers/mcqa/README.md

Lines changed: 82 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,21 @@ Verifies multiple-choice QA (MCQA) model outputs.
55
It consumes agent trajectories and returns a reward based on whether the assistant’s final output matches the gold answer.
66

77
### Input schema
8+
Required fields:
89
- `responses_create_params`: OpenAI Responses create params
9-
- Use only a user message with the question and options (e.g., “A: … B: …”).
10-
- `metadata` (dataset format):
11-
- `options` (required): list of dicts mapping a single letter to the option text, e.g. `[{"A": "Option_Text"}, {"B": "..."}]`.
12-
- `expected_answer` (required): the gold letter (single character). Must be one of the letters present in `metadata.options`.
13-
- `prompt_type` (required): must be `"mcqa"`.
10+
- Use only a user message with the question and options (e.g., "A: … B: …").
11+
- `options` (required): List of dicts mapping a single letter to option text, e.g. `[{"A": "Option_Text"}, {"B": "..."}]`
12+
- `expected_answer` (required): The gold letter (single character). Must be one of the letters present in `options`
1413

15-
Notes
16-
- Letters are validated against the keys present in `metadata.options`.
17-
- While most datasets use A–D, any letter set is supported as long as it matches the provided options.
18-
- Legacy support: top-level `options` and `expected_answer` are still accepted for backward compatibility, but the dataset format above is preferred.
14+
Optional fields:
15+
- `grading_mode`: Answer extraction method (default: `"strict_single_letter_boxed"`)
16+
- `template_metadata`: Custom regex pattern for answer extraction (see below)
17+
- `uuid`: Unique identifier for the question
18+
- `metadata`: Optional arbitrary metadata (not used for grading)
19+
20+
Notes:
21+
- Letters are validated against the keys present in `options`
22+
- While most datasets use A–D, any letter set is supported as long as it matches the provided options
1923

2024
### Grading modes
2125
- `strict_single_letter_boxed` (default)
@@ -28,11 +32,31 @@ Notes
2832
- `lenient_answer_colon`
2933
- Extracts content after `Answer:` (case-insensitive).
3034
- If it is a single allowed letter, use it.
31-
- Otherwise, if it exactly equals (after normalization) one options text, use that letter.
35+
- Otherwise, if it exactly equals (after normalization) one option's text, use that letter.
3236
- Example: `options = [{"A": "Circle"}, {"B": "Square}]`. This will match `Answer: B` or `Answer: Square`.
3337
- Legacy from NeMo-RL
3438

35-
### Example dataset row (dataset format)
39+
### Custom answer extraction (template_metadata) - OPTIONAL
40+
For datasets with custom prompt formats, you can optionally use `template_metadata` with a custom regex pattern.
41+
42+
**Note:** If you don't need custom formats, see `data/example.jsonl` for standard usage with `grading_mode` only.
43+
44+
- `template_metadata.output_regex`: Custom regex pattern to extract the answer letter
45+
- **Optional field** - use only if you need custom answer formats
46+
- Takes **priority** over `grading_mode` when present
47+
- Case-insensitive matching (IGNORECASE flag)
48+
- Uses rightmost (last) match if multiple matches exist
49+
- Gracefully falls back to `grading_mode` if regex is invalid
50+
51+
**Example formats supported:**
52+
- `"Option Selected: B"` → regex: `Option Selected:\s*([A-Za-z])`
53+
- `"Final Choice: C"` → regex: `Final Choice:\s*([A-Za-z])`
54+
- `"ANSWER IS D"` → regex: `ANSWER IS\s*([A-Za-z])`
55+
- `"Answer: B"` (plain) → regex: `Answer\s*:\s*(?!Answer)\s*([A-Za-z])`
56+
57+
**Priority order:** `template_metadata.output_regex` (if present) → `grading_mode` (default)
58+
59+
### Example dataset row (standard format)
3660
```json
3761
{
3862
"responses_create_params":
@@ -41,51 +65,49 @@ Notes
4165
[
4266
{
4367
"role": "user",
44-
"content": "You should output your final response letter inside \\boxed{} and nothing else You can first think step-by-step. Which of the following genetic tests is used to identify the presence of a specific mutation associated with cystic fibrosis?\nA: Karyotyping\nB: Polymerase Chain Reaction (PCR)\nC: Whole-genome sequencing\nD: Chromosome painting\nE: Restriction Fragment Length Polymorphism (RFLP) analysis\nF: Southern blotting\nG: Microarray analysis\nH: Fluorescence in situ hybridization (FISH)\nI: Enzyme-linked immunosorbent assay (ELISA)\nJ: Methylation-specific PCR"
68+
"content": "You should output your final response letter inside \\boxed{} and nothing else You can first think step-by-step. Which of the following genetic tests is used to identify the presence of a specific mutation associated with cystic fibrosis?\nA: Karyotyping\nB: Polymerase Chain Reaction (PCR)\n..."
4569
}
4670
]
4771
},
48-
"options":
49-
[
50-
{
51-
"A": "Karyotyping"
52-
},
53-
{
54-
"B": "Polymerase Chain Reaction (PCR)"
55-
},
56-
{
57-
"C": "Whole-genome sequencing"
58-
},
59-
{
60-
"D": "Chromosome painting"
61-
},
62-
{
63-
"E": "Restriction Fragment Length Polymorphism (RFLP) analysis"
64-
},
65-
{
66-
"F": "Southern blotting"
67-
},
68-
{
69-
"G": "Microarray analysis"
70-
},
71-
{
72-
"H": "Fluorescence in situ hybridization (FISH)"
73-
},
74-
{
75-
"I": "Enzyme-linked immunosorbent assay (ELISA)"
76-
},
77-
{
78-
"J": "Methylation-specific PCR"
79-
}
80-
],
72+
"options": [{"A": "Karyotyping"}, {"B": "Polymerase Chain Reaction (PCR)"}, ...],
8173
"expected_answer": "B",
8274
"grading_mode": "strict_single_letter_boxed",
8375
"uuid": "3c26f339-4b88-54be-b72a-e9c438ca6335"
8476
}
8577
```
8678

79+
### Example with template_metadata (custom format)
80+
```json
81+
{
82+
"responses_create_params":
83+
{
84+
"input":
85+
[
86+
{
87+
"role": "user",
88+
"content": "Which genetic test identifies cystic fibrosis mutations?\nA: Karyotyping\nB: PCR\n...\n\nChoose the correct option.\nConclude with \"ANSWER IS X\" on the final line."
89+
}
90+
]
91+
},
92+
"options": [{"A": "Karyotyping"}, {"B": "PCR"}, ...],
93+
"expected_answer": "B",
94+
"grading_mode": "strict_single_letter_boxed",
95+
"template_metadata":
96+
{
97+
"output_regex": "ANSWER IS\\s*([A-Za-z])\\s*",
98+
"template_id": "mcqa_generated_019",
99+
"prompt_type": "generated",
100+
"format_type": "mcqa"
101+
},
102+
"uuid": "eb07c826-fed5-57f8-bee6-bb29e099069d"
103+
}
104+
```
105+
106+
**Note:** Example files in `data/example_with_template_metadata.jsonl` use simulated `reward_profiles` for demonstration purposes.
107+
87108
### Example of rollouts and usage
88109

110+
**Standard format (with `grading_mode`):**
89111
```bash
90112
config_paths="responses_api_agents/simple_agent/configs/simple_agent.yaml,\
91113
responses_api_models/openai_model/configs/openai_model.yaml,\
@@ -107,6 +129,16 @@ ng_collect_rollouts \
107129
+output_jsonl_fpath=data/MCQA_filtered_decontaminated_samples_rollouts.jsonl +limit=5
108130
```
109131

132+
**With template_metadata (custom regex):**
133+
```bash
134+
# Using example file with 5 different custom prompt formats
135+
ng_collect_rollouts \
136+
+agent_name=simple_agent \
137+
+input_jsonl_fpath=resources_servers/mcqa/data/example_with_template_metadata.jsonl \
138+
+output_jsonl_fpath=resources_servers/mcqa/data/example_rollouts_with_template_metadata.jsonl \
139+
+limit=5
140+
```
141+
110142
Rollout example
111143

112144
```json
@@ -231,9 +263,11 @@ Rollout example
231263
```
232264

233265
### Implementation notes
234-
- The server extracts the last assistant message’s text from the Responses output.
235-
- Letters are validated against the provided `metadata.options` keys (or legacy top-level if present).
236-
- For `lenient_boxed`, only boxed content is considered; it must match exactly one option’s text after normalization.
266+
- The server extracts the last assistant message's text from the Responses output.
267+
- Letters are validated against the provided `options` keys.
268+
- For `lenient_boxed`, only boxed content is considered; it must match exactly one option's text after normalization.
269+
- **template_metadata priority**: When `template_metadata.output_regex` is present, it takes priority over `grading_mode` for answer extraction.
270+
- **Backward compatibility**: Existing datasets without `template_metadata` continue to work using `grading_mode`.
237271

238272

239273
## Licensing information

resources_servers/mcqa/app.py

Lines changed: 86 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,8 @@ class MCQARunRequest(BaseRunRequest):
4545
"lenient_boxed",
4646
"lenient_answer_colon",
4747
] = "strict_single_letter_boxed"
48+
# Template metadata with custom regex support
49+
template_metadata: Optional[dict[str, Any]] = None
4850

4951

5052
class MCQAVerifyRequest(MCQARunRequest, BaseVerifyRequest):
@@ -151,6 +153,54 @@ def _match_option_text(text: str, options: list[dict[str, str]], allowed_letters
151153
return None
152154

153155

156+
def _parse_answer_with_custom_regex(
157+
text: str, regex_pattern: str, allowed_letters: set[str], options: Optional[list[dict[str, str]]]
158+
) -> Optional[str]:
159+
"""Parse answer using custom regex from template_metadata.
160+
161+
Uses rightmost (last) match to handle reasoning before final answer.
162+
Case-insensitive matching to handle capitalization variations.
163+
164+
When using template_metadata with custom regex, we trust the regex pattern
165+
and allow extracted letters even if options metadata is incomplete.
166+
"""
167+
try:
168+
# Use IGNORECASE flag and findall to get all matches
169+
matches = re.findall(regex_pattern, text, re.IGNORECASE)
170+
if not matches:
171+
return None
172+
173+
# Take the LAST match (rightmost)
174+
captured = matches[-1].strip().upper()
175+
176+
# Try direct letter match first
177+
if len(captured) == 1 and captured.isalpha():
178+
# If we have options metadata, validate against it
179+
if allowed_letters and captured in allowed_letters:
180+
return captured
181+
# If options metadata is missing/incomplete, trust the regex
182+
# This handles cases where template_metadata regex is used but options are incomplete
183+
elif not allowed_letters:
184+
return captured
185+
# If captured letter is not in allowed_letters but allowed_letters exists,
186+
# it might be a data quality issue - still return it when using template_metadata
187+
else:
188+
# Trust the regex when using template_metadata (this function is only called for template_metadata)
189+
return captured
190+
191+
# Try matching against option text (normalized)
192+
normalized_captured = _normalize_for_match(captured)
193+
for entry in options or []:
194+
for k, v in entry.items():
195+
if k.upper() in allowed_letters and _normalize_for_match(v) == normalized_captured:
196+
return k.upper()
197+
198+
return None
199+
except re.error:
200+
# Invalid regex pattern, return None
201+
return None
202+
203+
154204
class MCQAResourcesServer(SimpleResourcesServer):
155205
config: MCQAResourcesServerConfig
156206

@@ -167,39 +217,44 @@ async def verify(self, body: MCQAVerifyRequest) -> MCQAVerifyResponse:
167217

168218
pred: Optional[str] = None
169219

170-
if body.grading_mode == "strict_single_letter_boxed":
171-
pred, _, _ = _parse_answer_letter_strict_boxed(text, allowed_letters)
172-
elif body.grading_mode == "lenient_boxed":
173-
# Try strict boxed first
174-
pred, _, _ = _parse_answer_letter_strict_boxed(text, allowed_letters)
175-
if pred is None:
176-
# Then try to match option text inside boxed content only
177-
letter_from_text = _match_option_text(text, options, allowed_letters)
178-
if letter_from_text is not None:
179-
pred = letter_from_text
180-
elif body.grading_mode == "lenient_answer_colon":
181-
# Look for Answer: <...>
182-
m = ANSWER_COLON_PATTERN.search(text)
183-
if m:
184-
candidate = _strip_latex_wrappers(m.group(1)).strip()
185-
# Letter case
186-
if len(candidate) == 1 and candidate.isalpha():
187-
letter_up = candidate.upper()
188-
if letter_up in allowed_letters:
189-
pred = letter_up
190-
# Option text equality (normalized)
220+
# Check for template_metadata first (highest priority)
221+
if body.template_metadata and "output_regex" in body.template_metadata:
222+
regex_pattern = body.template_metadata["output_regex"]
223+
pred = _parse_answer_with_custom_regex(text, regex_pattern, allowed_letters, options)
224+
225+
# Fallback to existing grading_mode logic if template_metadata didn't work
226+
if pred is None:
227+
if body.grading_mode == "strict_single_letter_boxed":
228+
pred, _, _ = _parse_answer_letter_strict_boxed(text, allowed_letters)
229+
elif body.grading_mode == "lenient_boxed":
230+
# Try strict boxed first
231+
pred, _, _ = _parse_answer_letter_strict_boxed(text, allowed_letters)
191232
if pred is None:
192-
cand_norm = _normalize_for_match(candidate)
193-
for entry in options or []:
194-
for k, v in entry.items():
195-
k_up = k.upper()
196-
if k_up in allowed_letters and _normalize_for_match(v) == cand_norm:
197-
pred = k_up
233+
# Then try to match option text inside boxed content only
234+
letter_from_text = _match_option_text(text, options, allowed_letters)
235+
if letter_from_text is not None:
236+
pred = letter_from_text
237+
elif body.grading_mode == "lenient_answer_colon":
238+
# Look for Answer: <...>
239+
m = ANSWER_COLON_PATTERN.search(text)
240+
if m:
241+
candidate = _strip_latex_wrappers(m.group(1)).strip()
242+
# Letter case
243+
if len(candidate) == 1 and candidate.isalpha():
244+
letter_up = candidate.upper()
245+
if letter_up in allowed_letters:
246+
pred = letter_up
247+
# Option text equality (normalized)
248+
if pred is None:
249+
cand_norm = _normalize_for_match(candidate)
250+
for entry in options or []:
251+
for k, v in entry.items():
252+
k_up = k.upper()
253+
if k_up in allowed_letters and _normalize_for_match(v) == cand_norm:
254+
pred = k_up
255+
break
256+
if pred is not None:
198257
break
199-
if pred is not None:
200-
break
201-
else:
202-
pred = None
203258

204259
gold = (expected_answer or "").strip().upper()
205260
is_correct = (pred == gold) if (pred is not None and gold) else False

resources_servers/mcqa/configs/mcqa.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,6 @@ mcqa_simple_agent:
2525
- name: example
2626
type: example
2727
jsonl_fpath: resources_servers/mcqa/data/example.jsonl
28+
- name: example_with_template_metadata
29+
type: example
30+
jsonl_fpath: resources_servers/mcqa/data/example_with_template_metadata.jsonl

0 commit comments

Comments
 (0)