NVIDIA-NeMo
diff --git a/‎README.md‎
Lines changed: 48 additions & 0 deletions b/‎README.md‎
Lines changed: 48 additions & 0 deletions
diff --git a/‎nemo_gym/cli.py‎
Lines changed: 6 additions & 1 deletion b/‎nemo_gym/cli.py‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎nemo_gym/rollout_collection.py‎
Lines changed: 13 additions & 8 deletions b/‎nemo_gym/rollout_collection.py‎
Lines changed: 13 additions & 8 deletions
diff --git a/‎nemo_gym/server_utils.py‎
Lines changed: 85 additions & 16 deletions b/‎nemo_gym/server_utils.py‎
Lines changed: 85 additions & 16 deletions
diff --git a/‎nemo_gym/train_data_utils.py‎
Lines changed: 1 addition & 1 deletion b/‎nemo_gym/train_data_utils.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎pyproject.toml‎
Lines changed: 5 additions & 0 deletions b/‎pyproject.toml‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎resources_servers/comp_coding/data/train_metrics.json‎
Lines changed: 44 additions & 0 deletions b/‎resources_servers/comp_coding/data/train_metrics.json‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎resources_servers/comp_coding/data/train_metrics.jsonl‎
Lines changed: 0 additions & 27 deletions b/‎resources_servers/comp_coding/data/train_metrics.jsonl‎
Lines changed: 0 additions & 27 deletions
@@ -19,6 +19,7 @@
 - [How To: ng\_dump\_config - Dump a YAML config as exactly as NeMo Gym sees it](#how-to-ng_dump_config---dump-a-yaml-config-as-exactly-as-nemo-gym-sees-it)
 - [How To: Use NeMo Gym with a non-Responses compatible API endpoint like vLLM](#how-to-use-nemo-gym-with-a-non-responses-compatible-api-endpoint-like-vllm)
 - [How To: Multi-verifier usage](#how-to-multi-verifier-usage)
+- [How To: Profile your resources server](#how-to-profile-your-resources-server)
 - [FAQ: DCO and commit signing VSCode and Git setup](#faq-dco-and-commit-signing-vscode-and-git-setup)
 - [FAQ: SFT and RL](#faq-sft-and-rl)
 - [FAQ: Error: Found files with missing copyright](#faq-error-found-files-with-missing-copyright)
@@ -772,6 +773,53 @@ ng_run "+config_paths=[$config_paths]"
 The same process goes for data preparation and downstream training framework Gym configuration, you would just add additional server configs.
 
 
+# How To: Profile your resources server
+For large scale verifier training, it's critical that your resources server is as efficient as possible. It may be slammed with 16k concurrent requests or more. Gym provides easy tools to profile and understand the efficiency of your servers.
+
+In one terminal, start your agent, model, and resources servers, with profiling enabled.
+- `profiling_enabled` (bool): whether profiling is enabled or not. By default this is disabled since it incurs some slight overhead we don't want at runtime.
+- `profiling_results_dirpath` (str): The directory to save all server profiling results in. Previous logs for the same will be overwritten in the same directory.
+```bash
+config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
+resources_servers/library_judge_math/configs/bytedtsinghua_dapo17k.yaml"
+ng_run "+config_paths=[${config_paths}]" \
+    +profiling_enabled=true \
+    +profiling_results_dirpath=results/profiling/library_judge_math
+```
+
+In another terminal, run some large number of rollouts against your servers. Use the `limit` and `num_repeats` flags to adjust the number of samples you want to run.
+```bash
+ng_collect_rollouts +agent_name=library_judge_math_simple_agent \
+    +input_jsonl_fpath=resources_servers/library_judge_math/data/dapo17k_bytedtsinghua_train.jsonl \
+    +output_jsonl_fpath=temp/library_judge_math_rollouts.jsonl \
+    +limit=1024 \
+    +num_repeats 1
+```
+
+After `ng_collect_rollouts` finishes, ctrl+c to quit your servers. You should see some output in the terminal like this:
+```bash
+```
+
+The log file content for a server will look something like the following:
+```
+name                                                                                                                      ncall       tsub      ttot      tavg      
+.../nemo-gym/resources_servers/library_judge_math/app.py:118 LibraryJudgeMathResourcesServer.verify                       1024        0.009755  17.98387  0.017562
+.../nemo-gym/resources_servers/library_judge_math/app.py:145 LibraryJudgeMathResourcesServer._verify_answer               1024        0.002933  17.87998  0.017461
+.../nemo-gym/resources_servers/library_judge_math/app.py:173 LibraryJudgeMathResourcesServer._verify_answer_with_library  1024        0.007851  17.87704  0.017458
+.../nemo-gym/resources_servers/library_judge_math/app.py:191 <genexpr>                                                    2339        0.001695  0.029082  0.000012
+.../nemo-gym/resources_servers/library_judge_math/app.py:163 _mute_output                                                 2048        0.007473  0.016538  0.000008
+```
+
+- `ncall`: number of calls (how many times the function/subroutine was invoked).
+  - The `LibraryJudgeMathResourcesServer.verify` function was invoked 1024 times.
+- `tsub`: time spent inside the subroutine itself, excluding calls to other functions (sometimes called "self time").
+  - The `LibraryJudgeMathResourcesServer.verify` function __itself__ accounted for only 0.009755s of time.
+- `ttot`: total time spent in the subroutine, including all the functions it called.
+  - The `LibraryJudgeMathResourcesServer.verify` function and all functions it called including `_verify_answer`, etc accounted for a total of 17.98387s.
+- `tavg`: average time per call (often ttot / ncall).
+  - The `LibraryJudgeMathResourcesServer.verify` function took 0.017562s per call on average.
+
+
 # FAQ: DCO and commit signing VSCode and Git setup
 Here are some suggestions for easier development using the VSCode code editor.
 
 
@@ -37,7 +37,12 @@
     GlobalConfigDictParserConfig,
     get_global_config_dict,
 )
-from nemo_gym.server_utils import HEAD_SERVER_KEY_NAME, HeadServer, ServerClient, ServerStatus
+from nemo_gym.server_utils import (
+    HEAD_SERVER_KEY_NAME,
+    HeadServer,
+    ServerClient,
+    ServerStatus,
+)
 
 
 def _setup_env_command(dir_path: Path) -> str:  # pragma: no cover
 
@@ -43,15 +43,15 @@ class RolloutCollectionConfig(BaseModel):
 
 class RolloutCollectionHelper(BaseModel):  # pragma: no cover
     async def run_from_config(self, config: RolloutCollectionConfig):
+        range_iterator = repeat(0)
+        if config.limit:
+            range_iterator = range(config.limit)
+            print(f"Limiting the number of rows to {config.limit}!")
+
         with open(config.input_jsonl_fpath) as input_dataset:
-            rows = list(map(json.loads, input_dataset))
+            rows = [row for _, row in zip(range_iterator, map(json.loads, input_dataset))]
         print(f"Found {len(rows)} rows!")
 
-        if config.limit:
-            previous_length = len(rows)
-            rows = rows[: config.limit]
-            print(f"Limiting rows from {previous_length} to {len(rows)}!")
-
         if config.num_repeats:
             previous_length = len(rows)
             rows = list(chain.from_iterable(repeat(row, config.num_repeats) for row in rows))
@@ -63,6 +63,11 @@ async def run_from_config(self, config: RolloutCollectionConfig):
 
         server_client = self.setup_server_client()
 
+        tqdm_miniters = 10
+        print(
+            f"The tqdm progress bar will only update every {tqdm_miniters} samples that finish to ensure that you are not being spammed."
+        )
+
         metrics = Counter()
         with open(config.output_jsonl_fpath, "a") as f:
 
@@ -73,7 +78,7 @@ async def _post_coroutine(row: dict) -> None:
                     f.write(json.dumps(result) + "\n")
                     metrics.update({k: v for k, v in result.items() if isinstance(v, (int, float))})
 
-            await tqdm.gather(*map(_post_coroutine, rows), desc="Collecting rollouts")
+            await tqdm.gather(*map(_post_coroutine, rows), desc="Collecting rollouts", miniters=tqdm_miniters)
 
         avg_metrics = {k: v / len(rows) for k, v in metrics.items()}
 
@@ -88,7 +93,7 @@ async def _post_subroutine(row: Dict) -> Dict:
             res = await server_client.post(server_name=row.pop("agent_ref")["name"], url_path="/run", json=row)
             return await res.json()
 
-        return await tqdm.gather(*map(_post_subroutine, examples), desc="Collecting rollouts")
+        return await tqdm.gather(*map(_post_subroutine, examples), desc="Collecting rollouts", miniters=10)
 
     def setup_server_client(self, head_server_config: Optional[BaseServerConfig] = None) -> ServerClient:
         server_client = ServerClient.load_from_global_config(head_server_config)
 
@@ -15,15 +15,19 @@
 import atexit
 import json
 from abc import abstractmethod
+from contextlib import asynccontextmanager
+from io import StringIO
 from logging import Filter as LoggingFilter
 from logging import LogRecord, getLogger
 from os import getenv
+from pathlib import Path
 from threading import Thread
 from typing import Literal, Optional, Tuple, Type, Union, Unpack
 from uuid import uuid4
 
 import requests
 import uvicorn
+import yappi
 from aiohttp import ClientResponse, ClientSession, ClientTimeout, DummyCookieJar, ServerDisconnectedError, TCPConnector
 from aiohttp.client import _RequestOptions
 from fastapi import FastAPI, Request, Response
@@ -32,6 +36,7 @@
 from requests.exceptions import ConnectionError
 from starlette.middleware.sessions import SessionMiddleware
 
+from nemo_gym import PARENT_DIR
 from nemo_gym.config_types import (
     BaseRunServerInstanceConfig,
     BaseServerConfig,
@@ -50,8 +55,8 @@
 
 
 class GlobalAIOHTTPAsyncClientConfig(BaseModel):
-    global_aiohttp_connector_limit: int = 1000
-    global_aiohttp_connector_limit_per_host: int = 100
+    global_aiohttp_connector_limit: int = 100 * 1024
+    global_aiohttp_connector_limit_per_host: int = 1024
 
 
 def get_global_aiohttp_client(
@@ -123,7 +128,7 @@ async def request(method: str, url: str, **kwargs: Unpack[_RequestOptions]) -> C
             await asyncio.sleep(0.5)
         except Exception as e:
             print(
-                f"""Hit an exception while making a request (try {num_tries}): {e}
+                f"""Hit an exception while making a request (try {num_tries}): {type(e)}: {e}
 Sleeping 0.5s and retrying...
 """
             )
@@ -274,6 +279,20 @@ def load_config_from_global_config(cls) -> "BaseRunServerInstanceConfig":
         return server_config
 
 
+class ProfilingMiddlewareInputConfig(BaseModel):
+    # Relative to the Gym root dir.
+    profiling_results_dirpath: Optional[str] = None
+
+
+class ProfilingMiddlewareConfig(ProfilingMiddlewareInputConfig):
+    profiling_enabled: bool = False
+
+
+class UvicornLoggingConfig(BaseModel):
+    # Default to False for regular use cases.
+    uvicorn_logging_show_200_ok: bool = False
+
+
 class SimpleServer(BaseServer):
     server_client: ServerClient
 
@@ -305,36 +324,86 @@ async def add_session_id(request: Request, call_next):  # pragma: no cover
         session_middleware_key = self.get_session_middleware_key()
         app.add_middleware(SessionMiddleware, secret_key=session_middleware_key, session_cookie=session_middleware_key)
 
+    def setup_profiling(self, app: FastAPI, profiling_config: ProfilingMiddlewareConfig) -> None:  # pragma: no cover
+        base_profile_dir = Path(PARENT_DIR) / profiling_config.profiling_results_dirpath
+        server_profile_path = (base_profile_dir / self.get_session_middleware_key()).with_suffix(".log")
+
+        base_profile_dir.mkdir(parents=True, exist_ok=True)
+
+        main_app_lifespan = app.router.lifespan_context
+
+        @asynccontextmanager
+        async def lifespan_wrapper(app):
+            yappi.set_clock_type("WALL")
+            yappi.start()
+            print(f"🔍 Enabled profiling for {self.config.name}")
+
+            async with main_app_lifespan(app) as maybe_state:
+                yield maybe_state
+
+            print(f"🛑 Stopping profiler for {self.config.name}. Check {server_profile_path} for the metrics!")
+            yappi.stop()
+
+            buffer = StringIO()
+            yappi.get_func_stats().print_all(
+                out=buffer,
+                columns={
+                    0: ("name", 200),
+                    1: ("ncall", 10),
+                    2: ("tsub", 8),
+                    3: ("ttot", 8),
+                    4: ("tavg", 8),
+                },
+            )
+
+            buffer.seek(0)
+            with open(server_profile_path, "w") as f:
+                past_header = False
+                for line in buffer:
+                    if not past_header or self.config.entrypoint in line:
+                        f.write(line)
+
+                    if line.startswith("name"):
+                        past_header = True
+
+        app.router.lifespan_context = lifespan_wrapper
+
     @classmethod
     def run_webserver(cls) -> None:  # pragma: no cover
+        global_config_dict = get_global_config_dict()
+
         server_config = cls.load_config_from_global_config()
         server_client = ServerClient(
             head_server_config=ServerClient.load_head_server_config(),
-            global_config_dict=get_global_config_dict(),
+            global_config_dict=global_config_dict,
         )
         server = cls(config=server_config, server_client=server_client)
 
         app = server.setup_webserver()
 
-        class No200Filter(LoggingFilter):
-            def filter(self, record: LogRecord) -> bool:
-                msg = record.getMessage()
-                return not msg.strip().endswith("200")
+        profiling_config = ProfilingMiddlewareConfig.model_validate(global_config_dict)
+        if profiling_config.profiling_enabled:
+            server.setup_profiling(app, profiling_config)
 
-        uvicorn_logger = getLogger("uvicorn.access")
-        uvicorn_logger.addFilter(No200Filter())
+        uvicorn_logging_cfg = UvicornLoggingConfig.model_validate(global_config_dict)
+        if not uvicorn_logging_cfg.uvicorn_logging_show_200_ok:
 
-        print(
-            "Adding a uvicorn logging filter so that the logs aren't spammed with 200 OK messages. This is to help errors pop up better and filter out noise."
-        )
+            class No200Filter(LoggingFilter):
+                def filter(self, record: LogRecord) -> bool:
+                    msg = record.getMessage()
+                    return not msg.strip().endswith("200")
+
+            uvicorn_logger = getLogger("uvicorn.access")
+            uvicorn_logger.addFilter(No200Filter())
+
+            print(
+                "Adding a uvicorn logging filter so that the logs aren't spammed with 200 OK messages. This is to help errors pop up better and filter out noise."
+            )
 
         uvicorn.run(
             app,
             host=server.config.host,
             port=server.config.port,
-            # We don't have any explicit lifespan logic, so instead of defaulting to "auto"
-            # We just turn lifespan off
-            lifespan="off",
         )
 
 
 
@@ -577,7 +577,7 @@ def collate_samples(
 
             parent = Path(config.output_dirpath)
             parent.mkdir(exist_ok=True)
-            metrics_fpath = parent / f"{type}_metrics.jsonl"
+            metrics_fpath = parent / f"{type}_metrics.json"
             maybe_conflicting_metrics_fpath = self._validate_aggregate_metrics(
                 aggregate_metrics_dict=aggregate_metrics_dict,
                 metrics_fpath=metrics_fpath,
 
@@ -131,6 +131,11 @@ dependencies = [
     # Updated Sun Sep 21, 2025 with aiohttp==3.12.15
     # License: Apache 2.0 https://github.com/aio-libs/aiohttp/blob/9a2f146a12e3525b43e96723ef41584bf9cf784e/LICENSE.txt
     "aiohttp",
+
+    # yappi: profiling tool
+    # Updated Mon Sep 22, 2025 with yappi==1.6.10
+    # License: MIT https://github.com/sumerc/yappi/blob/1d3f7501701e1f050b6dcd6a86fd36aec08185c7/LICENSE
+    "yappi",
 ]
 
 [dependency-groups]
 
@@ -0,0 +1,44 @@
+{
+    "name": "train",
+    "type": "train",
+    "jsonl_fpath": "resources_servers/comp_coding/data/train.jsonl",
+    "gitlab_identifier": {
+        "dataset_name": "comp_coding",
+        "version": "0.0.1",
+        "artifact_fpath": "train.jsonl"
+    },
+    "license": "Apache 2.0",
+    "Number of examples": 5000,
+    "Number of tools": {
+        "Total # non-null values": 0,
+        "Average": 0.0,
+        "Min": 0.0,
+        "Max": 0.0,
+        "Median": 0.0,
+        "Standard deviation": 0.0
+    },
+    "Json-dumped number of words (proxy for token count)": {
+        "Total # non-null values": 5000,
+        "Average": 336.1797999999992,
+        "Min": 46.0,
+        "Max": 1274.0,
+        "Median": 319.5131482834187,
+        "Standard deviation": 135.7584072571132
+    },
+    "Number of turns": {
+        "Total # non-null values": 5000,
+        "Average": 1.0,
+        "Min": 1.0,
+        "Max": 1.0,
+        "Median": 1.0,
+        "Standard deviation": 0.0
+    },
+    "Temperature": {
+        "Total # non-null values": 0,
+        "Average": 0.0,
+        "Min": 0.0,
+        "Max": 0.0,
+        "Median": 0.0,
+        "Standard deviation": 0.0
+    }
+}