diff --git a/.agents/README.md b/.agents/README.md index 0807da4e4..7b0846617 100644 --- a/.agents/README.md +++ b/.agents/README.md @@ -25,4 +25,4 @@ Tool-specific directories symlink back here so each harness resolves skills from All skills and agents in this directory are for **contributors developing DataDesigner** — not for end users building datasets. -The usage skill for building datasets with DataDesigner lives separately at [`skills/data-designer/`](../skills/data-designer/). For product documentation, see the [docs site](https://nvidia-nemo.github.io/DataDesigner/). +The usage skill for building datasets with DataDesigner lives separately at [`skills/data-designer/`](../skills/data-designer/). For product documentation, see the [docs site](https://docs.nvidia.com/nemo/datadesigner/). diff --git a/.agents/agents/docs-searcher.md b/.agents/agents/docs-searcher.md index 814254189..5014e3875 100644 --- a/.agents/agents/docs-searcher.md +++ b/.agents/agents/docs-searcher.md @@ -1,6 +1,6 @@ --- name: docs-searcher -description: Search local documentation in the docs/ folder for content related to a topic. Use this agent when the user wants to find documentation about a specific feature, concept, or usage pattern. Proactively use this when answering questions that might be covered in the project documentation. +description: Search local Fern documentation for content related to a topic. Use this agent when the user wants to find documentation about a specific feature, concept, or usage pattern. Proactively use this when answering questions that might be covered in the project documentation. tools: Glob, Grep, Read model: haiku permissionMode: bypassPermissions @@ -8,20 +8,20 @@ permissionMode: bypassPermissions # Documentation Search Agent -You are a documentation search specialist. Your role is to efficiently search the local `docs/` folder for content relevant to a given topic. +You are a documentation search specialist. Your role is to efficiently search the local Fern docs under `fern/versions/latest/pages/` for content relevant to a given topic. ## Instructions When given a search topic, perform the following searches: -1. **Find all documentation files** in the docs/ folder: +1. **Find all documentation files** in the Fern pages folder: ``` - Glob pattern: "docs/**/*.md" + Glob pattern: "fern/versions/latest/pages/**/*.{md,mdx}" ``` 2. **Search for topic keywords** across all markdown files: ``` - Grep pattern: "" in path: "docs/" + Grep pattern: "" in path: "fern/versions/latest/pages/" ``` - Try multiple variations of the search terms (singular/plural, related terms) - Use case-insensitive search (`-i: true`) @@ -44,12 +44,12 @@ When given a search topic, perform the following searches: ### Relevant Documentation -- **[docs/path/to/file.md](docs/path/to/file.md)** +- **[fern/versions/latest/pages/path/to/file.mdx](fern/versions/latest/pages/path/to/file.mdx)** > Brief excerpt showing relevant content... Explanation of why this is relevant to the search topic. -- **[docs/another/file.md](docs/another/file.md)** +- **[fern/versions/latest/pages/another/file.mdx](fern/versions/latest/pages/another/file.mdx)** > Another relevant excerpt... Explanation of relevance. @@ -64,7 +64,7 @@ Brief summary of what was found and any recommendations for the user. - If no relevant documentation is found, clearly state that - Keep excerpts concise but include enough context to be useful - Prioritize user guides and examples over API reference when both exist -- If the docs/ folder doesn't exist or is empty, report that clearly +- If `fern/versions/latest/pages/` doesn't exist or is empty, report that clearly ## Search Strategy diff --git a/.agents/recipes/_fix-policy.md b/.agents/recipes/_fix-policy.md index e95ef8ae7..d5973bc80 100644 --- a/.agents/recipes/_fix-policy.md +++ b/.agents/recipes/_fix-policy.md @@ -37,7 +37,7 @@ If the top-ranked candidate fails the bar, try the next. If none of the top | Suite | Paths the recipe MAY modify | |-------|-----------------------------| -| docs-and-references | `architecture/**`, `docs/**`, `README.md`, `CONTRIBUTING.md`, `DEVELOPMENT.md`, `STYLEGUIDE.md`, `packages/*/src/**/*.py` (docstring-only edits) | +| docs-and-references | `architecture/**`, `fern/versions/latest/pages/**`, `README.md`, `CONTRIBUTING.md`, `DEVELOPMENT.md`, `STYLEGUIDE.md`, `packages/*/src/**/*.py` (docstring-only edits) | | dependencies | `packages/*/pyproject.toml` | | structure | `packages/*/src/**/*.py` | | code-quality | `packages/*/src/**/*.py` | diff --git a/.agents/recipes/docs-and-references/recipe.md b/.agents/recipes/docs-and-references/recipe.md index 45a54b324..5d2453038 100644 --- a/.agents/recipes/docs-and-references/recipe.md +++ b/.agents/recipes/docs-and-references/recipe.md @@ -57,7 +57,7 @@ interface package is what users see first. Check links in these locations: - `README.md` - all relative links and URLs - `architecture/*.md` - cross-references to other architecture docs and code -- `docs/` - MkDocs content links, code references, cross-page links +- `fern/versions/latest/pages/` - Fern content links, code references, cross-page links - `CONTRIBUTING.md`, `DEVELOPMENT.md`, `STYLEGUIDE.md` - relative links For each link, verify the target file or anchor exists. Report broken links @@ -80,10 +80,10 @@ ls architecture/ ### 4. Docs site content accuracy -The MkDocs site under `docs/` is the primary user-facing documentation. +The Fern site under `fern/versions/latest/pages/` is the primary user-facing documentation. Review for accuracy against the current code: -**Concepts pages** (`docs/concepts/`): +**Concepts pages** (`fern/versions/latest/pages/concepts/`): - Do code examples use correct imports, class names, and method signatures? Check against actual source - e.g., verify `DataDesigner.create()`, `DataDesigner.preview()`, builder patterns match the real API. @@ -91,17 +91,17 @@ Review for accuracy against the current code: or renamed? - Are new features or column types missing from the docs? -**Recipes** (`docs/recipes/`): +**Recipes** (`fern/versions/latest/pages/recipes/`): - Do step-by-step instructions reference correct file paths, class names, and CLI commands? Run `grep` for class names mentioned in recipe docs and verify they resolve in the source. -**Dev notes** (`docs/devnotes/posts/`): +**Dev notes** (`fern/versions/latest/pages/devnotes/posts/`): - Dev notes describe implementation details that may have changed. Spot-check the most recent 3-5 posts for references to functions, classes, or architecture that have since been modified. -**Code reference** (`docs/code_reference/`): +**Code reference** (`fern/versions/latest/pages/code_reference/`): - Check that autodoc module paths point to modules that still exist. **Prioritize by risk of drift**: pages with the most code symbols referenced diff --git a/.agents/skills/datadesigner-docs/SKILL.md b/.agents/skills/datadesigner-docs/SKILL.md index d2a49ae1c..fe6c93e15 100644 --- a/.agents/skills/datadesigner-docs/SKILL.md +++ b/.agents/skills/datadesigner-docs/SKILL.md @@ -12,11 +12,11 @@ description: > Unified skill for adding, updating, moving, and removing pages on the NeMo Data Designer Fern docs site. -Current URL: **`datadesigner.docs.buildwithfern.com/nemo/datadesigner`** (see `instances` in [`fern/docs.yml`](../../../fern/docs.yml)). Source of truth for everything user-facing is `fern/`. +Current URL: **`docs.nvidia.com/nemo/datadesigner`** (see `instances` in [`fern/docs.yml`](../../../fern/docs.yml)). Source of truth for everything user-facing is `fern/`. ## Scope Rule -**ALL doc edits happen under `fern/`.** The legacy `docs/` directory is the original MkDocs source. `docs/notebook_source/*.py` remains canonical for notebook code, but **do not add new top-level prose pages under `docs/`**. Concept pages, recipes, plugins, code reference, and Dev Notes prose live under `fern/versions/latest/pages/`. +**ALL doc prose edits happen under `fern/`.** The remaining `docs/` directory is only for notebook source, generated Colab notebooks, docs scripts, and downloadable recipe scripts. `docs/notebook_source/*.py` remains canonical for notebook code. Concept pages, recipes, plugins, code reference, and Dev Notes prose live under `fern/versions/latest/pages/`. ## Versioning Model @@ -43,7 +43,7 @@ fern/ ├── fern.config.json ← organization + fern-api version pin ├── main.css ← bundled NVIDIA theme CSS ├── assets/ ← logos, favicon, recipe assets, devnote post images (shared) -├── images/ ← /images/* references from MDX (mirrors docs/images/) +├── images/ ← /images/* references from MDX ├── styles/ ← per-component CSS (notebook-viewer, authors, metrics-table, …) ├── components/ ← React/JSX MDX components │ ├── NotebookViewer.tsx ← renders converted .ipynb cells with outputs @@ -484,7 +484,7 @@ git commit -s -m "docs: " DCO sign-off (`-s`) is required by CONTRIBUTING. Use `docs:` prefix (matches recent commit history). Subject line ≤ 50 chars (hard limit 72). -When the team adds a Fern preview workflow (modeled after Gym's `fern-docs-preview-comment.yml`), PRs touching `fern/**` will get an automatic preview URL posted as a comment. Until that lands, share local dev-server screenshots in PR descriptions. +PRs touching Fern docs inputs get an automatic Fern preview URL posted as a comment. Fork PRs still run checks, but hosted preview publishing is skipped because it requires deployment secrets. ## Cutting a New Version Train @@ -511,4 +511,4 @@ Do not copy page trees by hand on `main`. The release workflow copies `latest/pa - Editing Python source under `packages/` — that's a code change, not a docs change. - Adding a notebook tutorial's *code*: edit `docs/notebook_source/.py`, not the converted `.ipynb` or the wrapper MDX. -- Editing dev note *prose*: edit the migrated MDX under `fern/versions/latest/pages/devnotes/posts/.mdx`. (The original `docs/devnotes/posts/.md` is no longer the source of truth — Fern is.) +- Editing dev note *prose*: edit the MDX under `fern/versions/latest/pages/devnotes/posts/.mdx`. diff --git a/.agents/skills/review-code/SKILL.md b/.agents/skills/review-code/SKILL.md index 6c6c8a572..198a0edf4 100644 --- a/.agents/skills/review-code/SKILL.md +++ b/.agents/skills/review-code/SKILL.md @@ -97,7 +97,7 @@ Read the following files at the repository root to load the project's standards **Documentation sources (load when the changeset touches matching areas):** - **`architecture/*.md`** — subsystem maps aligned with `packages/` (e.g. `engine/mcp/` ↔ `architecture/mcp.md`). Use to verify the PR does not leave recorded architecture false relative to new behavior. -- **`docs/`** — published user-facing documentation. Cross-check when public API, CLI behavior, or config surface changes would affect what readers are told. +- **`fern/versions/latest/pages/`** — published user-facing documentation. Cross-check when public API, CLI behavior, or config surface changes would affect what readers are told. Use these guidelines as the baseline for the entire review. Project-specific rules take precedence over general best practices. @@ -181,14 +181,14 @@ Re-read the changed files with a focus on **structure and design of the new/modi **Documentation alignment (same pass — scoped, not a full docs audit):** -When **code** under `packages/` changes behavior, structure, or public contracts in a way that a maintainer would reasonably describe in `architecture/` or `docs/`: +When **code** under `packages/` changes behavior, structure, or public contracts in a way that a maintainer would reasonably describe in `architecture/` or Fern docs: -1. Identify the closest **`architecture/.md`** (and any obvious `docs/` pages) for that subsystem. +1. Identify the closest **`architecture/.md`** (and any obvious `fern/versions/latest/pages/` pages) for that subsystem. 2. If the PR **also edits** those docs, sanity-check that the edits match the code. -3. If the PR **does not** edit docs but the change **contradicts** what `architecture/` or `docs/` currently asserts, flag it (**Warnings** if contributors rely on that text; **Suggestions** if impact is narrow). Suggest updating the same PR or an explicit follow-up issue. +3. If the PR **does not** edit docs but the change **contradicts** what `architecture/` or Fern docs currently assert, flag it (**Warnings** if contributors rely on that text; **Suggestions** if impact is narrow). Suggest updating the same PR or an explicit follow-up issue. 4. **Skip** this check for pure refactors with no observable behavior change, typo-only PRs, or changes already limited to documentation. -The local **`search-docs`** skill can help locate `docs/` pages by topic when the right file is not obvious. +The local **`search-docs`** skill can help locate Fern docs pages by topic when the right file is not obvious. ### Pass 3: Standards, Testing & Polish diff --git a/.agents/skills/search-docs/SKILL.md b/.agents/skills/search-docs/SKILL.md index 00989683b..16ff681ec 100644 --- a/.agents/skills/search-docs/SKILL.md +++ b/.agents/skills/search-docs/SKILL.md @@ -1,6 +1,6 @@ --- name: search-docs -description: Search local documentation in the docs/ folder for content related to a topic +description: Search local Fern documentation for content related to a topic argument-hint: metadata: internal: true @@ -8,7 +8,7 @@ metadata: # Documentation Search -Use the `docs-searcher` subagent to search local documentation for content related to: **$ARGUMENTS** +Use the `docs-searcher` subagent to search local Fern documentation for content related to: **$ARGUMENTS** Call the Task tool with: - `subagent_type: "docs-searcher"` diff --git a/.github/scripts/patch-devnotes-nav.py b/.github/scripts/patch-devnotes-nav.py deleted file mode 100644 index f35e69f69..000000000 --- a/.github/scripts/patch-devnotes-nav.py +++ /dev/null @@ -1,57 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -"""Patch the Dev Notes nav block in mkdocs.yml. - -Used by publish-devnotes.yml to splice HEAD's Dev Notes nav entries into an -older source checkout without touching the rest of the file. - -Usage: python patch-devnotes-nav.py -""" - -from __future__ import annotations - -import re -import sys - - -def extract_devnotes_block(text: str) -> tuple[int, int, list[str]]: - """Return (start, end, lines) for the ' - Dev Notes:' nav block.""" - lines = text.splitlines(keepends=True) - start = None - for i, line in enumerate(lines): - if re.match(r"^ - Dev Notes:", line): - start = i - break - if start is None: - raise SystemExit("Dev Notes nav section not found") - end = start + 1 - while end < len(lines): - # Stop at next top-level nav entry (2-space indent) or non-nav section - if lines[end].strip() and not lines[end].startswith(" ") and not lines[end].startswith(" #"): - break - end += 1 - return start, end, lines - - -def main() -> None: - if len(sys.argv) != 3: - raise SystemExit(f"Usage: {sys.argv[0]} ") - - head_path, target_path = sys.argv[1], sys.argv[2] - - with open(head_path) as f: - head_start, head_end, head_lines = extract_devnotes_block(f.read()) - head_block = head_lines[head_start:head_end] - - with open(target_path) as f: - old_start, old_end, old_lines = extract_devnotes_block(f.read()) - new_lines = old_lines[:old_start] + head_block + old_lines[old_end:] - - with open(target_path, "w") as f: - f.writelines(new_lines) - print(f"Patched Dev Notes nav: replaced lines {old_start + 1}-{old_end} with {len(head_block)} lines from HEAD") - - -if __name__ == "__main__": - main() diff --git a/.github/workflows/build-docs.yml b/.github/workflows/build-docs.yml deleted file mode 100644 index 0dc7ee4ce..000000000 --- a/.github/workflows/build-docs.yml +++ /dev/null @@ -1,80 +0,0 @@ -name: Build docs -on: - workflow_dispatch: - inputs: - use_cache: - description: "Use cached notebooks for unchanged sources" - type: boolean - default: true - release: - types: - - published - -permissions: {} - -jobs: - build-notebooks: - uses: ./.github/workflows/build-notebooks.yml - permissions: - actions: read - contents: write - with: - use_cache: ${{ github.event_name == 'workflow_dispatch' && inputs.use_cache || false }} - secrets: inherit - deploy: - needs: build-notebooks - runs-on: ubuntu-latest - permissions: - contents: write - steps: - - name: Checkout repository - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6 - - name: Install uv - uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b # v8.1.0 - with: - version: "0.9.5" - - name: Set up Python - run: uv python install 3.11 - - name: Install dependencies for docs - run: uv sync --all-packages --group docs - - name: Download artifact from previous step - uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c # v8.0.1 - with: - name: notebooks - path: docs/notebooks - - name: Find the latest existing release tag - id: get_release - run: | - if [ "${{ github.event_name }}" == "release" ]; then - LATEST_TAG="${{ github.event.release.tag_name }}" - else - echo "::notice::Running manually via workflow_dispatch. Fetching latest release tag..." - - gh auth status || echo "GitHub CLI is not authenticated, relying on GITHUB_TOKEN." - - # We use tr -d '\n' to remove the trailing newline for a clean tag string - LATEST_TAG=$(gh release view --json tagName -q .tagName 2>/dev/null) - - if [ -z "$LATEST_TAG" ]; then - echo "::error::Could not find the latest published release tag. Ensure a release exists." - exit 1 - fi - fi - - echo "Latest release tag found: $LATEST_TAG" - echo "LATEST_TAG=$LATEST_TAG" >> $GITHUB_ENV - env: - GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} - - name: Extract version from release tag - run: | - # Remove the 'v' prefix and any suffix after a space - VERSION=$(echo ${{ env.LATEST_TAG }} | sed 's/^v//' | sed 's/ .*$//') - echo "::notice::Extracted version: $VERSION" - echo "VERSION=$VERSION" >> $GITHUB_ENV - - name: Setup doc deploy - run: | - git fetch origin gh-pages --depth=1 - git config --global user.name "github-actions[bot]" - git config --global user.email "41898282+github-actions[bot]@users.noreply.github.com" - - name: Build and deploy docs - run: uv run mike deploy --push --update-aliases ${{ env.VERSION }} latest diff --git a/.github/workflows/docs-preview.yml b/.github/workflows/docs-preview.yml index b9e40ef0c..a4a793e41 100644 --- a/.github/workflows/docs-preview.yml +++ b/.github/workflows/docs-preview.yml @@ -1,18 +1,24 @@ -name: Docs preview +name: Fern docs preview on: pull_request: types: [opened, synchronize, reopened] paths: - - "docs/**" - "fern/**" - - "mkdocs.yml" + - "docs/notebook_source/**" + - "docs/colab_notebooks/**" + - "packages/data-designer-config/src/data_designer/config/**" + - "packages/data-designer/src/data_designer/interface/**" + - "packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py" + - "packages/data-designer-engine/src/data_designer/engine/mcp/**" + - "packages/data-designer-engine/src/data_designer/engine/processing/processors/**" + - "packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py" - ".github/workflows/docs-preview.yml" permissions: {} concurrency: - group: docs-preview-${{ github.event.pull_request.number }} + group: fern-docs-preview-${{ github.event.pull_request.number }} cancel-in-progress: true env: @@ -41,20 +47,6 @@ jobs: - name: Install docs dependencies run: uv sync --all-packages --group docs --group notebooks - - name: Create preview notebooks - run: | - mkdir -p docs/notebooks - cp docs/notebook_source/_README.md docs/notebooks/README.md - cp docs/notebook_source/_pyproject.toml docs/notebooks/pyproject.toml - - for src in docs/notebook_source/*.py; do - name=$(basename "$src" .py) - .venv/bin/jupytext --to ipynb --output "docs/notebooks/${name}.ipynb" "$src" - done - - - name: Build docs - run: .venv/bin/mkdocs build - - name: Verify published Fern archive exists run: | if ! git ls-remote --exit-code --heads origin "$FERN_PUBLISHED_BRANCH" >/dev/null 2>&1; then @@ -123,16 +115,6 @@ jobs: echo "url=https://nvidia-preview-pr-$PR_NUMBER.docs.buildwithfern.com/nemo/datadesigner" >> "$GITHUB_OUTPUT" - - name: Deploy to Cloudflare Pages - if: github.event.pull_request.head.repo.full_name == github.repository - id: deploy - continue-on-error: true - uses: cloudflare/wrangler-action@ebbaa1584979971c8614a24965b4405ff95890e0 # v4.0.0 - with: - apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }} - accountId: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }} - command: pages deploy site/ --project-name=dd-docs-preview --branch=pr-${{ github.event.pull_request.number }} - - name: Find existing comment if: github.event.pull_request.head.repo.full_name == github.repository uses: peter-evans/find-comment@b30e6a3c0ed37e7c023ccd3f1db5c6c0b0c23aad # v4 @@ -151,8 +133,6 @@ jobs: edit-mode: replace body: | - **MkDocs preview:** ${{ steps.deploy.outcome == 'success' && steps.deploy.outputs.deployment-url || 'MkDocs preview failed; see workflow logs.' }} - **Fern preview:** ${{ steps.fern-preview.outcome == 'success' && steps.fern-preview.outputs.url || 'Fern preview failed or skipped; see workflow logs.' }} > Fern previews include the `docs-website` version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews. diff --git a/.github/workflows/publish-devnotes.yml b/.github/workflows/publish-devnotes.yml deleted file mode 100644 index 85685a5be..000000000 --- a/.github/workflows/publish-devnotes.yml +++ /dev/null @@ -1,74 +0,0 @@ -name: Publish devnotes - -on: - push: - branches: [main] - paths: - - "docs/devnotes/**" - workflow_dispatch: - -permissions: {} - -jobs: - deploy: - runs-on: ubuntu-latest - permissions: - actions: read - contents: write - steps: - - name: Checkout repository - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6 - with: - fetch-depth: 0 - - name: Get last deployed docs state - run: | - git fetch origin gh-pages --depth=1 - DEPLOY_MSG=$(git log FETCH_HEAD -1 --format="%s") - SOURCE_SHA=$(echo "$DEPLOY_MSG" | sed -n 's/^Deployed \([0-9a-f]*\) to .*/\1/p') - VERSION=$(echo "$DEPLOY_MSG" | sed -n 's/^Deployed [0-9a-f]* to \([^ ]*\) .*/\1/p') - - if [ -z "$SOURCE_SHA" ] || [ -z "$VERSION" ]; then - echo "::error::Could not parse deploy info from gh-pages. Expected: 'Deployed to ...'" - exit 1 - fi - - echo "::notice::Last deploy: commit $SOURCE_SHA for version $VERSION" - echo "SOURCE_SHA=$SOURCE_SHA" >> $GITHUB_ENV - echo "VERSION=$VERSION" >> $GITHUB_ENV - - name: Checkout docs source and overlay devnotes - run: | - git checkout ${{ env.SOURCE_SHA }} - git checkout ${{ github.sha }} -- docs/devnotes/ .github/scripts/patch-devnotes-nav.py - - # Patch the "Dev Notes" nav section from HEAD's mkdocs.yml into the - # old source's mkdocs.yml. This keeps nav entries for new devnotes - # without pulling in entries for non-devnotes pages that may not - # exist in the old source checkout. - git show ${{ github.sha }}:mkdocs.yml > /tmp/mkdocs-head.yml - python3 .github/scripts/patch-devnotes-nav.py /tmp/mkdocs-head.yml mkdocs.yml - - name: Install uv - uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b # v8.1.0 - with: - version: "0.9.5" - - name: Set up Python - run: uv python install 3.11 - - name: Install dependencies for docs - run: uv sync --all-packages --group docs - - name: Download notebooks from last docs build - env: - GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} - run: | - mkdir -p docs/notebooks - LAST_RUN_ID=$(gh run list --workflow build-docs.yml --status success --limit 1 --json databaseId -q '.[0].databaseId') - if [ -z "$LAST_RUN_ID" ]; then - echo "::error::No successful build-docs run found. Cannot build without notebooks." - exit 1 - fi - gh run download "$LAST_RUN_ID" --name notebooks --dir docs/notebooks - echo "::notice::Downloaded notebooks from build-docs run $LAST_RUN_ID" - - name: Setup doc deploy - run: | - git config --global user.name "github-actions[bot]" - git config --global user.email "41898282+github-actions[bot]@users.noreply.github.com" - - name: Rebuild latest docs - run: uv run mike deploy --push --update-aliases ${{ env.VERSION }} latest diff --git a/.github/workflows/publish-fern-devnotes.yml b/.github/workflows/publish-fern-devnotes.yml index d54dd86bc..89a08d711 100644 --- a/.github/workflows/publish-fern-devnotes.yml +++ b/.github/workflows/publish-fern-devnotes.yml @@ -108,11 +108,11 @@ jobs: fi } - for workflow in build-fern-docs.yml build-docs.yml; do + for workflow in build-fern-docs.yml; do download_notebooks "$workflow" --event release done - for workflow in build-fern-docs.yml build-docs.yml; do + for workflow in build-fern-docs.yml; do download_notebooks "$workflow" --event workflow_dispatch --branch main done diff --git a/.gitignore b/.gitignore index e7f8718b7..6fa1c3129 100644 --- a/.gitignore +++ b/.gitignore @@ -38,9 +38,6 @@ htmlcov/ coverage.xml .pytest_cache/ -# mkdocs documentation -/site - # Jupyter Notebook .ipynb_checkpoints diff --git a/Makefile b/Makefile index 962662014..39bbd850a 100644 --- a/Makefile +++ b/Makefile @@ -89,7 +89,6 @@ help: @echo " check-fern-docs - Generate local Fern artifacts and run fern check" @echo " check-fern-docs-locally - Install deps, generate Fern artifacts, and run fern check" @echo " serve-fern-docs-locally - Generate local Fern artifacts and serve Fern docs" - @echo " serve-docs-locally - Serve legacy MkDocs documentation locally" @echo " check-license-headers - Check if all files have license headers" @echo " update-license-headers - Add license headers to all files" @echo "" @@ -473,7 +472,6 @@ update-license-headers: DOCS_PYTHON_VERSION ?= 3.13 DOCS_PYTHON ?= .venv/bin/python DOCS_JUPYTEXT ?= .venv/bin/jupytext -DOCS_MKDOCS ?= .venv/bin/mkdocs DOCS_PY2FERN ?= .venv/bin/py2fern FERN_API_REFERENCE_OUTPUT ?= fern/code-reference FERN_API_REFERENCE_CONFIG_OUTPUT ?= $(FERN_API_REFERENCE_OUTPUT)/data-designer @@ -496,11 +494,6 @@ install-docs-deps: @echo "📦 Installing docs dependencies (Python $(DOCS_PYTHON_VERSION))..." uv sync --python $(DOCS_PYTHON_VERSION) --all-packages --group docs --group notebooks -serve-docs-locally: - @$(MAKE) install-docs-deps - @echo "📝 Building and serving docs (Python $(DOCS_PYTHON_VERSION))..." - $(DOCS_MKDOCS) serve --livereload - generate-fern-api-reference: @echo "📚 Generating Fern API reference with py2fern ($(DOCS_PY2FERN))..." @rm -rf $(FERN_API_REFERENCE_OUTPUT) @@ -762,7 +755,7 @@ clean-test-coverage: generate-colab-notebooks generate-fern-api-reference generate-fern-api-reference-native generate-fern-notebooks generate-fern-notebooks-with-outputs help \ install install-dev install-dev-notebooks install-dev-recipes install-docs-deps \ lint lint-config lint-engine lint-fix lint-fix-config lint-fix-engine lint-fix-interface lint-interface \ - perf-import perf-import-runtime prepare-fern-docs prepare-fern-release publish serve-docs-locally serve-fern-docs-locally show-versions \ + perf-import perf-import-runtime prepare-fern-docs prepare-fern-release publish serve-fern-docs-locally show-versions \ health-checks \ test test-config test-config-isolated test-e2e test-engine test-engine-isolated \ test-interface test-interface-isolated test-isolated \ diff --git a/README.md b/README.md index 6251848c9..bc0c728c0 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ [![CI](https://github.com/NVIDIA-NeMo/DataDesigner/actions/workflows/ci.yml/badge.svg)](https://github.com/NVIDIA-NeMo/DataDesigner/actions/workflows/ci.yml) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) -[![Python 3.10 - 3.13](https://img.shields.io/badge/🐍_Python-3.10_|_3.11_|_3.12_|_3.13-blue.svg)](https://www.python.org/downloads/) [![NeMo Microservices](https://img.shields.io/badge/NeMo-Microservices-76b900)](https://docs.nvidia.com/nemo/microservices/latest/index.html) [![Code](https://img.shields.io/badge/Code-Documentation-8A2BE2.svg)](https://nvidia-nemo.github.io/DataDesigner/) ![Tokens](https://img.shields.io/badge/400+_Billion-Tokens_Generated-76b900.svg?logo=nvidia&logoColor=white) +[![Python 3.10 - 3.13](https://img.shields.io/badge/🐍_Python-3.10_|_3.11_|_3.12_|_3.13-blue.svg)](https://www.python.org/downloads/) [![NeMo Microservices](https://img.shields.io/badge/NeMo-Microservices-76b900)](https://docs.nvidia.com/nemo/microservices/latest/index.html) [![Code](https://img.shields.io/badge/Code-Documentation-8A2BE2.svg)](https://docs.nvidia.com/nemo/datadesigner/) ![Tokens](https://img.shields.io/badge/400+_Billion-Tokens_Generated-76b900.svg?logo=nvidia&logoColor=white) **Generate high-quality synthetic datasets from scratch or using your own seed data.** @@ -24,7 +24,7 @@ Data Designer helps you create synthetic datasets that go beyond simple LLM prom ### 📣 Heads-up: async engine is now the default -Data Designer now runs pipelines on a cell-level async engine that overlaps independent columns and adapts concurrency per (provider, model). On most pipelines this is faster with no config changes; on slow self-hosted endpoints, set `inference_parameters.timeout` to your real per-request latency. See [Architecture & Performance → Async Engine](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/architecture-and-performance/#async-engine) for the behaviors worth knowing about. +Data Designer now runs pipelines on a cell-level async engine that overlaps independent columns and adapts concurrency per (provider, model). On most pipelines this is faster with no config changes; on slow self-hosted endpoints, set `inference_parameters.timeout` to your real per-request latency. See [Architecture & Performance → Async Engine](https://docs.nvidia.com/nemo/datadesigner/concepts/architecture-and-performance#async-engine) for the behaviors worth knowing about. If you hit anything unexpected, fall back to the legacy sync engine for one transitional release with `DATA_DESIGNER_ASYNC_ENGINE=0`, and please [open an issue](https://github.com/NVIDIA-NeMo/DataDesigner/issues/new) so we can fix the async path. @@ -103,18 +103,18 @@ preview.display_sample_record() ### 📚 Learn more -- **[Getting Started](https://nvidia-nemo.github.io/DataDesigner/latest/)** – Install, configure, and generate your first dataset -- **[Tutorial Notebooks](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/)** – Step-by-step interactive tutorials -- **[Column Types](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/columns/)** – Explore samplers, LLM columns, validators, and more -- **[Validators](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/validators/)** – Learn how to validate generated data with Python, SQL, and remote validators -- **[Model Configuration](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/model-configs/)** – Configure custom models and providers -- **[Person Sampling](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/person_sampling/)** – Learn how to sample realistic person data with demographic attributes +- **[Getting Started](https://docs.nvidia.com/nemo/datadesigner/getting-started/welcome)** – Install, configure, and generate your first dataset +- **[Tutorial Notebooks](https://docs.nvidia.com/nemo/datadesigner/tutorials/overview)** – Step-by-step interactive tutorials +- **[Column Types](https://docs.nvidia.com/nemo/datadesigner/concepts/columns)** – Explore samplers, LLM columns, validators, and more +- **[Validators](https://docs.nvidia.com/nemo/datadesigner/concepts/validators)** – Learn how to validate generated data with Python, SQL, and remote validators +- **[Model Configuration](https://docs.nvidia.com/nemo/datadesigner/concepts/models/model-configs)** – Configure custom models and providers +- **[Person Sampling](https://docs.nvidia.com/nemo/datadesigner/concepts/person-sampling)** – Learn how to sample realistic person data with demographic attributes -### 📝 Documentation transition +### 📝 Documentation -Data Designer is gradually moving documentation from MkDocs to Fern. During the transition, maintainers publish both docs builds for a few releases so the Fern site can mature without losing the existing MkDocs release archive. +Data Designer documentation now lives on Fern at [docs.nvidia.com/nemo/datadesigner](https://docs.nvidia.com/nemo/datadesigner/). -Contributors should keep editing the existing docs sources under `docs/`. Tutorial notebook source lives in `docs/notebook_source/*.py`; generated notebooks and Fern artifacts are not the source of truth. +Contributors should edit docs prose under `fern/`. Tutorial notebook source remains in `docs/notebook_source/*.py`; generated notebooks and Fern artifacts are not the source of truth. The legacy MkDocs archive remains available on GitHub Pages for releases `0.5.7` and older. ### 🔧 Configure models via CLI @@ -126,7 +126,7 @@ data-designer config list # View current settings ### 🤖 Agent Skill -Data Designer has a [skill](https://nvidia-nemo.github.io/DataDesigner/latest/devnotes/data-designer-got-skills/) for coding agents. Just describe the dataset you want, and your agent handles schema design, validation, and generation. While the skill should work with other coding agents that support skills, our development and testing has focused on [Claude Code](https://code.claude.com) at this stage. +Data Designer has a [skill](https://docs.nvidia.com/nemo/datadesigner/dev-notes/data-designer-got-skills) for coding agents. Just describe the dataset you want, and your agent handles schema design, validation, and generation. While the skill should work with other coding agents that support skills, our development and testing has focused on [Claude Code](https://code.claude.com) at this stage. **Install via [skills.sh](https://skills.sh)** (be sure to select Claude Code as an additional agent): @@ -155,7 +155,7 @@ Disable with `NEMO_TELEMETRY_ENABLED=false`. **[More details →](#telemetry-and Aggregate model usage across synthetic data generation jobs, year-to-date 1/1/2026–5/1/2026: -![Top models used for synthetic data generation](docs/images/top-models.png) +![Top models used for synthetic data generation](fern/images/top-models.png) _Last updated on May 1, 2026_ diff --git a/VERSIONING.md b/VERSIONING.md index 32ac4a3d9..d3a46e85a 100644 --- a/VERSIONING.md +++ b/VERSIONING.md @@ -42,10 +42,12 @@ git tag v0.1.0 git push origin v0.1.0 ``` -Fern release publishing snapshots versioned docs automatically into the CI-managed `docs-website` branch, similar to how MkDocs publishes built output to `gh-pages`. Release owners do not need a dedicated pre-release docs PR. +Fern release publishing snapshots versioned docs automatically into the CI-managed `docs-website` branch. Release owners do not need a dedicated pre-release docs PR. The `docs-website` branch must already contain the historical Fern archive (`v0.6.0`, `v0.5.9`, `v0.5.8`, and `older`). The release workflow fails if those redirect targets are missing. +The legacy `gh-pages` branch is frozen for the MkDocs archive. Do not publish new docs builds there; Fern links readers to GitHub Pages only for releases `0.5.7` and older. + For the already-published `v0.6.0` release, rerun **Build Fern docs** manually with `release_tag=v0.6.0` and `source_ref=main` after the Fern fix PR merges. Future GitHub release events default `source_ref` to the release tag. ## Accessing Version in Code diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 000000000..5f09188bf --- /dev/null +++ b/docs/README.md @@ -0,0 +1,10 @@ +# Docs Support Files + +User-facing docs prose lives under `fern/`. + +This directory only keeps support files still consumed by the Fern docs build: + +- `notebook_source/` - tutorial notebook source +- `colab_notebooks/` - committed Colab notebooks linked from Fern tutorial pages +- `scripts/` - notebook generation helpers +- `assets/recipes/` - downloadable recipe scripts linked from Fern pages diff --git a/docs/assets/palette-favicon.png b/docs/assets/palette-favicon.png deleted file mode 100644 index 11c795d3e..000000000 Binary files a/docs/assets/palette-favicon.png and /dev/null differ diff --git a/docs/code_reference/config/analysis.md b/docs/code_reference/config/analysis.md deleted file mode 100644 index fa59221a0..000000000 --- a/docs/code_reference/config/analysis.md +++ /dev/null @@ -1,31 +0,0 @@ -# Analysis - -Profiling result objects and report helpers returned after generation. - -## Column Statistics - -`DataDesigner.create()` and `DataDesigner.preview()` run the dataset profiler after generation. The profiler computes statistics for each configured column; side-effect columns are recorded separately in `DatasetProfilerResults.side_effect_column_names`. - -Statistics result classes store computed metrics for each column type and format those metrics for reports. - -::: data_designer.config.analysis.column_statistics - -## Column Profilers - -Column profilers are optional analysis tools that provide deeper insights into specific column types. Currently, the only column profiler available is the Judge Score Profiler. - -Profiler result classes store computed profiler output and format it for reports. - -::: data_designer.config.analysis.column_profilers - -## Dataset Profiler - -The [DatasetProfilerResults](#data_designer.config.analysis.dataset_profiler.DatasetProfilerResults) class stores profiling results for a generated dataset. It aggregates column-level statistics, side-effect column names, and optional profiler results, and provides methods to: - -- Compute dataset-level metrics (completion percentage, column type summary) -- Filter statistics by column type -- Generate formatted analysis reports via the `to_report()` method - -Reports can be displayed in the console or exported to HTML/SVG formats. - -::: data_designer.config.analysis.dataset_profiler diff --git a/docs/code_reference/config/column_configs.md b/docs/code_reference/config/column_configs.md deleted file mode 100644 index 4ff2e8f2f..000000000 --- a/docs/code_reference/config/column_configs.md +++ /dev/null @@ -1,18 +0,0 @@ -# Column Configurations - -Column configs declare Data Designer's built-in column types. Each configuration inherits from [SingleColumnConfig](#data_designer.config.base.SingleColumnConfig), which provides shared arguments like the column `name`, whether to `drop` the column after generation, and the `column_type`. - -For column generator implementation classes, see [column_generators](../engine/column_generators.md). - -!!! info "`column_type` is a discriminator field" - The `column_type` argument is used to identify column types when deserializing the [Data Designer Config](data_designer_config.md) from JSON/YAML. It acts as the discriminator in a [discriminated union](https://docs.pydantic.dev/latest/concepts/unions/#discriminated-unions), allowing Pydantic to automatically determine which column configuration class to instantiate. - -## `SingleColumnConfig` {#data_designer.config.base.SingleColumnConfig} - -::: data_designer.config.base.SingleColumnConfig - options: - show_root_toc_entry: false - -## Column configurations - -::: data_designer.config.column_configs diff --git a/docs/code_reference/config/config_builder.md b/docs/code_reference/config/config_builder.md deleted file mode 100644 index 1aad978ae..000000000 --- a/docs/code_reference/config/config_builder.md +++ /dev/null @@ -1,10 +0,0 @@ -# Data Designer's Config Builder - -Use [DataDesignerConfigBuilder](#data_designer.config.config_builder.DataDesignerConfigBuilder) to construct [DataDesignerConfig](data_designer_config.md#data_designer.config.data_designer_config.DataDesignerConfig) objects. The builder accumulates model configs, tool configs, column configs, constraints, seed settings, processors, and profilers. - -Inputs can come from scratch, a `dict`, [BuilderConfig](#data_designer.config.config_builder.BuilderConfig), a local YAML/JSON file, or an HTTP(S) YAML/JSON URL via [`from_config()`](#data_designer.config.config_builder.DataDesignerConfigBuilder.from_config). Use [`build()`](#data_designer.config.config_builder.DataDesignerConfigBuilder.build) to create a [DataDesignerConfig](data_designer_config.md#data_designer.config.data_designer_config.DataDesignerConfig), or [`write_config()`](#data_designer.config.config_builder.DataDesignerConfigBuilder.write_config) to serialize the current builder config to YAML or JSON. - -!!! info "Model config loading" - [DataDesignerConfigBuilder](#data_designer.config.config_builder.DataDesignerConfigBuilder) accepts model configs as a list of [ModelConfig](models.md#data_designer.config.models.ModelConfig) objects, a YAML/JSON config path, or `None`. When `model_configs=None`, the builder loads default model configs if Data Designer can run locally; otherwise initialization raises BuilderConfigurationError. Model configs define the aliases referenced by model-backed columns such as [`LLMTextColumnConfig`](column_configs.md#data_designer.config.column_configs.LLMTextColumnConfig), [`LLMCodeColumnConfig`](column_configs.md#data_designer.config.column_configs.LLMCodeColumnConfig), [`LLMStructuredColumnConfig`](column_configs.md#data_designer.config.column_configs.LLMStructuredColumnConfig), [`LLMJudgeColumnConfig`](column_configs.md#data_designer.config.column_configs.LLMJudgeColumnConfig), [`EmbeddingColumnConfig`](column_configs.md#data_designer.config.column_configs.EmbeddingColumnConfig), and [`ImageColumnConfig`](column_configs.md#data_designer.config.column_configs.ImageColumnConfig). - -::: data_designer.config.config_builder diff --git a/docs/code_reference/config/data_designer_config.md b/docs/code_reference/config/data_designer_config.md deleted file mode 100644 index d6329a9fa..000000000 --- a/docs/code_reference/config/data_designer_config.md +++ /dev/null @@ -1,7 +0,0 @@ -# Data Designer Configuration - -[DataDesignerConfig](#data_designer.config.data_designer_config.DataDesignerConfig) is the top-level configuration object passed to Data Designer. It declares the columns to generate and may include model configs, tool configs, seed settings, sampler constraints, processors, and profiler configs. - -Prefer [DataDesignerConfigBuilder](config_builder.md#data_designer.config.config_builder.DataDesignerConfigBuilder) for programmatic construction. Direct [DataDesignerConfig](#data_designer.config.data_designer_config.DataDesignerConfig) instantiation is also supported. - -::: data_designer.config.data_designer_config diff --git a/docs/code_reference/config/index.md b/docs/code_reference/config/index.md deleted file mode 100644 index 1ec8b4de0..000000000 --- a/docs/code_reference/config/index.md +++ /dev/null @@ -1,7 +0,0 @@ -# Config Package - -The `data-designer-config` package provides `data_designer.config`, the configuration layer of Data Designer. It contains the objects used to describe dataset structure, model access, tool access, seed data, sampler parameters, validators, processors, run settings, plugin registrations, and analysis results. - -This package is the base of the dependency chain. Engine and interface code consume these config objects, but config objects do not execute generation directly. - -For programmatic configuration work, start with [config_builder](config_builder.md) and [data_designer_config](data_designer_config.md). Use the narrower pages for exact constructor fields for columns, models, MCP tools, seeds, processors, samplers, validators, or profiling results. diff --git a/docs/code_reference/config/mcp.md b/docs/code_reference/config/mcp.md deleted file mode 100644 index 49b6f5cfb..000000000 --- a/docs/code_reference/config/mcp.md +++ /dev/null @@ -1,16 +0,0 @@ -# MCP Configuration - -MCP config objects tell Data Designer which Model Context Protocol providers exist and which tools an LLM column may use. - -[MCPProvider](#data_designer.config.mcp.MCPProvider) configures remote MCP servers via SSE or Streamable HTTP transport. [LocalStdioMCPProvider](#data_designer.config.mcp.LocalStdioMCPProvider) configures local MCP servers as subprocesses via stdio transport. [ToolConfig](#data_designer.config.mcp.ToolConfig) sets which tools are available for LLM columns and how they are constrained. - -For MCP execution internals, see [Engine MCP](../engine/mcp.md). Related guides: - -- **[MCP Providers](../../concepts/mcp/mcp-providers.md)** - Configure local or remote MCP providers -- **[Tool Configs](../../concepts/mcp/tool-configs.md)** - Define tool permissions and limits -- **[Enabling Tools](../../concepts/mcp/enabling-tools.md)** - Use tools in LLM columns -- **[Traces](../../concepts/traces.md)** - Capture full conversation history - -## API Reference - -::: data_designer.config.mcp diff --git a/docs/code_reference/config/models.md b/docs/code_reference/config/models.md deleted file mode 100644 index e14e8cfdb..000000000 --- a/docs/code_reference/config/models.md +++ /dev/null @@ -1,12 +0,0 @@ -# Models - -[ModelProvider](#data_designer.config.models.ModelProvider) stores connection and authentication details for model providers. [ModelConfig](#data_designer.config.models.ModelConfig) stores a model alias, model identifier, provider settings, and inference parameters. [Inference Parameters](../../concepts/models/inference-parameters.md) control model behavior. Chat-completion parameters include `temperature`, `top_p`, and `max_tokens`; `temperature` and `top_p` can be fixed values or configured distributions. [ImageContext](#data_designer.config.models.ImageContext) provides image inputs to multimodal models, and [ImageInferenceParams](#data_designer.config.models.ImageInferenceParams) configures image generation models. - -Related guides: - -- **[Model Providers](../../concepts/models/model-providers.md)** -- **[Model Configs](../../concepts/models/model-configs.md)** -- **[Image Context](../../notebooks/4-providing-images-as-context.ipynb)** -- **[Generating Images](../../notebooks/5-generating-images.ipynb)** - -::: data_designer.config.models diff --git a/docs/code_reference/config/plugins.md b/docs/code_reference/config/plugins.md deleted file mode 100644 index 93f4533de..000000000 --- a/docs/code_reference/config/plugins.md +++ /dev/null @@ -1,17 +0,0 @@ -# Plugins - -Plugin packages register [Plugin](#data_designer.plugins.plugin.Plugin) objects through entry points in the `data_designer.plugins` group. A plugin registration ties a config class to its implementation class and declares its [PluginType](#data_designer.plugins.plugin.PluginType). - -Related pages: [Build Your Own](../../plugins/build_your_own.md), [Column Generators](../engine/column_generators.md), [Seed Readers](../engine/seed_readers.md), [Engine Processors](../engine/processors.md), and [Processor Configurations](processors.md). - -## `Plugin` {#data_designer.plugins.plugin.Plugin} - -::: data_designer.plugins.plugin.Plugin - options: - show_root_toc_entry: false - -## `PluginType` {#data_designer.plugins.plugin.PluginType} - -::: data_designer.plugins.plugin.PluginType - options: - show_root_toc_entry: false diff --git a/docs/code_reference/config/processors.md b/docs/code_reference/config/processors.md deleted file mode 100644 index a1795643b..000000000 --- a/docs/code_reference/config/processors.md +++ /dev/null @@ -1,7 +0,0 @@ -# Processor Configurations - -Processor configs request data transformations after generation. Add them to a `DataDesignerConfig` or `DataDesignerConfigBuilder`; the engine later compiles them into runtime processor implementations. - -Related pages: [engine processors](../engine/processors.md) and [Build Your Own](../../plugins/build_your_own.md). - -::: data_designer.config.processors diff --git a/docs/code_reference/config/run_config.md b/docs/code_reference/config/run_config.md deleted file mode 100644 index f39dbb7f3..000000000 --- a/docs/code_reference/config/run_config.md +++ /dev/null @@ -1,29 +0,0 @@ -# Run Config - -`RunConfig` controls dataset generation behavior, including early shutdown thresholds, -batch sizing, non-inference worker concurrency, and the Jinja rendering engine used by -the runtime. - -`JinjaRenderingEngine.SECURE` is the default. Set `JinjaRenderingEngine.NATIVE` -when you want Jinja2's broader built-in sandbox behavior instead of Data Designer's -hardened renderer. - -For guidance on when to use each mode, see [Security](../../concepts/security.md). - -## Usage - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -data_designer = DataDesigner() -data_designer.set_run_config(dd.RunConfig( - buffer_size=500, - max_conversation_restarts=3, - jinja_rendering_engine=dd.JinjaRenderingEngine.NATIVE, -)) -``` - -## API Reference - -::: data_designer.config.run_config diff --git a/docs/code_reference/config/sampler_params.md b/docs/code_reference/config/sampler_params.md deleted file mode 100644 index 751fc604d..000000000 --- a/docs/code_reference/config/sampler_params.md +++ /dev/null @@ -1,12 +0,0 @@ -# Sampler Parameters - -Sampler parameter classes configure Data Designer's built-in samplers. Use them in [SamplerColumnConfig](column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) to specify how sampled column values are generated. - -!!! tip "Displaying available samplers and their parameters" - The config builder has an `info` attribute that can be used to display the - available sampler types and their parameters: - ```python - config_builder.info.display("samplers") - ``` - -::: data_designer.config.sampler_params diff --git a/docs/code_reference/config/seeds.md b/docs/code_reference/config/seeds.md deleted file mode 100644 index a3b77ac64..000000000 --- a/docs/code_reference/config/seeds.md +++ /dev/null @@ -1,19 +0,0 @@ -# Seeds - -Seed configs declare existing data used as input during generation. A [SeedConfig](#data_designer.config.seed.SeedConfig) combines a seed source with optional row sampling and selection settings. Seed source objects declare where seed data comes from; the engine reads them through seed readers. - -Use these objects with `DataDesignerConfigBuilder.with_seed_dataset()`. Related pages: [Seed Datasets](../../concepts/seed-datasets.md) and [seed readers](../engine/seed_readers.md). - -Built-in seed sources include local files, Hugging Face paths, in-memory DataFrames, directories, file contents, and agent rollout traces. Plugin seed sources can extend the same discriminated union through the plugin system. - -## Seed Config - -::: data_designer.config.seed - -## Built-In Seed Sources - -::: data_designer.config.seed_source - -## DataFrame Seed Source - -::: data_designer.config.seed_source_dataframe diff --git a/docs/code_reference/config/validator_params.md b/docs/code_reference/config/validator_params.md deleted file mode 100644 index c69773da6..000000000 --- a/docs/code_reference/config/validator_params.md +++ /dev/null @@ -1,6 +0,0 @@ -# Validator Parameters - -`ValidationColumnConfig` selects a validator with `validator_type` and configures it with `validator_params`. -The `validator_type` field can be `code`, `local_callable`, or `remote`. The matching `validator_params` objects are: - -::: data_designer.config.validator_params diff --git a/docs/code_reference/engine/column_generators.md b/docs/code_reference/engine/column_generators.md deleted file mode 100644 index b2aff0ce1..000000000 --- a/docs/code_reference/engine/column_generators.md +++ /dev/null @@ -1,53 +0,0 @@ -# Column Generators - -Column generators execute column generation in the Data Designer engine. A generator receives the upstream data needed for its task, returns row or batch data with generated values added, and reports the generation strategy the scheduler should use. - -Related pages: [column_configs](../config/column_configs.md), [Build Your Own](../../plugins/build_your_own.md), [Using Models in Plugins](../../plugins/models.md), and [Custom Columns](../../concepts/custom_columns.md). - -## Configuration - -User-facing column configs inherit from [SingleColumnConfig](../config/column_configs.md#data_designer.config.base.SingleColumnConfig) and define a unique `column_type` discriminator. During compilation, the engine may group related configs into multi-column configs for generators that create sampler or seed columns together. - -## Generation strategy - -Column generator base classes return [GenerationStrategy](../config/column_configs.md#data_designer.config.column_configs.GenerationStrategy) values to tell the engine whether they run per row or over a full batch. - -## Implementation bases - -Generators that operate on a full batch can inherit from [ColumnGeneratorFullColumn](#data_designer.engine.column_generators.generators.base.ColumnGeneratorFullColumn). Row-oriented non-model generators can inherit from [ColumnGeneratorCellByCell](#data_designer.engine.column_generators.generators.base.ColumnGeneratorCellByCell). Generators that create initial rows use [FromScratchColumnGenerator](#data_designer.engine.column_generators.generators.base.FromScratchColumnGenerator). Model-backed plugin generators should use [ColumnGeneratorWithModelRegistry](#data_designer.engine.column_generators.generators.base.ColumnGeneratorWithModelRegistry) or [ColumnGeneratorWithModel](#data_designer.engine.column_generators.generators.base.ColumnGeneratorWithModel); see [Using Models in Plugins](../../plugins/models.md) for authoring guidance. - -### `ColumnGenerator` {#data_designer.engine.column_generators.generators.base.ColumnGenerator} - -::: data_designer.engine.column_generators.generators.base.ColumnGenerator - options: - show_root_toc_entry: false - -### `ColumnGeneratorFullColumn` {#data_designer.engine.column_generators.generators.base.ColumnGeneratorFullColumn} - -::: data_designer.engine.column_generators.generators.base.ColumnGeneratorFullColumn - options: - show_root_toc_entry: false - -### `ColumnGeneratorCellByCell` {#data_designer.engine.column_generators.generators.base.ColumnGeneratorCellByCell} - -::: data_designer.engine.column_generators.generators.base.ColumnGeneratorCellByCell - options: - show_root_toc_entry: false - -### `FromScratchColumnGenerator` {#data_designer.engine.column_generators.generators.base.FromScratchColumnGenerator} - -::: data_designer.engine.column_generators.generators.base.FromScratchColumnGenerator - options: - show_root_toc_entry: false - -### `ColumnGeneratorWithModelRegistry` {#data_designer.engine.column_generators.generators.base.ColumnGeneratorWithModelRegistry} - -::: data_designer.engine.column_generators.generators.base.ColumnGeneratorWithModelRegistry - options: - show_root_toc_entry: false - -### `ColumnGeneratorWithModel` {#data_designer.engine.column_generators.generators.base.ColumnGeneratorWithModel} - -::: data_designer.engine.column_generators.generators.base.ColumnGeneratorWithModel - options: - show_root_toc_entry: false diff --git a/docs/code_reference/engine/index.md b/docs/code_reference/engine/index.md deleted file mode 100644 index 06dfa4e6d..000000000 --- a/docs/code_reference/engine/index.md +++ /dev/null @@ -1,5 +0,0 @@ -# Engine Package - -The `data-designer-engine` package provides `data_designer.engine`, the runtime layer of Data Designer. It consumes `data_designer.config` objects and maps them to execution behavior through generators, seed readers, processors, registries, model access, and MCP tool execution. - -This package sits between config and interface: it depends on config, and the public interface calls into it. Use these pages for plugin implementation contracts, registry behavior, seed reader internals, processor execution, column generator bases, and MCP runtime behavior. diff --git a/docs/code_reference/engine/mcp.md b/docs/code_reference/engine/mcp.md deleted file mode 100644 index a9b333b97..000000000 --- a/docs/code_reference/engine/mcp.md +++ /dev/null @@ -1,94 +0,0 @@ -# Engine MCP - -Execution-time MCP registries, facades, session handling, schema discovery, and tool calls. - -For user-facing provider and tool config objects, see [MCP configuration](../config/mcp.md). - -## Parallel Structure - -| Model layer | MCP layer | Purpose | -|-------------|-----------|---------| -| `ModelProviderRegistry` | `MCPProviderRegistry` | Holds provider configurations. | -| `ModelRegistry` | `MCPRegistry` | Manages configs by alias and lazily creates facades. | -| `ModelFacade` | `MCPFacade` | Provides a lightweight runtime facade scoped to one config. | -| `ModelConfig.alias` | `ToolConfig.tool_alias` | Alias referenced by column configs. | - -## Registry - -### `MCPToolDefinition` {#data_designer.engine.mcp.registry.MCPToolDefinition} - -::: data_designer.engine.mcp.registry.MCPToolDefinition - options: - show_root_toc_entry: false - -### `MCPToolResult` {#data_designer.engine.mcp.registry.MCPToolResult} - -::: data_designer.engine.mcp.registry.MCPToolResult - options: - show_root_toc_entry: false - -### `MCPRegistry` {#data_designer.engine.mcp.registry.MCPRegistry} - -::: data_designer.engine.mcp.registry.MCPRegistry - options: - show_root_toc_entry: false - -### `create_mcp_registry` {#data_designer.engine.mcp.factory.create_mcp_registry} - -::: data_designer.engine.mcp.factory.create_mcp_registry - options: - show_root_toc_entry: false - -## Facade - -`ModelFacade.generate()` accepts a `tool_alias` parameter. When it is provided, `ModelFacade` looks up the matching `MCPFacade` from `MCPRegistry`, fetches tool schemas, passes them to the model, processes tool calls after each completion, tracks tool-call turns, and returns messages that include tool results for trace capture. - -### `MCPFacade` {#data_designer.engine.mcp.facade.MCPFacade} - -::: data_designer.engine.mcp.facade.MCPFacade - options: - show_root_toc_entry: false - -## I/O Service - -The I/O service owns a background event loop, pools MCP sessions by provider config, coalesces concurrent tool schema lookups, and executes parallel tool calls. - -### `MCPIOService` {#data_designer.engine.mcp.io.MCPIOService} - -::: data_designer.engine.mcp.io.MCPIOService - options: - show_root_toc_entry: false - -### Runtime Helpers - -::: data_designer.engine.mcp.io.list_tools - options: - show_root_toc_entry: false - -::: data_designer.engine.mcp.io.list_tool_names - options: - show_root_toc_entry: false - -::: data_designer.engine.mcp.io.call_tools - options: - show_root_toc_entry: false - -::: data_designer.engine.mcp.io.clear_provider_caches - options: - show_root_toc_entry: false - -::: data_designer.engine.mcp.io.clear_tools_cache - options: - show_root_toc_entry: false - -::: data_designer.engine.mcp.io.get_cache_info - options: - show_root_toc_entry: false - -::: data_designer.engine.mcp.io.clear_session_pool - options: - show_root_toc_entry: false - -::: data_designer.engine.mcp.io.get_session_pool_info - options: - show_root_toc_entry: false diff --git a/docs/code_reference/engine/processors.md b/docs/code_reference/engine/processors.md deleted file mode 100644 index e11653ead..000000000 --- a/docs/code_reference/engine/processors.md +++ /dev/null @@ -1,43 +0,0 @@ -# Engine Processor Implementations - -Runtime processor classes and processor registry helpers. - -Plugin processors inherit from [Processor](#data_designer.engine.processing.processors.base.Processor) and override one or more callback methods: `process_before_batch`, `process_after_batch`, or `process_after_generation`. - -For user-facing processor config objects, see [processor configurations](../config/processors.md). - -## Base Contract - -### `Processor` {#data_designer.engine.processing.processors.base.Processor} - -::: data_designer.engine.processing.processors.base.Processor - options: - show_root_toc_entry: false - -## Built-In Implementations - -### `DropColumnsProcessor` {#data_designer.engine.processing.processors.drop_columns.DropColumnsProcessor} - -::: data_designer.engine.processing.processors.drop_columns.DropColumnsProcessor - options: - show_root_toc_entry: false - -### `SchemaTransformProcessor` {#data_designer.engine.processing.processors.schema_transform.SchemaTransformProcessor} - -::: data_designer.engine.processing.processors.schema_transform.SchemaTransformProcessor - options: - show_root_toc_entry: false - -## Registry - -### `ProcessorRegistry` {#data_designer.engine.processing.processors.registry.ProcessorRegistry} - -::: data_designer.engine.processing.processors.registry.ProcessorRegistry - options: - show_root_toc_entry: false - -### `create_default_processor_registry` {#data_designer.engine.processing.processors.registry.create_default_processor_registry} - -::: data_designer.engine.processing.processors.registry.create_default_processor_registry - options: - show_root_toc_entry: false diff --git a/docs/code_reference/engine/seed_readers.md b/docs/code_reference/engine/seed_readers.md deleted file mode 100644 index 5f6294a34..000000000 --- a/docs/code_reference/engine/seed_readers.md +++ /dev/null @@ -1,101 +0,0 @@ -# Seed Readers - -Seed readers are engine-side adapters that turn a configured seed source into tabular seed rows. The engine attaches a `SeedSource` and secret resolver, asks the reader for column names and dataset size, then streams batches into generation. - -Related pages: [seeds](../config/seeds.md), [Seed Datasets](../../concepts/seed-datasets.md), and [Build Your Own](../../plugins/build_your_own.md). - -## Core Contracts - -### `SeedReader` {#data_designer.engine.resources.seed_reader.SeedReader} - -::: data_designer.engine.resources.seed_reader.SeedReader - options: - show_root_toc_entry: false - -### `FileSystemSeedReader` {#data_designer.engine.resources.seed_reader.FileSystemSeedReader} - -::: data_designer.engine.resources.seed_reader.FileSystemSeedReader - options: - show_root_toc_entry: false - -### `SeedReaderFileSystemContext` {#data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext} - -::: data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext - options: - show_root_toc_entry: false - -### `SeedReaderBatch` {#data_designer.engine.resources.seed_reader.SeedReaderBatch} - -::: data_designer.engine.resources.seed_reader.SeedReaderBatch - options: - show_root_toc_entry: false - -### `SeedReaderBatchReader` {#data_designer.engine.resources.seed_reader.SeedReaderBatchReader} - -::: data_designer.engine.resources.seed_reader.SeedReaderBatchReader - options: - show_root_toc_entry: false - -### `PandasSeedReaderBatch` {#data_designer.engine.resources.seed_reader.PandasSeedReaderBatch} - -::: data_designer.engine.resources.seed_reader.PandasSeedReaderBatch - options: - show_root_toc_entry: false - -### `create_seed_reader_output_dataframe` {#data_designer.engine.resources.seed_reader.create_seed_reader_output_dataframe} - -::: data_designer.engine.resources.seed_reader.create_seed_reader_output_dataframe - options: - show_root_toc_entry: false - -## Built-In Readers - -### `LocalFileSeedReader` {#data_designer.engine.resources.seed_reader.LocalFileSeedReader} - -::: data_designer.engine.resources.seed_reader.LocalFileSeedReader - options: - show_root_toc_entry: false - -### `HuggingFaceSeedReader` {#data_designer.engine.resources.seed_reader.HuggingFaceSeedReader} - -::: data_designer.engine.resources.seed_reader.HuggingFaceSeedReader - options: - show_root_toc_entry: false - -### `DataFrameSeedReader` {#data_designer.engine.resources.seed_reader.DataFrameSeedReader} - -::: data_designer.engine.resources.seed_reader.DataFrameSeedReader - options: - show_root_toc_entry: false - -### `DirectorySeedReader` {#data_designer.engine.resources.seed_reader.DirectorySeedReader} - -::: data_designer.engine.resources.seed_reader.DirectorySeedReader - options: - show_root_toc_entry: false - -### `FileContentsSeedReader` {#data_designer.engine.resources.seed_reader.FileContentsSeedReader} - -::: data_designer.engine.resources.seed_reader.FileContentsSeedReader - options: - show_root_toc_entry: false - -### `AgentRolloutSeedReader` {#data_designer.engine.resources.seed_reader.AgentRolloutSeedReader} - -::: data_designer.engine.resources.seed_reader.AgentRolloutSeedReader - options: - show_root_toc_entry: false - -## Registry and Errors - -### `SeedReaderRegistry` {#data_designer.engine.resources.seed_reader.SeedReaderRegistry} - -::: data_designer.engine.resources.seed_reader.SeedReaderRegistry - options: - show_root_toc_entry: false - -### `SeedReaderError` {#data_designer.engine.resources.seed_reader.SeedReaderError} - -::: data_designer.engine.resources.seed_reader.SeedReaderError - options: - show_root_toc_entry: false diff --git a/docs/code_reference/index.md b/docs/code_reference/index.md deleted file mode 100644 index 5263b0ffe..000000000 --- a/docs/code_reference/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Code Reference - -Data Designer is implemented as three installable packages that share the `data_designer` namespace. The packages are layered: user-facing interface code calls the engine, and the engine consumes declarative config objects. - -| Package | Namespace | Role | -|---------|-----------|------| -| [`data-designer-config`](config/index.md) | `data_designer.config` | Configuration schemas, builder APIs, plugin registration objects, and result schemas. | -| [`data-designer-engine`](engine/index.md) | `data_designer.engine` | Runtime contracts and implementations for generation, seed reading, processing, and MCP tool execution. | -| [`data-designer`](interface/index.md) | `data_designer.interface` | Public entry points for previewing, creating, and inspecting generated datasets. | - -The dependency direction is `interface -> engine -> config`. Config objects describe what should happen, engine objects implement how it happens, and interface objects expose the supported public API. diff --git a/docs/code_reference/interface/data_designer.md b/docs/code_reference/interface/data_designer.md deleted file mode 100644 index 050ba6242..000000000 --- a/docs/code_reference/interface/data_designer.md +++ /dev/null @@ -1,11 +0,0 @@ -# DataDesigner Interface - -[DataDesigner](#data_designer.interface.data_designer.DataDesigner) validates configs, generates in-memory previews, creates persisted datasets, lists configured MCP tools, and exposes default model settings. - -For runtime settings passed through `set_run_config()`, see [run_config](../config/run_config.md). For persisted creation results returned by `create()`, see [results](results.md). - -## `DataDesigner` {#data_designer.interface.data_designer.DataDesigner} - -::: data_designer.interface.data_designer.DataDesigner - options: - show_root_toc_entry: false diff --git a/docs/code_reference/interface/errors.md b/docs/code_reference/interface/errors.md deleted file mode 100644 index a969cf8fe..000000000 --- a/docs/code_reference/interface/errors.md +++ /dev/null @@ -1,29 +0,0 @@ -# Interface Errors - -Interface errors represent failures surfaced at the public API boundary. DataDesignerGenerationError wraps dataset generation failures from `create()` and `preview()`, DataDesignerEarlyShutdownError identifies generation runs that terminate early without producing records, and DataDesignerProfilingError wraps profiling failures from those methods. These errors inherit from `data_designer.errors.DataDesignerError`, allowing callers to catch either specific interface failures or the project-wide base error type. - -The package-level `data_designer.interface` export lazily exposes [DataDesignerGenerationError](#data_designer.interface.errors.DataDesignerGenerationError), [DataDesignerEarlyShutdownError](#data_designer.interface.errors.DataDesignerEarlyShutdownError), and [DataDesignerProfilingError](#data_designer.interface.errors.DataDesignerProfilingError). [InvalidBufferValueError](#data_designer.interface.errors.InvalidBufferValueError) is defined in this module. - -## `DataDesignerGenerationError` {#data_designer.interface.errors.DataDesignerGenerationError} - -::: data_designer.interface.errors.DataDesignerGenerationError - options: - show_root_toc_entry: false - -## `DataDesignerEarlyShutdownError` {#data_designer.interface.errors.DataDesignerEarlyShutdownError} - -::: data_designer.interface.errors.DataDesignerEarlyShutdownError - options: - show_root_toc_entry: false - -## `DataDesignerProfilingError` {#data_designer.interface.errors.DataDesignerProfilingError} - -::: data_designer.interface.errors.DataDesignerProfilingError - options: - show_root_toc_entry: false - -## `InvalidBufferValueError` {#data_designer.interface.errors.InvalidBufferValueError} - -::: data_designer.interface.errors.InvalidBufferValueError - options: - show_root_toc_entry: false diff --git a/docs/code_reference/interface/index.md b/docs/code_reference/interface/index.md deleted file mode 100644 index e43caa783..000000000 --- a/docs/code_reference/interface/index.md +++ /dev/null @@ -1,7 +0,0 @@ -# Interface Package - -The `data-designer` package provides the top-level user-facing package surface. This section covers `data_designer.interface`, which contains `DataDesigner`, persisted dataset creation results, and interface-level errors. - -This package sits above engine and config. `DataDesigner` accepts Data Designer configs, calls the runtime layer, and returns preview or persisted creation results. - -Start with [DataDesigner](data_designer.md) for previewing, creating, and inspecting datasets from a config. Use [results](results.md) for the object returned by persisted dataset creation, and [errors](errors.md) for exceptions surfaced at the public API boundary. diff --git a/docs/code_reference/interface/results.md b/docs/code_reference/interface/results.md deleted file mode 100644 index 044ca6ccf..000000000 --- a/docs/code_reference/interface/results.md +++ /dev/null @@ -1,11 +0,0 @@ -# Dataset Creation Results - -[DatasetCreationResults](#data_designer.interface.results.DatasetCreationResults) is returned by [DataDesigner.create()](data_designer.md#data_designer.interface.data_designer.DataDesigner.create). It provides access to persisted creation artifacts, including the generated dataset, profiling analysis, processor outputs, task traces, dataset metadata, and Hugging Face Hub upload support. - -Preview generation uses the in-memory `data_designer.config.preview_results.PreviewResults` object returned by [DataDesigner.preview()](data_designer.md#data_designer.interface.data_designer.DataDesigner.preview). Persisted dataset creation uses [DatasetCreationResults](#data_designer.interface.results.DatasetCreationResults). - -## `DatasetCreationResults` {#data_designer.interface.results.DatasetCreationResults} - -::: data_designer.interface.results.DatasetCreationResults - options: - show_root_toc_entry: false diff --git a/docs/colab_notebooks/1-the-basics.ipynb b/docs/colab_notebooks/1-the-basics.ipynb index a55028a44..06626e64e 100644 --- a/docs/colab_notebooks/1-the-basics.ipynb +++ b/docs/colab_notebooks/1-the-basics.ipynb @@ -2,15 +2,17 @@ "cells": [ { "cell_type": "markdown", - "id": "e9bc2aab", - "metadata": {}, + "id": "98a09b05", + "metadata": { + "nemo_colab_inject": true + }, "source": [ "\"Open" ] }, { "cell_type": "markdown", - "id": "33dcb5be", + "id": "5756aaa6", "metadata": {}, "source": [ "# 🎨 Data Designer Tutorial: The Basics\n", @@ -22,7 +24,7 @@ }, { "cell_type": "markdown", - "id": "adb77b8d", + "id": "0a71d58b", "metadata": {}, "source": [ "### 📦 Import Data Designer\n", @@ -34,8 +36,10 @@ }, { "cell_type": "markdown", - "id": "170ce1ea", - "metadata": {}, + "id": "b83390bc", + "metadata": { + "nemo_colab_inject": true + }, "source": [ "### ⚡ Colab Setup\n", "\n", @@ -45,8 +49,10 @@ { "cell_type": "code", "execution_count": null, - "id": "67e478f9", - "metadata": {}, + "id": "3a02b8e7", + "metadata": { + "nemo_colab_inject": true + }, "outputs": [], "source": [ "%%capture\n", @@ -56,8 +62,10 @@ { "cell_type": "code", "execution_count": null, - "id": "533fc40d", - "metadata": {}, + "id": "d3a48618", + "metadata": { + "nemo_colab_inject": true + }, "outputs": [], "source": [ "import getpass\n", @@ -74,7 +82,7 @@ { "cell_type": "code", "execution_count": null, - "id": "9ad92889", + "id": "74c4400b", "metadata": {}, "outputs": [], "source": [ @@ -84,20 +92,20 @@ }, { "cell_type": "markdown", - "id": "0232c4c6", + "id": "3cf1b9f6", "metadata": {}, "source": [ "### ⚙️ Initialize the Data Designer interface\n", "\n", "- `DataDesigner` is the main object responsible for managing the data generation process.\n", "\n", - "- When initialized without arguments, the [default model providers](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) are used.\n" + "- When initialized without arguments, the [default model providers](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) are used.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "fbbd0cab", + "id": "5072cd56", "metadata": {}, "outputs": [], "source": [ @@ -106,7 +114,7 @@ }, { "cell_type": "markdown", - "id": "305f635e", + "id": "ce8bca3b", "metadata": {}, "source": [ "### 🎛️ Define model configurations\n", @@ -115,7 +123,7 @@ "\n", "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", "\n", - "- The \"model provider\" is the external service that hosts the model (see the [model config](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) docs for more details).\n", + "- The \"model provider\" is the external service that hosts the model (see the [model config](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) docs for more details).\n", "\n", "- By default, we use [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" ] @@ -123,7 +131,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d0865b58", + "id": "c16ce85c", "metadata": {}, "outputs": [], "source": [ @@ -153,7 +161,7 @@ }, { "cell_type": "markdown", - "id": "6e1624f7", + "id": "e5d87363", "metadata": {}, "source": [ "### 🏗️ Initialize the Data Designer Config Builder\n", @@ -168,7 +176,7 @@ { "cell_type": "code", "execution_count": null, - "id": "33562cda", + "id": "8c496f00", "metadata": {}, "outputs": [], "source": [ @@ -177,7 +185,7 @@ }, { "cell_type": "markdown", - "id": "d8ec3063", + "id": "317722b9", "metadata": {}, "source": [ "## 🎲 Getting started with sampler columns\n", @@ -194,7 +202,7 @@ { "cell_type": "code", "execution_count": null, - "id": "70de1b0a", + "id": "8069e322", "metadata": {}, "outputs": [], "source": [ @@ -203,7 +211,7 @@ }, { "cell_type": "markdown", - "id": "991a8f34", + "id": "60fc9704", "metadata": {}, "source": [ "Let's start designing our product review dataset by adding product category and subcategory columns.\n" @@ -212,7 +220,7 @@ { "cell_type": "code", "execution_count": null, - "id": "222cbbcc", + "id": "ff7ffa2e", "metadata": {}, "outputs": [], "source": [ @@ -293,7 +301,7 @@ }, { "cell_type": "markdown", - "id": "29ca2aa3", + "id": "314551bc", "metadata": {}, "source": [ "Next, let's add samplers to generate data related to the customer and their review.\n" @@ -302,7 +310,7 @@ { "cell_type": "code", "execution_count": null, - "id": "4ca9ba1c", + "id": "80740541", "metadata": {}, "outputs": [], "source": [ @@ -339,7 +347,7 @@ }, { "cell_type": "markdown", - "id": "f4d54299", + "id": "076827d2", "metadata": {}, "source": [ "## 🦜 LLM-generated columns\n", @@ -354,7 +362,7 @@ { "cell_type": "code", "execution_count": null, - "id": "361b63b1", + "id": "83bfae12", "metadata": {}, "outputs": [], "source": [ @@ -390,7 +398,7 @@ }, { "cell_type": "markdown", - "id": "49ca028a", + "id": "98e3e22d", "metadata": {}, "source": [ "### 🔁 Iteration is key – preview the dataset!\n", @@ -407,7 +415,7 @@ { "cell_type": "code", "execution_count": null, - "id": "068ea8c3", + "id": "1e8edc7b", "metadata": {}, "outputs": [], "source": [ @@ -417,7 +425,7 @@ { "cell_type": "code", "execution_count": null, - "id": "bf196a77", + "id": "0a4fd25a", "metadata": {}, "outputs": [], "source": [ @@ -428,7 +436,7 @@ { "cell_type": "code", "execution_count": null, - "id": "36ebb017", + "id": "b05b127b", "metadata": {}, "outputs": [], "source": [ @@ -438,7 +446,7 @@ }, { "cell_type": "markdown", - "id": "1dcba545", + "id": "12904d0b", "metadata": {}, "source": [ "### 📊 Analyze the generated data\n", @@ -451,7 +459,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e5164902", + "id": "cb28940a", "metadata": {}, "outputs": [], "source": [ @@ -461,7 +469,7 @@ }, { "cell_type": "markdown", - "id": "cc433fae", + "id": "30907730", "metadata": {}, "source": [ "### 🆙 Scale up!\n", @@ -474,7 +482,7 @@ { "cell_type": "code", "execution_count": null, - "id": "17132fe2", + "id": "4a0799b1", "metadata": {}, "outputs": [], "source": [ @@ -484,7 +492,7 @@ { "cell_type": "code", "execution_count": null, - "id": "6669442a", + "id": "b2ffa7e2", "metadata": {}, "outputs": [], "source": [ @@ -497,7 +505,7 @@ { "cell_type": "code", "execution_count": null, - "id": "ee689b41", + "id": "343534e5", "metadata": {}, "outputs": [], "source": [ @@ -509,20 +517,20 @@ }, { "cell_type": "markdown", - "id": "6965e6ac", + "id": "4771ea98", "metadata": {}, "source": [ "## ⏭️ Next Steps\n", "\n", "Now that you've seen the basics of Data Designer, check out the following notebooks to learn more about:\n", "\n", - "- [Structured outputs, jinja expressions, and conditional generation](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/2-structured-outputs-and-jinja-expressions/)\n", + "- [Structured outputs, jinja expressions, and conditional generation](https://docs.nvidia.com/nemo/datadesigner/tutorials/structured-outputs-jinja-expressions-and-conditional-generation)\n", "\n", - "- [Seeding synthetic data generation with an external dataset](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/3-seeding-with-a-dataset/)\n", + "- [Seeding synthetic data generation with an external dataset](https://docs.nvidia.com/nemo/datadesigner/tutorials/seeding-with-an-external-dataset)\n", "\n", - "- [Providing images as context](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/4-providing-images-as-context/)\n", + "- [Providing images as context](https://docs.nvidia.com/nemo/datadesigner/tutorials/providing-images-as-context)\n", "\n", - "- [Generating images](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/)\n" + "- [Generating images](https://docs.nvidia.com/nemo/datadesigner/tutorials/generating-images)\n" ] } ], diff --git a/docs/colab_notebooks/2-structured-outputs-and-jinja-expressions.ipynb b/docs/colab_notebooks/2-structured-outputs-and-jinja-expressions.ipynb index 77272fbb1..15b713760 100644 --- a/docs/colab_notebooks/2-structured-outputs-and-jinja-expressions.ipynb +++ b/docs/colab_notebooks/2-structured-outputs-and-jinja-expressions.ipynb @@ -2,15 +2,17 @@ "cells": [ { "cell_type": "markdown", - "id": "f4f854dd", - "metadata": {}, + "id": "88a85c19", + "metadata": { + "nemo_colab_inject": true + }, "source": [ "\"Open" ] }, { "cell_type": "markdown", - "id": "027ffdf3", + "id": "94f54d32", "metadata": {}, "source": [ "# 🎨 Data Designer Tutorial: Structured Outputs, Jinja Expressions, and Conditional Generation\n", @@ -19,12 +21,12 @@ "\n", "In this notebook, we will continue our exploration of Data Designer, demonstrating more advanced data generation using structured outputs, Jinja expressions, and conditional generation with `skip.when`.\n", "\n", - "If this is your first time using Data Designer, we recommend starting with the [first notebook](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/) in this tutorial series.\n" + "If this is your first time using Data Designer, we recommend starting with the [first notebook](https://docs.nvidia.com/nemo/datadesigner/tutorials/the-basics) in this tutorial series.\n" ] }, { "cell_type": "markdown", - "id": "158f95c6", + "id": "0034324b", "metadata": {}, "source": [ "### 📦 Import Data Designer\n", @@ -36,8 +38,10 @@ }, { "cell_type": "markdown", - "id": "459b2f2b", - "metadata": {}, + "id": "b34f3a8e", + "metadata": { + "nemo_colab_inject": true + }, "source": [ "### ⚡ Colab Setup\n", "\n", @@ -47,8 +51,10 @@ { "cell_type": "code", "execution_count": null, - "id": "2bdb065c", - "metadata": {}, + "id": "58098daa", + "metadata": { + "nemo_colab_inject": true + }, "outputs": [], "source": [ "%%capture\n", @@ -58,8 +64,10 @@ { "cell_type": "code", "execution_count": null, - "id": "8ccc1e8f", - "metadata": {}, + "id": "acfb4bf7", + "metadata": { + "nemo_colab_inject": true + }, "outputs": [], "source": [ "import getpass\n", @@ -76,7 +84,7 @@ { "cell_type": "code", "execution_count": null, - "id": "aeb8441e", + "id": "40d25e71", "metadata": {}, "outputs": [], "source": [ @@ -86,20 +94,20 @@ }, { "cell_type": "markdown", - "id": "df989756", + "id": "42a10ad4", "metadata": {}, "source": [ "### ⚙️ Initialize the Data Designer interface\n", "\n", "- `DataDesigner` is the main object that is used to interface with the library.\n", "\n", - "- When initialized without arguments, the [default model providers](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) are used.\n" + "- When initialized without arguments, the [default model providers](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) are used.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "2a8f113b", + "id": "fece2b5a", "metadata": {}, "outputs": [], "source": [ @@ -108,7 +116,7 @@ }, { "cell_type": "markdown", - "id": "b986772a", + "id": "caece5f7", "metadata": {}, "source": [ "### 🎛️ Define model configurations\n", @@ -117,7 +125,7 @@ "\n", "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", "\n", - "- The \"model provider\" is the external service that hosts the model (see the [model config](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) docs for more details).\n", + "- The \"model provider\" is the external service that hosts the model (see the [model config](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) docs for more details).\n", "\n", "- By default, we use [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" ] @@ -125,7 +133,7 @@ { "cell_type": "code", "execution_count": null, - "id": "2ce9cf8c", + "id": "4dbf6c7c", "metadata": {}, "outputs": [], "source": [ @@ -155,7 +163,7 @@ }, { "cell_type": "markdown", - "id": "6b5ab2ea", + "id": "defeb238", "metadata": {}, "source": [ "### 🏗️ Initialize the Data Designer Config Builder\n", @@ -170,7 +178,7 @@ { "cell_type": "code", "execution_count": null, - "id": "69a41c06", + "id": "20fd84df", "metadata": {}, "outputs": [], "source": [ @@ -179,7 +187,7 @@ }, { "cell_type": "markdown", - "id": "b17aca77", + "id": "d561cd7f", "metadata": {}, "source": [ "### 🧑‍🎨 Designing our data\n", @@ -206,7 +214,7 @@ { "cell_type": "code", "execution_count": null, - "id": "133df1c0", + "id": "1472db3e", "metadata": {}, "outputs": [], "source": [ @@ -234,7 +242,7 @@ }, { "cell_type": "markdown", - "id": "2535b9c0", + "id": "8da9b341", "metadata": {}, "source": [ "Next, let's design our product review dataset using a few more tricks compared to the previous notebook.\n" @@ -243,7 +251,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7d4d991d", + "id": "e4e28a13", "metadata": {}, "outputs": [], "source": [ @@ -352,7 +360,7 @@ }, { "cell_type": "markdown", - "id": "afc66880", + "id": "8dcdfcd9", "metadata": {}, "source": [ "Next, we will use more advanced Jinja expressions to create new columns.\n", @@ -369,7 +377,7 @@ { "cell_type": "code", "execution_count": null, - "id": "5d8452b2", + "id": "369ec084", "metadata": {}, "outputs": [], "source": [ @@ -422,7 +430,7 @@ }, { "cell_type": "markdown", - "id": "d7780299", + "id": "91413bfb", "metadata": {}, "source": [ "## 🚦 Conditional generation with `skip.when`\n", @@ -445,7 +453,7 @@ }, { "cell_type": "markdown", - "id": "794ac1aa", + "id": "c54104b5", "metadata": {}, "source": [ "**Pattern 1 — Expression gate.** Only generate a detailed complaint analysis when the customer gave a low rating (1 or 2 stars).\n", @@ -455,7 +463,7 @@ { "cell_type": "code", "execution_count": null, - "id": "6d96baaa", + "id": "bbf22c61", "metadata": {}, "outputs": [], "source": [ @@ -478,7 +486,7 @@ }, { "cell_type": "markdown", - "id": "a3598079", + "id": "c6b0295f", "metadata": {}, "source": [ "**Pattern 2 — Skip propagation.** `action_items` depends on `complaint_analysis`.\n", @@ -489,7 +497,7 @@ { "cell_type": "code", "execution_count": null, - "id": "59be7563", + "id": "806e2935", "metadata": {}, "outputs": [], "source": [ @@ -508,7 +516,7 @@ }, { "cell_type": "markdown", - "id": "44cfc2e8", + "id": "4b75b941", "metadata": {}, "source": [ "**Pattern 3 — Propagation opt-out.** `review_summary` also depends on `complaint_analysis`,\n", @@ -519,7 +527,7 @@ { "cell_type": "code", "execution_count": null, - "id": "a9cee7fe", + "id": "7e44ef07", "metadata": {}, "outputs": [], "source": [ @@ -545,7 +553,7 @@ }, { "cell_type": "markdown", - "id": "67f39d99", + "id": "fae62671", "metadata": {}, "source": [ "### 🔁 Iteration is key – preview the dataset!\n", @@ -562,7 +570,7 @@ { "cell_type": "code", "execution_count": null, - "id": "3aa1cd01", + "id": "ed45c5a9", "metadata": {}, "outputs": [], "source": [ @@ -572,7 +580,7 @@ { "cell_type": "code", "execution_count": null, - "id": "5d78f540", + "id": "b4ef1454", "metadata": {}, "outputs": [], "source": [ @@ -585,7 +593,7 @@ { "cell_type": "code", "execution_count": null, - "id": "86011901", + "id": "fcf761ef", "metadata": {}, "outputs": [], "source": [ @@ -597,7 +605,7 @@ }, { "cell_type": "markdown", - "id": "8fa363ed", + "id": "493ff54a", "metadata": {}, "source": [ "### 📊 Analyze the generated data\n", @@ -610,7 +618,7 @@ { "cell_type": "code", "execution_count": null, - "id": "3dede878", + "id": "5c38165d", "metadata": {}, "outputs": [], "source": [ @@ -620,7 +628,7 @@ }, { "cell_type": "markdown", - "id": "38839a98", + "id": "848f4dcd", "metadata": {}, "source": [ "### 🆙 Scale up!\n", @@ -633,7 +641,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8208f51b", + "id": "0af8dfb1", "metadata": {}, "outputs": [], "source": [ @@ -643,7 +651,7 @@ { "cell_type": "code", "execution_count": null, - "id": "2b07217f", + "id": "2b1469e5", "metadata": {}, "outputs": [], "source": [ @@ -656,7 +664,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7deaa6e2", + "id": "0ccc4315", "metadata": {}, "outputs": [], "source": [ @@ -668,18 +676,18 @@ }, { "cell_type": "markdown", - "id": "b4c1a576", + "id": "c6601b5a", "metadata": {}, "source": [ "## ⏭️ Next Steps\n", "\n", "Check out the following notebook to learn more about:\n", "\n", - "- [Seeding synthetic data generation with an external dataset](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/3-seeding-with-a-dataset/)\n", + "- [Seeding synthetic data generation with an external dataset](https://docs.nvidia.com/nemo/datadesigner/tutorials/seeding-with-an-external-dataset)\n", "\n", - "- [Providing images as context](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/4-providing-images-as-context/)\n", + "- [Providing images as context](https://docs.nvidia.com/nemo/datadesigner/tutorials/providing-images-as-context)\n", "\n", - "- [Generating images](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/)\n" + "- [Generating images](https://docs.nvidia.com/nemo/datadesigner/tutorials/generating-images)\n" ] } ], diff --git a/docs/colab_notebooks/3-seeding-with-a-dataset.ipynb b/docs/colab_notebooks/3-seeding-with-a-dataset.ipynb index 7aab5eaa8..a8f642b64 100644 --- a/docs/colab_notebooks/3-seeding-with-a-dataset.ipynb +++ b/docs/colab_notebooks/3-seeding-with-a-dataset.ipynb @@ -2,15 +2,17 @@ "cells": [ { "cell_type": "markdown", - "id": "21e9e0eb", - "metadata": {}, + "id": "15f5cea5", + "metadata": { + "nemo_colab_inject": true + }, "source": [ "\"Open" ] }, { "cell_type": "markdown", - "id": "b185696e", + "id": "9dbe0d94", "metadata": {}, "source": [ "# 🎨 Data Designer Tutorial: Seeding Synthetic Data Generation with an External Dataset\n", @@ -19,12 +21,12 @@ "\n", "In this notebook, we will demonstrate how to seed synthetic data generation in Data Designer with an external dataset.\n", "\n", - "If this is your first time using Data Designer, we recommend starting with the [first notebook](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/) in this tutorial series.\n" + "If this is your first time using Data Designer, we recommend starting with the [first notebook](https://docs.nvidia.com/nemo/datadesigner/tutorials/the-basics) in this tutorial series.\n" ] }, { "cell_type": "markdown", - "id": "692c9796", + "id": "e5f21bad", "metadata": {}, "source": [ "### 📦 Import Data Designer\n", @@ -36,8 +38,10 @@ }, { "cell_type": "markdown", - "id": "daa8cd50", - "metadata": {}, + "id": "f2c8b1cb", + "metadata": { + "nemo_colab_inject": true + }, "source": [ "### ⚡ Colab Setup\n", "\n", @@ -47,8 +51,10 @@ { "cell_type": "code", "execution_count": null, - "id": "8848bd1e", - "metadata": {}, + "id": "5073489b", + "metadata": { + "nemo_colab_inject": true + }, "outputs": [], "source": [ "%%capture\n", @@ -58,8 +64,10 @@ { "cell_type": "code", "execution_count": null, - "id": "317ce78f", - "metadata": {}, + "id": "dbdb8ad2", + "metadata": { + "nemo_colab_inject": true + }, "outputs": [], "source": [ "import getpass\n", @@ -76,7 +84,7 @@ { "cell_type": "code", "execution_count": null, - "id": "1cb2d5c8", + "id": "34abb3ea", "metadata": {}, "outputs": [], "source": [ @@ -86,20 +94,20 @@ }, { "cell_type": "markdown", - "id": "8b49428f", + "id": "2c4c3dda", "metadata": {}, "source": [ "### ⚙️ Initialize the Data Designer interface\n", "\n", "- `DataDesigner` is the main object responsible for managing the data generation process.\n", "\n", - "- When initialized without arguments, the [default model providers](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) are used.\n" + "- When initialized without arguments, the [default model providers](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) are used.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "69df6d66", + "id": "8c0b6ccd", "metadata": {}, "outputs": [], "source": [ @@ -108,7 +116,7 @@ }, { "cell_type": "markdown", - "id": "50378de0", + "id": "9f51cdf6", "metadata": {}, "source": [ "### 🎛️ Define model configurations\n", @@ -117,7 +125,7 @@ "\n", "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", "\n", - "- The \"model provider\" is the external service that hosts the model (see the [model config](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) docs for more details).\n", + "- The \"model provider\" is the external service that hosts the model (see the [model config](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) docs for more details).\n", "\n", "- By default, we use [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" ] @@ -125,7 +133,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e932a29e", + "id": "32e8274b", "metadata": {}, "outputs": [], "source": [ @@ -155,7 +163,7 @@ }, { "cell_type": "markdown", - "id": "9487eecc", + "id": "042d9142", "metadata": {}, "source": [ "### 🏗️ Initialize the Data Designer Config Builder\n", @@ -170,7 +178,7 @@ { "cell_type": "code", "execution_count": null, - "id": "172f0df0", + "id": "c71c261b", "metadata": {}, "outputs": [], "source": [ @@ -179,7 +187,7 @@ }, { "cell_type": "markdown", - "id": "54700574", + "id": "f7176386", "metadata": {}, "source": [ "## 🏥 Prepare a seed dataset\n", @@ -204,7 +212,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7c1e1f69", + "id": "faca8c3c", "metadata": {}, "outputs": [], "source": [ @@ -222,7 +230,7 @@ }, { "cell_type": "markdown", - "id": "bdd24ad6", + "id": "bbf3d6cb", "metadata": {}, "source": [ "## 🎨 Designing our synthetic patient notes dataset\n", @@ -235,7 +243,7 @@ { "cell_type": "code", "execution_count": null, - "id": "2b33b6f6", + "id": "2c6b3bce", "metadata": {}, "outputs": [], "source": [ @@ -316,7 +324,7 @@ }, { "cell_type": "markdown", - "id": "2d23d1c3", + "id": "d00607e5", "metadata": {}, "source": [ "### 🔁 Iteration is key – preview the dataset!\n", @@ -333,7 +341,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d2e864ef", + "id": "f630d117", "metadata": {}, "outputs": [], "source": [ @@ -343,7 +351,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d948d638", + "id": "7a7cc9f6", "metadata": {}, "outputs": [], "source": [ @@ -354,7 +362,7 @@ { "cell_type": "code", "execution_count": null, - "id": "a5bb03c7", + "id": "ea6e42ec", "metadata": {}, "outputs": [], "source": [ @@ -364,7 +372,7 @@ }, { "cell_type": "markdown", - "id": "a6d81e80", + "id": "03bb76ae", "metadata": {}, "source": [ "### 📊 Analyze the generated data\n", @@ -377,7 +385,7 @@ { "cell_type": "code", "execution_count": null, - "id": "536d8500", + "id": "de633b43", "metadata": {}, "outputs": [], "source": [ @@ -387,7 +395,7 @@ }, { "cell_type": "markdown", - "id": "e93e1239", + "id": "14253b1a", "metadata": {}, "source": [ "### 🆙 Scale up!\n", @@ -400,7 +408,7 @@ { "cell_type": "code", "execution_count": null, - "id": "60a30857", + "id": "9ede3ba3", "metadata": {}, "outputs": [], "source": [ @@ -410,7 +418,7 @@ { "cell_type": "code", "execution_count": null, - "id": "b746c558", + "id": "82b4e030", "metadata": {}, "outputs": [], "source": [ @@ -423,7 +431,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e8aa5c7e", + "id": "46a085c9", "metadata": {}, "outputs": [], "source": [ @@ -435,16 +443,16 @@ }, { "cell_type": "markdown", - "id": "023fff7b", + "id": "001acc5e", "metadata": {}, "source": [ "## ⏭️ Next Steps\n", "\n", "Check out the following notebook to learn more about:\n", "\n", - "- [Providing images as context](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/4-providing-images-as-context/)\n", + "- [Providing images as context](https://docs.nvidia.com/nemo/datadesigner/tutorials/providing-images-as-context)\n", "\n", - "- [Generating images](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/)\n" + "- [Generating images](https://docs.nvidia.com/nemo/datadesigner/tutorials/generating-images)\n" ] } ], diff --git a/docs/colab_notebooks/4-providing-images-as-context.ipynb b/docs/colab_notebooks/4-providing-images-as-context.ipynb index ba225ac0a..53bb81ec4 100644 --- a/docs/colab_notebooks/4-providing-images-as-context.ipynb +++ b/docs/colab_notebooks/4-providing-images-as-context.ipynb @@ -2,15 +2,17 @@ "cells": [ { "cell_type": "markdown", - "id": "f7d47856", - "metadata": {}, + "id": "f2d3e832", + "metadata": { + "nemo_colab_inject": true + }, "source": [ "\"Open" ] }, { "cell_type": "markdown", - "id": "e826ba2c", + "id": "846cbeba", "metadata": {}, "source": [ "# 🎨 Data Designer Tutorial: Providing Images as Context for Vision-Based Data Generation" @@ -18,7 +20,7 @@ }, { "cell_type": "markdown", - "id": "4e0854f1", + "id": "9465a71f", "metadata": {}, "source": [ "#### 📚 What you'll learn\n", @@ -28,12 +30,12 @@ "- ✨ **Visual Document Processing**: Converting images to chat-ready format for model consumption\n", "- 🔍 **Vision-Language Generation**: Using vision models to generate detailed summaries from images\n", "\n", - "If this is your first time using Data Designer, we recommend starting with the [first notebook](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/) in this tutorial series.\n" + "If this is your first time using Data Designer, we recommend starting with the [first notebook](https://docs.nvidia.com/nemo/datadesigner/tutorials/the-basics) in this tutorial series.\n" ] }, { "cell_type": "markdown", - "id": "adc08017", + "id": "ea6cab1d", "metadata": {}, "source": [ "### 📦 Import Data Designer\n", @@ -45,8 +47,10 @@ }, { "cell_type": "markdown", - "id": "c68a6c2c", - "metadata": {}, + "id": "6d9518b6", + "metadata": { + "nemo_colab_inject": true + }, "source": [ "### ⚡ Colab Setup\n", "\n", @@ -56,8 +60,10 @@ { "cell_type": "code", "execution_count": null, - "id": "67bf78ce", - "metadata": {}, + "id": "e18f24a9", + "metadata": { + "nemo_colab_inject": true + }, "outputs": [], "source": [ "%%capture\n", @@ -67,8 +73,10 @@ { "cell_type": "code", "execution_count": null, - "id": "21bbf67b", - "metadata": {}, + "id": "7cb636e4", + "metadata": { + "nemo_colab_inject": true + }, "outputs": [], "source": [ "import getpass\n", @@ -85,7 +93,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d7056b4d", + "id": "53e92ed7", "metadata": {}, "outputs": [], "source": [ @@ -108,20 +116,20 @@ }, { "cell_type": "markdown", - "id": "48235c24", + "id": "84b20d99", "metadata": {}, "source": [ "### ⚙️ Initialize the Data Designer interface\n", "\n", "- `DataDesigner` is the main object responsible for managing the data generation process.\n", "\n", - "- When initialized without arguments, the [default model providers](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) are used.\n" + "- When initialized without arguments, the [default model providers](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) are used.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "768218ca", + "id": "870091c0", "metadata": {}, "outputs": [], "source": [ @@ -130,7 +138,7 @@ }, { "cell_type": "markdown", - "id": "ff4a52ed", + "id": "975f4539", "metadata": {}, "source": [ "### 🏗️ Initialize the Data Designer Config Builder\n", @@ -139,13 +147,13 @@ "\n", "- The config builder provides an intuitive interface for building this configuration.\n", "\n", - "- When initialized without arguments, the [default model configurations](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) are used.\n" + "- When initialized without arguments, the [default model configurations](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) are used.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "42640912", + "id": "2224a537", "metadata": {}, "outputs": [], "source": [ @@ -154,7 +162,7 @@ }, { "cell_type": "markdown", - "id": "4ecad6af", + "id": "d00ae426", "metadata": {}, "source": [ "### 🌱 Seed Dataset Creation\n", @@ -171,7 +179,7 @@ { "cell_type": "code", "execution_count": null, - "id": "bafdf91f", + "id": "d94bc473", "metadata": {}, "outputs": [], "source": [ @@ -186,7 +194,7 @@ { "cell_type": "code", "execution_count": null, - "id": "dc5c92ac", + "id": "a432d57f", "metadata": {}, "outputs": [], "source": [ @@ -231,7 +239,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d4cde737", + "id": "b4c8e1ba", "metadata": {}, "outputs": [], "source": [ @@ -249,7 +257,7 @@ { "cell_type": "code", "execution_count": null, - "id": "39848e33", + "id": "c61ea059", "metadata": {}, "outputs": [], "source": [ @@ -259,7 +267,7 @@ { "cell_type": "code", "execution_count": null, - "id": "b94581da", + "id": "203cbf48", "metadata": {}, "outputs": [], "source": [ @@ -271,7 +279,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7c561ff0", + "id": "6ee139f3", "metadata": {}, "outputs": [], "source": [ @@ -293,7 +301,7 @@ }, { "cell_type": "markdown", - "id": "99a5ad0c", + "id": "0445efa1", "metadata": {}, "source": [ "### 🔁 Iteration is key – preview the dataset!\n", @@ -310,7 +318,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d32dcf48", + "id": "3755e3e7", "metadata": {}, "outputs": [], "source": [ @@ -320,7 +328,7 @@ { "cell_type": "code", "execution_count": null, - "id": "70db2f87", + "id": "592fb941", "metadata": {}, "outputs": [], "source": [ @@ -331,7 +339,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8b65b184", + "id": "33d2699d", "metadata": {}, "outputs": [], "source": [ @@ -341,7 +349,7 @@ }, { "cell_type": "markdown", - "id": "58e3147f", + "id": "0193d3bf", "metadata": {}, "source": [ "### 📊 Analyze the generated data\n", @@ -354,7 +362,7 @@ { "cell_type": "code", "execution_count": null, - "id": "82b01514", + "id": "0a24414d", "metadata": {}, "outputs": [], "source": [ @@ -364,7 +372,7 @@ }, { "cell_type": "markdown", - "id": "8274677b", + "id": "2087bb8a", "metadata": {}, "source": [ "### 🔎 Visual Inspection\n", @@ -375,7 +383,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e7bd89dc", + "id": "db656f8d", "metadata": { "lines_to_next_cell": 2 }, @@ -399,7 +407,7 @@ }, { "cell_type": "markdown", - "id": "01f6d07d", + "id": "8661033a", "metadata": {}, "source": [ "### 🆙 Scale up!\n", @@ -412,7 +420,7 @@ { "cell_type": "code", "execution_count": null, - "id": "21981b68", + "id": "6604d596", "metadata": {}, "outputs": [], "source": [ @@ -422,7 +430,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7c655cea", + "id": "c459a01b", "metadata": {}, "outputs": [], "source": [ @@ -435,7 +443,7 @@ { "cell_type": "code", "execution_count": null, - "id": "291a3dfc", + "id": "ab258027", "metadata": {}, "outputs": [], "source": [ @@ -447,7 +455,7 @@ }, { "cell_type": "markdown", - "id": "af7c69cc", + "id": "4aa21dc8", "metadata": {}, "source": [ "## ⏭️ Next Steps\n", @@ -459,7 +467,7 @@ "- Combine vision-based descriptions with other column types for multi-modal workflows\n", "- Apply this pattern to other vision tasks like image captioning, OCR validation, or visual question answering\n", "\n", - "- [Generating images](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/) with Data Designer\n" + "- [Generating images](https://docs.nvidia.com/nemo/datadesigner/tutorials/generating-images) with Data Designer\n" ] } ], diff --git a/docs/colab_notebooks/5-generating-images.ipynb b/docs/colab_notebooks/5-generating-images.ipynb index efecb0387..0b4f8f7c7 100644 --- a/docs/colab_notebooks/5-generating-images.ipynb +++ b/docs/colab_notebooks/5-generating-images.ipynb @@ -2,15 +2,17 @@ "cells": [ { "cell_type": "markdown", - "id": "52eeca6e", - "metadata": {}, + "id": "0fcaa1ff", + "metadata": { + "nemo_colab_inject": true + }, "source": [ "\"Open" ] }, { "cell_type": "markdown", - "id": "ea02d680", + "id": "15241416", "metadata": {}, "source": [ "# 🎨 Data Designer Tutorial: Generating Images\n", @@ -27,12 +29,12 @@ "\n", "> **Prerequisites**: This tutorial uses [OpenRouter](https://openrouter.ai) with the Flux 2 Pro image model. Set `OPENROUTER_API_KEY` in your environment before running.\n", "\n", - "If this is your first time using Data Designer, we recommend starting with the [first notebook](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/) in this tutorial series.\n" + "If this is your first time using Data Designer, we recommend starting with the [first notebook](https://docs.nvidia.com/nemo/datadesigner/tutorials/the-basics) in this tutorial series.\n" ] }, { "cell_type": "markdown", - "id": "1c36e1cd", + "id": "885e2471", "metadata": {}, "source": [ "### 📦 Import Data Designer\n", @@ -43,8 +45,10 @@ }, { "cell_type": "markdown", - "id": "4933a0df", - "metadata": {}, + "id": "b9a73500", + "metadata": { + "nemo_colab_inject": true + }, "source": [ "### ⚡ Colab Setup\n", "\n", @@ -54,8 +58,10 @@ { "cell_type": "code", "execution_count": null, - "id": "abe49f1b", - "metadata": {}, + "id": "4133533b", + "metadata": { + "nemo_colab_inject": true + }, "outputs": [], "source": [ "%%capture\n", @@ -65,8 +71,10 @@ { "cell_type": "code", "execution_count": null, - "id": "f6ffa0a4", - "metadata": {}, + "id": "0dcced26", + "metadata": { + "nemo_colab_inject": true + }, "outputs": [], "source": [ "import getpass\n", @@ -83,7 +91,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f1de4914", + "id": "cbbbb901", "metadata": {}, "outputs": [], "source": [ @@ -96,7 +104,7 @@ }, { "cell_type": "markdown", - "id": "112c71f5", + "id": "3d198931", "metadata": {}, "source": [ "### ⚙️ Initialize the Data Designer interface\n", @@ -107,7 +115,7 @@ { "cell_type": "code", "execution_count": null, - "id": "88c82623", + "id": "da5ae375", "metadata": {}, "outputs": [], "source": [ @@ -116,7 +124,7 @@ }, { "cell_type": "markdown", - "id": "50ca5262", + "id": "a84c8ce2", "metadata": {}, "source": [ "### 🎛️ Define an image-generation model\n", @@ -128,7 +136,7 @@ { "cell_type": "code", "execution_count": null, - "id": "49fdc61e", + "id": "9ac00282", "metadata": {}, "outputs": [], "source": [ @@ -150,7 +158,7 @@ }, { "cell_type": "markdown", - "id": "6740ea52", + "id": "3f178963", "metadata": {}, "source": [ "### 🏗️ Build the config: samplers + image column\n", @@ -161,7 +169,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7b89467a", + "id": "8e4529e2", "metadata": {}, "outputs": [], "source": [ @@ -334,7 +342,7 @@ }, { "cell_type": "markdown", - "id": "ad84fd89", + "id": "b0e27f35", "metadata": {}, "source": [ "### 🔁 Preview: images as base64\n", @@ -345,7 +353,7 @@ { "cell_type": "code", "execution_count": null, - "id": "24ecd543", + "id": "82a3d4a1", "metadata": {}, "outputs": [], "source": [ @@ -355,7 +363,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7540fc51", + "id": "eef18a56", "metadata": {}, "outputs": [], "source": [ @@ -366,7 +374,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8920f6c5", + "id": "618586ec", "metadata": {}, "outputs": [], "source": [ @@ -375,7 +383,7 @@ }, { "cell_type": "markdown", - "id": "5739eee6", + "id": "f4125f38", "metadata": {}, "source": [ "### 🆙 Create: images saved to disk\n", @@ -386,7 +394,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f5326cbb", + "id": "78249f6b", "metadata": {}, "outputs": [], "source": [ @@ -396,7 +404,7 @@ { "cell_type": "code", "execution_count": null, - "id": "506d537f", + "id": "c4a3767d", "metadata": {}, "outputs": [], "source": [ @@ -407,7 +415,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8dbd4874", + "id": "b702f652", "metadata": {}, "outputs": [], "source": [ @@ -423,16 +431,16 @@ }, { "cell_type": "markdown", - "id": "fa0307b2", + "id": "47d3ba2a", "metadata": {}, "source": [ "## ⏭️ Next steps\n", "\n", - "- [The basics](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/): samplers and LLM text columns\n", - "- [Structured outputs and Jinja](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/2-structured-outputs-and-jinja-expressions/)\n", - "- [Seeding with a dataset](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/3-seeding-with-a-dataset/)\n", - "- [Providing images as context](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/4-providing-images-as-context/)\n", - "- [Image-to-image editing](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/6-editing-images-with-image-context/): edit existing images with seed datasets\n" + "- [The basics](https://docs.nvidia.com/nemo/datadesigner/tutorials/the-basics): samplers and LLM text columns\n", + "- [Structured outputs and Jinja](https://docs.nvidia.com/nemo/datadesigner/tutorials/structured-outputs-jinja-expressions-and-conditional-generation)\n", + "- [Seeding with a dataset](https://docs.nvidia.com/nemo/datadesigner/tutorials/seeding-with-an-external-dataset)\n", + "- [Providing images as context](https://docs.nvidia.com/nemo/datadesigner/tutorials/providing-images-as-context)\n", + "- [Image-to-image editing](https://docs.nvidia.com/nemo/datadesigner/tutorials/image-to-image-editing): edit existing images with seed datasets\n" ] } ], diff --git a/docs/colab_notebooks/6-editing-images-with-image-context.ipynb b/docs/colab_notebooks/6-editing-images-with-image-context.ipynb index 8a29e17af..becd216fd 100644 --- a/docs/colab_notebooks/6-editing-images-with-image-context.ipynb +++ b/docs/colab_notebooks/6-editing-images-with-image-context.ipynb @@ -2,15 +2,17 @@ "cells": [ { "cell_type": "markdown", - "id": "7348e00d", - "metadata": {}, + "id": "c2040d1e", + "metadata": { + "nemo_colab_inject": true + }, "source": [ "\"Open" ] }, { "cell_type": "markdown", - "id": "c5e18f66", + "id": "6897cf62", "metadata": {}, "source": [ "# 🎨 Data Designer Tutorial: Image-to-Image Editing\n", @@ -23,16 +25,16 @@ "- 🔗 **Chaining image columns**: Use `ImageContext` to pass generated images to a follow-up editing column\n", "- 🎲 **Sampler-driven diversity**: Combine sampled accessories and settings for varied edits\n", "\n", - "This tutorial uses an **autoregressive** model (one that supports both text-to-image *and* image-to-image generation via the chat completions API). Diffusion models (DALL·E, Stable Diffusion, etc.) do not support image context—see [Tutorial 5](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/) for text-to-image generation with diffusion models.\n", + "This tutorial uses an **autoregressive** model (one that supports both text-to-image *and* image-to-image generation via the chat completions API). Diffusion models (DALL·E, Stable Diffusion, etc.) do not support image context—see [Tutorial 5](https://docs.nvidia.com/nemo/datadesigner/tutorials/generating-images) for text-to-image generation with diffusion models.\n", "\n", "> **Prerequisites**: This tutorial uses [OpenRouter](https://openrouter.ai) with the Flux 2 Pro model. Set `OPENROUTER_API_KEY` in your environment before running.\n", "\n", - "If this is your first time using Data Designer, we recommend starting with the [first notebook](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/) in this tutorial series.\n" + "If this is your first time using Data Designer, we recommend starting with the [first notebook](https://docs.nvidia.com/nemo/datadesigner/tutorials/the-basics) in this tutorial series.\n" ] }, { "cell_type": "markdown", - "id": "daa7359c", + "id": "9b9f1ec7", "metadata": {}, "source": [ "### 📦 Import Data Designer\n", @@ -43,8 +45,10 @@ }, { "cell_type": "markdown", - "id": "5bb9d062", - "metadata": {}, + "id": "c9f96370", + "metadata": { + "nemo_colab_inject": true + }, "source": [ "### ⚡ Colab Setup\n", "\n", @@ -54,8 +58,10 @@ { "cell_type": "code", "execution_count": null, - "id": "b03fb17a", - "metadata": {}, + "id": "7f956805", + "metadata": { + "nemo_colab_inject": true + }, "outputs": [], "source": [ "%%capture\n", @@ -65,8 +71,10 @@ { "cell_type": "code", "execution_count": null, - "id": "e931d0de", - "metadata": {}, + "id": "f2e911ec", + "metadata": { + "nemo_colab_inject": true + }, "outputs": [], "source": [ "import getpass\n", @@ -83,7 +91,7 @@ { "cell_type": "code", "execution_count": null, - "id": "02e932f5", + "id": "d40b8756", "metadata": {}, "outputs": [], "source": [ @@ -99,7 +107,7 @@ }, { "cell_type": "markdown", - "id": "369a04c5", + "id": "24384480", "metadata": {}, "source": [ "### ⚙️ Initialize the Data Designer interface\n", @@ -110,7 +118,7 @@ { "cell_type": "code", "execution_count": null, - "id": "070aaa15", + "id": "467be1c2", "metadata": {}, "outputs": [], "source": [ @@ -119,7 +127,7 @@ }, { "cell_type": "markdown", - "id": "142952fe", + "id": "19f56aa9", "metadata": {}, "source": [ "### 🎛️ Define an image model\n", @@ -135,7 +143,7 @@ { "cell_type": "code", "execution_count": null, - "id": "2d66a7c8", + "id": "7679a8fd", "metadata": {}, "outputs": [], "source": [ @@ -157,7 +165,7 @@ }, { "cell_type": "markdown", - "id": "c4d0e592", + "id": "81e3d777", "metadata": {}, "source": [ "### 🏗️ Build the configuration\n", @@ -172,7 +180,7 @@ { "cell_type": "code", "execution_count": null, - "id": "51a228bb", + "id": "7463cff1", "metadata": {}, "outputs": [], "source": [ @@ -270,7 +278,7 @@ }, { "cell_type": "markdown", - "id": "dc6d84fa", + "id": "84121b5e", "metadata": {}, "source": [ "### 🔁 Preview: quick iteration\n", @@ -281,7 +289,7 @@ { "cell_type": "code", "execution_count": null, - "id": "05b58baa", + "id": "422b9358", "metadata": {}, "outputs": [], "source": [ @@ -291,7 +299,7 @@ { "cell_type": "code", "execution_count": null, - "id": "97e35ebb", + "id": "f5bde082", "metadata": {}, "outputs": [], "source": [ @@ -302,7 +310,7 @@ { "cell_type": "code", "execution_count": null, - "id": "345514ab", + "id": "26147a02", "metadata": {}, "outputs": [], "source": [ @@ -311,7 +319,7 @@ }, { "cell_type": "markdown", - "id": "15dfb8b7", + "id": "28ba2a2e", "metadata": { "lines_to_next_cell": 2 }, @@ -324,7 +332,7 @@ { "cell_type": "code", "execution_count": null, - "id": "13728788", + "id": "4b96ae4f", "metadata": {}, "outputs": [], "source": [ @@ -355,7 +363,7 @@ { "cell_type": "code", "execution_count": null, - "id": "6da35706", + "id": "564f4207", "metadata": {}, "outputs": [], "source": [ @@ -365,7 +373,7 @@ }, { "cell_type": "markdown", - "id": "59abd92b", + "id": "63b65bcd", "metadata": {}, "source": [ "### 🆙 Create at scale\n", @@ -376,7 +384,7 @@ { "cell_type": "code", "execution_count": null, - "id": "25be841b", + "id": "6ea97a86", "metadata": {}, "outputs": [], "source": [ @@ -386,7 +394,7 @@ { "cell_type": "code", "execution_count": null, - "id": "389cc5d2", + "id": "8b2a3d4e", "metadata": {}, "outputs": [], "source": [ @@ -397,7 +405,7 @@ { "cell_type": "code", "execution_count": null, - "id": "15002cbf", + "id": "d199ae65", "metadata": {}, "outputs": [], "source": [ @@ -407,7 +415,7 @@ }, { "cell_type": "markdown", - "id": "ba28d5ee", + "id": "57dafb98", "metadata": {}, "source": [ "## ⏭️ Next steps\n", @@ -419,9 +427,9 @@ "\n", "Related tutorials:\n", "\n", - "- [The basics](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/): samplers and LLM text columns\n", - "- [Providing images as context](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/4-providing-images-as-context/): image-to-text with VLMs\n", - "- [Generating images](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/): text-to-image generation with diffusion models\n" + "- [The basics](https://docs.nvidia.com/nemo/datadesigner/tutorials/the-basics): samplers and LLM text columns\n", + "- [Providing images as context](https://docs.nvidia.com/nemo/datadesigner/tutorials/providing-images-as-context): image-to-text with VLMs\n", + "- [Generating images](https://docs.nvidia.com/nemo/datadesigner/tutorials/generating-images): text-to-image generation with diffusion models\n" ] } ], diff --git a/docs/concepts/agent-rollout-ingestion.md b/docs/concepts/agent-rollout-ingestion.md deleted file mode 100644 index 268b39b4d..000000000 --- a/docs/concepts/agent-rollout-ingestion.md +++ /dev/null @@ -1,281 +0,0 @@ -# Agent Rollout Ingestion - -`AgentRolloutSeedSource` turns existing agent rollouts into a seed dataset for synthetic data workflows. It lets you operate locally on rollout artifacts you already have on disk, then normalizes them into rows you can filter, curate, and distill into training or evaluation data. - -## Quick Start - -Use `AgentRolloutSeedSource` when you want to work from existing agent traces instead of traces captured during a Data Designer generation run. - -=== "Claude Code" - - Uses `~/.claude/projects` and `*.jsonl` by default. - - ```python - import data_designer.config as dd - - seed_source = dd.AgentRolloutSeedSource( - format=dd.AgentRolloutFormat.CLAUDE_CODE, - ) - ``` - -=== "Codex" - - Uses `~/.codex/sessions` and `*.jsonl` by default. - - ```python - import data_designer.config as dd - - seed_source = dd.AgentRolloutSeedSource( - format=dd.AgentRolloutFormat.CODEX, - ) - ``` - -=== "Hermes Agent" - - Uses `~/.hermes/sessions` and `*.json*` by default so CLI session logs and gateway transcripts can coexist. - - ```python - import data_designer.config as dd - - seed_source = dd.AgentRolloutSeedSource( - format=dd.AgentRolloutFormat.HERMES_AGENT, - ) - ``` - -=== "Pi Coding Agent" - - Uses `~/.pi/agent/sessions` and `*.jsonl` by default. Sessions are tree-structured JSONL files; the active conversation path is resolved automatically. - - ```python - import data_designer.config as dd - - seed_source = dd.AgentRolloutSeedSource( - format=dd.AgentRolloutFormat.PI_CODING_AGENT, - ) - ``` - -=== "ATIF" - - ATIF requires an explicit `path`. See Harbor's [ATIF documentation](https://harborframework.com/docs/trajectory-format) for the format specification. - - ```python - import data_designer.config as dd - - seed_source = dd.AgentRolloutSeedSource( - format=dd.AgentRolloutFormat.ATIF, - path="/data/harbor/runs/swe-bench/job-042", - recursive=True, - file_pattern="trajectory*.json", - ) - ``` - -You can override `path` and `file_pattern` for any format when your rollout artifacts live outside the built-in defaults. - -## Normalized Field Compatibility - -All supported rollout formats map into the same seeded row schema. In the table below, `None` means the source artifact does not expose that field directly, and `derived` means Data Designer computes it from normalized `messages`. - -| Normalized field | ATIF | Claude Code | Codex | Hermes Agent | Pi Coding Agent | -|---|---|---|---|---|---| -| `trace_id` | `session_id` | `sessionId[:agentId]` | `session_meta.id` or file stem | CLI `session_id` or file stem; gateway file stem | Session header `id` | -| `source_kind` | `"atif"` | `"claude_code"` | `"codex"` | `"hermes_agent"` | `"pi_coding_agent"` | -| `source_path` | Parsed `.json` path | Parsed `.jsonl` trace path | Parsed `rollout-*.jsonl` path | Parsed CLI `.json` or gateway `.jsonl` path | Parsed `.jsonl` session path | -| `root_session_id` | `session_id` | `sessionId` or file stem | `trace_id` | `trace_id` | Session header `id` | -| `agent_id` | `None` | `agentId` | `None` | `None` | `None` | -| `is_sidechain` | `False` | `isSidechain` | `False` | `False` | `False` | -| `cwd` | `agent.extra.cwd` | First non-null record `cwd` | `session_meta.cwd` | `None` | Session header `cwd` | -| `project_path` | `extra.project_path` or `cwd` | `projectPath` or `cwd` | `cwd` | `None` | Session header `cwd` | -| `git_branch` | `agent.extra.git_branch` | First non-null record `gitBranch` | `session_meta.git_branch` | `None` | `None` | -| `started_at` | Earliest step timestamp | Earliest row timestamp | `session_meta.timestamp` or earliest record timestamp | CLI `session_start`; gateway `created_at` | Earliest entry timestamp | -| `ended_at` | Latest step timestamp | Latest row timestamp | Latest record timestamp | CLI `last_updated`; gateway `updated_at` | Latest entry timestamp | -| `messages` | Normalized steps | Normalized trace rows | Normalized response items | Normalized CLI or gateway rows | Normalized active-path messages | -| `source_meta` | ATIF metadata | Claude metadata | Codex metadata | Hermes metadata | Pi session metadata | -| `message_count` | `derived` | `derived` | `derived` | `derived` | `derived` | -| `tool_call_count` | `derived` | `derived` | `derived` | `derived` | `derived` | -| `final_assistant_message` | `derived` | `derived` | `derived` | `derived` | `derived` | - -### Notes - -- `trace_id`: Claude Code appends `agentId` when present. Hermes uses either the CLI session ID or the gateway transcript file stem. Pi uses the session header `id`. -- `is_sidechain`: ATIF, Hermes, and Pi currently normalize this to `False`. Claude Code preserves `isSidechain` directly. -- `messages`: All formats normalize into the same chat-style message schema. See [Message Traces](traces.md) for the shared block structure. Pi sessions are tree-structured; only the active conversation path (from the last entry back to root) is included. -- `source_meta`: This is where format-specific details live, such as ATIF copied-context metadata, Claude summaries, Codex response-item types, Hermes tool/session metadata, or Pi session version and branch information. - -## Example: Summarize a Random Turn - -Because the seeded fields are normalized, you can also build lightweight summarization workflows directly from imported rollouts. This example samples one random normalized message from each trace and summarizes it in a single sentence. - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -data_designer = DataDesigner() -config_builder = dd.DataDesignerConfigBuilder( - model_configs=[ - dd.ModelConfig( - alias="trace-writer", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", - ) - ] -) - -config_builder.with_seed_dataset( - dd.AgentRolloutSeedSource( - format=dd.AgentRolloutFormat.CLAUDE_CODE, - ) -) - -config_builder.add_column( - dd.ExpressionColumnConfig( - name="sampled_turn", - expr="{{ messages | random }}", - ) -) - -config_builder.add_column( - dd.LLMTextColumnConfig( - name="turn_summary", - model_alias="trace-writer", - prompt="""\ -Summarize this randomly sampled rollout turn in one sentence. -The turn may come from the user, assistant, or a tool result. - -Trace: {{ trace_id }} -Turn: -{{ sampled_turn }} -""", - ) -) - -preview = data_designer.preview(config_builder, num_records=3) -preview.display_sample_record() -``` - -This stays fully declarative: no custom seed reader or preprocessing step is required. Because `sampled_turn` is drawn from the normalized `messages` list, the same config works across all supported rollout formats. - -## Example: Turn Tool Interactions into a Review Dataset - -You can also explode imported rollouts into a tool-interaction dataset. This example scans normalized `messages`, emits one row per tool call and matching tool response, preserves the trace context up to that response, and then uses a structured column to label the interaction as a success, failure, or unclear outcome. - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner -from pydantic import BaseModel, Field -from typing import Literal - - -@dd.custom_column_generator( - required_columns=["messages"], - side_effect_columns=["tool_call", "tool_response", "tool_name"], -) -def explode_tool_interactions(row: dict) -> list[dict]: - rows = [] - tool_calls_by_id = {} - context_messages = [] - - for message in row["messages"]: - context_messages.append(message) - - for tool_call in message.get("tool_calls") or []: - tool_call_id = tool_call.get("id") - if tool_call_id: - tool_calls_by_id[tool_call_id] = tool_call - - if message.get("role") != "tool": - continue - - tool_call = tool_calls_by_id.get( - message.get("tool_call_id"), - { - "id": message.get("tool_call_id"), - "type": "function", - "function": {"name": "unknown", "arguments": "{}"}, - }, - ) - tool_name = tool_call.get("function", {}).get("name", "unknown") - - rows.append( - { - **row, - "tool_interaction_context": list(context_messages), - "tool_call": tool_call, - "tool_response": message, - "tool_name": tool_name, - } - ) - - return rows - - -class ToolInteractionAnalysis(BaseModel): - outcome: Literal["success", "failure", "unclear"] = Field( - description="Whether the tool interaction appears to have succeeded, failed, or is ambiguous." - ) - summary: str = Field( - description="One or two sentences summarizing what the tool was asked to do and what the response indicates." - ) - - -data_designer = DataDesigner() -config_builder = dd.DataDesignerConfigBuilder( - model_configs=[ - dd.ModelConfig( - alias="tool-analyst", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", - ) - ] -) - -config_builder.with_seed_dataset( - dd.AgentRolloutSeedSource( - format=dd.AgentRolloutFormat.CLAUDE_CODE, - ) -) - -config_builder.add_column( - dd.CustomColumnConfig( - name="tool_interaction_context", - generator_function=explode_tool_interactions, - allow_resize=True, - ) -) - -config_builder.add_column( - dd.LLMStructuredColumnConfig( - name="tool_interaction_analysis", - model_alias="tool-analyst", - output_format=ToolInteractionAnalysis, - prompt="""\ -You are analyzing one tool interaction from an imported agent rollout. - -Context up to the tool response: -{{ tool_interaction_context }} - -Tool name: {{ tool_name }} - -Tool call: -{{ tool_call }} - -Tool response: -{{ tool_response }} - -Decide whether this interaction is a success, failure, or unclear outcome. -Then summarize what the tool was asked to do and what happened. -Base your answer on the tool call arguments, the tool response, and the immediate context. -""", - ) -) - -preview = data_designer.preview(config_builder, num_records=5) -preview.display_sample_record() -``` - -This pattern is useful when you want to curate evaluator or monitoring datasets from real traces. The resize-enabled custom column turns each tool interaction into its own record, and the structured column adds a consistent outcome label plus a grounded summary. Because the logic operates on normalized `tool_calls` and `tool` messages, the same pattern transfers across supported rollout formats. If your traces are long, consider adding a second custom or expression column that windows the context before sending it to a model. - -## Related Guides - -- For the general seed dataset model, see [Seed Datasets](seed-datasets.md). -- For the normalized `messages` structure used in imported rollouts, see [Message Traces](traces.md). -- For an end-to-end distillation example, see [Agent Rollout Trace Distillation](../recipes/trace_ingestion/agent_rollout_distillation.md). diff --git a/docs/concepts/architecture-and-performance.md b/docs/concepts/architecture-and-performance.md deleted file mode 100644 index 31ab502aa..000000000 --- a/docs/concepts/architecture-and-performance.md +++ /dev/null @@ -1,344 +0,0 @@ -# 🏗️ Architecture & Performance - -Data Designer is an **orchestration framework** that coordinates synthetic data generation workflows. It is a **client** of LLM inference servers—it does not host models itself. - -This guide explains the architecture, execution model, and how to tune performance for your specific use case. - ---- - -## Separation of Concerns - -``` -┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐ -│ Data Designer │ │ Inference Server(s) │ -│ (Orchestration) │ HTTP │ (LLM Hosting) │ -│ │ ─────► │ │ -│ • Dataset workflow management │ │ • Model weights and execution │ -│ • Column dependency resolution │ │ • GPU allocation and scheduling │ -│ • Batching and parallelism │ │ • Request queuing │ -│ • Retry and error handling │ │ • Token generation │ -│ • Adaptive concurrency (AIMD) │ │ • Rate limiting (optional) │ -│ • Data validation and quality │ │ │ -└─────────────────────────────────────┘ └─────────────────────────────────────┘ - ▲ ▲ - │ │ - Your workflow Your infrastructure - configuration (or cloud API) -``` - -### What Data Designer Does - -- **Orchestrates** the generation workflow across multiple columns -- **Resolves dependencies** between columns (DAG-based execution) -- **Batches** work into manageable chunks (`buffer_size`) -- **Parallelizes** LLM calls within batches (`max_parallel_requests`) -- **Adapts to rate limits** automatically via AIMD concurrency control -- **Handles errors** with retries and early shutdown logic -- **Validates** generated data against schemas and constraints - -### What Data Designer Does NOT Do - -- **Host models**: You must provide LLM endpoints -- **Manage GPUs**: Your inference server handles GPU allocation -- **Scale inference**: You must provision sufficient capacity -- **Impose rate limits**: Your server or API gateway sets rate limits (Data Designer *reacts* to them automatically) - ---- - -## Execution Model - -!!! note "Two execution engines" - The default execution path is the **async engine**, which dispatches work at the cell level and overlaps independent columns — see [Async Engine](#async-engine) below for its semantics. The legacy **sync engine** is still available for one transitional release via `DATA_DESIGNER_ASYNC_ENGINE=0` and is what this section describes. The configuration knobs documented below (`buffer_size`, `max_parallel_requests`, AIMD throttle config, error handling) apply to both engines; the differences are flagged inline. - -The sync engine processes datasets in **batches**, with **parallel** operations within each batch. - -### How It Works (sync engine) - -**Step 1: Split into batches** - -Your dataset is divided into batches of `buffer_size` records. Each batch is processed completely before moving to the next. - -**Step 2: Process columns sequentially** - -Within a batch, columns are generated one at a time following the dependency graph. The order depends on column dependencies—expression columns may come before LLM columns if the LLM columns depend on them. (The async engine relaxes this: columns whose per-cell dependencies are satisfied can run concurrently with columns earlier in the order.) - -Example workflow: - -``` -Batch 1 (100 records) -│ -├─► Column 1: category (Sampler) ──── All 100 values generated -├─► Column 2: prompt (LLM Text) ──── All 100 values generated -├─► Column 3: response (LLM Text) ──── All 100 values generated -├─► Column 4: score (Expression) ──── All 100 values computed -│ -└─► Write batch to disk - │ - ▼ -Batch 2 (100 records) - ...repeat... -``` - -**Step 3: Generate cells in parallel** - -Within each column, cells are processed **in parallel** up to the configured limit: - -| Column Type | Parallelism Control | -|-------------|---------------------| -| Sampler | `non_inference_max_parallel_workers` | -| LLM (Text, Code, Structured, Judge) | `max_parallel_requests` | -| Expression | Sequential (fast, CPU-bound) | - -### Key Concepts - -| Concept | Description | -|---------|-------------| -| **Batching** | Records are split into batches of `buffer_size`. In the sync engine, each batch completes entirely before the next begins; in the async engine, multiple row groups (the async equivalent) can be in flight concurrently. | -| **Sequential columns** | Sync-engine only: columns within a batch are generated one at a time, respecting the dependency graph. The async engine schedules at the cell level instead. | -| **Parallel cells** | Within a column, individual cells (records) are generated in parallel up to the configured limit. Same on both engines. | - -### Concurrency Formula - -At any moment, the number of concurrent LLM requests is: - -```python -concurrent_requests = min( - buffer_size, # Records in current batch - current_throttle_limit, # AIMD-managed limit (≤ max_parallel_requests) - remaining_cells_in_column # Cells left to generate -) -``` - -`max_parallel_requests` sets the **ceiling**. The actual limit (`current_throttle_limit`) is managed at runtime by an AIMD (Additive Increase / Multiplicative Decrease) controller that reacts to rate-limit signals from the inference server: - -- **During optional startup ramp**: when `rampup_seconds` is greater than 0, a new throttle domain starts at one concurrent request and increases linearly toward `max_parallel_requests` over that duration. -- **On the first 429 in a burst**: the limit is reduced by a configurable factor (default: 25% reduction) and a cooldown is applied. Further 429s from already in-flight requests in the same burst do not reduce the limit again — they release their permits and hold the limit steady. -- **After consecutive successes**: the limit increases by 1 (by default) until it reaches the ceiling or a stabilized rate-limit threshold. - -This means Data Designer automatically finds the right concurrency level for your server without manual tuning. - -!!! note "Engine paths" - AIMD adaptive concurrency is fully active on the default **async engine**. The legacy **sync engine** is available for one transitional release via `DATA_DESIGNER_ASYNC_ENGINE=0`; on that path 429s are first retried at the HTTP transport layer and AIMD only engages as a fallback. See [Async engine](#async-engine) below. - -**Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer can send up to 32 requests in parallel. If `rampup_seconds=30`, it starts at one request and climbs linearly toward 32 over 30 seconds. If the server returns 429s, startup ramp stops, concurrency drops automatically (e.g., to 24, then 18), and normal AIMD recovery takes over once the server catches up. - ---- - -## Configuration Parameters - -### `buffer_size` (RunConfig) - -Controls how many records are processed per batch. - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -run_config = dd.RunConfig(buffer_size=2000) - -designer = DataDesigner() -designer.set_run_config(run_config) -``` - -| Value | Memory Usage | Throughput | Error Feedback | -|-------|--------------|------------|----------------| -| **Low** (100-500) | Lower | May not saturate inference | Fast | -| **Default** (1000) | Moderate | Good for most cases | Moderate | -| **High** (2000-5000) | Higher | Better for deep pipelines | Slower | - -**When to increase**: High-capacity inference server, single-model workflows, memory not constrained - -**When to decrease**: Memory-constrained environments, development/debugging, complex multi-model pipelines - ---- - -### `max_parallel_requests` (InferenceParams) - -Sets the **maximum** concurrent LLM API calls **per model**. This is the ceiling that the AIMD throttle controller can ramp up to — the actual concurrency at runtime may be lower if the server signals rate limits. - -```python -import data_designer.config as dd - -model = dd.ModelConfig( - alias="my-model", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", - inference_parameters=dd.ChatCompletionInferenceParams( - max_parallel_requests=8, - ), -) -``` - -**Default**: 4 - -**When to increase**: Your inference backend has high throughput capacity, you're using a cloud API with generous rate limits, or you're running vLLM/TensorRT-LLM with multiple GPUs. With AIMD, setting an aggressively high value is safer than before — the system will self-correct downward if the server can't keep up. The salvage queue on the async engine (default) reclaims failed rows; on the sync engine the initial burst of 429s before AIMD stabilizes can drop rows, so start with a more conservative ceiling if you've opted into sync. - -**When to decrease**: You want to cap resource usage to a known safe level, or you want more predictable/debuggable execution. - -!!! tip "Finding the optimal value" - The right value depends on your inference stack and model. Self-hosted vLLM servers can often handle values as high as 256, 512, or even 1024 depending on your hardware. - - With AIMD, a practical approach is to set `max_parallel_requests` to the **upper bound** you're comfortable with and let the throttle controller find the sustainable level automatically. If you see frequent 429 → recovery cycles in the logs, your ceiling is above the server's true capacity but the system is handling it. If you never see any throttle activity, you may have room to increase the ceiling further. - - **Benchmark approach**: Run a small dataset (e.g., 100 records) with increasing `max_parallel_requests` values (4 → 8 → 16 → 32 → ...) and measure generation time. Stop increasing when the runtime stops decreasing—that's when your inference server is saturated. - ---- - -### `non_inference_max_parallel_workers` (RunConfig) - -Controls thread pool size for non-LLM operations (samplers, expressions, validators). - -```python -run_config = dd.RunConfig(non_inference_max_parallel_workers=8) -designer.set_run_config(run_config) -``` - -**Default**: 4 - -**When to increase**: Many CPU-bound columns (complex expressions, heavy sampling) - ---- - -### Adaptive Throttling (RunConfig) - -Data Designer uses an AIMD (Additive Increase / Multiplicative Decrease) controller to automatically adjust concurrency per model based on rate-limit feedback from the inference server. The defaults work well for most workloads. Override them via `ThrottleConfig` only when you understand the trade-offs. - -!!! note "Engine paths" - Adaptive throttling is fully active on the default **async engine**, where 429 responses propagate directly to the AIMD controller. On the legacy **sync engine** (`DATA_DESIGNER_ASYNC_ENGINE=0`), 429s are first retried at the HTTP transport layer; `ThrottleConfig` settings only take effect as a fallback if transport retries are exhausted. - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -run_config = dd.RunConfig( - throttle=dd.ThrottleConfig( - reduce_factor=0.75, # Multiply limit by this on a 429 (default: 0.75) - additive_increase=1, # Add this many slots after success_window successes (default: 1) - success_window=25, # Consecutive successes before increasing (default: 25) - cooldown_seconds=2.0, # Pause after a 429 when no Retry-After header (default: 2.0) - ceiling_overshoot=0.10, # Probe 10% above observed server limit (default: 0.10) - rampup_seconds=0.0, # Optional startup ramp duration; 0 disables it (default: 0.0) - ), -) - -designer = DataDesigner() -designer.set_run_config(run_config) -``` - -| Parameter | Default | Effect | -|-----------|---------|--------| -| `reduce_factor` | 0.75 | How aggressively to cut concurrency on a 429. Lower = more aggressive. | -| `additive_increase` | 1 | Slots added per recovery step. Higher = faster ramp-up, but riskier. | -| `success_window` | 25 | Consecutive successes required before each increase step. | -| `cooldown_seconds` | 2.0 | Pause duration after a 429 (used when the server doesn't send `Retry-After`). | -| `ceiling_overshoot` | 0.10 | Fraction above the observed rate-limit ceiling the controller is allowed to probe. | -| `rampup_seconds` | 0.0 | Optional startup ramp duration. When greater than 0, domains start at one concurrent request and linearly climb to the configured ceiling unless a 429 aborts the ramp. | - -!!! tip "How it works in practice" - When a model endpoint returns HTTP 429, the controller reduces the concurrency limit for that model and pauses briefly. After enough consecutive successes, it begins ramping back up. If the server rate-limits again, the controller records that level as a ceiling and stabilizes just below it, with a small overshoot band to detect when the server can handle more load. - - You can observe this in the logs — look for messages like `concurrency reduced from X → Y` and `concurrency increased from X → Y`. - ---- - -### Error Handling (RunConfig) - -Control retry behavior and early shutdown for failed generations. - -```python -run_config = dd.RunConfig( - max_conversation_restarts=5, # Full conversation restarts (default: 5) - max_conversation_correction_steps=0, # In-conversation corrections (default: 0) - disable_early_shutdown=False, # Enable early shutdown (default) - shutdown_error_rate=0.5, # Shut down if >50% errors - shutdown_error_window=10, # Min tasks before error monitoring -) -designer.set_run_config(run_config) -``` - -**When to adjust**: - -- **Strict schemas**: Increase `max_conversation_restarts` to 7, add `max_conversation_correction_steps=2` -- **Debugging**: Set `disable_early_shutdown=True` to see all errors -- **Simple text**: Reduce `max_conversation_restarts` to 3 - ---- - -## Async Engine - -The async engine is the default execution path. It dispatches work at the cell level rather than the column level, so independent columns overlap in time and per-(provider, model) AIMD pools tune themselves independently. See the [Async All the Way Down](../devnotes/posts/async-all-the-way-down.md) dev note for the full architecture. - -### Per-model timeouts drive every deadline - -The `inference_parameters.timeout` field on a `ModelConfig` sets the per-request HTTP timeout. The same value also drives the sync→async bridge that custom columns use when they call `model.generate()`. There is no separate queue-wait deadline — waits scale with provider speed and AIMD's adaptive concurrency. Slow self-hosted endpoints (e.g. large models on a single GPU) only need this one knob raised: - -```python -import data_designer.config as dd - -config_builder.add_model_config( - dd.ModelConfig( - alias="slow-model", - model="my/slow-model", - provider="my-provider", - inference_parameters=dd.ChatCompletionInferenceParams( - timeout=600, - ), - ) -) -``` - -### Run outcomes - -A run can finish with fewer records than requested when non-retryable errors drop rows. Inspect `len(result.load_dataset())` to detect. - -If the rate of non-retryable errors crosses `RunConfig.shutdown_error_rate`, generation stops early and raises `DataDesignerEarlyShutdownError` (a subclass of `DataDesignerGenerationError`). Catch it separately when a typed retry path is appropriate: - -```python -from data_designer.interface.errors import DataDesignerEarlyShutdownError - -try: - result = dd_instance.create(config_builder, num_records=1000) -except DataDesignerEarlyShutdownError: - # e.g. retry against a different model alias - ... -``` - -### Opting out - -!!! warning "Deprecated" - `DATA_DESIGNER_ASYNC_ENGINE=0` selects the legacy sync engine. This is a deprecated escape hatch for the transitional release and will be removed in a future version. The opt-out also emits a `DeprecationWarning` at run time so it shows up in your logs. - -```bash -DATA_DESIGNER_ASYNC_ENGINE=0 python my_pipeline.py -``` - ---- - -## Common Problems - -| Problem | Symptom | Solution | -|---------|---------|----------| -| **Low throughput** | Low GPU utilization | Increase `max_parallel_requests` and/or `buffer_size`. If the throttle has self-reduced due to earlier 429s (check logs for "concurrency reduced" messages), the server may need more capacity or you can wait for AIMD recovery. | -| **Frequent 429 → recovery cycles** | Logs show repeated concurrency drops and ramp-ups | The `max_parallel_requests` ceiling is above the server's sustained capacity. This is handled automatically, but you can lower the ceiling to reduce the sawtooth or tune `reduce_factor` / `success_window`. | -| **Long tail of slow generations** | Most records fast, few very slow | Reduce `max_conversation_restarts`, simplify schemas, improve prompts | -| **Multi-model idle periods** | One model busy, others idle | Reduce `buffer_size` for faster cycling, or consolidate models | -| **Memory errors** | OOM crashes | Reduce `buffer_size` and `max_parallel_requests` | -| **Too many errors** | Generation fails frequently | Check prompts/schemas; adjust `shutdown_error_rate` or disable early shutdown for debugging | - ---- - -## Tuning Workflow - -1. **Start with defaults** for initial development — AIMD handles rate-limit adaptation automatically -2. **Profile your workload**: How many LLM columns? How many records? What models? -3. **Identify bottleneck**: Low GPU util → increase `max_parallel_requests` (AIMD will self-correct if you overshoot). Memory issues → decrease `buffer_size`. Long tails → tune retry settings. -4. **Check throttle logs**: Look for "concurrency reduced" / "concurrency increased" messages to understand whether rate limits are the bottleneck -5. **Iterate**: Make one change at a time, measure impact before next change - ---- - -## Related Documentation - -- [Deployment Options](deployment-options.md): Choosing between library and microservice -- [Model Configuration](models/model-configs.md): Complete model settings reference -- [Inference Parameters](models/inference-parameters.md): Detailed parameter reference diff --git a/docs/concepts/columns.md b/docs/concepts/columns.md deleted file mode 100644 index 45b87d174..000000000 --- a/docs/concepts/columns.md +++ /dev/null @@ -1,216 +0,0 @@ -# Columns - -Columns are the fundamental building blocks in Data Designer. Each column represents a field in your dataset and defines how to generate it—whether that's sampling from a distribution, calling an LLM, or applying a transformation. - -!!! note "The Declarative Approach" - Columns are **declarative specifications**. You describe *what* you want, and the framework handles *how* to generate it—managing execution order, batching, parallelization, and resources automatically. - -## Column Types - -Data Designer provides eleven built-in column types, each optimized for different generation scenarios. - -### 🎲 Sampler Columns - -Sampler columns generate data using numerical sampling—fast, deterministic, and ideal for numerical and categorical dataset fields. They're significantly faster than LLMs and can produce data following specific distributions (Poisson for event counts, Gaussian for measurements, etc.). - -Available sampler types: - -- **UUID**: Unique identifiers -- **Category**: Categorical values with optional probability weights -- **Subcategory**: Hierarchical categorical data (states within countries, models within brands) -- **Uniform**: Evenly distributed numbers (integers or floats) -- **Gaussian**: Normally distributed values with configurable mean and standard deviation -- **Bernoulli**: Binary outcomes with specified success probability -- **Bernoulli Mixture**: Binary outcomes from multiple probability components -- **Binomial**: Count of successes in repeated trials -- **Poisson**: Count data and event frequencies -- **Scipy**: Access to the full scipy.stats distribution library -- **Person**: Realistic synthetic individuals with names, demographics, and attributes -- **Datetime**: Timestamps within specified ranges -- **Timedelta**: Time duration values - -!!! tip "Conditional Sampling" - Samplers support **conditional parameters** that change behavior based on other columns. Want age distributions that vary by country? Income ranges that depend on occupation? Just define conditions on existing column values. - -### 📝 LLM-Text Columns - -LLM-Text columns generate natural language text: product descriptions, customer reviews, narrative summaries, email threads, or anything requiring semantic understanding and creativity. - -Use **Jinja2 templating** in prompts to reference other columns. Data Designer automatically manages dependencies and injects the referenced column values into the prompt. - -!!! note "Generation Traces" - LLM columns can optionally capture message traces in a separate `{column_name}__trace` column. Set `with_trace` on the column config to control what's captured: `TraceType.NONE` (default, no trace), `TraceType.LAST_MESSAGE` (final assistant message only), or `TraceType.ALL_MESSAGES` (full conversation history). The trace includes the ordered message history for the final generation attempt (system/user/assistant/tool calls/tool results), and may include model reasoning fields when the provider exposes them. - -!!! tip "Extracting Reasoning Content" - Some models expose chain-of-thought reasoning separately from the main response via a `reasoning_content` field. To capture only this reasoning (without the full trace), set `extract_reasoning_content=True`: - - ```python - dd.LLMTextColumnConfig( - name="answer", - model_alias="reasoning-model", - prompt="Solve this problem: {{ problem }}", - extract_reasoning_content=True, # Creates answer__reasoning_content column - ) - ``` - - This creates a `{column_name}__reasoning_content` column containing the stripped reasoning content from the final assistant response, or `None` if the model didn't provide reasoning. This is independent of `with_trace`—you can use either or both. - -!!! tip "Tool Use in LLM Columns" - LLM columns can invoke external tools during generation via MCP (Model Context Protocol). Enable tools by setting `tool_alias` to reference a configured `ToolConfig`: - - ```python - dd.LLMTextColumnConfig( - name="answer", - model_alias="nvidia-text", - prompt="Search for information and answer: {{ question }}", - tool_alias="search-tools", # References a ToolConfig - with_trace=dd.TraceType.ALL_MESSAGES, # Capture tool call history - ) - ``` - - When `tool_alias` is set, the model can request tool calls during generation. Data Designer executes the tools via configured MCP providers and feeds results back until the model produces a final answer. See [Tool Use & MCP](tool_use_and_mcp.md) for full configuration details. - -!!! tip "Performance" - LLM columns are parallelized within each batch using `max_parallel_requests` from your model's inference parameters. See the [Architecture & Performance](architecture-and-performance.md) guide for optimization strategies. - -### 💻 LLM-Code Columns - -LLM-Code columns generate code in specific programming languages. They handle the prompting and parsing necessary to extract clean code from the LLM's response—automatically detecting and extracting code from markdown blocks. You provide the prompt and choose the model; the column handles the extraction. - -Supported languages: **Bash, C, C++, C#, COBOL, Go, Java, JavaScript, Kotlin, Python, Ruby, Rust, Scala, Swift, TypeScript**, plus **SQL** dialects (SQLite, PostgreSQL, MySQL, T-SQL, BigQuery, ANSI SQL). - -### 🗂️ LLM-Structured Columns - -LLM-Structured columns generate JSON with a *guaranteed schema*. Define your structure using a Pydantic model or JSON schema, and Data Designer ensures the LLM output conforms—no parsing errors, no schema drift. - -Use for complex nested structures: API responses, configuration files, database records with multiple related fields, or any structured data where type safety matters. Schemas can be arbitrarily complex with nested objects, arrays, enums, and validation constraints, but success depends on the model's capabilities. - -!!! tip "Schema Complexity and Model Choice" - Flat schemas with simple fields are easier and more robustly produced across models. Deeply nested schemas with complex validation constraints are more sensitive to model choice—stronger models handle complexity better. If you're experiencing schema conformance issues, try simplifying the schema or switching to a more capable model. - -### ⚖️ LLM-Judge Columns - -LLM-Judge columns score generated content across multiple quality dimensions using LLMs as evaluators. - -Define scoring rubrics (relevance, accuracy, fluency, helpfulness) and the judge model evaluates each record. Score rubrics specify criteria and scoring options (1-5 scales, categorical grades, etc.), producing quantified quality metrics for every data point. - -Use judge columns for data quality filtering (e.g., keep only 4+ rated responses), A/B testing generation strategies, and quality monitoring over time. - -### 🖼️ Image Columns - -Image columns generate images from text prompts using either **diffusion** models (DALL·E, Stable Diffusion, Imagen) or **autoregressive** models (Gemini image, GPT image). - -Use **Jinja2 templating** in the prompt to reference other columns, driving diversity across generated images. For example, reference sampled attributes like style, subject, and composition to produce varied images without manually writing different prompts. - -Image columns require a model configured with `ImageInferenceParams`. Model-specific options (size, quality, aspect ratio) are passed via `extra_body` in the inference parameters. - -**Output modes:** - -- **Preview** (`data_designer.preview()`): Images are stored as base64-encoded strings directly in the DataFrame for quick iteration -- **Create** (`data_designer.create()`): Images are saved to disk in an `images//` folder with UUID filenames; the DataFrame stores relative paths - -Image columns also support `multi_modal_context` for autoregressive models that accept image inputs, enabling image-to-image generation workflows. - -!!! tip "Tutorials" - The image tutorials cover three workflows: [Providing Images as Context](../notebooks/4-providing-images-as-context.ipynb) (image → text), [Generating Images](../notebooks/5-generating-images.ipynb) (text → image), and [Editing Images with Image Context](../notebooks/6-editing-images-with-image-context.ipynb) (image → image). - -### 🧬 Embedding Columns - -Embedding columns generate vector embeddings (numerical representations) for text content using embedding models. These embeddings capture semantic meaning, enabling similarity search, clustering, and semantic analysis. - -Specify a `target_column` containing text, and Data Designer generates embeddings for that content. The target column can contain either a single text string or a list of text strings in stringified JSON format. In the latter case, embeddings are generated for each text string in the list. - -Common use cases: - -- **Semantic search**: Generate embeddings for documents, then find similar content by vector similarity -- **Clustering**: Group similar texts based on embedding proximity -- **Recommendation systems**: Match content by semantic similarity -- **Anomaly detection**: Identify outliers in embedding space - -!!! note "Embedding Models" - Embedding columns require an embedding model configured with `EmbeddingInferenceParams`. These models differ from chat completion models—they output vectors rather than text. The generation type is automatically determined by the inference parameters type. - -### 🧩 Expression Columns - -Expression columns handle simple transformations using **Jinja2 templates**—concatenate first and last names, calculate numerical totals, format date strings. No LLM overhead needed. - -Template capabilities: - -- **Variable substitution**: Pull values from any existing column -- **String filters**: Uppercase, lowercase, strip whitespace, replace patterns -- **Conditional logic**: if/elif/else support -- **Arithmetic**: Add, subtract, multiply, divide - -### 🔍 Validation Columns - -Validation columns check generated content against rules and return structured pass/fail results. - -Built-in validation types: - -**Code validation** runs Python or SQL code through a linter to validate the code. - -**Local callable validation** accepts a Python function directly when using Data Designer as a library. - -**Remote validation** sends data to HTTP endpoints for validation-as-a-service. Useful for linters, security scanners, or proprietary systems. - -### 🌱 Seed Dataset Columns - -Seed dataset columns bootstrap generation from existing data. Provide a real dataset, and those columns become available as context for generating new synthetic data. - -Typical pattern: use seed data for one part of your schema (real product names and categories), then generate synthetic fields around it (customer reviews, purchase histories, ratings). The seed data provides realism and constraints; generated columns add volume and variation. - -### 🔧 Custom Columns - -Custom columns let you implement your own generation logic using Python functions. Use the `@custom_column_generator` decorator to declare dependencies, and the framework handles DAG ordering and parallelization. - -Two generation strategies: - -- **`cell_by_cell`** (default): Function receives one row, framework parallelizes -- **`full_column`**: Function receives entire DataFrame for vectorized operations - -For LLM access, declare `model_aliases` in the decorator and receive a `models` dict as the third argument. See [Custom Columns](custom_columns.md) for details. - -## Shared Column Properties - -Every column configuration inherits from `SingleColumnConfig` with these standard properties: - -### `name` - -The column's identifier—unique within your configuration, used in Jinja2 references, and becomes the column name in the output DataFrame. Choose descriptive names: `user_review` > `col_17`. - -### `drop` - -Boolean flag (default: `False`) controlling whether the column appears in final output. Setting `drop=True` generates the column (available as a dependency) but excludes it from final output. - -**When to drop columns:** - -- Intermediate calculations that feed expressions but aren't meaningful standalone -- Context columns used only for LLM prompt templates -- Validation results during development unwanted in production - -Dropped columns participate fully in generation and the dependency graph—just filtered out at the end. - -### `column_type` - -Literal string identifying the column type: `"sampler"`, `"llm-text"`, `"expression"`, etc. Set automatically by each configuration class and serves as Pydantic's discriminator for deserialization. - -You rarely set this manually—instantiating `LLMTextColumnConfig` automatically sets `column_type="llm-text"`. Serialization is reversible: save to YAML, load later, and Pydantic reconstructs the exact objects. - -### `required_columns` - -Computed property listing columns that must be generated before this one. The framework derives this automatically: - -- For LLM/Expression columns: extracted from Jinja2 template `{{ variables }}` -- For Validation columns: explicitly listed target columns -- For Sampler columns with conditional parameters: columns referenced in conditions - -You read this property for introspection but never set it—always computed from configuration details. - -### `side_effect_columns` - -Computed property listing columns created implicitly alongside the primary column. Currently, only LLM columns produce side effects: - -- `{name}__trace`: Created when `with_trace` is not `TraceType.NONE` on the column. -- `{name}__reasoning_content`: Created when `extract_reasoning_content=True` on the column. - -For detailed information on each column type, refer to the [column configuration code reference](../code_reference/config/column_configs.md). diff --git a/docs/concepts/custom_columns.md b/docs/concepts/custom_columns.md deleted file mode 100644 index 3d9ae3954..000000000 --- a/docs/concepts/custom_columns.md +++ /dev/null @@ -1,195 +0,0 @@ -# Custom Columns - -Custom columns let you implement your own generation logic using Python functions. Use them for multi-step LLM workflows, external API integration, or any scenario requiring full programmatic control. For reusable, distributable components, see [Plugins](../plugins/overview.md) instead. - -## Quick Start - -```python -import data_designer.config as dd - -@dd.custom_column_generator(required_columns=["name"]) -def create_greeting(row: dict) -> dict: - row["greeting"] = f"Hello, {row['name']}!" - return row - -config_builder.add_column( - dd.CustomColumnConfig( - name="greeting", - generator_function=create_greeting, - ) -) -``` - -## Function Signatures - -Three signatures are supported. **Parameter names are validated**: - -| Args | Signature | Use Case | -|------|-----------|----------| -| 1 | `fn(row) -> dict` | Simple transforms | -| 2 | `fn(row, generator_params) -> dict` | With typed params | -| 3 | `fn(row, generator_params, models) -> dict` | LLM access via models dict | - -For `full_column` strategy, use `df` instead of `row`. - -For LLM access without params, use `generator_params: None`: - -```python -@dd.custom_column_generator(required_columns=["name"], model_aliases=["my-model"]) -def generate_message(row: dict, generator_params: None, models: dict) -> dict: - response, _ = models["my-model"].generate(prompt=f"Greet {row['name']}") - row["greeting"] = response - return row -``` - -Model aliases are validated before generation starts. If an alias doesn't exist in your config, an error is raised during the health check. - -## Generation Strategies - -| Strategy | Input | Use Case | -|----------|-------|----------| -| `cell_by_cell` (default) | `row: dict` | LLM calls, row-by-row logic | -| `full_column` | `df: DataFrame` | Vectorized DataFrame operations | - -**Recommendation:** Use `cell_by_cell` for LLM calls. The framework handles parallelization automatically. Use `full_column` only for vectorized operations that don't involve LLM calls. - -For `full_column`, set `generation_strategy=dd.GenerationStrategy.FULL_COLUMN`. - -!!! note "Concurrent dispatch" - Sync `cell_by_cell` generators are dispatched concurrently across rows under the async engine. Module-level mutable state (counters, caches, non-thread-safe HTTP clients) needs synchronization or per-row instantiation. For network-bound work, prefer `async def fn(row)` — the engine runs it directly on its event loop and skips the thread bridge. - -## The Decorator - -```python -@dd.custom_column_generator( - required_columns=["col1"], # DAG ordering - side_effect_columns=["extra"], # Additional columns created - model_aliases=["model1"], # Required for LLM access -) -``` - -## Models Dict - -The third argument is a dict of `ModelFacade` instances, keyed by alias. **You must declare all models required in your custom column generator in `model_aliases`** - this populates the `models` dict and enables health checks before generation starts. - -```python -@dd.custom_column_generator(model_aliases=["my-model"]) -def my_generator(row: dict, generator_params: None, models: dict) -> dict: - model = models["my-model"] - response, trace = model.generate( - prompt="...", - parser=my_custom_parser, # optional, defaults to identity - system_prompt="...", - max_correction_steps=3, - ) - row["result"] = response - return row -``` - -This gives you direct access to all `ModelFacade` capabilities: custom parsers, correction loops, structured output, tool use, etc. - -## Configuration - -| Parameter | Type | Required | Description | -|-----------|------|----------|-------------| -| `name` | str | Yes | Column name | -| `generator_function` | Callable | Yes | Decorated function | -| `generation_strategy` | GenerationStrategy | No | `CELL_BY_CELL` or `FULL_COLUMN` | -| `generator_params` | BaseModel | No | Typed params passed to function | -| `allow_resize` | bool | No | Allow 1:N or N:1 generation | - -### Resizing (1:N and N:1) - -**FULL_COLUMN:** Set `allow_resize=True` and return a DataFrame with more or fewer rows than the input: - -```python -@dd.custom_column_generator( - required_columns=["topic"], - side_effect_columns=["variation_id"], -) -def expand_topics(df: pd.DataFrame, params: None, models: dict) -> pd.DataFrame: - rows = [] - for _, row in df.iterrows(): - for i in range(3): # Generate 3 variations per input - rows.append({ - "topic": row["topic"], - "question": f"Question {i+1} about {row['topic']}", - "variation_id": i, - }) - return pd.DataFrame(rows) - -dd.CustomColumnConfig( - name="question", - generator_function=expand_topics, - generation_strategy=dd.GenerationStrategy.FULL_COLUMN, - allow_resize=True, -) -``` - -**CELL_BY_CELL:** With `allow_resize=True`, your function may return a single row (`dict`) or multiple rows (`list[dict]`). Return `[]` to drop that input row. - -```python -@dd.custom_column_generator(required_columns=["id"]) -def expand_row(row: dict) -> list[dict]: - return [ - {**row, "variant": "a"}, - {**row, "variant": "b"}, - ] - -dd.CustomColumnConfig( - name="variant", - generator_function=expand_row, - generation_strategy=dd.GenerationStrategy.CELL_BY_CELL, - allow_resize=True, -) -``` - -Use cases: - -- **Expansion (1:N)**: Generate multiple variations per input -- **Retraction (N:1)**: Filter, aggregate, or deduplicate records (FULL_COLUMN) or return `[]` per row (CELL_BY_CELL) - -## Multi-Turn Example - -```python -@dd.custom_column_generator( - required_columns=["topic"], - side_effect_columns=["draft", "critique"], - model_aliases=["writer", "editor"], -) -def writer_editor(row: dict, generator_params: None, models: dict) -> dict: - draft, _ = models["writer"].generate(prompt=f"Write about '{row['topic']}'") - critique, _ = models["editor"].generate(prompt=f"Critique: {draft}") - revised, _ = models["writer"].generate(prompt=f"Revise based on: {critique}\n\nOriginal: {draft}") - - row["final_text"] = revised - row["draft"] = draft - row["critique"] = critique - return row -``` - -## Development Testing - -Test generators with real LLM calls without running the full pipeline: - -```python -data_designer = DataDesigner() -models = data_designer.get_models(["my-model"]) -result = my_generator({"name": "Alice"}, None, models) -``` - -In unit tests that mock model clients, use `MagicMock(spec=ModelFacade)` so async methods are auto-detected: - -```python -from unittest.mock import MagicMock -from data_designer.engine.models.facade import ModelFacade - -mock_model = MagicMock(spec=ModelFacade) -``` - -Mocking only `generate()` will silently no-op under the async engine because the bridge routes through `agenerate()`. - -## See Also - -- [Column Configs Reference](../code_reference/config/column_configs.md) -- [Plugins Overview](../plugins/overview.md) diff --git a/docs/concepts/deployment-options.md b/docs/concepts/deployment-options.md deleted file mode 100644 index 35e325e2f..000000000 --- a/docs/concepts/deployment-options.md +++ /dev/null @@ -1,186 +0,0 @@ -# Deployment Options: Library vs. Microservice - -Data Designer is available as both an **open-source library** and a **NeMo Microservice**. This guide helps you choose the right deployment option for your use case. - -## Deployment Architectures at a Glance - -Data Designer supports three main deployment patterns: - -
- -- **Library + Your LLM Provider** - - --- - - Each user runs the library locally and connects to their choice of LLM provider. - - ![Library with Decentralized Providers](../images/deployment-library-decentralized.png) - -- **Library + Enterprise Gateway** - - --- - - Users run the library locally but share a centralized enterprise LLM gateway with RBAC and governance. - - ![Library with Enterprise Gateway](../images/deployment-enterprise-gateway.png) - -- **SDG as a Service (Microservice)** - - --- - - A centralized SDG service that multiple users access via REST API. - - ![SDG Microservice](../images/deployment-microservice.png) - -
- -## Quick Comparison - -| Aspect | Open-Source Library | NeMo Microservice | -|--------|---------------------|-------------------| -| **What it is** | Python package you import and run | REST API service exposing `preview` and `create` methods | -| **Best for** | Developers with LLM access who want flexibility and customization | Teams using NeMo Microservices platform | -| **LLM Access** | You provide (any OpenAI-compatible API) | Integrated with NeMo Microservices Platform | -| **Installation** | `pip install data-designer` | Deploy via NeMo Microservices platform | -| **Scaling** | You manage inference capacity | Managed alongside other NeMo services | - -!!! success "Same Configuration API" - Both the library and microservice use the **same `DataDesignerConfigBuilder` API**. Start with the library, and your configurations migrate seamlessly if you later adopt the NeMo platform. - -## 📦 When to Use the Open-Source Library - -The library is the right choice for most users. Choose it if you: - -### You Have Access to LLMs - -![Library with Decentralized Providers](../images/deployment-library-decentralized.png){ align=right width="350" } - -You have API keys or endpoints for LLM inference: - -- **Cloud APIs**: NVIDIA API Catalog (build.nvidia.com), OpenAI, Azure OpenAI, Anthropic -- **Self-hosted**: vLLM, TGI, TensorRT-LLM, or any OpenAI-compatible server -- **Enterprise gateways**: Centralized LLM gateway with RBAC, rate limiting, or other enterprise features - -```python -from data_designer.interface import DataDesigner -from data_designer.config import ModelConfig - -# Use any OpenAI-compatible endpoint -model = ModelConfig( - alias="my-model", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", # or "openai", or a custom ModelProvider -) - -dd = DataDesigner() -# Your code controls the full workflow -``` - -### You Need Maximum Flexibility - -- **Custom plugins**: Extend Data Designer with custom column generators, seed readers, or processors -- **Local development**: Rapid iteration with immediate feedback -- **Integration**: Embed Data Designer into existing Python pipelines or notebooks -- **Experimentation**: Research workflows with custom models or configurations - -### You Already Have Enterprise LLM Infrastructure - -![Library with Enterprise Gateway](../images/deployment-enterprise-gateway.png){ align=right width="350" } - -!!! tip "Library + Enterprise LLM Gateway" - Many enterprises already have centralized LLM access through API gateways with: - - - Role-based access control (RBAC) - - Rate limiting and quotas - - Audit logging - - Cost allocation - - In this case, **use the library** and point it at your enterprise gateway. You get enterprise-grade LLM access while retaining full control over your Data Designer workflows. - -```python -from data_designer.config import ModelConfig, ModelProvider - -# Define your enterprise gateway as a provider -enterprise_provider = ModelProvider( - name="enterprise-gateway", - endpoint="https://llm-gateway.yourcompany.com/v1", - api_key="ENTERPRISE_LLM_KEY", # Environment variable name (uppercase) or actual key -) - -# Use the provider in your model config -model = ModelConfig( - alias="enterprise-llm", - model="gpt-4", - provider="enterprise-gateway", # References the provider above -) -``` - ---- - -## ☁️ When to Use the Microservice - -![SDG Microservice](../images/deployment-microservice.png){ align=right width="350" } - -The NeMo Microservice exposes Data Designer's `preview` and `create` methods as REST API endpoints. Choose it if you: - -### You're Using the NeMo Microservices Platform - -The primary value of the microservice is **integration with other NeMo Microservices**: - -- **NeMo Inference Microservices (NIMs)**: Seamless integration with NVIDIA's optimized inference endpoints -- **NeMo Customizer**: Generate synthetic data for model fine-tuning workflows -- **NeMo Evaluator**: Create evaluation datasets alongside model assessment -- **Unified deployment**: Single platform for your entire AI pipeline - - -### You Want to Expose SDG as a Team Service - -If you need to provide synthetic data generation as a shared service: - -- **Multi-tenant access**: Multiple teams submit generation jobs via API -- **Job management**: Queue, monitor, and manage generation jobs centrally -- **Resource sharing**: Shared infrastructure for SDG workloads - -When users can submit configs containing Jinja templates to a shared engine, template rendering becomes a remote code execution concern and part of your security boundary. See [Security](security.md) for guidance on when to keep the default `JinjaRenderingEngine.SECURE` mode. - ---- - -## 🧭 Decision Flowchart - -``` - ┌─────────────────────────┐ - │ Are you using the NeMo │ - │ Microservices platform? │ - └───────────┬─────────────┘ - │ - ┌───────────┴───────────┐ - ▼ ▼ - YES NO - │ │ - ▼ ▼ - ┌───────────────────┐ ┌───────────────────────────┐ - │ Use Microservice │ │ Do you need to expose SDG │ - │ │ │ as a shared REST service? │ - │ Integrates with │ └─────────────┬─────────────┘ - │ NIMs, Customizer, │ │ - │ Evaluator │ ┌───────────┴───────────┐ - └───────────────────┘ ▼ ▼ - YES NO - │ │ - ▼ ▼ - ┌─────────────────────┐ ┌─────────────────┐ - │ Consider if the │ │ Use the Library │ - │ overhead is worth │ │ │ - │ it vs. library + │ │ Most flexible │ - │ enterprise gateway │ │ option for │ - └─────────────────────┘ │ direct use │ - └─────────────────┘ -``` - ---- - -## Learn More - -- **Library**: Continue with this documentation -- **Microservice**: See the [NeMo Data Designer Microservice documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html){target="_blank"} -- **Security model**: See [Security](security.md) diff --git a/docs/concepts/mcp/configure-mcp-cli.md b/docs/concepts/mcp/configure-mcp-cli.md deleted file mode 100644 index 72b357e99..000000000 --- a/docs/concepts/mcp/configure-mcp-cli.md +++ /dev/null @@ -1,135 +0,0 @@ -# Configuring MCP Using the CLI - -The Data Designer CLI provides an interactive interface for creating and managing MCP providers and tool configurations stored in your Data Designer home directory (default: `~/.data-designer/`). - -## Configuration Files - -The CLI manages two YAML configuration files for MCP: - -- **`mcp_providers.yaml`**: MCP provider configurations -- **`tool_configs.yaml`**: Tool configurations - -!!! note "Custom Directory" - You can customize the configuration directory location with the `DATA_DESIGNER_HOME` environment variable: - ```bash - export DATA_DESIGNER_HOME="/path/to/your/custom/directory" - ``` - -## CLI Commands - -The Data Designer CLI provides commands for MCP configuration: - -```bash -# Configure MCP providers -data-designer config mcp - -# Configure tool configs -data-designer config tools - -# List all configurations (including MCP) -data-designer config list -``` - -!!! tip "Getting help" - See available commands: - ```bash - data-designer config --help - ``` - -## Configuring MCP Providers - -Run the interactive MCP provider configuration command: - -```bash -data-designer config mcp -``` - -### Provider Type Selection - -The wizard first asks you to choose a provider type: - -1. **Remote SSE**: Connect to a pre-existing MCP server via HTTP Server-Sent Events -2. **Remote Streamable HTTP**: Connect to a pre-existing MCP server via Streamable HTTP -3. **Local stdio subprocess**: Launch an MCP server as a subprocess - -### Remote SSE / Streamable HTTP Configuration - -When configuring a remote provider (SSE or Streamable HTTP), you'll be prompted for: - -- **Name**: Unique identifier (e.g., `"doc-search"`) -- **Endpoint**: Server endpoint URL (e.g., `"http://localhost:8080/sse"` or `"https://mcp.example.com/mcp"`) -- **API Key**: Optional API key or environment variable name - -### Local Stdio Configuration - -When configuring a Local stdio provider, you'll be prompted for: - -- **Name**: Unique identifier (e.g., `"local-tools"`) -- **Command**: Executable to run (e.g., `"python"`) -- **Arguments**: Command-line arguments (e.g., `"-m my_mcp_server"`) -- **Environment Variables**: Optional environment variables for the subprocess - -### Available Operations - -- **Add a new provider**: Define a new MCP provider -- **Update an existing provider**: Modify provider settings -- **Delete a provider**: Remove a provider -- **Delete all providers**: Remove all MCP providers - -## Configuring Tool Configs - -Run the interactive tool configuration command: - -```bash -data-designer config tools -``` - -!!! info "Provider Required" - You need at least one MCP provider configured before adding tool configs. Run `data-designer config mcp` first if none exist. - -### Configuration Options - -When creating a tool config, you'll be prompted for: - -- **Tool Alias**: Unique name for referencing in columns (e.g., `"my-tools"`) -- **Providers**: Select one or more MCP providers (checkbox selection) -- **Allowed Tools**: Optionally restrict to specific tools (leave empty for all) -- **Max Tool Call Turns**: Maximum tool-calling iterations (default: 5) -- **Timeout**: Per-call timeout in seconds (default: 60.0) - -### Available Operations - -- **Add a new tool config**: Define a new tool configuration -- **Update an existing tool config**: Modify settings -- **Delete a tool config**: Remove a tool configuration -- **Delete all tool configs**: Remove all tool configurations - -## Listing Configurations - -View all current configurations: - -```bash -data-designer config list -``` - -This command displays: - -- **Model Providers**: All configured model providers -- **Model Configurations**: All configured models -- **MCP Providers**: All configured MCP providers with their endpoints -- **Tool Configurations**: All configured tool configs with their settings - -## Manual Editing - -You can also edit the YAML files directly for advanced configurations. The files are located at: - -- `~/.data-designer/mcp_providers.yaml` -- `~/.data-designer/tool_configs.yaml` - -After manual edits, the changes take effect the next time you initialize `DataDesigner`. - -## See Also - -- **[MCP Providers](mcp-providers.md)**: Learn about provider configuration options -- **[Tool Configurations](tool-configs.md)**: Learn about tool config options -- **[Configure Model Settings with the CLI](../models/configure-model-settings-with-the-cli.md)**: CLI guide for model configuration diff --git a/docs/concepts/mcp/enabling-tools.md b/docs/concepts/mcp/enabling-tools.md deleted file mode 100644 index e343cc7dc..000000000 --- a/docs/concepts/mcp/enabling-tools.md +++ /dev/null @@ -1,105 +0,0 @@ -# Enabling Tools on Columns - -This guide explains how to enable tool use on LLM columns by connecting them to tool configurations via the `tool_alias` parameter. - -## Overview - -To enable tool use on an LLM column, you reference a `ToolConfig` by its alias. During generation, the model can then request tool calls, and Data Designer executes them and feeds the results back to the model. - -## Using tool_alias - -Add the `tool_alias` parameter to any supported LLM column configuration: - -```python -import data_designer.config as dd - -builder.add_column( - dd.LLMTextColumnConfig( - name="answer", - prompt="Use tools as needed to answer: {{ question }}", - model_alias="nvidia-text", - tool_alias="my-tools", # References a ToolConfig - ) -) -``` - -## Supported Column Types - -Tool use is supported on these column configuration types: - -| Column Type | Description | -|------------|-------------| -| `LLMTextColumnConfig` | Text generation with tool access | -| `LLMCodeColumnConfig` | Code generation with tool access | -| `LLMStructuredColumnConfig` | Structured JSON generation with tool access | -| `LLMJudgeColumnConfig` | Judge/scoring with tool access | - -## How It Works - -When `tool_alias` is specified: - -1. **Tool schemas are fetched** from the referenced MCP providers -2. **Model receives tool schemas** with the prompt -3. **Model can request tool calls** in its response -4. **Data Designer executes calls** and returns results to the model -5. **Iteration continues** until the model produces a final answer (or limits are reached) - -## Complete Example - -Here's a complete workflow showing provider → ToolConfig → column: - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -# 1. Configure MCP provider -mcp_provider = dd.LocalStdioMCPProvider( - name="demo-mcp", - command="python", - args=["-m", "my_mcp_server"], -) - -# 2. Create DataDesigner instance with provider -data_designer = DataDesigner(mcp_providers=[mcp_provider]) - -# 3. Define tool configuration -tool_config = dd.ToolConfig( - tool_alias="my-tools", - providers=["demo-mcp"], - allow_tools=["search_docs", "get_fact"], - max_tool_call_turns=5, -) - -# 4. Create config builder with tool config -builder = dd.DataDesignerConfigBuilder(tool_configs=[tool_config]) - -# 5. Add columns that use tools -builder.add_column( - dd.SamplerColumnConfig( - name="question", - sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams( - values=["What is machine learning?", "Explain neural networks"] - ), - ) -) - -builder.add_column( - dd.LLMTextColumnConfig( - name="answer", - prompt="Use the available tools to research and answer: {{ question }}", - model_alias="nvidia-text", - tool_alias="my-tools", # Enable tools - with_trace=dd.TraceType.ALL_MESSAGES, # Capture tool call history - ) -) - -# 6. Generate data -results = data_designer.preview(builder, num_records=5) -``` - -## See Also - -- **[Tool Configurations](tool-configs.md)**: Configure tool access and limits -- **[Traces](../traces.md)**: Capture and inspect tool call history -- **[MCP Providers](mcp-providers.md)**: Configure MCP server connections diff --git a/docs/concepts/mcp/mcp-providers.md b/docs/concepts/mcp/mcp-providers.md deleted file mode 100644 index 0c03ded91..000000000 --- a/docs/concepts/mcp/mcp-providers.md +++ /dev/null @@ -1,159 +0,0 @@ -# MCP Providers - -MCP providers are external services that host and serve tools via the Model Context Protocol (MCP). Data Designer uses provider configurations to establish connections to these services. - -## Overview - -An MCP provider defines how Data Designer connects to a tool server. Data Designer supports two provider types: - -| Provider Class | Connection Method | Use Case | -|---------------|-------------------|----------| -| `MCPProvider` | SSE or Streamable HTTP | Connect to a pre-existing MCP server | -| `LocalStdioMCPProvider` | Subprocess via stdin/stdout | Launch an MCP server as a subprocess | - -When you create a `ToolConfig`, you reference providers by name, and Data Designer uses those provider settings to communicate with the appropriate MCP servers. - -## MCPProvider (Remote) - -Use `MCPProvider` to connect to a pre-existing MCP server. Both SSE (Server-Sent Events) and Streamable HTTP transports are supported: - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -# SSE transport (default) -sse_provider = dd.MCPProvider( - name="remote-mcp", - endpoint="http://localhost:8080/sse", - api_key="MCP_API_KEY", # Environment variable name -) - -# Streamable HTTP transport -http_provider = dd.MCPProvider( - name="remote-tools", - endpoint="https://mcp.example.com/mcp", - api_key="MCP_API_KEY", - provider_type="streamable_http", -) - -data_designer = DataDesigner(mcp_providers=[sse_provider, http_provider]) -``` - -### MCPProvider Fields - -| Field | Type | Required | Description | -|-------|------|----------|-------------| -| `name` | `str` | Yes | Unique identifier for the provider | -| `endpoint` | `str` | Yes | Endpoint URL for the remote MCP server | -| `api_key` | `str` | No | API key or environment variable name | -| `provider_type` | `str` | No | Transport type: `"sse"` (default) or `"streamable_http"` | - -## LocalStdioMCPProvider (Subprocess) - -Use `LocalStdioMCPProvider` to launch an MCP server as a subprocess: - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -mcp_provider = dd.LocalStdioMCPProvider( - name="demo-mcp", - command="python", - args=["-m", "my_mcp_server_module"], - env={"MY_SERVICE_TOKEN": "..."}, -) - -data_designer = DataDesigner(mcp_providers=[mcp_provider]) -``` - -### LocalStdioMCPProvider Fields - -| Field | Type | Required | Description | -|-------|------|----------|-------------| -| `name` | `str` | Yes | Unique identifier for the provider | -| `command` | `str` | Yes | Executable to run (e.g., `"python"`, `"node"`) | -| `args` | `list[str]` | No | Command-line arguments | -| `env` | `dict[str, str]` | No | Environment variables for the subprocess | -| `provider_type` | `str` | No | Always `"stdio"` (set automatically) | - -## API Key Configuration - -The `api_key` field can be specified in two ways: - -1. **Environment variable name** (recommended): Set `api_key` to the name of an environment variable (e.g., `"MCP_API_KEY"`). Data Designer will resolve it at runtime. - -2. **Plain-text value**: Set `api_key` to the actual API key string. This is less secure and not recommended for production. - -```python -# Method 1: Environment variable (recommended) -provider = dd.MCPProvider( - name="secure-mcp", - endpoint="https://mcp.example.com/sse", - api_key="MCP_API_KEY", # Will be resolved from environment -) - -# Method 2: Direct value (not recommended) -provider = dd.MCPProvider( - name="secure-mcp", - endpoint="https://mcp.example.com/sse", - api_key="actual-api-key-value", -) -``` - -## YAML Configuration - -Both provider types use a `provider_type` discriminator field in YAML configurations. When writing YAML configs manually (e.g., in `~/.data-designer/mcp_providers.yaml`), include the discriminator: - -```yaml -providers: - # Remote SSE provider - - name: doc-search - provider_type: sse - endpoint: http://localhost:8080/sse - api_key: ${MCP_API_KEY} - - # Remote Streamable HTTP provider - - name: remote-tools - provider_type: streamable_http - endpoint: https://mcp.example.com/mcp - api_key: ${MCP_API_KEY} - - # Local stdio provider - - name: local-tools - provider_type: stdio - command: python - args: - - -m - - my_mcp_server - env: - DEBUG: "true" -``` - -## Using Multiple Providers - -You can configure multiple MCP providers and use them together in a single `ToolConfig`: - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -providers = [ - dd.MCPProvider( - name="doc-search-mcp", - endpoint="http://localhost:8080/sse", - ), - dd.LocalStdioMCPProvider( - name="calculator-mcp", - command="python", - args=["-m", "calculator_mcp"], - ), -] - -data_designer = DataDesigner(mcp_providers=providers) -``` - -## See Also - -- **[Tool Configurations](tool-configs.md)**: Configure tool access with ToolConfig -- **[Configure MCP with the CLI](configure-mcp-cli.md)**: Use the CLI to manage MCP providers -- **[Enabling Tools on Columns](enabling-tools.md)**: Use tools in LLM columns diff --git a/docs/concepts/mcp/safety-and-limits.md b/docs/concepts/mcp/safety-and-limits.md deleted file mode 100644 index 54a5871e5..000000000 --- a/docs/concepts/mcp/safety-and-limits.md +++ /dev/null @@ -1,144 +0,0 @@ -# Safety and Limits - -This guide covers the safety controls available for tool use, including allowlists, turn budgets, and timeouts. These controls help prevent runaway loops and ensure predictable generation behavior. - -## Overview - -When LLM columns use tools, the model can make multiple tool calls in a loop until it produces a final answer. Without limits, this could lead to: - -- Excessive API calls and costs -- Long generation times -- Infinite loops if the model keeps requesting tools - -Data Designer provides three types of controls: - -| Control | Purpose | -|---------|---------| -| **Tool allowlists** | Restrict which tools can be called | -| **Turn budgets** | Limit iterations of tool-calling | -| **Timeouts** | Cap individual tool call latency | - -## Tool Allowlists - -Restrict which tools are available using `allow_tools`: - -```python -import data_designer.config as dd - -tool_config = dd.ToolConfig( - tool_alias="restricted-tools", - providers=["demo-mcp"], - allow_tools=["search_docs", "list_docs"], # Only these tools -) -``` - -### Behavior - -| Setting | Behavior | -|---------|----------| -| `allow_tools=None` (default) | All tools from the providers are available | -| `allow_tools=["tool1", "tool2"]` | Only the specified tools are available | - -Tools not in the allowlist won't be included in the schemas sent to the model, so the model won't know they exist. - -!!! tip "Use allowlists for untrusted tools" - If your MCP providers expose tools that could be dangerous or expensive, use allowlists to restrict access to only the tools you need. - -## Turn Budgets - -Limit the number of tool-calling iterations using `max_tool_call_turns`: - -```python -import data_designer.config as dd - -tool_config = dd.ToolConfig( - tool_alias="limited-tools", - providers=["demo-mcp"], - max_tool_call_turns=5, # Maximum 5 iterations (default) -) -``` - -### Understanding Turns - -A **turn** is one iteration where the LLM requests tool calls. With parallel tool calling, a single turn may execute multiple tools simultaneously. - -| Scenario | Turn Count | -|----------|------------| -| Model requests 1 tool | 1 turn | -| Model requests 3 tools in parallel | 1 turn | -| Model requests 1 tool, then 2 more, then 1 more | 3 turns | - -This approach gives models flexibility to use parallel calling efficiently while still bounding total iterations. - -### Graceful Budget Exhaustion - -When the turn limit is reached, Data Designer doesn't abruptly stop generation. Instead: - -1. The model's tool call request is recorded in the conversation -2. Tool "results" are returned with a refusal message explaining the limit was reached -3. The model receives this feedback and can produce a final response - -This ensures the model can still provide a useful answer based on the tools it already called, rather than failing silently. - -## Timeouts - -Limit how long each tool call can take using `timeout_sec`: - -```python -import data_designer.config as dd - -tool_config = dd.ToolConfig( - tool_alias="fast-tools", - providers=["demo-mcp"], - timeout_sec=30.0, # 30 seconds per tool call -) -``` - -### Timeout Behavior - -When a timeout occurs: - -1. The tool call is terminated -2. An error message is returned to the model -3. The model can attempt recovery (retry, skip, or answer without the result) - -```python -# Example error in trace when timeout occurs -{ - "role": "tool", - "content": "Error: Tool 'search_docs' failed: Connection timeout after 30s", - "tool_call_id": "call_abc123" -} -``` - -### Default Timeout - -The default timeout is 60 seconds. Adjust based on your tools: - -| Tool Type | Recommended Timeout | -|-----------|---------------------| -| Fast lookups | 5-10 seconds | -| Database queries | 15-30 seconds | -| External API calls | 30-60 seconds | -| Complex computations | 60+ seconds | - -## Combining Controls - -You can use all controls together for defense in depth: - -```python -import data_designer.config as dd - -tool_config = dd.ToolConfig( - tool_alias="secure-tools", - providers=["demo-mcp"], - allow_tools=["search_docs", "get_fact"], # Restricted tools - max_tool_call_turns=3, # Limited iterations - timeout_sec=15.0, # Fast timeout -) -``` - -## See Also - -- **[Tool Configurations](tool-configs.md)**: Complete ToolConfig reference -- **[Traces](../traces.md)**: Monitor tool usage patterns diff --git a/docs/concepts/mcp/tool-configs.md b/docs/concepts/mcp/tool-configs.md deleted file mode 100644 index c1841b9db..000000000 --- a/docs/concepts/mcp/tool-configs.md +++ /dev/null @@ -1,112 +0,0 @@ -# Tool Configurations - -Tool configurations define how LLM columns access external tools during generation. Each `ToolConfig` specifies which MCP providers to use, which tools are allowed, and operational limits. - -## Overview - -A `ToolConfig` connects LLM columns to MCP providers. When you create column configurations (like `LLMTextColumnConfig` or `LLMCodeColumnConfig`), you reference a tool configuration by its alias. Data Designer uses the tool configuration to determine which tools are available and how to manage tool calls. - -## ToolConfig Structure - -The `ToolConfig` class has the following fields: - -| Field | Type | Required | Description | -|-------|------|----------|-------------| -| `tool_alias` | `str` | Yes | Unique identifier for this tool configuration (referenced by columns) | -| `providers` | `list[str]` | Yes | List of MCP provider names to use (can reference multiple providers) | -| `allow_tools` | `list[str]` | No | Restrict to specific tools (`None` = allow all tools from providers) | -| `max_tool_call_turns` | `int` | No | Maximum tool-calling iterations (default: 5) | -| `timeout_sec` | `float` | No | Per-call timeout in seconds (default: 60.0) | - -## Examples - -### Basic Tool Configuration - -```python -import data_designer.config as dd - -tool_config = dd.ToolConfig( - tool_alias="my-tools", - providers=["demo-mcp"], -) -``` - -### Restricting Allowed Tools - -```python -import data_designer.config as dd - -tool_config = dd.ToolConfig( - tool_alias="search-only", - providers=["demo-mcp"], - allow_tools=["search_docs", "list_docs"], # Only these tools allowed -) -``` - -### Using Multiple Providers - -A single `ToolConfig` can reference multiple MCP providers, allowing tools to be drawn from different sources: - -```python -import data_designer.config as dd - -tool_config = dd.ToolConfig( - tool_alias="multi-search", - providers=["doc-search-mcp", "web-search-mcp"], - allow_tools=["search_docs", "search_web", "list_docs"], - max_tool_call_turns=10, -) -``` - -When the model requests a tool call, Data Designer automatically finds which provider hosts that tool and routes the call appropriately. - -### Setting Operational Limits - -```python -import data_designer.config as dd - -tool_config = dd.ToolConfig( - tool_alias="limited-tools", - providers=["demo-mcp"], - max_tool_call_turns=3, # Maximum 3 tool-calling iterations - timeout_sec=30.0, # 30 seconds per tool call -) -``` - -## Adding to Config Builder - -Tool configurations can be added to the config builder in two ways: - -```python -import data_designer.config as dd - -tool_config = dd.ToolConfig( - tool_alias="my-tools", - providers=["demo-mcp"], -) - -# Method 1: Pass at initialization -builder = dd.DataDesignerConfigBuilder(tool_configs=[tool_config]) - -# Method 2: Add later -builder = dd.DataDesignerConfigBuilder() -builder.add_tool_config(tool_config) -``` - -## Understanding Turn-Based Limiting - -The `max_tool_call_turns` parameter limits how many tool-calling iterations (turns) are permitted, not the total number of individual tool calls. - -!!! note "Turn-based vs call-based counting" - A **turn** is one iteration where the LLM requests tool calls. With parallel tool calling, a single turn may execute multiple tools simultaneously. - - For example, if the model requests 3 tools in parallel, that counts as 1 turn, not 3. This gives models flexibility to use parallel calling efficiently while still bounding total iterations. - -When the turn limit is reached, Data Designer gracefully refuses additional tool calls rather than failing abruptly. The model receives feedback explaining the limit was reached and can produce a final response based on the tools it already called. - -## See Also - -- **[MCP Providers](mcp-providers.md)**: Configure connections to MCP servers -- **[Enabling Tools on Columns](enabling-tools.md)**: Reference tool configs from LLM columns -- **[Safety and Limits](safety-and-limits.md)**: Detailed guide on tool safety controls -- **[Configure MCP with the CLI](configure-mcp-cli.md)**: Use the CLI to manage tool configurations diff --git a/docs/concepts/models/configure-model-settings-with-the-cli.md b/docs/concepts/models/configure-model-settings-with-the-cli.md deleted file mode 100644 index e7c1b5edd..000000000 --- a/docs/concepts/models/configure-model-settings-with-the-cli.md +++ /dev/null @@ -1,142 +0,0 @@ -# Configuring Model Settings Using The CLI - -The Data Designer CLI provides an interactive interface for creating and managing default model providers and model configurations stored in your Data Designer home directory (default: `~/.data-designer/`). - -## Configuration Files - -The CLI manages two YAML configuration files: - -- **`model_providers.yaml`**: Model provider configurations -- **`model_configs.yaml`**: Model configurations - -!!! info "Automatic Configuration" - If these configuration files don't already exist, the Data Designer library automatically creates them with default settings at runtime when first initialized. - -!!! note "Custom Directory" - You can customize the configuration directory location with the `DATA_DESIGNER_HOME` environment variable: - ```bash - export DATA_DESIGNER_HOME="/path/to/your/custom/directory" - ``` - -## CLI Commands - -The Data Designer CLI provides four main configuration commands: - -```bash -# Configure model providers -data-designer config providers - -# Configure models -data-designer config models - -# List current configurations -data-designer config list - -# Reset all configurations -data-designer config reset -``` - -!!! tip "Getting help" - See available commands - ```bash - data-designer --help - ``` - - See available sub-commands - ```bash - data-designer config --help - ``` - -## Managing Model Providers - -Run the interactive provider configuration command: - -```bash -data-designer config providers -``` - -### Available Operations - -**Add a new provider**: Define a new provider by entering its name, endpoint URL, provider type, and optionally an API key (as plain text or as an environment variable name). - -**Update an existing provider**: Modify an existing provider's settings. All fields are pre-filled with current values. - -**Delete a provider**: Remove a provider and its associated models. - -**Delete all providers**: Remove all providers and their associated models. - -**Change default provider**: Set which provider is used by default. This option is only available when multiple providers are configured. - -!!! warning "Deprecated: 'Change default provider' workflow" - The "Change default provider" workflow is **deprecated** and will be removed in a future - release alongside the registry-level default. Specify `provider=` explicitly on each - `ModelConfig` instead — the workflow now emits a `DeprecationWarning` when entered. - See [issue #589](https://github.com/NVIDIA-NeMo/DataDesigner/issues/589). - -## Managing Model Configurations - -Run the interactive model configuration command: - -```bash -data-designer config models -``` - -!!! info "Provider Required" - You need at least one provider configured before adding models. Run `data-designer config providers` first if none exist. - -### Available Operations - -**Add a new model configuration** - -Create a new model configuration with the following fields: - -- **Alias**: A unique name for referencing this model in a column configuration. -- **Model ID**: The model identifier (e.g., `nvidia/nemotron-3-nano-30b-a3b`) -- **Provider**: Select from available providers (if multiple exist) -- **Temperature**: Sampling temperature (0.0 to 2.0) -- **Top P**: Nucleus sampling parameter (0.0 to 1.0) -- **Max Tokens**: Maximum output length (1 to 100000) - -!!! note "Additional Settings" - To configure additional inference parameter settings or use distribution-based inference parameters, edit the `model_configs.yaml` file directly. - -**Update an existing model configuration**: Modify an existing model's configuration. All fields are pre-filled with current values. - -**Delete a model configuration**: Remove a single model configuration. - -**Delete all model configurations**: Remove all model configurations. The CLI will ask for confirmation before proceeding. - -## Listing Configurations - -View all current configurations: - -```bash -data-designer config list -``` - -This command displays: - -- **Model Providers**: All configured providers with their endpoints (API keys are masked) -- **Default Provider**: The currently selected default provider _(deprecated; see issue #589)_ -- **Model Configurations**: All configured models with their settings - -## Resetting Configurations - -Delete all configuration files: - -```bash -data-designer config reset -``` - -The CLI will show which configuration files exist and ask for confirmation before deleting them. - -!!! danger "Destructive Operation" - This command permanently deletes all configuration files and resets to the default model providers and configurations. You'll need to reconfigure your custom configurations from scratch. - -## See Also - -- **[Default Model Settings](default-model-settings.md)**: Pre-configured providers and model settings included with Data Designer -- **[Custom Model Settings](custom-model-settings.md)**: Learn how to create custom providers and model configurations -- **[Model Providers](model-providers.md)**: Learn about the `ModelProvider` class and provider configuration -- **[Model Configurations](model-configs.md)**: Learn about `ModelConfig` -- **[Getting Started](../../index.md)**: Installation and basic usage example diff --git a/docs/concepts/models/custom-model-settings.md b/docs/concepts/models/custom-model-settings.md deleted file mode 100644 index be73ae408..000000000 --- a/docs/concepts/models/custom-model-settings.md +++ /dev/null @@ -1,221 +0,0 @@ -# Custom Model Settings - -While Data Designer ships with pre-configured model providers and configurations, you can create custom configurations to use different models, adjust inference parameters, or connect to custom API endpoints. - -## When to Use Custom Settings - -Use custom model settings when you need to: - -- Use models not included in the defaults -- Adjust inference parameters (temperature, top_p, max_tokens) for specific use cases -- Add distribution-based inference parameters for variability -- Connect to self-hosted or custom model endpoints -- Create multiple variants of the same model with different settings - -## Creating and Using Custom Settings - -### Custom Models with Default Providers - -Create custom model configurations that use the default providers (no need to define providers yourself): - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -# Create custom models using default providers -custom_models = [ - # High-temperature for more variability - dd.ModelConfig( - alias="creative-writer", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", # Uses default NVIDIA provider - inference_parameters=dd.ChatCompletionInferenceParams( - temperature=1.2, - top_p=0.98, - max_tokens=4096, - ), - ), - # Low-temperature for less variability - dd.ModelConfig( - alias="fact-checker", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", # Uses default NVIDIA provider - inference_parameters=dd.ChatCompletionInferenceParams( - temperature=0.1, - top_p=0.9, - max_tokens=2048, - ), - ), -] - -# Create DataDesigner (uses default providers) -data_designer = DataDesigner() - -# Pass custom models to config builder -config_builder = dd.DataDesignerConfigBuilder(model_configs=custom_models) - -# Add a topic column using a categorical sampler -config_builder.add_column( - dd.SamplerColumnConfig( - name="topic", - sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams( - values=["Artificial Intelligence", "Space Exploration", "Ancient History", "Climate Science"], - ), - ) -) - -# Use your custom models -config_builder.add_column( - dd.LLMTextColumnConfig( - name="creative_story", - model_alias="creative-writer", - prompt="Write a creative short story about {{topic}}.", - ) -) - -config_builder.add_column( - dd.LLMTextColumnConfig( - name="facts", - model_alias="fact-checker", - prompt="List 3 facts about {{topic}}.", - ) -) - -# Preview your dataset -preview_result = data_designer.preview(config_builder=config_builder) -preview_result.display_sample_record() -``` - -!!! note "Default Providers Always Available" - When you only specify `model_configs`, the default model providers (NVIDIA, OpenAI, and OpenRouter) are still available. You only need to create custom providers if you want to connect to different endpoints or modify provider settings. - -!!! warning "Always specify `provider=` on `ModelConfig`" - Leaving `provider` unset (or passing `provider=None`) on `ModelConfig` is **deprecated**. - The legacy "implicit default provider" routing — used when `provider` is omitted — emits - a `DeprecationWarning` and will be removed in a future release. Always reference the - intended provider by name, as the examples below do. See - [issue #589](https://github.com/NVIDIA-NeMo/DataDesigner/issues/589). - -!!! tip "Mixing Custom and Default Models" - When you provide custom `model_configs` to `DataDesignerConfigBuilder`, they **replace** the defaults entirely. To use custom model configs in addition to the default configs, use the add_model_config method: - - ```python - import data_designer.config as dd - - # Load defaults first - config_builder = dd.DataDesignerConfigBuilder() - - # Add custom model to defaults - config_builder.add_model_config( - dd.ModelConfig( - alias="my-custom-model", - model="nvidia/llama-3.3-nemotron-super-49b-v1.5", - provider="nvidia", # Uses default provider - inference_parameters=dd.ChatCompletionInferenceParams( - temperature=0.6, - max_tokens=8192, - ), - ) - ) - - # Now you can use both default and custom models - # Default: nvidia-text, nvidia-reasoning, nvidia-vision, etc. - # Custom: my-custom-model - ``` - -### Custom Providers with Custom Models - -Define both custom providers and custom model configurations when you need to connect to services not included in the defaults: - -!!! warning "Network Accessibility" - The custom provider endpoints must be reachable from where Data Designer runs. Ensure network connectivity, firewall rules, and any VPN requirements are properly configured. - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -# Step 1: Define custom providers -custom_providers = [ - dd.ModelProvider( - name="my-custom-provider", - endpoint="https://api.my-llm-service.com/v1", - provider_type="openai", # OpenAI-compatible API - api_key="MY_SERVICE_API_KEY", # Environment variable name - ), - dd.ModelProvider( - name="my-self-hosted-provider", - endpoint="https://my-org.internal.com/llm/v1", - provider_type="openai", - api_key="SELF_HOSTED_API_KEY", - ), -] - -# Step 2: Define custom models -custom_models = [ - dd.ModelConfig( - alias="my-text-model", - model="openai/some-model-id", - provider="my-custom-provider", # References provider by name - inference_parameters=dd.ChatCompletionInferenceParams( - temperature=0.85, - top_p=0.95, - max_tokens=2048, - ), - ), - dd.ModelConfig( - alias="my-self-hosted-text-model", - model="openai/some-hosted-model-id", - provider="my-self-hosted-provider", - inference_parameters=dd.ChatCompletionInferenceParams( - temperature=0.7, - top_p=0.9, - max_tokens=1024, - ), - ), -] - -# Step 3: Create DataDesigner with custom providers -data_designer = DataDesigner(model_providers=custom_providers) - -# Step 4: Create config builder with custom models -config_builder = dd.DataDesignerConfigBuilder(model_configs=custom_models) - -# Step 5: Add a topic column using a categorical sampler -config_builder.add_column( - dd.SamplerColumnConfig( - name="topic", - sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams( - values=["Technology", "Healthcare", "Finance", "Education"], - ), - ) -) - -# Step 6: Use your custom model by referencing its alias -config_builder.add_column( - dd.LLMTextColumnConfig( - name="short_news_article", - model_alias="my-text-model", # Reference custom alias - prompt="Write a short news article about the '{{topic}}' topic in 10 sentences.", - ) -) - -config_builder.add_column( - dd.LLMTextColumnConfig( - name="long_news_article", - model_alias="my-self-hosted-text-model", # Reference custom alias - prompt="Write a detailed news article about the '{{topic}}' topic.", - ) -) - -# Step 7: Preview your dataset -preview_result = data_designer.preview(config_builder=config_builder) -preview_result.display_sample_record() -``` - -## See Also - -- **[Default Model Settings](default-model-settings.md)**: Pre-configured providers and model settings -- **[Configure Model Settings With the CLI](configure-model-settings-with-the-cli.md)**: CLI-based configuration -- **[Getting Started](../../index.md)**: Installation and basic usage example diff --git a/docs/concepts/models/default-model-settings.md b/docs/concepts/models/default-model-settings.md deleted file mode 100644 index 50c957c93..000000000 --- a/docs/concepts/models/default-model-settings.md +++ /dev/null @@ -1,134 +0,0 @@ -# Default Model Settings - -Data Designer ships with pre-configured model providers and model configurations that make it easy to start generating synthetic data without manual setup. - -## Model Providers - -Data Designer includes a few default model providers that are configured automatically: - -### NVIDIA Provider (`nvidia`) - -- **Endpoint**: `https://integrate.api.nvidia.com/v1` -- **API Key**: Set via `NVIDIA_API_KEY` environment variable -- **Models**: Access to NVIDIA's hosted models from [build.nvidia.com](https://build.nvidia.com) -- **Getting Started**: Sign up and get your API key at [build.nvidia.com](https://build.nvidia.com) - -The NVIDIA provider gives you access to state-of-the-art models including Nemotron and other NVIDIA-optimized models. - -### OpenAI Provider (`openai`) - -- **Endpoint**: `https://api.openai.com/v1` -- **API Key**: Set via `OPENAI_API_KEY` environment variable -- **Models**: Access to OpenAI's model catalog -- **Getting Started**: Get your API key from [platform.openai.com/api-keys](https://platform.openai.com/api-keys) - -The OpenAI provider gives you access to GPT models and other OpenAI offerings. - -### OpenRouter Provider (`openrouter`) - -- **Endpoint**: `https://openrouter.ai/api/v1` -- **API Key**: Set via `OPENROUTER_API_KEY` environment variable -- **Models**: Access to a wide variety of models through OpenRouter's unified API -- **Getting Started**: Get your API key from [openrouter.ai](https://openrouter.ai) - -The OpenRouter provider gives you access to a unified interface for many different language models from various providers. - -## Model Configurations - -Data Designer provides pre-configured model aliases for common use cases. When you create a `DataDesignerConfigBuilder` without specifying `model_configs`, these default configurations are automatically available. - -### NVIDIA Models - -The following model configurations are automatically available when `NVIDIA_API_KEY` is set: - -| Alias | Model | Use Case | Inference Parameters | -|-------|-------|----------|---------------------| -| `nvidia-text` | `nvidia/nemotron-3-nano-30b-a3b` | General text generation | `temperature=1.0, top_p=1.0` | -| `nvidia-reasoning` | `nvidia/nemotron-3-super-120b-a12b` | Reasoning and analysis tasks | `temperature=1.0, top_p=0.95, extra_body={"reasoning_effort": "medium"}` | -| `nvidia-vision` | `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning` | Vision and image understanding | `temperature=0.60, top_p=0.95` | -| `nvidia-embedding` | `nvidia/llama-3.2-nv-embedqa-1b-v2` | Text embeddings | `encoding_format="float", extra_body={"input_type": "query"}` | - - -### OpenAI Models - -The following model configurations are automatically available when `OPENAI_API_KEY` is set: - -| Alias | Model | Use Case | Inference Parameters | -|-------|-------|----------|---------------------| -| `openai-text` | `gpt-4.1` | General text generation | `temperature=0.85, top_p=0.95` | -| `openai-reasoning` | `gpt-5` | Reasoning and analysis tasks | `temperature=0.35, top_p=0.95` | -| `openai-vision` | `gpt-5` | Vision and image understanding | `temperature=0.85, top_p=0.95` | -| `openai-embedding` | `text-embedding-3-large` | Text embeddings | `encoding_format="float"` | - -### OpenRouter Models - -The following model configurations are automatically available when `OPENROUTER_API_KEY` is set: - -| Alias | Model | Use Case | Inference Parameters | -|-------|-------|----------|---------------------| -| `openrouter-text` | `nvidia/nemotron-3-nano-30b-a3b` | General text generation | `temperature=1.0, top_p=1.0` | -| `openrouter-reasoning` | `openai/gpt-oss-20b` | Reasoning and analysis tasks | `temperature=0.35, top_p=0.95` | -| `openrouter-vision` | `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free` | Vision and image understanding | `temperature=0.60, top_p=0.95` | -| `openrouter-embedding` | `openai/text-embedding-3-large` | Text embeddings | `encoding_format="float"` | - - -## Using Default Settings - -Default settings work out of the box - no configuration needed! Simply create `DataDesigner` and `DataDesignerConfigBuilder` instances without any arguments, and reference the default model aliases in your column configurations. - -For a complete example showing how to use default model settings, see the **[Getting Started](../../index.md)** page. - -### How Default Model Providers and Configurations Work - -When the Data Designer library or the CLI is initialized, default model configurations and providers are stored in the Data Designer home directory for easy access and customization if they do not already exist. These configuration files serve as the single source of truth for model settings. By default they are saved to the following paths: - -- **Model Configs**: `~/.data-designer/model_configs.yaml` -- **Model Providers**: `~/.data-designer/model_providers.yaml` - -!!! tip Tip - While these files provide a convenient way to specify settings for your model providers and configuration you use most often, they can always be set programmatically in your SDG workflow. - -You can customize the home directory location by setting the `DATA_DESIGNER_HOME` environment variable: - -```bash -# In your .bashrc, .zshrc, or similar -export DATA_DESIGNER_HOME="/path/to/your/custom/directory" -``` - -These configuration files can be modified in two ways: - -1. **Using the CLI**: Run CLI commands to add, update, or delete model configurations and providers -2. **Manual editing**: Directly edit the YAML files with your preferred text editor - -Both methods operate on the same files, ensuring consistency across your entire Data Designer setup. - -## Important Notes - -!!! warning "API Key Requirements" - While default model configurations are always available, you need to set the appropriate API key environment variable (`NVIDIA_API_KEY`, `OPENAI_API_KEY`, or `OPENROUTER_API_KEY`) to actually use the corresponding models for data generation. Without a valid API key, any attempt to generate data using that provider's models will fail. - -!!! warning "Hosted Provider Data Handling" - The default model providers call hosted endpoints operated by NVIDIA, OpenAI, OpenRouter, or their upstream providers. Provider terms and privacy practices apply independently of Data Designer, and free or trial endpoints may log request data for security, operations, or product improvement. Do not submit confidential information or personal data, including faces, voices, screenshots, regulated data, or other sensitive content, unless the selected provider and endpoint are approved for your use case. - -!!! warning "Deprecated: implicit default provider routing" - The `default:` key in `~/.data-designer/model_providers.yaml` and the registry-level - "default provider" concept are **deprecated** and will be removed in a future release. - Specify `provider=` explicitly on every `ModelConfig` instead — the built-in defaults - above already do this, and a `DeprecationWarning` is now emitted whenever the legacy - routing is exercised. See [issue #589](https://github.com/NVIDIA-NeMo/DataDesigner/issues/589). - -!!! tip "Environment Variables" - Store your API keys in environment variables rather than hardcoding them in your scripts: - - ```bash - # In your .bashrc, .zshrc, or similar - export NVIDIA_API_KEY="your-api-key-here" - export OPENAI_API_KEY="your-openai-api-key-here" - export OPENROUTER_API_KEY="your-openrouter-api-key-here" - ``` - -## See Also - -- **[Custom Model Settings](custom-model-settings.md)**: Learn how to create custom providers and model configurations -- **[Configure Model Settings With the CLI](configure-model-settings-with-the-cli.md)**: Learn how to use the CLI to manage model settings -- **[Model Configurations](model-configs.md)**: Learn about model configurations diff --git a/docs/concepts/models/inference-parameters.md b/docs/concepts/models/inference-parameters.md deleted file mode 100644 index 03e932ed4..000000000 --- a/docs/concepts/models/inference-parameters.md +++ /dev/null @@ -1,184 +0,0 @@ -# Inference Parameters - -Inference parameters control how models generate responses during synthetic data generation. Data Designer provides three types of inference parameters: `ChatCompletionInferenceParams` for text/code/structured generation, `EmbeddingInferenceParams` for embedding generation, and `ImageInferenceParams` for image generation. - -## Overview - -When you create a `ModelConfig`, you can specify inference parameters to adjust model behavior. These parameters control aspects like randomness (temperature), diversity (top_p), context size (max_tokens), and more. Data Designer supports both static values and dynamic distribution-based sampling for certain parameters. - -## Chat Completion Inference Parameters - -The `ChatCompletionInferenceParams` class controls how models generate text completions (for text, code, and structured data generation). It provides fine-grained control over generation behavior and supports both static values and dynamic distribution-based sampling. - -### Fields - -| Field | Type | Required | Description | -|-------|------|----------|-------------| -| `temperature` | `float` or `Distribution` | No | Controls randomness in generation (0.0 to 2.0). Higher values = more creative/random | -| `top_p` | `float` or `Distribution` | No | Nucleus sampling parameter (0.0 to 1.0). Controls diversity by filtering low-probability tokens | -| `max_tokens` | `int` | No | Maximum number of tokens to generate in the response (≥ 1) | -| `max_parallel_requests` | `int` | No | Maximum concurrent API requests to this model (default: 4, ≥ 1). See [Concurrency Control](#concurrency-control) below. | -| `timeout` | `int` | No | API request timeout in seconds (≥ 1) | -| `extra_body` | `dict[str, Any]` | No | Additional parameters to include in the API request body | - -!!! note "Default Values" - If `temperature`, `top_p`, or `max_tokens` are not provided, the model provider's default values will be used. Different providers and models may have different defaults. - -!!! tip "Controlling Reasoning Effort for Reasoning Models" - For reasoning models like Nemotron 3 Super (`nvidia/nemotron-3-super-120b-a12b`) and GPT-OSS (`gpt-oss-20b`, `gpt-oss-120b`), you can control the reasoning effort using the `extra_body` parameter: - - ```python - import data_designer.config as dd - - # High reasoning effort (more thorough, slower) - inference_parameters = dd.ChatCompletionInferenceParams( - extra_body={"reasoning_effort": "high"} - ) - - # Medium reasoning effort (balanced) - inference_parameters = dd.ChatCompletionInferenceParams( - extra_body={"reasoning_effort": "medium"} - ) - - # Low reasoning effort (faster, less thorough) - inference_parameters = dd.ChatCompletionInferenceParams( - extra_body={"reasoning_effort": "low"} - ) - ``` - -### Temperature and Top P Guidelines - -- **Temperature**: - - `0.0-0.3`: Highly deterministic, focused outputs (ideal for structured/reasoning tasks) - - `0.4-0.7`: Balanced creativity and coherence (general purpose) - - `0.8-1.0`: Creative, diverse outputs (ideal for creative writing) - - `1.0+`: Highly random and experimental - -- **Top P**: - - `0.1-0.5`: Very focused, only most likely tokens - - `0.6-0.9`: Balanced diversity - - `0.95-1.0`: Maximum diversity, including less likely tokens - -!!! tip "Adjusting Temperature and Top P Together" - When tuning both parameters simultaneously, consider these combinations: - - - **For deterministic/structured outputs**: Low temperature (`0.0-0.3`) + moderate-to-high top_p (`0.8-0.95`) - - The low temperature ensures focus, while top_p allows some token diversity - - **For balanced generation**: Moderate temperature (`0.5-0.7`) + high top_p (`0.9-0.95`) - - This is a good starting point for most use cases - - **For creative outputs**: Higher temperature (`0.8-1.0`) + high top_p (`0.95-1.0`) - - Both parameters work together to maximize diversity - - **Avoid**: Setting both very low (overly restrictive) or adjusting both dramatically at once. When experimenting, adjust one parameter at a time to understand its individual effect. - -## Distribution-Based Inference Parameters - -For `temperature` and `top_p` in `ChatCompletionInferenceParams`, you can specify distributions instead of fixed values. This allows Data Designer to sample different values for each generation request, introducing controlled variability into your synthetic data. - -### Uniform Distribution - -Samples values uniformly between a low and high bound: - -```python -import data_designer.config as dd - -inference_params = dd.ChatCompletionInferenceParams( - temperature=dd.UniformDistribution( - params=dd.UniformDistributionParams(low=0.7, high=1.0) - ), -) -``` - -### Manual Distribution - -Samples from a discrete set of values with optional weights: - -```python -import data_designer.config as dd - -# Equal probability for each value -inference_params = dd.ChatCompletionInferenceParams( - temperature=dd.ManualDistribution( - params=dd.ManualDistributionParams(values=[0.5, 0.7, 0.9]) - ), -) - -# Weighted probabilities (normalized automatically) -inference_params = dd.ChatCompletionInferenceParams( - top_p=dd.ManualDistribution( - params=dd.ManualDistributionParams( - values=[0.8, 0.9, 0.95], - weights=[0.2, 0.5, 0.3] # 20%, 50%, 30% probability - ) - ), -) -``` - -## Concurrency Control - -The `max_parallel_requests` parameter controls how many concurrent API calls Data Designer makes to a specific model. This directly impacts throughput and should be tuned to match your inference server's capacity. - -!!! tip "Performance Tuning" - For recommended values by deployment type (NVIDIA API Catalog, vLLM, OpenAI, NIMs) and detailed optimization strategies, see the [Architecture & Performance](../architecture-and-performance.md) guide. - -## Embedding Inference Parameters - -The `EmbeddingInferenceParams` class controls how models generate embeddings. This is used when working with embedding models for tasks like semantic search or similarity analysis. - -### Fields - -| Field | Type | Required | Description | -|-------|------|----------|-------------| -| `encoding_format` | `Literal["float", "base64"]` | No | Format of the embedding encoding (default: "float") | -| `dimensions` | `int` | No | Number of dimensions for the embedding | -| `max_parallel_requests` | `int` | No | Maximum concurrent API requests (default: 4, ≥ 1) | -| `timeout` | `int` | No | API request timeout in seconds (≥ 1) | -| `extra_body` | `dict[str, Any]` | No | Additional parameters to include in the API request body | - - -## Image Inference Parameters - -The `ImageInferenceParams` class is used for image generation models, including both diffusion models (DALL·E, Stable Diffusion, Imagen) and autoregressive models (Gemini image, GPT image). Unlike text models, image-specific options are passed entirely via `extra_body`, since they vary significantly between providers. - -### Fields - -| Field | Type | Required | Description | -|-------|------|----------|-------------| -| `max_parallel_requests` | `int` | No | Maximum concurrent API requests (default: 4, ≥ 1) | -| `timeout` | `int` | No | API request timeout in seconds (≥ 1) | -| `extra_body` | `dict[str, Any]` | No | Model-specific image options (size, quality, aspect ratio, etc.) | - -### Examples - -```python -import data_designer.config as dd - -# Autoregressive model (chat completions API, supports image context) -dd.ModelConfig( - alias="image-model", - model="black-forest-labs/flux.2-pro", - provider="openrouter", - inference_parameters=dd.ImageInferenceParams( - extra_body={"height": 512, "width": 512} - ), -) - -# Diffusion model (e.g., DALL·E, Stable Diffusion) -dd.ModelConfig( - alias="dalle", - model="dall-e-3", - provider="openai", - inference_parameters=dd.ImageInferenceParams( - extra_body={"size": "1024x1024", "quality": "hd"} - ), -) -``` - - -## See Also - -- **[Default Model Settings](default-model-settings.md)**: Pre-configured model settings included with Data Designer -- **[Custom Model Settings](custom-model-settings.md)**: Learn how to create custom providers and model configurations -- **[Model Configurations](model-configs.md)**: Learn about configuring model settings -- **[Model Providers](model-providers.md)**: Learn about configuring model providers -- **[Architecture & Performance](../architecture-and-performance.md)**: Understanding separation of concerns and optimizing concurrency diff --git a/docs/concepts/models/model-configs.md b/docs/concepts/models/model-configs.md deleted file mode 100644 index 888a7bdca..000000000 --- a/docs/concepts/models/model-configs.md +++ /dev/null @@ -1,147 +0,0 @@ -# Model Configurations - -Model configurations define the specific models you use for synthetic data generation and their associated inference parameters. Each `ModelConfig` represents a named model that can be referenced throughout your data generation workflows. - -## Overview - -A `ModelConfig` specifies which LLM model to use and how it should behave during generation. When you create column configurations (like `LLMText`, `LLMCode`, or `LLMStructured`), you reference a model by its alias. Data Designer uses the model configuration to determine which model to call and with what parameters. - -## ModelConfig Structure - -The `ModelConfig` class has the following fields: - -| Field | Type | Required | Description | -|-------|------|----------|-------------| -| `alias` | `str` | Yes | Unique identifier for this model configuration (e.g., `"my-text-model"`, `"reasoning-model"`) | -| `model` | `str` | Yes | Model identifier as recognized by the provider (e.g., `"nvidia/nemotron-3-nano-30b-a3b"`, `"gpt-4"`) | -| `inference_parameters` | `InferenceParamsT` | No | Controls model behavior during generation. Use `ChatCompletionInferenceParams` for text/code/structured generation or `EmbeddingInferenceParams` for embeddings. Defaults to `ChatCompletionInferenceParams()` if not provided. The generation type is automatically determined by the inference parameters type. See [Inference Parameters](inference-parameters.md) for details. | -| `provider` | `str` | No | Reference to the name of the Provider to use (e.g., `"nvidia"`, `"openai"`, `"openrouter"`). If not specified, one set as the default provider, which may resolve to the first provider if there are more than one | -| `skip_health_check` | `bool` | No | Whether to skip the health check for this model. Defaults to `False`. Set to `True` to skip health checks when you know the model is accessible or want to defer validation. | - - -## Examples - -### Basic Model Configuration - -```python -import data_designer.config as dd - -# Simple model configuration with fixed parameters -model_config = dd.ModelConfig( - alias="my-text-model", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", - inference_parameters=dd.ChatCompletionInferenceParams( - temperature=0.85, - top_p=0.95, - max_tokens=2048, - ), -) -``` - -### Multiple Model Configurations for Different Tasks - -```python -import data_designer.config as dd - -model_configs = [ - # Creative tasks - dd.ModelConfig( - alias="creative-model", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", - inference_parameters=dd.ChatCompletionInferenceParams( - temperature=0.9, - top_p=0.95, - max_tokens=2048, - ), - ), - # Critic tasks - dd.ModelConfig( - alias="critic-model", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", - inference_parameters=dd.ChatCompletionInferenceParams( - temperature=0.25, - top_p=0.95, - max_tokens=2048, - ), - ), - # Reasoning and structured tasks - dd.ModelConfig( - alias="reasoning-model", - model="nvidia/nemotron-3-super-120b-a12b", - provider="nvidia", - inference_parameters=dd.ChatCompletionInferenceParams( - temperature=1.0, - top_p=0.95, - max_tokens=4096, - ), - ), - # Vision tasks - dd.ModelConfig( - alias="vision-model", - model="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning", - provider="nvidia", - inference_parameters=dd.ChatCompletionInferenceParams( - temperature=0.60, - top_p=0.95, - max_tokens=2048, - ), - ), - # Embedding tasks - dd.ModelConfig( - alias="embedding_model", - model="nvidia/llama-3.2-nv-embedqa-1b-v2", - provider="nvidia", - inference_parameters=dd.EmbeddingInferenceParams( - encoding_format="float", - extra_body={ - "input_type": "query" - } - ) - ) -] -``` - -!!! tip "Experiment with max_tokens for Task-Specific Model Configurations" - The number of tokens required to generate a single data entry can vary significantly with use case. For example, reasoning models often need more tokens to "think through" problems before generating a response. Note that `max_tokens` specifies the **maximum number of output tokens** to generate in the response, so set this value based on the expected length of the generated content. - -### Skipping Health Checks - -By default, Data Designer runs a health check for each model before starting data generation to ensure the model is accessible and configured correctly. You can skip this health check for specific models by setting `skip_health_check=True`: - -```python -import data_designer.config as dd - -model_config = dd.ModelConfig( - alias="my-model", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", - inference_parameters=dd.ChatCompletionInferenceParams( - temperature=0.85, - top_p=0.95, - max_tokens=2048, - ), - skip_health_check=True, # Skip health check for this model -) -``` - -!!! info "When to Skip Health Checks" - Skipping health checks can be useful when: - - - You've already verified the model is accessible and want to speed up initialization - - You're using a model that doesn't support the standard health check format - - You want to defer model validation until the model is actually used - - Note that skipping health checks means errors will only be discovered during actual data generation. - -## See Also - -- **[Inference Parameters](inference-parameters.md)**: Detailed guide to inference parameters and how to configure them -- **[Model Providers](model-providers.md)**: Learn about configuring model providers -- **[Default Model Settings](default-model-settings.md)**: Pre-configured model settings included with Data Designer -- **[Custom Model Settings](custom-model-settings.md)**: Learn how to create custom providers and model configurations -- **[Configure Model Settings With the CLI](configure-model-settings-with-the-cli.md)**: Use the CLI to manage model settings -- **[Column Configurations](../../code_reference/config/column_configs.md)**: Learn how to use models in column configurations -- **[Architecture & Performance](../architecture-and-performance.md)**: Understanding separation of concerns and optimizing concurrency diff --git a/docs/concepts/models/model-providers.md b/docs/concepts/models/model-providers.md deleted file mode 100644 index f8625ae9b..000000000 --- a/docs/concepts/models/model-providers.md +++ /dev/null @@ -1,76 +0,0 @@ -# Model Providers - -Model providers are external services that host and serve models. Data Designer uses the `ModelProvider` class to configure connections to these services. - -## Overview - -A `ModelProvider` defines how Data Designer connects to a provider's API endpoint. When you create a `ModelConfig`, you reference a provider by name, and Data Designer uses that provider's settings to make API calls to the appropriate endpoint. - -!!! warning "Deprecated: implicit default provider routing" - Earlier versions of Data Designer let you omit `provider=` on `ModelConfig` and - fall back to a registry-level default — including the `default:` key in - `~/.data-designer/model_providers.yaml`. That implicit routing is **deprecated** - and will be removed in a future release. Always reference a provider by name on - every `ModelConfig`. A `DeprecationWarning` is now emitted when the legacy path - is exercised. See [issue #589](https://github.com/NVIDIA-NeMo/DataDesigner/issues/589). - -## ModelProvider Configuration - -The `ModelProvider` class has the following fields: - -| Field | Type | Required | Description | -|-------|------|----------|-------------| -| `name` | `str` | Yes | Unique identifier for the provider (e.g., `"nvidia"`, `"openai"`, `"openrouter"`) | -| `endpoint` | `str` | Yes | API endpoint URL (e.g., `"https://integrate.api.nvidia.com/v1"`) | -| `provider_type` | `str` | No | Provider type: `"openai"` (default) or `"anthropic"`. See [Supported Provider Types](#supported-provider-types) below | -| `api_key` | `str` | No | API key or environment variable name (e.g., `"NVIDIA_API_KEY"`) | -| `extra_body` | `dict[str, Any]` | No | Additional parameters to include in the request body of all API requests to the provider. | -| `extra_headers` | `dict[str, str]` | No | Additional headers to include in all API requests to the provider. | - -## Supported Provider Types - -Data Designer supports two provider types: - -| Type | Description | -|------|-------------| -| `"openai"` | OpenAI-compatible chat completion API. This is the default and works with most providers, including NVIDIA NIM, vLLM, TGI, OpenRouter, Together AI, and OpenAI itself. | -| `"anthropic"` | Anthropic's native Messages API for Claude models. Use this when connecting directly to Anthropic's API. | - -Most self-hosted and third-party endpoints expose an OpenAI-compatible API, so `provider_type="openai"` is the right choice in the majority of cases. Only use `"anthropic"` when connecting directly to Anthropic's API at `https://api.anthropic.com`. - -> **Note:** Previous versions of Data Designer supported additional provider types (e.g., `"azure"`, `"bedrock"`, `"vertex_ai"`) via a LiteLLM bridge. These are no longer supported. If you were using one of these types, switch to `provider_type="openai"` and point the `endpoint` to an OpenAI-compatible proxy or gateway for that service. - -## API Key Configuration - -The `api_key` field can be specified in two ways: - -1. **Environment variable name** (recommended): Set `api_key` to the name of an environment variable (e.g., `"NVIDIA_API_KEY"`). Data Designer will automatically resolve it at runtime. - -2. **Plain-text value**: Set `api_key` to the actual API key string. This is less secure and not recommended for production use. - -```python -# Method 1: Environment variable (recommended) -provider = ModelProvider( - name="nvidia", - endpoint="https://integrate.api.nvidia.com/v1", - api_key="NVIDIA_API_KEY", # Will be resolved from environment -) - -# Method 2: Direct value (not recommended) -provider = ModelProvider( - name="nvidia", - endpoint="https://integrate.api.nvidia.com/v1", - api_key="nvapi-abc123...", # Direct API key -) -``` - -## See Also - -- **[Model Configurations](model-configs.md)**: Learn about configuring models -- **[Inference Parameters](inference-parameters.md)**: Detailed guide to inference parameters and how to configure them -- **[Default Model Settings](default-model-settings.md)**: Pre-configured providers and model settings included with Data Designer -- **[Custom Model Settings](custom-model-settings.md)**: Learn how to create custom providers and model configurations -- **[Model Configurations](model-configs.md)**: Learn about configuring models -- **[Inference Parameters](inference-parameters.md)**: Detailed guide to inference parameters and how to configure them -- **[Configure Model Settings With the CLI](configure-model-settings-with-the-cli.md)**: Use the CLI to manage providers and model settings -- **[Getting Started](../../index.md)**: Installation and basic usage example diff --git a/docs/concepts/person_sampling.md b/docs/concepts/person_sampling.md deleted file mode 100644 index 3c9e5eaf6..000000000 --- a/docs/concepts/person_sampling.md +++ /dev/null @@ -1,283 +0,0 @@ -# Person Sampling in Data Designer - -Person sampling in Data Designer allows you to generate synthetic person data for your datasets. There are two distinct approaches, each with different capabilities and use cases. - -## Overview - -Data Designer provides two ways to generate synthetic people: - -1. **Faker-based sampling** - Quick, basic PII generation for testing or when realistic demographic distributions are not relevant for your use case -2. **Nemotron-Personas datasets** - Demographically accurate, rich persona data - ---- - -## Approach 1: Faker-Based Sampling - -### What It Does -Uses the Faker library to generate random personal information. The data is basic and not demographically accurate, but is useful for quick testing, prototyping, or when realistic demographic distributions are not relevant for your use case. - -### Features -- Gives you access to person attributes that Faker exposes -- Quick to set up with no additional downloads -- Generates random names, emails, addresses, phone numbers, etc. -- Supports [all Faker-supported locales](https://faker.readthedocs.io/en/master/locales.html) -- **Not demographically grounded** - data patterns don't reflect real-world demographics - -### Usage Example -```python -import data_designer.config as dd - -config_builder.add_column( - dd.SamplerColumnConfig( - name="customer", - sampler_type=dd.SamplerType.PERSON_FROM_FAKER, - params=dd.PersonFromFakerSamplerParams( - locale="en_US", - age_range=[25, 65], - sex="Female", - ), - ) -) -``` - -For mor details, see the documentation for [`SamplerColumnConfig`](../code_reference/config/column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) and [`PersonFromFakerSamplerParams`](../code_reference/config/sampler_params.md#data_designer.config.sampler_params.PersonFromFakerSamplerParams). - ---- - -## Approach 2: Nemotron-Personas Datasets - -### What It Does -Uses curated Nemotron-Personas datasets from NVIDIA GPU Cloud (NGC) to generate demographically accurate person data with rich personality profiles and behavioral characteristics. - -The NGC datasets are extended versions of the [open-source Nemotron-Personas datasets on HuggingFace](https://huggingface.co/collections/nvidia/nemotron-personas), with additional fields and enhanced data quality. - -Supported locales: - -- `en_US`: United States -- `en_IN`: India (English) -- `en_SG`: Singapore (English) -- `fr_FR`: France (French) -- `hi_Deva_IN`: India (Devanagari script) -- `hi_Latn_IN`: India (Latin script) -- `ja_JP`: Japan -- `ko_KR`: South Korea (Korean) -- `pt_BR`: Brazil (Portuguese) - -### Features -- **Demographically accurate personal details**: Names, ages, sex, marital status, education, occupation based on census data -- **Rich persona details**: Comprehensive behavioral profiles including: - - Big Five personality traits with scores - - Cultural backgrounds and narratives - - Skills and hobbies - - Career goals and aspirations - - Context-specific personas (professional, financial, healthcare, sports, arts, travel, culinary, etc.) -- Consistent, referenceable attributes across your dataset -- Grounded in real-world demographic distributions - -### Prerequisites - -To use the extended Nemotron-Personas datasets with Data Designer, you need to download them [from NGC](https://catalog.ngc.nvidia.com/search?orderBy=scoreDESC&query=nemotron+personas) and move them to the Data Designer managed assets directory. - -See below for step-by-step instructions. - -### Nemotron-Personas Datasets Setup Instructions - -#### Step 0: Obtain an NGC API Key and install the NGC CLI - -To download the Nemotron-Personas datasets from NGC, you will need to obtain an NGC API key and install the NGC CLI. - -1. **NGC API Key**: Obtain from [NVIDIA GPU Cloud](https://ngc.nvidia.com/) -2. **NGC CLI**: [NGC CLI](https://org.ngc.nvidia.com/setup/installers/cli) - - -#### Step 1: Set Your NGC API Key -```bash -export NGC_API_KEY="your-ngc-api-key-here" -``` - -#### Step 2 (option 1): Download Nemotron-Personas Datasets via the Data Designer CLI - -Once you have the NGC CLI and your NGC API key set up, you can download the datasets via the Data Designer CLI. - -You can pass the locales you want to download as arguments to the CLI command: -```bash -data-designer download personas --locale en_US --locale ja_JP -``` - -Or you can use the interactive mode to select the locales you want to download: -```bash -data-designer download personas -``` - -#### Step 2 (option 2): Download Nemotron-Personas Datasets Directly - -Use the NGC CLI to download the datasets: -```bash -# For Nemotron-Personas USA -ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_us" - -# For Nemotron-Personas IN -ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_deva_in" -ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_latn_in" -ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_in" - -# For Nemotron-Personas FR -ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-fr_fr" - -# For Nemotron-Personas JP -ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-ja_jp" - -# For Nemotron-Personas KR -ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-ko_kr" - -# For Nemotron-Personas SG -ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_sg" - -# For Nemotron-Personas BR -ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-pt_br" -``` - -Then move the downloaded dataset to the Data Designer managed assets directory: -```bash -mkdir -p ~/.data-designer/managed-assets/datasets/ -mv nemotron-personas-dataset-*/*.parquet ~/.data-designer/managed-assets/datasets/ -``` - -#### Step 3: Use PersonSampler in Your Code -```python -import data_designer.config as dd - -config_builder.add_column( - dd.SamplerColumnConfig( - name="customer", - sampler_type=dd.SamplerType.PERSON, - params=dd.PersonSamplerParams( - locale="en_US", - sex="Female", - age_range=[25, 45], - with_synthetic_personas=True, - ), - ) -) -``` - -For more details, see the documentation for [`SamplerColumnConfig`](../code_reference/config/column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) and [`PersonSamplerParams`](../code_reference/config/sampler_params.md#data_designer.config.sampler_params.PersonSamplerParams). - -### Available Data Fields - -**Core Fields (all locales):** - -| Field | Type | Notes | -|-------|------|-------| -| `uuid` | UUID | Unique identifier | -| `first_name` | string | | -| `middle_name` | string | | -| `last_name` | string | | -| `sex` | enum | "Male" or "Female" | -| `birth_date` | date | Derived: year, month, day | -| `street_number` | int | | -| `street_name` | string | | -| `unit` | string | Address line 2 | -| `city` | string | | -| `region` | string | Alias: state | -| `district` | string | Alias: county | -| `postcode` | string | Alias: zipcode | -| `country` | string | | -| `phone_number` | PhoneNumber | Derived: area_code, country_code, prefix, line_number | -| `marital_status` | string | Values: never_married, married_present, separated, widowed, divorced | -| `education_level` | string or None | | -| `bachelors_field` | string or None | | -| `occupation` | string or None | | -| `email_address` | string | | -| `national_id` | string | - -**France-Specific Fields (`fr_FR`):** - -- `household_type` - Household composition (e.g., single person, couple with/without children) -- `monthly_income_eur` - Estimated monthly income in euros -- `first_name_heritage` - Cultural origin of the first name -- `name_heritage` - Cultural, linguistic, or geographic origin of the surname -- `is_first_gen_immigrant` - Whether the individual is a first-generation immigrant to France - -**Japan-Specific Fields (`ja_JP`):** - -- `area` - -**Korea-Specific Fields (`ko_KR`):** - -- `economic_activity_status` - Employment / economic activity status -- `family_type` - Household / family composition type -- `housing_type` - Dwelling type (apartment, detached home, etc.) -- `housing_tenure` - Owned vs rented, etc. -- `income_bracket` - Income range -- `military_status` - Military service status -- `drinking_status` - Drinking frequency / status -- `smoking_status` - Smoking frequency / status -- `blood_pressure_status` - Blood pressure health indicator -- `blood_sugar_status` - Blood sugar health indicator -- `bmi_status` - BMI health indicator -- `waist_status` - Waist-circumference health indicator - -**Brazil-Specific Fields (`pt_BR`):** - -- `race` - Census-reported race - -**Singapore-Specific Fields (`en_SG`):** - -- `industry` - Industry of employment -- `preferred_english_name` - Preferred English-form name - -**English Locales Shared Fields (`en_US`, `en_SG`):** - -- `ethnic_background` - Self-identified ethnic background - -**Religion Fields (`en_IN`, `hi_Deva_IN`, `hi_Latn_IN`, `en_SG`, `pt_BR`):** - -- `religion` - Census-reported religion - -**India Locales Fields (`en_IN`, `hi_Deva_IN`, `hi_Latn_IN`):** - -- `education_degree` - Census-reported education degree -- `first_language` - Native language -- `second_language` - Second language (if applicable) -- `third_language` - Third language (if applicable) -- `zone` - Urban vs rural - -**With Synthetic Personas Enabled:** - -- Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) with t-scores and labels -- Cultural background narratives -- Skills and competencies -- Hobbies and interests -- Career goals -- Context-specific personas (professional, financial, healthcare, sports, arts & entertainment, travel, culinary, etc.) - -*Japan-specific persona fields (`ja_JP`):* - -- `aspects` -- `digital_skill` - -*Korea-specific persona fields (`ko_KR`):* - -- `family_persona` - -*Religious persona fields (`en_IN`, `hi_Deva_IN`, `hi_Latn_IN`, `en_SG`, `pt_BR`):* - -- `religious_persona` -- `religious_background` - -*India-locales persona fields (`en_IN`, `hi_Deva_IN`, `hi_Latn_IN`):* - -- `linguistic_persona` -- `linguistic_background` - -### Configuration Parameters - -| Parameter | Type | Description | -|-----------|------|-------------| -| `locale` | str | Language/region code - must be one of: "en_US", "en_IN", "en_SG", "fr_FR", "hi_Deva_IN", "hi_Latn_IN", "ja_JP", "ko_KR", "pt_BR" | -| `sex` | str (optional) | Filter by "Male" or "Female" | -| `city` | str or list[str] (optional) | Filter by specific city or cities within locale | -| `age_range` | list[int] (optional) | Two-element list [min_age, max_age] (default: [18, 114]) | -| `with_synthetic_personas` | bool (optional) | Include rich personality profiles (default: False) | -| `select_field_values` | dict (optional) | Custom field-based filtering (e.g., {"state": ["NY", "CA"], "education_level": ["bachelors"]}) | diff --git a/docs/concepts/processors.md b/docs/concepts/processors.md deleted file mode 100644 index 290eff00c..000000000 --- a/docs/concepts/processors.md +++ /dev/null @@ -1,179 +0,0 @@ -# Processors - -Processors are transformations that modify your dataset before or after columns are generated. They run at different stages and can reshape, filter, or augment the data. - -!!! tip "When to Use Processors" - Processors handle transformations that don't fit the "column" model: restructuring the schema for a specific output format, dropping intermediate columns in bulk, or applying batch-wide operations. - -## Overview - -Each processor: - -- Receives the complete batch DataFrame -- Applies its transformation -- Passes the result to the next processor (or to output) - -Processors can run at three stages, determined by which callback methods they implement: - -| Stage | When it runs | Callback method | Use cases | -|-------|--------------|-----------------|-----------| -| Pre-batch | After seed columns, before dependent columns | `process_before_batch()` | Transform seed data before other columns are generated | -| Post-batch | After each batch completes | `process_after_batch()` | Drop columns, transform schema per batch | -| After generation | Once, on final dataset after all batches | `process_after_generation()` | Deduplicate, aggregate statistics, final cleanup | - -!!! info "Full Schema Available During Generation" - Each batch carries the full dataset schema during generation. Post-batch schema changes such as column dropping only alter past batches, so all columns remain accessible to generators while building follow-up batches. - -!!! warning "Row-count changes under the async engine" - The async engine (default) enforces row-count invariance in `process_before_batch()` and `process_after_batch()` — a processor returning a different row count raises `DatasetGenerationError`. Run row-filtering or expansion logic in `process_after_generation()`, which operates on the final dataset and supports row-count changes. The legacy sync engine (opt-out via `DATA_DESIGNER_ASYNC_ENGINE=0`) is permissive about row-count changes at all stages. - -A processor can implement any combination of these callbacks. The built-in processors use `process_after_batch()` by default. - -## Processor Types - -### 🗑️ Drop Columns Processor - -Removes specified columns from the output dataset. Dropped columns are saved separately in the `dropped-columns` directory for reference. - -!!! tip "Dropping Columns is More Easily Achieved via `drop = True`" - The Drop Columns Processor is different from others in the sense that it does not need to be explicitly added: setting `drop = True` when configuring a column will accomplish the same. - -**Configuration:** - -```python -import data_designer.config as dd - -processor = dd.DropColumnsProcessorConfig( - name="remove_intermediate", - column_names=["temp_calculation", "raw_input", "debug_info"], -) -``` - -**Behavior:** - -- Columns specified in `column_names` are removed from the output -- Original values are preserved in a separate parquet file -- Missing columns produce a warning but don't fail the build -- Column configs are automatically marked with `drop=True` when this processor is added - -**Use Cases:** - -- Removing intermediate columns used only for LLM context -- Cleaning up debug or validation columns before final output -- Separating sensitive data from the main dataset - -### 🔄 Schema Transform Processor - -Creates an additional dataset with a transformed schema using Jinja2 templates. The output is written to a separate directory alongside the main dataset. - -**Configuration:** - -```python -import data_designer.config as dd - -processor = dd.SchemaTransformProcessorConfig( - name="chat_format", - template={ - "messages": [ - {"role": "user", "content": "{{ question }}"}, - {"role": "assistant", "content": "{{ answer }}"}, - ], - "metadata": "{{ category | upper }}", - }, -) -``` - -**Behavior:** - -- Each key in `template` becomes a column in the transformed dataset -- Values are Jinja2 templates with access to all columns in the batch -- Complex structures (lists, nested dicts) are supported -- Output is saved to the `processors-files/{name}/` directory -- The original dataset passes through unchanged - -**Template Capabilities:** - -- **Variable substitution**: `{{ column_name }}` -- **Filters**: `{{ text | upper }}`, `{{ text | lower }}`, `{{ text | trim }}` -- **Nested structures**: Arbitrarily deep JSON structures -- **Lists**: `["{{ col1 }}", "{{ col2 }}"]` - -**Use Cases:** - -- Converting flat columns to chat message format -- Restructuring data for specific model training formats -- Creating derived views without modifying the source dataset - -## Using Processors - -Add processors to your configuration using the builder's `add_processor` method: - -```python -import data_designer.config as dd - -builder = dd.DataDesignerConfigBuilder() - -# ... add columns ... - -# Drop intermediate columns -builder.add_processor( - dd.DropColumnsProcessorConfig( - name="cleanup", - column_names=["scratch_work", "raw_context"], - ) -) - -# Transform to chat format -builder.add_processor( - dd.SchemaTransformProcessorConfig( - name="chat_format", - template={ - "messages": [ - {"role": "user", "content": "{{ question }}"}, - {"role": "assistant", "content": "{{ answer }}"}, - ], - }, - ) -) -``` - -### Execution Order - -Processors execute in the order they're added. Plan accordingly when one processor's output affects another. - -## Processor Plugins - -You can extend Data Designer with custom processors via the [plugin system](../plugins/overview.md). Once installed, plugin processors are automatically discovered and can be used with `add_processor()` like built-in processors. - -```python -from my_processor_plugin.config import MyProcessorConfig - -builder.add_processor( - MyProcessorConfig( - name="my_processor", - # ... plugin-specific parameters ... - ) -) -``` - -For implementation instructions across all plugin types, see [Build Your Own](../plugins/build_your_own.md). - -## Configuration Parameters - -### Common Parameters - -| Parameter | Type | Description | -|-----------|------|-------------| -| `name` | str | Identifier for the processor, used in output directory names | - -### DropColumnsProcessorConfig - -| Parameter | Type | Description | -|-----------|------|-------------| -| `column_names` | list[str] | Columns to remove from output | - -### SchemaTransformProcessorConfig - -| Parameter | Type | Description | -|-----------|------|-------------| -| `template` | dict[str, Any] | Jinja2 template defining the output schema. Must be JSON-serializable. | diff --git a/docs/concepts/security.md b/docs/concepts/security.md deleted file mode 100644 index 6b365befd..000000000 --- a/docs/concepts/security.md +++ /dev/null @@ -1,203 +0,0 @@ -# Security - -Data Designer can run in two very different trust models: - -- **Trusted / monolithic**: The same user or team writes the config and runs the engine. -- **Untrusted / shared execution**: One user submits a config and a different process, service, or team executes it. - -That distinction matters for features that evaluate user-supplied configuration at runtime, such as Jinja template rendering. In a trusted local workflow, broader template flexibility may be acceptable. In a shared-service deployment, user-supplied Jinja becomes part of the engine's remote code execution surface. A template sandbox escape would execute inside the process running Data Designer. - -See [Deployment Options](deployment-options.md) for the architectures where that trust boundary changes. - -## Jinja Rendering Modes - -Data Designer exposes the renderer choice through `RunConfig`: - -```python -import data_designer.config as dd - -run_config = dd.RunConfig( - jinja_rendering_engine=dd.JinjaRenderingEngine.SECURE, -) -``` - -`SECURE` is the default. Opt into `NATIVE` only when you are comfortable treating the config author and the engine operator as the same trust domain. - -| Mode | What it uses | Best fit | -|------|---------------|----------| -| `SECURE` | Data Designer's hardened renderer built on top of Jinja2's sandbox | Shared services, microservices, internal platforms, or any deployment where config submission is separated from execution | -| `NATIVE` | Jinja2's built-in sandbox with Data Designer's variable whitelist | Local library usage and other trusted, monolithic workflows that want broader Jinja behavior | - -!!! warning "Treat untrusted Jinja as a security boundary" - If many users can submit configs to one engine, or if configs are accepted over an API and executed elsewhere, keep `JinjaRenderingEngine.SECURE`. In that model, Jinja templates are no longer just prompt-formatting helpers. They are untrusted user programs being evaluated by your engine. - -## Compatibility Matrix - -`NATIVE` is not an unrestricted Python template engine. The matrix below shows what each mode permits, restricts, or adds on top of Jinja2's standard sandbox behavior. - -| Capability | `NATIVE` | `SECURE` | -|------|------|----------| -| Jinja2 `ImmutableSandboxedEnvironment` baseline | Yes | Yes | -| References to explicitly provided dataset variables only | Yes | Yes | -| Standard Jinja built-in filter set | Yes | Subset only | -| Data Designer `jsonpath` filter | Yes | Yes | -| `import`, `macro`, `set`, `extends`, `block` support | Yes | No | -| Nested or recursive `for` loops | Yes | No | -| Unbounded AST complexity | Yes | No | -| Template context sanitized to JSON-compatible types before render | No | Yes | -| Empty, oversized, or built-in-like rendered output is permitted | Yes | No | - -## What `SECURE` Adds on Top of Standard Jinja Sandbox - -The `SECURE` renderer uses a hardened environment implemented in the [renderer source file on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner/blob/v0.5.6/packages/data-designer-engine/src/data_designer/engine/processing/ginja/environment.py). Compared with the standard Jinja sandbox, it adds several additional controls. - -### Record Sanitization Before Render - -Before rendering, `SECURE` forces template context through a JSON-compatible serialization step. That means remote templates operate on plain data, not arbitrary Python objects. - -```python -# Intended shape for remote template context -record = { - "user": { - "name": "alice", - "roles": ["admin", "reviewer"], - } -} -``` - -```python -# Not the kind of server-side object SECURE wants to expose directly -record = { - "user": SomePythonObject(...), -} -``` - -In a remote execution setting, exposing rich Python objects increases the risk of attribute- and method-based sandbox escapes. Jinja's [sandbox security considerations](https://jinja.palletsprojects.com/en/stable/sandbox/) note that the sandbox is not a complete security boundary, and past escapes have included [`str.format` (CVE-2016-10745)](https://nvd.nist.gov/vuln/detail/CVE-2016-10745), [`str.format_map` (CVE-2019-10906)](https://github.com/advisories/GHSA-462w-v97r-4m45), [indirect `str.format` references (CVE-2024-56326)](https://nvd.nist.gov/vuln/detail/CVE-2024-56326), and [`|attr`-based access to `format` (CVE-2025-27516)](https://nvd.nist.gov/vuln/detail/CVE-2025-27516); PortSwigger's [server-side template injection research](https://portswigger.net/research/server-side-template-injection) covers the broader object-traversal pattern. - -### Filter Allowlist - -`SECURE` keeps only a small approved subset of Jinja filters plus the Data Designer `jsonpath` filter. If a filter is not on that allowlist, the template is rejected. Common excluded filters are: - -| Disallowed filters | Why they are excluded in `SECURE` | -| --- | --- | -| `attr`, `xmlattr` | These add dynamic attribute lookup or attribute-name construction, which widens the object-traversal surface in untrusted templates. | -| `map`, `select`, `reject`, `selectattr`, `rejectattr`, `groupby`, `batch`, `slice`, `sum` | These make templates behave more like a data-processing language and can multiply compute across large inputs. | -| `join`, `format`, `indent`, `wordwrap`, `center`, `filesizeformat` | These expand presentation and composition logic inside the template. `SECURE` keeps formatting logic narrow so templates stay close to interpolation. | -| `default`, `d`, `dictsort`, `count`, `wordcount`, `pprint`, `tojson` | These encourage fallback logic, secondary data shaping, or debug-style output inside the template rather than in the engine or config layer. | -| `safe`, `striptags`, `urlize` | These are primarily HTML-oriented output transforms and are unnecessary for server-side dataset rendering. | - -Some omitted convenience filters, such as the `e` alias for `escape`, are excluded because `SECURE` uses a small explicit allowlist. The current implementation does not assign each omitted filter its own separate security rationale. - -Use `NATIVE` when full Jinja filter compatibility matters more than the additional restrictions used for untrusted template execution. - -### Template Features Removed - -`SECURE` rejects `import`, `macro`, `set`, `extends`, and `block`. - -```jinja -{% macro render_name(name) %}{{ name }}{% endmacro %} -{{ render_name(customer_name) }} -``` - -```jinja -{% set temp = user_id %} -{{ temp }} -``` - -Those features are useful in trusted authoring environments, but they also make user templates more expressive and stateful. In a remote execution model, `SECURE` intentionally narrows the language so templates stay closer to data interpolation than to a reusable programming layer. - -### Loop Restrictions - -`SECURE` rejects recursive loops and nested `for` loops. - -```jinja -{% for row in rows %} - {% for item in row %} - {{ item }} - {% endfor %} -{% endfor %} -``` - -Nested and recursive loops are especially risky in shared execution because they can amplify compute cost and output size in ways that are hard to reason about from the outside. - -### AST Complexity Limits - -`SECURE` statically analyzes the parsed Jinja AST and rejects templates that exceed the current limits of 600 nodes or depth 10. - -```jinja -{% if a %} - {% if b %} - {% if c %} - {{ value }} - {% endif %} - {% endif %} -{% endif %} -``` - -This is not about any one feature being unsafe by itself. It is about limiting how much control flow and composition untrusted templates can pack into a single server-side render operation, which helps prevent compute bombs in shared execution. - -### `self` References Blocked - -`SECURE` rejects references to `self`. - -```jinja -{{ self }} -``` - -The point is to avoid exposing template internals back to the submitter. In a remote setting, even accidental access to those internals is unnecessary surface area. - -### Rendered Output Guards - -`SECURE` validates rendered output after template execution. It rejects empty output, very large output, and strings that look like Python built-in or function representations. - -```jinja -{{ "" }} -``` - -```text - - -``` - -These checks matter because not all bad outcomes come from parse-time behavior. Some templates are syntactically valid but still produce output that is clearly broken, oversized, or revealing internal implementation details. - -### Sanitized User-Facing Errors - -At the engine boundary, `SECURE` normalizes most template failures into a generic invalid-template message. - -```text -User provided prompt generation template is invalid. -``` - -That matters in remote execution because exception details can leak information about server-side implementation, supported objects, or internal execution paths that untrusted users do not need to see. - -These controls exist because the standard sandbox is a good baseline, but shared-service deployments need a narrower and more defensive execution model. - -## Why This Matters in Multi-User Deployments - -The security posture changes as soon as config submission and execution are separated. - -Examples: - -- A centralized Data Designer service accepts configs from many users. -- An internal platform lets users upload or edit configs that are executed by a background worker. -- A REST API accepts Jinja-containing configs and runs them on server-side infrastructure. - -In those environments, templates are no longer just local convenience syntax. They are untrusted input being evaluated by infrastructure the submitter does not control. In practice, that makes Jinja rendering a remote code execution concern, which is why `SECURE` exists and why it remains the default. - -If you are deciding between local library usage and a shared service model, read [Deployment Options](deployment-options.md). The library patterns are often still "trusted" deployments. The shared microservice pattern is not. - -## When To Use `NATIVE` - -Use `NATIVE` when all of the following are true: - -- The person submitting the config is also the person running the engine, or they are in the same trusted operational boundary. -- You want broader standard Jinja behavior than `SECURE` allows. -- You understand that this is a flexibility tradeoff, not the safer default. - -For example, this is often reasonable in a notebook, local script, or other single-user library workflow. - -## Related Reading - -- [Deployment Options](deployment-options.md) -- [Run Config Reference](../code_reference/config/run_config.md) diff --git a/docs/concepts/seed-datasets.md b/docs/concepts/seed-datasets.md deleted file mode 100644 index 581fc2f8f..000000000 --- a/docs/concepts/seed-datasets.md +++ /dev/null @@ -1,352 +0,0 @@ -# Seed Datasets - -Seed datasets let you bootstrap synthetic data generation from existing data. Instead of generating everything from scratch, you provide a dataset whose columns become available as context in your prompts and expressions—grounding your synthetic data in real-world examples. - -!!! tip "When to Use Seed Datasets" - Seed datasets shine when you have **real data you want to build on**: - - - Product catalogs → generate customer reviews - - Medical diagnoses → generate physician notes - - Code snippets → generate documentation - - Company profiles → generate financial reports - - The seed data provides realism and domain specificity; Data Designer adds volume and variation. - -## The Basic Pattern - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -# Define your model configuration -model_configs = [ - dd.ModelConfig( - alias="my-model", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", - ) -] - -config_builder = dd.DataDesignerConfigBuilder(model_configs=model_configs) - -# 1. Attach a seed dataset -seed_source = dd.LocalFileSeedSource(path="products.csv") -config_builder.with_seed_dataset(seed_source) - -# 2. Reference seed columns in your prompts -config_builder.add_column( - dd.LLMTextColumnConfig( - name="review", - model_alias="my-model", - prompt="""\ -Write a customer review for {{ product_name }}. -Category: {{ category }} -Price: ${{ price }} -""", - ) -) -``` - -Every column in your seed dataset becomes available as a Jinja2 variable in prompts and expressions. Data Designer automatically: - -- Reads rows from the seed dataset -- Injects seed column values into templates - -## Seed Sources - -Data Designer supports multiple ways to provide seed data, including: - -### 📁 LocalFileSeedSource - -Load from a local file—CSV, Parquet, or JSON. - -```python -# Single file -seed_source = dd.LocalFileSeedSource(path="data/products.csv") - -# Parquet files with wildcard -seed_source = dd.LocalFileSeedSource(path="data/products/*.parquet") -``` - -!!! note "Supported Formats" - - CSV (`.csv`) - - Parquet (`.parquet`) - - JSON (`.json`, `.jsonl`) - -### 🤗 HuggingFaceSeedSource - -Load directly from HuggingFace datasets without downloading manually. - -```python -seed_source = dd.HuggingFaceSeedSource( - path="datasets/gretelai/symptom_to_diagnosis/data/train.parquet", - token="hf_...", # Optional, for private datasets -) -``` - -### 🐼 DataFrameSeedSource - -Use an in-memory pandas DataFrame—great for preprocessing or combining multiple sources. - -```python -import pandas as pd - -df = pd.read_csv("raw_data.csv") -df = df[df["quality_score"] > 0.8] # Filter to high-quality rows - -seed_source = dd.DataFrameSeedSource(df=df) -``` - -!!! warning "Serialization" - `DataFrameSeedSource` can't be serialized to YAML/JSON configs. Use `LocalFileSeedSource` if you need to save and share configurations. - -### 🗂️ DirectorySeedSource - -Treat a directory tree as the seed dataset. Each matching file becomes one seed row, exposing file metadata you can reference in prompts and expressions. - -```python -seed_source = dd.DirectorySeedSource( - path="docs/", - file_pattern="*.md", - recursive=True, -) - -config_builder.with_seed_dataset(seed_source) -config_builder.add_column( - dd.ExpressionColumnConfig( - name="doc_label", - expr="{{ source_kind }}::{{ relative_path }}", - ) -) -``` - -Directory-backed seed datasets expose these columns: - -- `source_kind` — always `"directory_file"` -- `source_path` — full path to the matched file -- `relative_path` — path relative to the configured directory -- `file_name` — basename of the matched file - -!!! note "Filesystem matching" - `file_pattern` matches file names only, not relative paths. `recursive=True` is the default, so nested subdirectories are searched unless you turn it off. - -### 📄 FileContentsSeedSource - -Read matching text files into the seed dataset. Each file becomes one seed row with the same metadata as `DirectorySeedSource`, plus the decoded file contents in a `content` column. - -```python -seed_source = dd.FileContentsSeedSource( - path="docs/", - file_pattern="*.md", - encoding="utf-8", -) - -config_builder.with_seed_dataset(seed_source) -config_builder.add_column( - dd.LLMTextColumnConfig( - name="summary", - model_alias="my-model", - prompt="""\ -Summarize the following document. - -File: {{ file_name }} -Path: {{ relative_path }} - -{{ content }} -""", - ) -) -``` - -`FileContentsSeedSource` exposes these seeded columns: - -- `source_kind` — always `"file_contents"` -- `source_path` — full path to the matched file -- `relative_path` — path relative to the configured directory -- `file_name` — basename of the matched file -- `content` — decoded text contents of the matched file - -!!! tip "Custom Filesystem Readers" - If you need custom row construction, fan-out behavior, or expensive hydration logic for any directory-backed seed source, build a custom `FileSystemSeedReader` and pass it via `DataDesigner(seed_readers=[...])`. For packaging and registration, see [Build Your Own](../plugins/build_your_own.md). - -!!! note "Encoding" - `encoding="utf-8"` is the default. Set a different Python codec name if your files use another text encoding. - -### 🤖 AgentRolloutSeedSource - -Parse agent rollout trace files (e.g. from ATIF, Claude Code, Codex, or Hermes Agent) into a structured seed dataset. Each trace becomes one seed row with normalized metadata and the full message history, ready for distillation or analysis pipelines. - -```python -seed_source = dd.AgentRolloutSeedSource( - format=dd.AgentRolloutFormat.CLAUDE_CODE, -) - -config_builder.with_seed_dataset(seed_source) -``` - -!!! info "Dedicated guide" - See [Agent Rollout Ingestion](agent-rollout-ingestion.md) for the rollout-specific guide, including: - - - supported rollout formats and default locations - - format-specific configuration details like `path` and `file_pattern` - - the full normalized seeded-column schema exposed by `AgentRolloutSeedSource` - -!!! tip "Trace Distillation" - See the [Agent Rollout Trace Distillation recipe](../recipes/trace_ingestion/agent_rollout_distillation.md) for a complete example that turns agent traces into supervised fine-tuning data. - -## Sampling Strategies - -Control how rows are read from the seed dataset. - -### Ordered (Default) - -Rows are read sequentially in their original order. Each generated record corresponds to the next row in the seed dataset. If you generate more records than exist in the seed dataset, it will cycle in order until completion. - -```python -config_builder.with_seed_dataset( - seed_source, - sampling_strategy=dd.SamplingStrategy.ORDERED, -) -``` - -### Shuffle - -Rows are randomly shuffled before sampling. Useful when your seed data has some ordering you want to break. - -```python -config_builder.with_seed_dataset( - seed_source, - sampling_strategy=dd.SamplingStrategy.SHUFFLE, -) -``` - -## Selection Strategies - -Select a subset of your seed dataset—useful for large datasets or parallel processing. - -### IndexRange - -Select a specific range of row indices. - -```python -# Use only rows 100-199 (100 rows total) -config_builder.with_seed_dataset( - seed_source, - selection_strategy=dd.IndexRange(start=100, end=199), -) -``` - -### PartitionBlock - -Split the dataset into N equal partitions and select one. Perfect for distributing work across multiple jobs. - -```python -# Split into 5 partitions, use the 3rd one (index=2, zero-based) -config_builder.with_seed_dataset( - seed_source, - selection_strategy=dd.PartitionBlock(index=2, num_partitions=5), -) -``` - -!!! tip "Parallel Processing" - Run 5 parallel jobs, each with a different partition index, to process a large seed dataset in parallel: - - ```python - # Job 0: PartitionBlock(index=0, num_partitions=5) - # Job 1: PartitionBlock(index=1, num_partitions=5) - # Job 2: PartitionBlock(index=2, num_partitions=5) - # ... - ``` - -### Combining Strategies - -Sampling and selection strategies work together. For example, shuffle rows *within* a specific partition: - -```python -config_builder.with_seed_dataset( - seed_source, - sampling_strategy=dd.SamplingStrategy.SHUFFLE, - selection_strategy=dd.PartitionBlock(index=0, num_partitions=10), -) -``` - -## Complete Example - -Here's a complete example generating physician notes from a symptom-to-diagnosis seed dataset: - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -data_designer = DataDesigner() - -model_configs = [ - dd.ModelConfig( - alias="medical-notes", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", - ) -] - -config_builder = dd.DataDesignerConfigBuilder(model_configs=model_configs) - -# Attach seed dataset (has 'diagnosis' and 'symptoms' columns) -seed_source = dd.LocalFileSeedSource(path="symptom_to_diagnosis.csv") -config_builder.with_seed_dataset(seed_source) - -# Generate patient info -config_builder.add_column( - dd.SamplerColumnConfig( - name="patient", - sampler_type=dd.SamplerType.PERSON_FROM_FAKER, - params=dd.PersonFromFakerSamplerParams(), - ) -) - -config_builder.add_column( - dd.ExpressionColumnConfig( - name="patient_name", - expr="{{ patient.first_name }} {{ patient.last_name }}", - ) -) - -# Generate notes grounded in seed data -config_builder.add_column( - dd.LLMTextColumnConfig( - name="physician_notes", - model_alias="medical-notes", - prompt="""\ -You are a physician writing notes after a patient visit. - -Patient: {{ patient_name }} -Diagnosis: {{ diagnosis }} -Reported Symptoms: {{ symptoms }} - -Write detailed clinical notes for this visit. -""", - ) -) - -# Preview -preview = data_designer.preview(config_builder, num_records=5) -preview.display_sample_record() -``` - -## Best Practices - -### Keep Seed Data Clean - -Garbage in, garbage out. Clean your seed data before using it: - -- Remove duplicates -- Fix encoding issues -- Filter out low-quality rows -- Standardize column names - -### Match Generation Volume to Seed Size - -If your seed dataset has 1,000 rows and you generate 10,000 records, each seed row will be used ~10 times. Consider whether that's appropriate for your use case. - -### Use Seed Data for Diversity Control - -Seed datasets are excellent for controlling the distribution of your synthetic data. Want 30% electronics, 50% clothing, 20% home goods? Curate your seed dataset to match. diff --git a/docs/concepts/tool_use_and_mcp.md b/docs/concepts/tool_use_and_mcp.md deleted file mode 100644 index ec2771f3f..000000000 --- a/docs/concepts/tool_use_and_mcp.md +++ /dev/null @@ -1,69 +0,0 @@ -# Tool Use & MCP - -Tool use lets LLM columns call external tools during generation (e.g., lookups, calculations, retrieval, domain services). Data Designer supports tool use via the **Model Context Protocol (MCP)**, which standardizes how tools are discovered and invoked. - -## Quick Start - -1. Configure an MCP provider ([Local](mcp/mcp-providers.md#localstdiomcpprovider-subprocess) or [Remote](mcp/mcp-providers.md#mcpprovider-remote-sse)) -2. Create a [ToolConfig](mcp/tool-configs.md) referencing your provider -3. Add `tool_alias` to your [LLM column](mcp/enabling-tools.md) - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -# 1. Configure provider - -## Local Stdio provider -mcp_provider = dd.LocalStdioMCPProvider( - name="demo-mcp", - command="python", - args=["-m", "my_mcp_server"], -) - -## Remote provider -# mcp_provider = dd.MCPProvider( -# name="remote-mcp", -# endpoint="https://mcp.example.invalid/sse", -# api_key="REMOTE_MCP_API_KEY", -# ) - -data_designer = DataDesigner(mcp_providers=[mcp_provider]) - -# 2. Create tool config -tool_config = dd.ToolConfig( - tool_alias="my-tools", - providers=["demo-mcp"], -) - -builder = dd.DataDesignerConfigBuilder(tool_configs=[tool_config]) - -# 3. Use tools in column -builder.add_column( - dd.LLMTextColumnConfig( - name="answer", - prompt="Use tools to answer: {{ question }}", - model_alias="nvidia-text", - tool_alias="my-tools", - ) -) -``` - -## Guides - -| Guide | Description | -|-------|-------------| -| **[MCP Providers](mcp/mcp-providers.md)** | Configure local subprocess or remote SSE providers | -| **[Tool Configs](mcp/tool-configs.md)** | Define tool permissions and limits | -| **[Enabling Tools on Columns](mcp/enabling-tools.md)** | Use tools in LLM generation | -| **[Configure via CLI](mcp/configure-mcp-cli.md)** | Interactive CLI configuration | -| **[Traces](traces.md)** | Capture full conversation history | -| **[Safety & Limits](mcp/safety-and-limits.md)** | Allowlists, budgets, timeouts | - -## Example - -See the [PDF Q&A Recipe](../recipes/mcp_and_tooluse/pdf_qa.md) for a complete working example. - -## Code Reference - -For config objects, see [MCP Configuration Reference](../code_reference/config/mcp.md). For runtime internals, see [Engine MCP Reference](../code_reference/engine/mcp.md). diff --git a/docs/concepts/traces.md b/docs/concepts/traces.md deleted file mode 100644 index 1863ef240..000000000 --- a/docs/concepts/traces.md +++ /dev/null @@ -1,216 +0,0 @@ -# Message Traces - -Traces capture the conversation history during LLM generation, including system prompts, user prompts, model reasoning, tool calls, tool results, and the final response. This visibility is essential for understanding model behavior, debugging generation issues, and iterating on prompts. - -Traces are also useful in certain scenarios as the target output of the workflow, e.g. producing an SFT dataset for fine-tuning tool-use capability, for instance. - -## Overview - -When generating content with LLM columns, you often need to understand what happened during generation: - -- What system prompt was used? -- What did the rendered user prompt look like? -- Did the model provide any reasoning content? -- Which tools were called (if tool use is enabled)? -- What arguments were passed to tools? -- What did tools return? -- Did the model retry after failures? -- How did the model arrive at the final answer? - -Traces provide this visibility by capturing the ordered message history for each generation, including any multi-turn conversations that occur during tool use or retry scenarios. - -## Trace Types - -Data Designer supports three trace modes via the `TraceType` enum: - -| TraceType | Description | -|-----------|-------------| -| `TraceType.NONE` | No trace captured (default) | -| `TraceType.LAST_MESSAGE` | Only the final assistant message is captured | -| `TraceType.ALL_MESSAGES` | Full conversation history (system/user/assistant/tool) | - -## Enabling Traces - -### Per-Column (Recommended) - -Set `with_trace` on specific LLM columns: - -```python -import data_designer.config as dd - -# Capture full conversation history -builder.add_column( - dd.LLMTextColumnConfig( - name="answer", - prompt="Answer: {{ question }}", - model_alias="nvidia-text", - with_trace=dd.TraceType.ALL_MESSAGES, # Full trace - ) -) - -# Capture only the final assistant response -builder.add_column( - dd.LLMTextColumnConfig( - name="summary", - prompt="Summarize: {{ text }}", - model_alias="nvidia-text", - with_trace=dd.TraceType.LAST_MESSAGE, # Just the final response - ) -) -``` - -## Trace Column Naming - -When enabled, LLM columns produce an additional side-effect column: - -- `{column_name}__trace` - -For example, if your column is named `"answer"`, the trace column will be `"answer__trace"`. - -## Trace Data Structure - -Each trace is a `list[dict]` where each dict represents a message in the conversation. - -### Message Fields by Role - -| Role | Fields | Description | -|------|--------|-------------| -| `system` | `role`, `content` | System prompt setting model behavior. `content` is a list of blocks in ChatML format. | -| `user` | `role`, `content` | User prompt (rendered from template). `content` is a list of blocks (text + multimodal). | -| `assistant` | `role`, `content`, `tool_calls`, `reasoning_content` | Model response; `content` may be empty if only requesting tools. | -| `tool` | `role`, `content`, `tool_call_id` | Tool execution result; `tool_call_id` links to the request. | - -### Example Trace (Simple Generation) - -A basic trace without tool use: - -```python -[ - # System message (if configured) - { - "role": "system", - "content": [{"type": "text", "text": "You are a helpful assistant that provides clear, concise answers."}] - }, - # User message (the rendered prompt) - { - "role": "user", - "content": [{"type": "text", "text": "What is the capital of France?"}] - }, - # Final assistant response - { - "role": "assistant", - "content": [{"type": "text", "text": "The capital of France is Paris."}], - "reasoning_content": None # May contain reasoning if model supports it - } -] -``` - -### Example Trace (With Tool Use) - -When tool use is enabled, traces capture the full conversation including tool calls: - -```python -[ - # System message - { - "role": "system", - "content": [{"type": "text", "text": "You must call tools before answering. Only use tool results."}] - }, - # User message (the rendered prompt) - { - "role": "user", - "content": [{"type": "text", "text": "What documents are in the knowledge base about machine learning?"}] - }, - # Assistant requests tool calls - { - "role": "assistant", - "content": [{"type": "text", "text": ""}], - "tool_calls": [ - { - "id": "call_abc123", - "type": "function", - "function": { - "name": "list_docs", - "arguments": "{\"query\": \"machine learning\"}" - } - } - ] - }, - # Tool response (linked by tool_call_id) - { - "role": "tool", - "content": [{"type": "text", "text": "Found 3 documents: intro_ml.pdf, neural_networks.pdf, transformers.pdf"}], - "tool_call_id": "call_abc123" - }, - # Final assistant response - { - "role": "assistant", - "content": [{"type": "text", "text": "The knowledge base contains three documents about machine learning: ..."}] - } -] -``` - -### The tool_calls Structure - -When an assistant message includes tool calls: - -```python -{ - "id": "call_abc123", # Unique ID linking to tool response - "type": "function", # Always "function" for MCP tools - "function": { - "name": "search_docs", # Tool name - "arguments": "{...}" # JSON string of tool arguments - } -} -``` - -## Extracting Reasoning Content - -Some models (particularly those with extended thinking or chain-of-thought capabilities) expose their reasoning process separately via the `reasoning_content` field in assistant messages. While this is included in full traces, you may want to capture it separately without the overhead of storing the entire conversation history. - -### Dedicated Reasoning Column - -Set `extract_reasoning_content=True` on any LLM column to create a `{column_name}__reasoning_content` side-effect column: - -```python -import data_designer.config as dd - -builder.add_column( - dd.LLMTextColumnConfig( - name="solution", - prompt="Solve this math problem step by step: {{ problem }}", - model_alias="reasoning-model", - extract_reasoning_content=True, # Creates solution__reasoning_content - ) -) -``` - -The extracted reasoning content: - -- Contains only the `reasoning_content` from the **final** assistant message in the trace -- Is stripped of leading/trailing whitespace -- Is `None` if the model didn't provide reasoning content or if it was whitespace-only - -### When to Use Each Approach - -| Need | Approach | -|------|----------| -| Full conversation history for debugging | `with_trace=True` | -| Just the model's reasoning/thinking | `extract_reasoning_content=True` | -| Both conversation history and separate reasoning | Use both options | -| Fine-tuning data with reasoning | `extract_reasoning_content=True` for clean extraction | - -### Availability - -The `extract_reasoning_content` option is available on all LLM column types: - -- `LLMTextColumnConfig` -- `LLMCodeColumnConfig` -- `LLMStructuredColumnConfig` -- `LLMJudgeColumnConfig` - -## See Also - -- **[Agent Rollout Ingestion](agent-rollout-ingestion.md)**: Import external agent traces from disk into normalized seed rows -- **[Safety and Limits](mcp/safety-and-limits.md)**: Understand turn limits and timeout behavior diff --git a/docs/concepts/validators.md b/docs/concepts/validators.md deleted file mode 100644 index 043694ee7..000000000 --- a/docs/concepts/validators.md +++ /dev/null @@ -1,333 +0,0 @@ -# Validators - -Validators are quality assurance mechanisms in Data Designer that check generated content against rules and return structured pass/fail results. They enable automated verification of data for correctness, code quality, and adherence to specifications. - -!!! note "Quality Gates for Generated Data" - Validators act as **quality gates** in your generation pipeline. Use them to filter invalid records, score code quality, verify format compliance, or integrate with external validation services. - -## Overview - -Validation columns execute validation logic against target columns and produce structured results indicating: - -- **`is_valid`**: Boolean pass/fail status -- **Additional metadata**: Error messages, scores, severity levels, and custom fields - -Validators currently support three execution strategies: - -1. **Code validation**: Lint and check Python or SQL code using industry-standard tools -2. **Local callable validation**: Execute custom Python functions for flexible validation logic -3. **Remote validation**: Send data to HTTP endpoints for external validation services - -## Validator Types - -### 🐍 Python Code Validator - -The Python code validator runs generated Python code through [Ruff](https://github.com/astral-sh/ruff), a fast Python linter that checks for syntax errors, undefined variables, and code quality issues. - -**Configuration:** - -```python -import data_designer.config as dd - -validator_params = dd.CodeValidatorParams(code_lang=dd.CodeLang.PYTHON) -``` - -**Validation Output:** - -Each validated record returns: - -- **`is_valid`**: `True` if no fatal or error-level issues found -- **`python_linter_score`**: Quality score from 0-10 (based on pylint formula) -- **`python_linter_severity`**: Highest severity level found (`"none"`, `"convention"`, `"refactor"`, `"warning"`, `"error"`, `"fatal"`) -- **`python_linter_messages`**: List of linter messages with line numbers, columns, and descriptions - -**Severity Levels:** - -- **Fatal**: Syntax errors preventing code execution -- **Error**: Undefined names, invalid syntax -- **Warning**: Code smells and potential issues -- **Refactor**: Simplification opportunities -- **Convention**: Style guide violations - -A record is marked valid if it has no messages or only messages at warning/convention/refactor levels. - -**Example Validation Result:** - -```python -{ - "is_valid": False, - "python_linter_score": 0, - "python_linter_severity": "error", - "python_linter_messages": [ - { - "type": "error", - "symbol": "F821", - "line": 1, - "column": 7, - "message": "Undefined name `it`" - } - ] -} -``` - -### 🗄️ SQL Code Validator - -The SQL code validator uses [SQLFluff](https://github.com/sqlfluff/sqlfluff), a dialect-aware SQL linter that checks query syntax and structure. - -**Configuration:** - -```python -import data_designer.config as dd - -validator_params = dd.CodeValidatorParams(code_lang=dd.CodeLang.SQL_POSTGRES) -``` - -!!! tip "Multiple Dialects" - The SQL code validator supports multiple dialects: `SQL_POSTGRES`, `SQL_ANSI`, `SQL_MYSQL`, `SQL_SQLITE`, `SQL_TSQL` and `SQL_BIGQUERY`. - -**Validation Output:** - -Each validated record returns: - -- **`is_valid`**: `True` if no parsing errors found -- **`error_messages`**: Concatenated error descriptions (empty string if valid) - -The validator focuses on parsing errors (PRS codes) that indicate malformed SQL. It also checks for common pitfalls like `DECIMAL` definitions without scale parameters. - -**Example Validation Result:** - -```python -# Valid SQL -{ - "is_valid": True, - "error_messages": "" -} - -# Invalid SQL -{ - "is_valid": False, - "error_messages": "PRS: Line 1, Position 1: Found unparsable section: 'NOT SQL'" -} -``` - -### 🔧 Local Callable Validator - -The local callable validator executes custom Python functions for flexible validation logic. - -**Configuration:** - -```python -import pandas as pd - -import data_designer.config as dd - -def my_validation_function(df: pd.DataFrame) -> pd.DataFrame: - """Validate that values are positive. - - Args: - df: DataFrame with target columns - - Returns: - DataFrame with is_valid column and optional metadata - """ - result = pd.DataFrame() - result["is_valid"] = df["price"] > 0 - result["error_message"] = result["is_valid"].apply( - lambda valid: "" if valid else "Price must be positive" - ) - return result - -validator_params = dd.LocalCallableValidatorParams( - validation_function=my_validation_function, - output_schema={ # Optional: enforce output schema - "type": "object", - "properties": { - "data": { - "type": "array", - "items": { - "type": "object", - "properties": { - "is_valid": {"type": ["boolean", "null"]}, - "error_message": {"type": "string"} - }, - "required": ["is_valid"] - } - } - } - } -) -``` - -**Function Requirements:** - -- **Input**: DataFrame with target columns -- **Output**: DataFrame with `is_valid` column (boolean or null) -- **Extra fields**: Any additional columns become validation metadata - -The `output_schema` parameter is optional but recommended—it validates the function's output against a JSON schema, catching unexpected return formats. - -### 🌐 Remote Validator - -The remote validator sends data to HTTP endpoints for validation-as-a-service. This is useful for when you have validation software that needs to run on external compute and you can expose it through a service. Some examples are: - -- External linting services -- Security scanners -- Domain-specific validators -- Proprietary validation systems - -!!! note "Authentication" - Currently, the remote validator is only able to perform unauthenticated API calls. When implementing your own service, you can rely on network isolation for security. If you need to reach a service that requires authentication, you should implement a local proxy. - -**Configuration:** - -```python -import data_designer.config as dd - -validator_params = dd.RemoteValidatorParams( - endpoint_url="https://api.example.com/validate", - timeout=30.0, # Request timeout in seconds - max_retries=3, # Retry attempts on failure - retry_backoff=2.0, # Exponential backoff factor - max_parallel_requests=4, # Concurrent request limit - output_schema={ # Optional: enforce response schema - "type": "object", - "properties": { - "data": { - "type": "array", - "items": { - "type": "object", - "properties": { - "is_valid": {"type": ["boolean", "null"]}, - "confidence": {"type": "string"} - } - } - } - } - } -) -``` - -**Request Format:** - -The validator sends POST requests with this structure: - -```json -{ - "data": [ - {"column1": "value1", "column2": "value2"}, - {"column1": "value3", "column2": "value4"} - ] -} -``` - -**Expected Response Format:** - -The endpoint must return: - -```json -{ - "data": [ - { - "is_valid": true, - "custom_field": "any additional metadata" - }, - { - "is_valid": false, - "custom_field": "more metadata" - } - ] -} -``` - -**Retry Behavior:** - -The validator automatically retries on: - -- Network errors -- HTTP status codes: 429 (rate limit), 500, 502, 503, 504 - -Failed requests use exponential backoff: `delay = retry_backoff^attempt`. - -**Parallelization:** - -Set `max_parallel_requests` to control concurrency. Higher values improve throughput but increase server load. The validator batches requests according to the `batch_size` parameter in the validation column configuration. - -## Using Validators in Columns - -Add validation columns to your configuration using the builder's `add_column` method: - -```python -import data_designer.config as dd - -builder = dd.DataDesignerConfigBuilder() - -# Generate Python code -builder.add_column( - dd.LLMCodeColumnConfig( - name="sorting_algorithm", - prompt="Write a Python function to sort a list using bubble sort.", - code_lang="python", - model_alias="my-model" - ) -) - -# Validate the generated code -builder.add_column( - dd.ValidationColumnConfig( - name="code_validation", - target_columns=["sorting_algorithm"], - validator_type="code", - validator_params=dd.CodeValidatorParams(code_lang=dd.CodeLang.PYTHON), - batch_size=10, - drop=False, - ) -) -``` - -The `target_columns` parameter specifies which columns to validate. All target columns are passed to the validator together (except for code validators, which process each column separately). - -### Configuration Parameters - -See more about parameters used to instantiate `ValidationColumnConfig` in the [code reference](../code_reference/config/column_configs.md#data_designer.config.column_configs.ValidationColumnConfig). - -### Batch Size Considerations - -Larger batch sizes improve efficiency but consume more memory: - -- **Code validators**: 5-20 records (file I/O overhead) -- **Local callable**: 10-50 records (depends on function complexity) -- **Remote validators**: 1-10 records (network latency, server capacity) - -Adjust based on: - -- Validator computational cost -- Available memory -- Network bandwidth (for remote validators) -- Server rate limits - -If the validation logic uses information from other samples, only samples in the batch will be considered. - -### Multiple Column Validation - -Validate multiple columns simultaneously: - -```python -import data_designer.config as dd - -builder.add_column( - dd.ValidationColumnConfig( - name="multi_column_validation", - target_columns=["column_a", "column_b", "column_c"], - validator_type="remote", - validator_params=dd.RemoteValidatorParams( - endpoint_url="https://api.example.com/validate" - ) - ) -) -``` - -**Note**: Code validators always process each target column separately, even when multiple columns are specified. Local callable and remote validators receive all target columns together. - -## See Also - -- [Validator Parameters Reference](../code_reference/config/validator_params.md): Configuration object schemas diff --git a/docs/concepts/workflow-chaining.md b/docs/concepts/workflow-chaining.md deleted file mode 100644 index 2b8c3cc3f..000000000 --- a/docs/concepts/workflow-chaining.md +++ /dev/null @@ -1,107 +0,0 @@ -# Workflow Chaining - -!!! warning "Experimental Feature" - Workflow chaining is currently **experimental** and under active development. The documentation, examples, workflow API, metadata schema, and artifact layout are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please consider starting [a discussion on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner/discussions). - -Workflow chaining lets you split a dataset build into named stages. Each stage runs a normal `DataDesigner.create()` call, writes its own artifact directory, and hands a selected parquet output to the next stage as a `LocalFileSeedSource`. - -Use it when one generation step naturally depends on the cleaned or reshaped output of another step, especially when a processor-only stage is clearer than mixing all transformations into one config. - -## Basic shape - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -data_designer = DataDesigner() - -drafts = ( - dd.DataDesignerConfigBuilder(model_configs=[fast_model]) - .with_seed_dataset(dd.LocalFileSeedSource(path="parsed_docs/*.parquet")) - .add_column( - name="chunk_summary", - column_type="llm_text", - model_alias="fast", - prompt="Summarize this passage:\n\n{{ text }}", - ) - .add_column( - name="question", - column_type="llm_text", - model_alias="fast", - prompt="Write a question about this passage:\n\n{{ chunk_summary }}", - ) - .add_column( - name="answer", - column_type="llm_text", - model_alias="fast", - prompt="Answer {{ question }} using this passage:\n\n{{ text }}", - ) -) - -chatml = dd.DataDesignerConfigBuilder().add_processor( - dd.SchemaTransformProcessorConfig( - name="chatml", - template={ - "messages": [ - {"role": "user", "content": "{{ question }}"}, - {"role": "assistant", "content": "{{ answer }}"}, - ], - }, - ) -) - -workflow = data_designer.compose_workflow(name="doc-qa") -workflow.add_stage( - "drafts", - drafts, - num_records=100, - output_processors=[ - dd.DropColumnsProcessorConfig( - name="drop_scratch", - column_names=["text", "chunk_summary"], - ) - ], -) -workflow.add_stage("chatml", chatml, output="processor:chatml") - -results = workflow.run() -training_rows = results.load_dataset() -results.export("chatml.jsonl") -``` - -## Stage outputs - -A stage can expose different views of its data: - -| Surface | What it returns | -|---------|-----------------| -| `results["stage_name"]` | The effective `DatasetCreationResults` for that stage. If the stage uses `output_processors`, this points at the output-processor run. | -| `results.load_stage_output("stage_name")` | The selected output handed to downstream stages. This follows `output="processor:"` and `on_success`. | -| `results.load_dataset()` | The selected output from the final stage. | - -Processors added with `config_builder.add_processor(...)` run inside the stage and usually create side artifacts. They do not automatically change what the next stage receives. Use `output_processors=[...]` when a processor should define the stage boundary output. - -## Processor-only stages - -Stages can be processor-only when they receive seed data from an upstream stage: - -```python -cleanup = dd.DataDesignerConfigBuilder().add_processor( - dd.DropColumnsProcessorConfig( - name="drop_private_fields", - column_names=["email", "raw_notes"], - ) -) - -workflow.add_stage("cleanup", cleanup) -``` - -This is useful for final cleanup, schema transforms, and format-specific export preparation. - -## Current limits - -- Stages are linear. DAGs, parallel branches, and joins are planned separately. -- Stage-level resume is not implemented yet. -- `push_to_hub()` does not support selected processor or callback outputs yet. Use `export()` for the selected workflow output. -- `on_success` callbacks are trusted user code. If a callback returns a path, Data Designer reads that path as the next stage input. -- The artifact layout is intended for inspection, but it is not yet a stable public contract. diff --git a/docs/css/mkdocstrings.css b/docs/css/mkdocstrings.css deleted file mode 100644 index 56ba05c64..000000000 --- a/docs/css/mkdocstrings.css +++ /dev/null @@ -1,132 +0,0 @@ -/* Indentation. */ -div.doc-contents:not(.first) { - padding-left: 25px; - border-left: .05rem solid var(--md-typeset-table-color); - } - - /* Mark external links as such. */ - a.external::after, - a.autorefs-external::after { - /* https://primer.style/octicons/arrow-up-right-24 */ - mask-image: url('data:image/svg+xml,'); - -webkit-mask-image: url('data:image/svg+xml,'); - content: ' '; - - display: inline-block; - vertical-align: middle; - position: relative; - - height: 1em; - width: 1em; - background-color: currentColor; - } - - a.external:hover::after, - a.autorefs-external:hover::after { - background-color: var(--md-accent-fg-color); - } - - /* Tree-like output for backlinks. */ - .doc-backlink-list { - --tree-clr: var(--md-default-fg-color); - --tree-font-size: 1rem; - --tree-item-height: 1; - --tree-offset: 1rem; - --tree-thickness: 1px; - --tree-style: solid; - display: grid; - list-style: none !important; - } - - .doc-backlink-list li > span:first-child { - text-indent: .3rem; - } - .doc-backlink-list li { - padding-inline-start: var(--tree-offset); - border-left: var(--tree-thickness) var(--tree-style) var(--tree-clr); - position: relative; - margin-left: 0 !important; - - &:last-child { - border-color: transparent; - } - &::before{ - content: ''; - position: absolute; - top: calc(var(--tree-item-height) / 2 * -1 * var(--tree-font-size) + var(--tree-thickness)); - left: calc(var(--tree-thickness) * -1); - width: calc(var(--tree-offset) + var(--tree-thickness) * 2); - height: calc(var(--tree-item-height) * var(--tree-font-size)); - border-left: var(--tree-thickness) var(--tree-style) var(--tree-clr); - border-bottom: var(--tree-thickness) var(--tree-style) var(--tree-clr); - } - &::after{ - content: ''; - position: absolute; - border-radius: 50%; - background-color: var(--tree-clr); - top: calc(var(--tree-item-height) / 2 * 1rem); - left: var(--tree-offset) ; - translate: calc(var(--tree-thickness) * -1) calc(var(--tree-thickness) * -1); - } - } - - .doc-symbol-toc.doc-symbol-module::after { - content: "module"; - } - - .doc-symbol-toc.doc-symbol-method::after { - content: "method"; - } - - /* Keep API section tables readable when Python type annotations are long. */ - div.doc-contents:has(table:has(thead th:nth-child(3))) { - overflow-x: auto; - } - - div.doc-contents table:has(thead th:nth-child(3)) { - table-layout: fixed; - width: 100%; - min-width: 42rem; - } - - div.doc-contents table:has(thead th:nth-child(3)) td { - vertical-align: top; - } - - div.doc-contents table:has(thead th:nth-child(3)) code { - white-space: normal; - overflow-wrap: anywhere; - word-break: normal; - } - - /* Attributes: Name, Type, Description. */ - div.doc-contents table:has(thead th:nth-child(3)):not(:has(thead th:nth-child(4))) th:nth-child(1), - div.doc-contents table:has(thead th:nth-child(3)):not(:has(thead th:nth-child(4))) td:nth-child(1) { - width: clamp(9rem, 18%, 12rem); - } - - div.doc-contents table:has(thead th:nth-child(3)):not(:has(thead th:nth-child(4))) th:nth-child(2), - div.doc-contents table:has(thead th:nth-child(3)):not(:has(thead th:nth-child(4))) td:nth-child(2) { - width: clamp(16rem, 38%, 34rem); - } - - /* Parameters: Name, Type, Description, Default. */ - div.doc-contents table:has(thead th:nth-child(4)) { - min-width: 54rem; - } - - div.doc-contents table:has(thead th:nth-child(4)) th:nth-child(1), - div.doc-contents table:has(thead th:nth-child(4)) td:nth-child(1) { - width: clamp(9rem, 16%, 11rem); - } - - div.doc-contents table:has(thead th:nth-child(4)) th:nth-child(2), - div.doc-contents table:has(thead th:nth-child(4)) td:nth-child(2) { - width: clamp(16rem, 32%, 28rem); - } - - div.doc-contents table:has(thead th:nth-child(4)) th:nth-child(4), - div.doc-contents table:has(thead th:nth-child(4)) td:nth-child(4) { - width: clamp(5rem, 9%, 7rem); - } diff --git a/docs/css/style.css b/docs/css/style.css deleted file mode 100644 index d19b92cd1..000000000 --- a/docs/css/style.css +++ /dev/null @@ -1,244 +0,0 @@ - -.nvidia-green { - color: #76B900; /* NVIDIA green */ -} - -/* Style "NeMo" in the site name with pure CSS */ -.md-header__topic:first-child .md-ellipsis { - visibility: hidden; - position: relative; -} - -/* Make "NeMo" green using a gradient background trick */ -.md-header__topic:first-child .md-ellipsis::after { - content: "NeMo Data Designer"; - visibility: visible; - position: absolute; - left: 0; - top: 0; - background: linear-gradient(90deg, #76B900 0%, #76B900 4.8ch, currentColor 4.8ch, currentColor 100%); - -webkit-background-clip: text; - -webkit-text-fill-color: transparent; - background-clip: text; -} - -/* Prevent selection of prompt and output markers */ -.highlight .gp, -.highlight .go { - user-select: none; -} - -/* Reduce space between logo and site name */ -.md-header__button.md-logo { - margin-right: -1rem; -} - -/* Remove background from Jupyter notebook outputs */ -.jp-RenderedText pre, -.jp-OutputArea-output pre, -.output_stderr, -.output_stdout, -.output_stream, -div.output_area pre, -div.output_subarea pre, -.jupyter-wrapper .output_stream, -.jupyter-wrapper .output_stdout, -.jupyter-wrapper .output_stderr, -.jp-RenderedText[data-mime-type="application/vnd.jupyter.stderr"], -.jp-RenderedText[data-mime-type="application/vnd.jupyter.stderr"] pre, -.jp-OutputArea-output[data-mime-type="application/vnd.jupyter.stderr"], -.jp-OutputArea-output[data-mime-type="application/vnd.jupyter.stderr"] pre { - background-color: transparent !important; -} - -/* Override stderr-specific styling */ -.highlight .err { - background-color: transparent !important; - border: none !important; -} - -/* Make notebook cells grow with window size */ -.md-content__inner { - max-width: none !important; -} - -.md-content__inner:has(.jp-Notebook) { - padding-left: 0rem !important; - padding-right: 0.5rem !important; -} - -/* Rich output with horizontal scroll and compact tables */ -.jp-RenderedText pre, -.jp-OutputArea-output pre, -.jp-RenderedHTMLCommon pre, -.jp-RenderedHTML pre { - white-space: pre !important; - overflow-x: auto !important; - font-size: 0.6rem !important; - line-height: 1.1 !important; - padding: 0.25rem !important; - margin: 0.25rem 0 !important; -} - -/* Narrow sidebar */ -.md-sidebar--primary, -.md-sidebar--primary .md-sidebar__scrollwrap { - width: 10rem !important; -} - -/* Hide right sidebar (TOC) by default, JavaScript will show it on Code Reference pages */ -.md-sidebar.md-sidebar--secondary { - display: none !important; -} - -/* Show TOC on Code Reference pages (controlled by JavaScript) */ -body.show-toc .md-sidebar.md-sidebar--secondary { - display: block !important; -} - -/* Add color to TOC links on Concepts pages */ -body.show-toc .md-sidebar--secondary .md-nav__link { - transition: color 0.2s ease; -} - -body.show-toc .md-sidebar--secondary .md-nav__link:hover { - color: #76B900 !important; -} - -body.show-toc .md-sidebar--secondary .md-nav__link--active { - color: #76B900 !important; - font-weight: 500; -} - -/* Move repo and palette icons to the right */ -.md-header__inner { - padding-right: 0rem !important; -} - -/* Reduce space between h1 headings and content below */ -h1 { - margin-bottom: 0.5rem !important; -} -h2 { - margin-top: 0.2rem !important; - margin-bottom: 0.2rem !important; -} - -.md-typeset .devnote-dek { - border-left: 0.18rem solid #76B900; - color: var(--md-default-fg-color); - font-size: 1.05rem; - font-weight: 500; - line-height: 1.45; - margin: 0.6rem 0 1rem; - padding-left: 0.8rem; -} - -.md-typeset img.devnote-float-right, -.md-typeset img.devnote-section-graphic { - background: var(--md-default-bg-color); - border: 0.05rem solid var(--md-default-fg-color--lightest); - border-radius: 0.3rem; - box-shadow: 0 0.25rem 0.8rem rgb(0 0 0 / 18%); -} - -.md-typeset img.devnote-float-right { - float: right; - width: min(42%, 28rem); - max-width: 100%; - height: auto; - margin: 0 0 0.7rem 1rem; -} - -.md-typeset img.devnote-section-graphic { - float: right; - width: min(38%, 24rem); - max-width: 100%; - height: auto; - margin: 0.1rem 0 0.7rem 1rem; -} - -.md-typeset .devnote-clear { - clear: right; -} - -.md-post--excerpt .devnote-hide-in-index { - display: none; -} - -@media screen and (max-width: 60em) { - .md-typeset img.devnote-float-right, - .md-typeset img.devnote-section-graphic { - float: none; - display: block; - width: 100%; - margin: 1rem 0; - } -} - -/* Define the company grid layout */ - -#grid-container { - width: 100%; - text-align: center; - } - - #company-grid { - display: inline-block; - margin: 0 auto; - gap: 10px; - align-content: center; - justify-content: center; - grid-auto-flow: column; - } - - [data-md-color-scheme="slate"] #company-grid { - background-color: #ffffff; - border-radius: .5rem; - color: black; - } - - .tile { - display: inline-block; - text-align: center; - width: 120px; - height: 120px; - margin: 10px; - padding: 5px; - border-radius: .5rem; - } - - .tile img { - width: 100px; - } - - .md-typeset__table > table { - max-height: 60vh; - min-width: 100%; - width: max-content; - } - - .md-typeset__table { - display: block; - overflow-x: auto; - } - - .md-typeset__table code { - white-space: nowrap; - word-break: normal; - overflow-wrap: normal; - } - - .md-typeset__table > table thead { - position: sticky; - top: 0; - background-color: var(--md-default-bg-color); - } - - .md-typeset__table > table th { - border-bottom: .05rem solid var(--md-typeset-table-color); - } - - .md-typeset__table > table tr:first-child td { - border-top: none; - } diff --git a/docs/devnotes/.authors.yml b/docs/devnotes/.authors.yml deleted file mode 100644 index 860f8f27f..000000000 --- a/docs/devnotes/.authors.yml +++ /dev/null @@ -1,61 +0,0 @@ -authors: - amanoel: - name: Andre Manoel - description: Researcher at NVIDIA - avatar: https://avatars.githubusercontent.com/u/165937436?v=4 - nvidia: - name: NVIDIA NeMo Data Designer Team - description: NeMo Data Designer Core Team - avatar: https://avatars.githubusercontent.com/u/1728152?s=200&v=4 - sthan: - name: Steve Han - description: Researcher at NVIDIA - avatar: https://avatars.githubusercontent.com/u/150830061?v=4 - oliverholworthy: - name: Oliver Holworthy - description: Machine Learning Engineer at NVIDIA - avatar: https://avatars.githubusercontent.com/u/1216955?v=4 - dcorneil: - name: Dane Corneil - description: Researcher at NVIDIA - avatar: https://avatars.githubusercontent.com/u/4678597?v=4 - etramel: - name: Eric Tramel - description: Researcher at NVIDIA - avatar: https://avatars.githubusercontent.com/u/1223539?v=4 - kthadaka: - name: Kirit Thadaka - description: Product Manager at NVIDIA - avatar: https://avatars.githubusercontent.com/u/8190572?v=4 - dnathawani: - name: Dhruv Nathawani - description: Researcher at NVIDIA - avatar: https://avatars.githubusercontent.com/u/128275431?v=4 - ymeyer: - name: Yev Meyer - description: Principal Research Scientist at NVIDIA - avatar: https://avatars.githubusercontent.com/u/11296522?v=4 - mvansegbroeck: - name: Maarten Van Segbroeck - description: Director of Research at NVIDIA - avatar: https://avatars.githubusercontent.com/u/67658125?v=4 - jgreco: - name: Johnny Greco - description: Researcher at NVIDIA - avatar: https://avatars.githubusercontent.com/u/10998105?v=4 - nmulepati: - name: Nabin Mulepati - description: Researcher at NVIDIA - avatar: https://avatars.githubusercontent.com/u/5551931?v=4 - davanstrien: - name: Daniel van Strien - description: Machine Learning Librarian at Hugging Face - avatar: https://avatars.githubusercontent.com/u/8995957?v=4 - trintamaki: - name: Tuomas Rintamaki - description: Researcher at NVIDIA - avatar: https://avatars.githubusercontent.com/u/143676174?v=4 - nvilia: - name: Ilia Karmanov - description: Researcher at NVIDIA - avatar: https://avatars.githubusercontent.com/u/130559691?v=4 diff --git a/docs/devnotes/index.md b/docs/devnotes/index.md deleted file mode 100644 index 0f7e66940..000000000 --- a/docs/devnotes/index.md +++ /dev/null @@ -1,5 +0,0 @@ -# Dev Notes - -Welcome to NeMo Data Designer Dev Notes! Here you'll find in-depth guides, tutorials, and insights about synthetic data generation. - - diff --git a/docs/devnotes/posts/assets/async-all-the-way-down/architecture-layers.svg b/docs/devnotes/posts/assets/async-all-the-way-down/architecture-layers.svg deleted file mode 100644 index 122d3d95d..000000000 --- a/docs/devnotes/posts/assets/async-all-the-way-down/architecture-layers.svg +++ /dev/null @@ -1,59 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - Async Engine: Three Layers of Concurrency - - - - - AsyncTaskScheduler - Dependency-aware task dispatch from ExecutionGraph - • Topological ordering via Kahn's algorithm - • Cell-level readiness tracking (CompletionTracker) - • Multi-column generator deduplication - • Salvage rounds for retryable failures - - - - - - - - - Row-Group Admission & Buffer Management - Memory-bounded execution with streaming checkpoints - • Semaphore-gated row-group admission - • Per-group buffer with cell-level writes - • Checkpoint-to-parquet on completion - • Submission semaphore caps in-flight tasks - - - - - - - - ThrottleManager (AIMD Adaptive Concurrency) - Per-model, per-provider rate-limit adaptation - • Additive increase on success streaks - • Multiplicative decrease on 429s - • Independent pools per (provider, model) - • Global cap across model aliases - diff --git a/docs/devnotes/posts/assets/async-all-the-way-down/async-engine-hero.png b/docs/devnotes/posts/assets/async-all-the-way-down/async-engine-hero.png deleted file mode 100644 index bd5f58b3a..000000000 Binary files a/docs/devnotes/posts/assets/async-all-the-way-down/async-engine-hero.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/async-all-the-way-down/async-sync-animation.svg b/docs/devnotes/posts/assets/async-all-the-way-down/async-sync-animation.svg deleted file mode 100644 index 5bcdf29d7..000000000 --- a/docs/devnotes/posts/assets/async-all-the-way-down/async-sync-animation.svg +++ /dev/null @@ -1,326 +0,0 @@ - - - - - - - - - - - - Sync Engine - one column at a time - - - topic - summary - trivia - analysis - - ← topic - ← topic - ← summary - - - row 0 - row 1 - row 2 - row 3 - row 4 - row 5 - row 6 - row 7 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 32 steps to complete - - - - - - - Async Engine - cells dispatch as dependencies resolve - - - topic - summary - trivia - analysis - ← topic - ← topic - ← summary - - - row 0 - row 1 - row 2 - row 3 - row 4 - row 5 - row 6 - row 7 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ~14 steps to complete - - - - - waiting - - generating - - done - - dependency - diff --git a/docs/devnotes/posts/assets/async-all-the-way-down/benchmark-results.png b/docs/devnotes/posts/assets/async-all-the-way-down/benchmark-results.png deleted file mode 100644 index 790daf748..000000000 Binary files a/docs/devnotes/posts/assets/async-all-the-way-down/benchmark-results.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/async-all-the-way-down/dag-shapes.svg b/docs/devnotes/posts/assets/async-all-the-way-down/dag-shapes.svg deleted file mode 100644 index e87a5b34c..000000000 --- a/docs/devnotes/posts/assets/async-all-the-way-down/dag-shapes.svg +++ /dev/null @@ -1,119 +0,0 @@ - - - - - - - - - - - - - - - - Narrow (sequential) - - - topic - - - - summary - - - - analysis - - - - conclusion - - - Wide (parallel) - - - topic - - - - - - - - - summary - - trivia - - analysis - - conclusion - - trivia_2 - - - Deep (branching) - - - topic - - - - - - summary - - trivia - - - - - analysis - - - - - conclusion - - - Dual-model - - - topic - - - - - - - summary - - trivia - - analysis - - - - - - - judge_sum - - judge_tri - - judge_ana - - - - Generator model - - Judge model - diff --git a/docs/devnotes/posts/assets/async-all-the-way-down/scale-boxplot.png b/docs/devnotes/posts/assets/async-all-the-way-down/scale-boxplot.png deleted file mode 100644 index cdce792e7..000000000 Binary files a/docs/devnotes/posts/assets/async-all-the-way-down/scale-boxplot.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/async-all-the-way-down/scale-model-timeline.png b/docs/devnotes/posts/assets/async-all-the-way-down/scale-model-timeline.png deleted file mode 100644 index 5885b6a46..000000000 Binary files a/docs/devnotes/posts/assets/async-all-the-way-down/scale-model-timeline.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/async-all-the-way-down/speedup-scaling.png b/docs/devnotes/posts/assets/async-all-the-way-down/speedup-scaling.png deleted file mode 100644 index e2838bbe0..000000000 Binary files a/docs/devnotes/posts/assets/async-all-the-way-down/speedup-scaling.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/async-all-the-way-down/sync-vs-async-timeline.png b/docs/devnotes/posts/assets/async-all-the-way-down/sync-vs-async-timeline.png deleted file mode 100644 index da21b2b8b..000000000 Binary files a/docs/devnotes/posts/assets/async-all-the-way-down/sync-vs-async-timeline.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/async-all-the-way-down/task-lifecycle.png b/docs/devnotes/posts/assets/async-all-the-way-down/task-lifecycle.png deleted file mode 100644 index 69262a3b5..000000000 Binary files a/docs/devnotes/posts/assets/async-all-the-way-down/task-lifecycle.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/data-designer-got-skills/hero_benchmark_figure.png b/docs/devnotes/posts/assets/data-designer-got-skills/hero_benchmark_figure.png deleted file mode 100644 index 8cf5a13d1..000000000 Binary files a/docs/devnotes/posts/assets/data-designer-got-skills/hero_benchmark_figure.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/data-designer-got-skills/hero_data_designer_got_skills.png b/docs/devnotes/posts/assets/data-designer-got-skills/hero_data_designer_got_skills.png deleted file mode 100644 index d1e8670e3..000000000 Binary files a/docs/devnotes/posts/assets/data-designer-got-skills/hero_data_designer_got_skills.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/data-designer-got-skills/trace-baseline.html b/docs/devnotes/posts/assets/data-designer-got-skills/trace-baseline.html deleted file mode 100644 index 17d62dff1..000000000 --- a/docs/devnotes/posts/assets/data-designer-got-skills/trace-baseline.html +++ /dev/null @@ -1,757 +0,0 @@ - -
- - -Baseline -Typical Baseline Session - -35 tools -1 errors -159s - - -
-
-Agent -Let me start by exploring the Data Designer package and checking available configurations. -
-
-1 -Bash - -.venv/bin/python -c "import data_designer.config as dd, os; print(os.path.dirname(dd.__file__))"
.venv/lib/python3.13/site-packages/data_designer/config
-
-
-
-2 -Bash - -.venv/bin/data-designer config list
output (5,284 chars)
─────────────────────── Data Designer Configurations ─────────────────────── - - 💡 Configuration directory: ~/.data-designer - - Model Providers -┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓ -┃ Name ┃ Endpoint ┃ Type ┃ API Key ┃ Default ┃ -┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩ -│ nvidia │ https://integrate.a… │ openai │ NVIDIA_API_KEY │ ✓ │ -│ openai │ https://inference-a… │ openai │ NVIDIA_INFERENCE_API… │ │ -│ openrouter │ https://openrouter.… │ openai │ OPENROUTER_API_KEY │ │ -└────────────┴──────────────────────┴────────┴───────────────────────┴─────────┘ - - Model Configurations -┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓ -┃ ┃ ┃ ┃ Inference ┃ -┃ Alias ┃ Model ┃ Provider ┃ Parameters ┃ -┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩ -│ nvidia-text │ nvidia/nemotron-3-… │ nvidia │ generation_type=c… │ -│ │ │ │ max_parallel_requ… │ -│ │ │ │ temperature=1.00, │ -│ │ │ │ top_p=1.00 │ -│ nvidia-reasoning │ openai/gpt-oss-20b │ nvidia │ generation_type=c… │ -│ │ │ │ max_parallel_requ… │ -│ │ │ │ temperature=0.35, │ -│ │ │ │ top_p=0.95 │ -│ nvidia-vision │ nvidia/nemotron-na… │ nvidia │ generation_type=c… │ -│ │ │ │ max_parallel_requ… │ -│ │ │ │ temperature=0.85, │ -│ │ │ │ top_p=0.95 │ -│ nvidia-embedding │ nvidia/llama-3.2-n… │ nvidia │ generation_type=e… │ -│ │ │ │ max_parallel_requ… │ -│ │ │ │ extra_body={'inpu… │ -│ │ │ │ 'query'}, │ -│ │ │ │ encoding_format=f… │ -│ openai-text │ us/azure/openai/gp… │ openai │ generation_type=c… │ -│ │ │ │ max_parallel_requ… │ -│ │ │ │ temperature=0.85, │ -│ │ │ │ top_p=0.95 │ -│ openai-reasoning │ openai/openai/gpt-… │ openai │ generation_type=c… │ -│ │ │ │ max_parallel_requ… │ -│ │ │ │ extra_body={'reas… │ -│ │ │ │ 'medium'} │ -│ openai-vision │ openai/openai/gpt-… │ openai │ generation_type=c… │ -│ │ │ │ max_parallel_requ… │ -│ │ │ │ extra_body={'reas… │ -│ │ │ │ 'medium'} │ -│ openai-embedding │ text-embedding-3-l… │ openai │ generation_type=e… │ -│ │ │ │ max_parallel_requ… │ -│ │ │ │ encoding_format=f… │ -│ openrouter-text │ nvidia/nemotron-3-… │ openrouter │ generation_type=c… │ -│ │ │ │ max_parallel_requ… │ -│ │ │ │ temperature=1.00, │ -│ │ │ │ top_p=1.00 │ -│ openrouter-reasoning │ openai/gpt-oss-20b │ openrouter │ generation_type=c… │ -│ │ │ │ max_parallel_requ… │ -│ │ │ │ temperature=0.35, │ -│ │ │ │ top_p=0.95 │ -│ openrouter-vision │ nvidia/nemotron-na… │ openrouter │ generation_type=c… │ -│ │ │ │ max_parallel_requ… │ -│ │ │ │ temperature=0.85, │ -│ │ │ │ top_p=0.95 │ -│ openrouter-embedding │ openai/text-embedd… │ openrouter │ generation_type=e… │ -│ │ │ │ max_parallel_requ… │ -│ │ │ │ encoding_format=f… │ -└──────────────────────┴─────────────────────┴────────────┴────────────────────┘ - - ⚠️ MCP providers have not been configured. Run 'data-designer config mcp' to -configure them. - - ⚠️ Tool configs have not been configured. Run 'data-designer config tools' -to configure them.
-
-
-
-Agent -Now let me explore the Data Designer package structure and understand the API. -
-
-3 -Agent - -Explore — Explore Data Designer API
prompt (974 chars)
Explore the Data Designer package thoroughly. The package is located at: -.venv/lib/python3.13/site-packages/data_designer/ - -I need to understand: -1. How to create a DataDesignerConfig or ConfigBuilder — look for the main config builder API -2. What column types are available (especially for LLM generation, code validation, sampling/categorical, LLM-as-judge) -3. How columns are defined and added to a config -4. How to specify model aliases for columns -5. Any example configs or tests that show how to build a config -6. How the `data-designer validate` CLI command works (what it expects) -7. The `load_config_builder` function pattern — what should it return? - -Be very thorough — check __init__.py files, column types, config module, CLI module, and any examples directory. Look at the actual class signatures and constructors.
subagent work (25 tool calls, 1 error)
1Bashfind .venv/lib/python3.13/site-packages/data_designer
-
2Bashfind .venv/lib/python3.13/site-packages/data_designer
-
3Readdata_designer/__init__.pyerror
-
4Readdata_designer/config/__init__.py
-
5Readdata_designer/config/config_builder.py
-
6Readdata_designer/config/column_types.py
-
7Bashfind .venv/lib/python3.13/site-packages/data_designer
-
8Readdata_designer/config/column_configs.py (limit=300)
-
9Readdata_designer/config/column_configs.py (offset=300, limit=300)
-
10Bashfind .venv/lib/python3.13/site-packages/data_designer
-
11Bashfind .venv/lib/python3.13/site-packages/data_designer
-
12Readdata_designer/cli/commands/validate.py
-
13Readdata_designer/cli/utils/config_loader.py
-
14Readdata_designer/cli/controllers/generation_controller.py (limit=200)
-
15Readdata_designer/config/models.py (limit=250)
-
16Readdata_designer/config/models.py (offset=250, limit=250)
-
17Readdata_designer/config/models.py (offset=500, limit=200)
-
18Readdata_designer/config/sampler_params.py (limit=200)
-
19Bashfind .venv/lib/python3.13/site-packages/data_designer
-
20Readdata_designer/config/data_designer_config.py (limit=200)
-
21Bashfind .venv/lib/python3.13/site-packages/data_designer
-
22Readdata_designer/config/validator_params.py (limit=200)
-
23Readdata_designer/interface/data_designer.py (limit=150)
-
24Readdata_designer/config/utils/code_lang.py
-
25Readdata_designer/config/sampler_params.py (offset=200, limit=300)
subagent report (12,076 chars)
Now I have comprehensive information about the Data Designer package. Let me compile a thorough report. - -## Data Designer Package Exploration Report - -I've conducted a thorough exploration of the Data Designer package located at `.venv/lib/python3.13/site-packages/data_designer/`. Here's my comprehensive understanding: - -### 1. Creating a DataDesignerConfig or ConfigBuilder - -**Main API:** The primary builder class is `DataDesignerConfigBuilder` located at: -- `.venv/lib/python3.13/site-packages/data_designer/config/config_builder.py` - -**Constructor:** -```python -DataDesignerConfigBuilder( - model_configs: list[ModelConfig] | str | Path | None = None, - tool_configs: list[ToolConfig] | None = None -) -``` - -**Key Methods:** -- `__init__()`: Initialize with optional model configs (defaults to local models) and tool configs -- `add_column()`: Add columns to the config (most commonly used) -- `add_model_config()`: Add model configurations -- `add_tool_config()`: Add MCP tool configurations -- `with_seed_dataset()`: Attach a seed dataset source -- `build()`: Returns a `DataDesignerConfig` object -- `get_builder_config()`: Returns a `BuilderConfig` wrapping the config -- `write_config()`: Save to YAML/JSON file -- `from_config()`: Create builder from existing config file or dict - -**Alternative Creation:** -```python -# From existing config file/dict -builder = DataDesignerConfigBuilder.from_config(config_path_or_dict) -``` - -### 2. Available Column Types - -**Location:** `.venv/lib/python3.13/site-packages/data_designer/config/column_types.py` - -**Column Types (enum: DataDesignerColumnType):** - -1. **LLM Generation Columns:** - - `llm-text` → `LLMTextColumnConfig`: Free-form text generation with Jinja2 templates - - `llm-code` → `LLMCodeColumnConfig`: Code generation with language specification (Python, SQL, JS, etc.) - - `llm-structured` → `LLMStructuredColumnConfig`: Structured JSON output using JSON schema or Pydantic models - - `llm-judge` → `LLMJudgeColumnConfig`: Multi-dimensional evaluation/scoring of other columns - -2. **Sampling/Categorical Columns:** - - `sampler` → `SamplerColumnConfig`: Numerical and categorical sampling (13+ sampler types available) - -3. **Code Validation:** - - `validation` → `ValidationColumnConfig`: Run validation code (Python/SQL) or call validation functions - -4. **Other Column Types:** - - `seed-dataset` → `SeedDatasetColumnConfig`: Columns sourced from seed data (auto-created) - - `expression` → `ExpressionColumnConfig`: Jinja2-based derived/computed columns - - `embedding` → `EmbeddingColumnConfig`: Generate embeddings for text - - `image` → `ImageColumnConfig`: Generate images from prompts - - `custom` → `CustomColumnConfig`: User-defined generation logic with decorators - -### 3. How Columns Are Defined and Added - -**Two approaches:** - -**Approach 1: Direct column objects** -```python -from data_designer.config import LLMTextColumnConfig, SamplerColumnConfig - -text_col = LLMTextColumnConfig( - name="review", - prompt="Write a product review about {{ product_name }}", - model_alias="claude-3.5-sonnet" -) -builder.add_column(text_col) - -# Sampler column -sampler_col = SamplerColumnConfig( - name="age", - sampler_type="uniform", - params={"low": 18, "high": 65} -) -builder.add_column(sampler_col) -``` - -**Approach 2: Via kwargs (using add_column)** -```python -builder.add_column( - name="review", - column_type="llm-text", - prompt="Write a product review about {{ product_name }}", - model_alias="claude-3.5-sonnet" -) -``` - -**Column Properties:** -- Each column config extends `SingleColumnConfig` with a `column_type` discriminator field -- Columns support: - - `required_columns`: List of columns this column depends on - - `side_effect_columns`: Additional columns created as side effects (e.g., traces, reasoning_content) - - Jinja2 template validation for prompts - - Type conversion and formatting options - -### 4. Model Aliases for Columns - -**Location:** `.venv/lib/python3.13/site-packages/data_designer/config/models.py` - -**Model Configuration:** -```python -from data_designer.config import ModelConfig, ChatCompletionInferenceParams - -model_config = ModelConfig( - alias="my-claude", # User-defined alias referenced in columns - model="claude-3.5-sonnet", - inference_parameters=ChatCompletionInferenceParams( - temperature=0.7, - top_p=0.95, - max_tokens=2000 - ), - provider=None, # Optional custom provider - skip_health_check=False -) - -builder.add_model_config(model_config) -``` - -**Inference Parameter Types:** -- `ChatCompletionInferenceParams`: For LLM generation (temperature, top_p, max_tokens, dynamic distributions) -- `EmbeddingInferenceParams`: For embeddings (encoding_format, dimensions) -- `ImageInferenceParams`: For image generation (extra_body for model-specific options) - -**Model Properties:** -- `alias`: User-defined reference used in columns (e.g., `model_alias="my-claude"`) -- `model`: Model identifier (e.g., "claude-3.5-sonnet") -- `inference_parameters`: Supports fixed values or distributions for sampling -- `generation_type`: Auto-determined from inference_parameters (chat-completion, embedding, or image) - -### 5. Available Sampler Types - -**Location:** `.venv/lib/python3.13/site-packages/data_designer/config/sampler_params.py` - -**All Sampler Types (SamplerType enum):** - -1. **Statistical Distributions:** - - `uniform`: Continuous uniform distribution - - `gaussian`: Normal/Gaussian distribution - - `binomial`: Binomial distribution (success count) - - `bernoulli`: Binary (0/1) outcome - - `bernoulli_mixture`: Mixed zero-inflation with continuous distribution - - `poisson`: Poisson count distribution - - `scipy`: Any scipy.stats distribution - -2. **Categorical/Discrete:** - - `category`: Discrete categories with optional weights - - `subcategory`: Hierarchical sampling based on parent category - -3. **Temporal:** - - `datetime`: Uniform datetime sampling within range - - `timedelta`: Time offset relative to reference column - -4. **Special:** - - `uuid`: UUID generation (with optional prefix/formatting) - - `person`: Synthetic person data with demographic filtering (from managed datasets or Faker) - - `person_from_faker`: Pure Faker-based person generation - -### 6. `data-designer validate` CLI Command - -**Location:** `.venv/lib/python3.13/site-packages/data_designer/cli/commands/validate.py` - -**Usage:** -```bash -data-designer validate <config_source> -``` - -**What it validates:** -- Checks configuration is well-formed -- Resolves all references (models, columns, seed datasets, etc.) -- Validates model aliases are defined -- Validates column dependencies and required columns -- Checks constraint validity -- Validates processor configurations - -**Accepted Config Sources:** -1. **YAML/JSON files**: Local paths or HTTP(S) URLs -2. **Python modules**: `.py` files that define a `load_config_builder()` function - -**Controller Implementation:** -```python -# In GenerationController.run_validate() -data_designer = DataDesigner() -data_designer.validate(config_builder) -``` - -Exits with code 0 on success, code 1 on validation failure. - -### 7. `load_config_builder` Function Pattern - -**Location:** `.venv/lib/python3.13/site-packages/data_designer/cli/utils/config_loader.py` - -**Function Signature Required:** -```python -def load_config_builder() -> DataDesignerConfigBuilder: - """Must return a DataDesignerConfigBuilder instance.""" - ... -``` - -**What it should return:** -- A `DataDesignerConfigBuilder` instance (not BuilderConfig, not DataDesignerConfig) -- The builder is used by the CLI to access the full configuration - -**Validation Requirements:** -1. The `.py` file **must** define `load_config_builder()` function -2. Function must be callable and return exactly a `DataDesignerConfigBuilder` -3. The CLI will call this function and validate the returned builder -4. The parent directory of the `.py` file is temporarily added to sys.path to support local imports - -**Example Pattern:** -```python -# my_config.py -from data_designer.config import DataDesignerConfigBuilder - -def load_config_builder() -> DataDesignerConfigBuilder: - builder = DataDesignerConfigBuilder() - - # Add model - builder.add_model_config(...) - - # Add columns - builder.add_column(...) - - return builder - -# CLI usage: -# data-designer validate my_config.py -``` - -### 8. LLM-as-Judge Column Configuration - -**Location:** `.venv/lib/python3.13/site-packages/data_designer/config/column_configs.py` - -**Class:** `LLMJudgeColumnConfig` (extends `LLMTextColumnConfig`) - -**Configuration Example:** -```python -from data_designer.config import LLMJudgeColumnConfig, Score - -judge_col = LLMJudgeColumnConfig( - name="quality_assessment", - prompt="Evaluate the quality of: {{ generated_text }}", - model_alias="claude-3.5-sonnet", - system_prompt="You are a quality evaluator...", - scores=[ - Score( - name="Relevance", - description="How relevant is the content to the topic?", - options={ - 1: "Not relevant", - 2: "Somewhat relevant", - 3: "Highly relevant" - } - ), - Score( - name="Fluency", - description="How fluent and natural is the writing?", - options={ - "poor": "Difficult to understand", - "good": "Clear and readable", - "excellent": "Excellent quality" - } - ) - ], - with_trace=TraceType.ALL_MESSAGES, - extract_reasoning_content=False -) - -builder.add_column(judge_col) -``` - -**Key Features:** -- Inherits all LLMTextColumnConfig capabilities (prompts, traces, reasoning extraction, tool calling) -- `scores`: List of `Score` objects defining evaluation dimensions -- Each score has: name, description, options (dict of value → description) -- Can reference other columns in prompts via Jinja2 templating -- Supports system prompts and tool aliases for complex evaluations - -### Key File Locations Summary - -| Component | File Path | -|-----------|-----------| -| Config Builder | `.../data_designer/config/config_builder.py` | -| Column Types | `.../data_designer/config/column_types.py` | -| Column Configs | `.../data_designer/config/column_configs.py` | -| Model Config | `.../data_designer/config/models.py` | -| Sampler Params | `.../data_designer/config/sampler_params.py` | -| Validator Params | `.../data_designer/config/validator_params.py` | -| Code Languages | `.../data_designer/config/utils/code_lang.py` | -| Validate CLI | `.../data_designer/cli/commands/validate.py` | -| Config Loader | `.../data_designer/cli/utils/config_loader.py` | -| Main Interface | `.../data_designer/interface/data_designer.py` | -| Data Config | `.../data_designer/config/data_designer_config.py` | -agentId: af6f4be5db40ad2af (use SendMessage with to: 'af6f4be5db40ad2af' to continue this agent) -<usage>total_tokens: 87439 -tool_uses: 25 -duration_ms: 75309</usage>
-
-
-
-Agent -Let me check a few specific details about the column configs and validation column. -
-
-4 -Read - -data_designer/config/column_configs.py - -
-
-Agent -Now let me check the validator params and the config builder's `add_column` method, plus the `__init__` exports. -
-
-5 -Read - -data_designer/config/validator_params.py - -
-
-6 -Read - -data_designer/config/__init__.py - -
-
-Agent -Let me check the sampler params for category type and the config builder's `add_column` signature. -
-
-7 -Grep - -class CategorySamplerParams
results (1,072 chars)
45:class CategorySamplerParams(ConfigBase): -46- """Parameters for categorical sampling with optional probability weighting. -47- -48- Samples values from a discrete set of categories. When weights are provided, values are -49- sampled according to their assigned probabilities. Without weights, uniform sampling is used. -50- -51- Attributes: -52- values: List of possible categorical values to sample from. Can contain strings, integers, -53- or floats. Must contain at least one value. -54- weights: Optional unnormalized probability weights for each value. If provided, must be -55- the same length as `values`. Weights are automatically normalized to sum to 1.0. -56- Larger weights result in higher sampling probability for the corresponding value. -57- """ -58- -59- values: list[str | int | float] = Field( -60- ..., -61- min_length=1, -62- description="List of possible categorical values that can be sampled from.", -63- ) -64- weights: list[float] | None = Field( -65- default=None,
-
-
-
-8 -Grep - -def add_column
results (1,590 chars)
267: def add_column( -268- self, -269- column_config: ColumnConfigT | None = None, -270- *, -271- name: str | None = None, -272- column_type: DataDesignerColumnType | None = None, -273- **kwargs, -274- ) -> Self: -275- """Add a Data Designer column configuration to the current Data Designer configuration. -276- -277- If no column config object is provided, you must provide the `name`, `column_type`, and any -278- additional keyword arguments that are required by the column config constructor. -279- -280- Args: -281- column_config: Data Designer column config object to add. -282- name: Name of the column to add. This is only used if `column_config` is not provided. -283- column_type: Column type to add. This is only used if `column_config` is not provided. -284- **kwargs: Additional keyword arguments to pass to the column constructor. -285- -286- Returns: -287- The current Data Designer config builder instance. -288- -289- Raises: -290- BuilderConfigurationError: If the column name collides with an existing seed dataset column. -291- """ -292- if column_config is None: -293- if name is None or column_type is None: -294- raise BuilderConfigurationError( -295- "🛑 You must provide either a 'column_config' object or 'name' *and* 'column_type' " -296- f"with additional keyword arguments. You provided {column_config=}, {name=}, and {column_type=}." -297- )
-
-
-
-Agent -Now I have everything I need. Let me write the config file. -
-
-9 -Write - -config.py
written content (6,872 chars)
from data_designer.config import ( - CategorySamplerParams, - CodeLang, - CodeValidatorParams, - DataDesignerConfigBuilder, - LLMCodeColumnConfig, - LLMJudgeColumnConfig, - LLMTextColumnConfig, - SamplerColumnConfig, - Score, - ValidationColumnConfig, - ValidatorType, -) - - -def load_config_builder() -> DataDesignerConfigBuilder: - builder = DataDesignerConfigBuilder() - - # --- Difficulty level --- - builder.add_column( - SamplerColumnConfig( - name="difficulty", - sampler_type="category", - params=CategorySamplerParams( - values=["beginner", "intermediate", "advanced"], - weights=[0.35, 0.40, 0.25], - ), - ) - ) - - # --- Subtopic --- - builder.add_column( - SamplerColumnConfig( - name="subtopic", - sampler_type="category", - params=CategorySamplerParams( - values=[ - "data cleaning", - "exploratory analysis", - "aggregation and groupby operations", - "visualization with matplotlib/seaborn", - "statistical testing", - "feature engineering", - "working with messy or missing data", - ], - ), - ) - ) - - # --- Natural language instruction --- - builder.add_column( - LLMTextColumnConfig( - name="instruction", - model_alias="nvidia-text", - prompt=( - "Write a clear, specific natural-language instruction for a {{ difficulty }}-level " - "Python data science task in the area of {{ subtopic }}.\n\n" - "The instruction should describe a concrete problem that a data scientist might encounter " - "and should be solvable in a single self-contained Python script. " - "Do NOT include any code in your response — only the task description.\n\n" - "Vary the complexity to match the difficulty level:\n" - "- beginner: simple operations using pandas, basic plots, straightforward cleaning\n" - "- intermediate: multi-step workflows, combining libraries, moderate data manipulation\n" - "- advanced: complex pipelines, statistical modeling, performance optimization, " - "advanced visualization techniques\n\n" - "Return only the instruction text, nothing else." - ), - ) - ) - - # --- Python solution --- - builder.add_column( - LLMCodeColumnConfig( - name="solution", - model_alias="nvidia-reasoning", - code_lang=CodeLang.PYTHON, - prompt=( - "Write a complete, self-contained Python solution for the following " - "{{ difficulty }}-level data science task:\n\n" - "{{ instruction }}\n\n" - "Requirements:\n" - "- The code must be syntactically valid and runnable as-is.\n" - "- Include all necessary imports.\n" - "- Generate or define any sample data needed (do not load from external files).\n" - "- Use appropriate libraries (pandas, numpy, matplotlib, seaborn, scipy, " - "scikit-learn, etc.) as needed.\n" - "- Add brief inline comments explaining key steps.\n" - "- Follow Python best practices and PEP 8 style.\n" - ), - ) - ) - - # --- Code validation --- - builder.add_column( - ValidationColumnConfig( - name="code_validation", - target_columns=["solution"], - validator_type=ValidatorType.CODE, - validator_params=CodeValidatorParams(code_lang=CodeLang.PYTHON), - ) - ) - - # --- LLM-as-judge evaluation --- - builder.add_column( - LLMJudgeColumnConfig( - name="quality_score", - model_alias="nvidia-reasoning", - prompt=( - "Evaluate the following Python data science solution against its instruction.\n\n" - "**Difficulty:** {{ difficulty }}\n" - "**Subtopic:** {{ subtopic }}\n\n" - "**Instruction:**\n{{ instruction }}\n\n" - "**Solution:**\n```python\n{{ solution }}\n```\n\n" - "Score the solution on each dimension below. Be rigorous but fair." - ), - scores=[ - Score( - name="correctness", - description=( - "Does the code correctly implement what the instruction asks for? " - "Consider logical correctness, proper use of libraries, and whether " - "the output would match expectations." - ), - options={ - 1: "Completely wrong or does not address the instruction", - 2: "Major errors that would produce incorrect results", - 3: "Partially correct but has notable issues", - 4: "Mostly correct with minor issues", - 5: "Fully correct and complete implementation", - }, - ), - Score( - name="code_quality", - description=( - "Assess the code style, readability, and Pythonic practices. " - "Consider variable naming, structure, comments, and adherence " - "to best practices." - ), - options={ - 1: "Very poor quality, unreadable or badly structured", - 2: "Below average, multiple style issues", - 3: "Acceptable but could be improved", - 4: "Good quality, clean and well-structured", - 5: "Excellent quality, exemplary Python code", - }, - ), - Score( - name="instruction_clarity", - description=( - "How clear and well-defined is the original instruction? " - "Consider whether the task is unambiguous, specific enough to implement, " - "and appropriately scoped for the stated difficulty level." - ), - options={ - 1: "Vague or incomprehensible instruction", - 2: "Unclear, missing important details", - 3: "Adequate but could be more specific", - 4: "Clear and well-defined", - 5: "Excellent clarity, perfectly scoped for difficulty level", - }, - ), - ], - ) - ) - - return builder -
File created successfully at: ./config.py
-
-
-
-Agent -Now let me validate it. -
-
-10 -Bash - -.venv/bin/data-designer validate config.py
────────────────────────── Data Designer Validate ────────────────────────── - - Config: config.py - -[21:25:04] [INFO] ✅ Validation passed - ✅ Configuration is valid
-
-
-
-Agent - -
message (999 chars)
The configuration validates successfully. Here's what the pipeline does: - -1. **`difficulty`** — Sampler column that randomly picks from beginner (35%), intermediate (40%), advanced (25%) -2. **`subtopic`** — Sampler column that uniformly samples from 7 data science subtopics (data cleaning, exploratory analysis, aggregation/groupby, visualization, statistical testing, feature engineering, messy/missing data) -3. **`instruction`** — LLM text column (`nvidia-text`) that generates a natural language task description conditioned on the difficulty and subtopic -4. **`solution`** — LLM code column (`nvidia-reasoning`) that generates a complete, self-contained Python solution for the instruction -5. **`code_validation`** — Validation column that checks the generated Python code for syntactic correctness -6. **`quality_score`** — LLM judge column (`nvidia-reasoning`) that scores each (instruction, solution) pair on three dimensions (correctness, code quality, instruction clarity) using a 1–5 scale
-
-
-
-
diff --git a/docs/devnotes/posts/assets/data-designer-got-skills/trace-skill.html b/docs/devnotes/posts/assets/data-designer-got-skills/trace-skill.html deleted file mode 100644 index 48e6027fa..000000000 --- a/docs/devnotes/posts/assets/data-designer-got-skills/trace-skill.html +++ /dev/null @@ -1,472 +0,0 @@ - - - -
- - -Skill -Typical Skill Session - -9 tools -0 errors -92s - - -
-
-Agent -I'll start by learning about the Data Designer library to understand the available column types, samplers, validators, and processors. -
-
-1 -Bash - -data-designer agent context 2>&1
output (7,427 chars)
Data Designer v0.5.4rc2 - -## Config Module - -The config module contains all user-facing configuration types. Do not search other modules in the library. -config_root: .venv/lib/python3.13/site-packages/data_designer/config - -builder: {config_root}/config_builder.py -base: {config_root}/base.py (read for inherited fields shared by columns and processors) -All config types are accessible via: import data_designer.config as dd - -## Types - -### columns -file: {config_root}/column_configs.py - -type description -------------------------- --------------------------------------------------------------------------------- -CustomColumnConfig Configuration for custom user-defined column generators. -EmbeddingColumnConfig Configuration for embedding generation columns. -ExpressionColumnConfig Configuration for derived columns using Jinja2 expressions. -ImageColumnConfig Configuration for image generation columns. -LLMCodeColumnConfig Configuration for code generation columns using Large Language Models. -LLMJudgeColumnConfig Configuration for LLM-as-a-judge quality assessment and scoring columns. -LLMStructuredColumnConfig Configuration for structured JSON generation columns using Large Language Models. -LLMTextColumnConfig Configuration for text generation columns using Large Language Models. -SamplerColumnConfig Configuration for columns generated using numerical samplers. -SeedDatasetColumnConfig Configuration for columns sourced from seed datasets. -ValidationColumnConfig Configuration for validation columns that validate existing columns. - -### constraints -file: {config_root}/sampler_constraints.py - -type description --------------------------- -------------------------------------------------------------------------- -ColumnInequalityConstraint Constrain a sampler column to be less/greater than another sampler column. -ScalarInequalityConstraint Constrain a sampler column to be less/greater than a scalar value. - -### processors -file: {config_root}/processors.py - -type description ------------------------------- --------------------------------------------------------------------------------- -DropColumnsProcessorConfig Drop columns from the output dataset (prefer ``drop=True`` in the column config). -SchemaTransformProcessorConfig Configuration for transforming the dataset schema using Jinja2 templates. - -### samplers -file: {config_root}/sampler_params.py - -type description ------------------------------ ------------------------------------------------------------------------------------- -BernoulliSamplerParams Parameters for sampling from a Bernoulli distribution. -BernoulliMixtureSamplerParams Parameters for sampling from a Bernoulli mixture distribution. -BinomialSamplerParams Parameters for sampling from a Binomial distribution. -CategorySamplerParams Parameters for categorical sampling with optional probability weighting. -DatetimeSamplerParams Parameters for uniform datetime sampling within a specified range. -GaussianSamplerParams Parameters for sampling from a Gaussian (Normal) distribution. -PersonSamplerParams Parameters for sampling synthetic person data with demographic attributes. -PersonFromFakerSamplerParams Parameters for sampling synthetic person data with demographic attributes from Faker. -PoissonSamplerParams Parameters for sampling from a Poisson distribution. -ScipySamplerParams Parameters for sampling from any scipy.stats continuous or discrete distribution. -SubcategorySamplerParams Parameters for subcategory sampling conditioned on a parent category column. -TimeDeltaSamplerParams Parameters for sampling time deltas relative to a reference datetime column. -UniformSamplerParams Parameters for sampling from a continuous Uniform distribution. -UUIDSamplerParams Parameters for generating UUID (Universally Unique Identifier) values. - -### validators -file: {config_root}/validator_params.py - -type description ----------------------------- ----------------------------------------------------------------------------------------------------- -CodeValidatorParams Configuration for code validation. Supports Python and SQL code validation. -LocalCallableValidatorParams Configuration for local callable validation. Expects a function to be passed that validates the data. -RemoteValidatorParams Configuration for remote validation. Sends data to a remote endpoint for validation. - -## Model Aliases - -default_provider: nvidia - -model_alias model generation_type provider ----------------- --------------------------------- --------------- -------- -nvidia-embedding nvidia/llama-3.2-nv-embedqa-1b-v2 embedding nvidia -nvidia-reasoning openai/gpt-oss-20b chat-completion nvidia -nvidia-text nvidia/nemotron-3-nano-30b-a3b chat-completion nvidia -nvidia-vision nvidia/nemotron-nano-12b-v2-vl chat-completion nvidia -openai-embedding text-embedding-3-large embedding openai -openai-reasoning openai/openai/gpt-5.2 chat-completion openai -openai-text us/azure/openai/gpt-4.1 chat-completion openai -openai-vision openai/openai/gpt-5.2 chat-completion openai - -## Persona Datasets - -locale size installed ----------- ------- --------- -en_IN 2.39 GB True -en_SG 0.30 GB True -en_US 1.24 GB True -hi_Deva_IN 4.14 GB True -hi_Latn_IN 2.7 GB True -ja_JP 1.69 GB True -pt_BR 2.33 GB True - -## Commands - -command_pattern description ------------------------------------------- ------------------------------------------------------------------- -data-designer agent context Prints output from all agent subcommands to bootstrap context. -data-designer agent types [family] Type names, descriptions, and source files for one or all families. -data-designer agent state model-aliases Model aliases and usability status. -data-designer agent state persona-datasets Persona locales and install status.
-
-
-
-Agent -Now let me inspect the specific config schemas I'll need. -
-
-2 -Read - -data_designer/config/base.py - -
-
-3 -Read - -data_designer/config/column_configs.py - -
-
-4 -Read - -data_designer/config/sampler_params.py - -
-
-5 -Read - -data_designer/config/validator_params.py - -
-
-6 -Read - -data_designer/config/config_builder.py - -
-
-7 -Read - -data_designer/config/processors.py - -
-
-Agent - -
message (863 chars)
Now I have a complete understanding of the library. Here are my key design decisions: - -**Schema:** -1. `difficulty` — Category sampler: beginner, intermediate, advanced (equal weights) -2. `subtopic` — Category sampler: 7 data science subtopics -3. `instruction` — LLM text column: generates a natural language task description conditioned on difficulty + subtopic -4. `solution` — LLM code column (Python): generates a complete solution conditioned on instruction + difficulty + subtopic -5. `code_validation` — Validation column (code validator, Python): checks syntactic correctness of the solution -6. `quality_score` — LLM judge column: scores correctness, code quality, and instruction clarity on 1-5 scale - -**Diversity axes:** difficulty x subtopic combinations ensure broad coverage. The LLM prompts emphasize varying libraries, complexity, and operation types.
-
-
-
-8 -Write - -text_to_python.py
written content (7,540 chars)
# /// script -# dependencies = [ -# "data-designer", -# ] -# /// -import data_designer.config as dd - - -def load_config_builder() -> dd.DataDesignerConfigBuilder: - config_builder = dd.DataDesignerConfigBuilder() - - # 1. Difficulty level - config_builder.add_column( - dd.SamplerColumnConfig( - name="difficulty", - sampler_type="category", - params=dd.CategorySamplerParams( - values=["beginner", "intermediate", "advanced"], - ), - ) - ) - - # 2. Subtopic - config_builder.add_column( - dd.SamplerColumnConfig( - name="subtopic", - sampler_type="category", - params=dd.CategorySamplerParams( - values=[ - "data cleaning", - "exploratory analysis", - "aggregation and groupby operations", - "visualization with matplotlib/seaborn", - "statistical testing", - "feature engineering", - "working with messy or missing data", - ], - ), - ) - ) - - # 3. Natural language instruction - config_builder.add_column( - dd.LLMTextColumnConfig( - name="instruction", - model_alias="nvidia-text", - system_prompt=( - "You are an expert data science educator who writes clear, realistic Python coding tasks. " - "Your instructions should be specific enough that a developer can implement a complete solution " - "without ambiguity. Vary the libraries used (pandas, numpy, scipy, sklearn, matplotlib, seaborn, etc.) " - "and the operations required to match the difficulty level." - ), - prompt=( - "Write a single, self-contained Python coding instruction for a {{ difficulty }}-level data science task " - "in the subtopic of '{{ subtopic }}'.\n\n" - "Requirements:\n" - "- The task should be realistic and something a data scientist would actually encounter.\n" - "- For beginner tasks, focus on simple operations with clear steps.\n" - "- For intermediate tasks, combine multiple operations or require domain knowledge.\n" - "- For advanced tasks, involve complex logic, performance considerations, or sophisticated techniques.\n" - "- The instruction should describe what the code should do, including any expected input/output behavior.\n" - "- Do NOT include the solution — only the task description.\n\n" - "Return only the instruction text, nothing else." - ), - ) - ) - - # 4. Python solution - config_builder.add_column( - dd.LLMCodeColumnConfig( - name="solution", - model_alias="nvidia-text", - code_lang="python", - system_prompt=( - "You are an expert Python data scientist. Write clean, well-structured, and correct Python code. " - "The code must be completely self-contained — include all necessary imports, sample data creation, " - "and function definitions. Do not rely on external files or databases. " - "Use realistic variable names and follow PEP 8 conventions." - ), - prompt=( - "Write a complete, self-contained Python solution for the following {{ difficulty }}-level " - "data science task in the area of '{{ subtopic }}'.\n\n" - "Task:\n{{ instruction }}\n\n" - "Requirements:\n" - "- Include all necessary imports at the top.\n" - "- Create any sample/mock data needed so the code runs independently.\n" - "- The code must be syntactically valid and runnable as-is.\n" - "- Match the complexity to the {{ difficulty }} level.\n" - "- Add brief inline comments explaining key steps.\n\n" - "Return only the Python code." - ), - ) - ) - - # 5. Code validation - config_builder.add_column( - dd.ValidationColumnConfig( - name="code_validation", - target_columns=["solution"], - validator_type="code", - validator_params=dd.CodeValidatorParams(code_lang="python"), - ) - ) - - # 6. LLM-as-a-judge quality scoring - config_builder.add_column( - dd.LLMJudgeColumnConfig( - name="quality_score", - model_alias="nvidia-text", - prompt=( - "Evaluate the following Python data science coding task and its solution.\n\n" - "Difficulty: {{ difficulty }}\n" - "Subtopic: {{ subtopic }}\n\n" - "Instruction:\n{{ instruction }}\n\n" - "Solution:\n{{ solution }}\n\n" - "Score each dimension below on a 1-5 scale using the provided rubrics." - ), - scores=[ - dd.Score( - name="correctness", - description=( - "Does the solution correctly implement what the instruction asks for? " - "Consider whether the code logic is sound, handles edge cases appropriately, " - "and would produce the expected output." - ), - options={ - 1: "Completely incorrect or does not address the instruction.", - 2: "Major logical errors; partially addresses the instruction.", - 3: "Mostly correct with minor bugs or missing edge cases.", - 4: "Correct implementation with only trivial issues.", - 5: "Fully correct, handles edge cases, and produces expected output.", - }, - ), - dd.Score( - name="code_quality", - description=( - "Assess the code quality including readability, proper use of libraries, " - "PEP 8 compliance, appropriate variable naming, and idiomatic Python/pandas usage." - ), - options={ - 1: "Very poor quality: unreadable, no structure, bad naming.", - 2: "Below average: some structure but many style issues.", - 3: "Acceptable: readable but could improve in idiom or style.", - 4: "Good: clean, well-structured, follows conventions.", - 5: "Excellent: idiomatic, elegant, and highly readable.", - }, - ), - dd.Score( - name="instruction_clarity", - description=( - "How clear and unambiguous is the instruction? Could a competent developer " - "implement the solution from the instruction alone without guessing?" - ), - options={ - 1: "Extremely vague or incomprehensible.", - 2: "Unclear with significant ambiguity.", - 3: "Understandable but some details are missing or ambiguous.", - 4: "Clear with only minor room for interpretation.", - 5: "Crystal clear, specific, and unambiguous.", - }, - ), - ], - ) - ) - - return config_builder -
File created successfully at: ./text_to_python.py
-
-
-
-Agent -Now let me validate the config. -
-
-9 -Bash - -data-designer validate text_to_python.py 2>&1
────────────────────────── Data Designer Validate ────────────────────────── - - Config: text_to_python.py - -[22:15:26] [INFO] ✅ Validation passed - ✅ Configuration is valid
-
-
-
-Agent - -
message (1,025 chars)
Validation passed. Here's a summary of the dataset pipeline: - -| Column | Type | Description | -|---|---|---| -| `difficulty` | Category sampler | beginner, intermediate, advanced (uniform) | -| `subtopic` | Category sampler | 7 data science subtopics (data cleaning, exploratory analysis, aggregation/groupby, visualization, statistical testing, feature engineering, messy/missing data) | -| `instruction` | LLM text | Natural language task description conditioned on difficulty + subtopic | -| `solution` | LLM code (Python) | Complete self-contained Python solution for the instruction | -| `code_validation` | Code validator | Checks the solution for syntactic correctness | -| `quality_score` | LLM judge | Scores correctness, code_quality, and instruction_clarity (1-5 scale with structured reasoning) | - -To generate the dataset, run: - -``` -data-designer create text_to_python.py --num-records <N> --dataset-name text-to-python -``` - -Note: generation can take a long time for large record counts (50+), so choose `N` accordingly.
-
-
-
-
diff --git a/docs/devnotes/posts/assets/design-principles/design-principles-hero.png b/docs/devnotes/posts/assets/design-principles/design-principles-hero.png deleted file mode 100644 index 03e6c8d41..000000000 Binary files a/docs/devnotes/posts/assets/design-principles/design-principles-hero.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/have-it-your-way/customization-blocks-confusion.png b/docs/devnotes/posts/assets/have-it-your-way/customization-blocks-confusion.png deleted file mode 100644 index 9f0bf87b1..000000000 Binary files a/docs/devnotes/posts/assets/have-it-your-way/customization-blocks-confusion.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/have-it-your-way/data-designer-plugins-hero.png b/docs/devnotes/posts/assets/have-it-your-way/data-designer-plugins-hero.png deleted file mode 100644 index ebc6a7089..000000000 Binary files a/docs/devnotes/posts/assets/have-it-your-way/data-designer-plugins-hero.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/owning-the-model-stack/aimd-concurrency-over-time.png b/docs/devnotes/posts/assets/owning-the-model-stack/aimd-concurrency-over-time.png deleted file mode 100644 index 69e295846..000000000 Binary files a/docs/devnotes/posts/assets/owning-the-model-stack/aimd-concurrency-over-time.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/owning-the-model-stack/native-model-client-hero.png b/docs/devnotes/posts/assets/owning-the-model-stack/native-model-client-hero.png deleted file mode 100644 index 11f642c1b..000000000 Binary files a/docs/devnotes/posts/assets/owning-the-model-stack/native-model-client-hero.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/owning-the-model-stack/native-model-client-layers.png b/docs/devnotes/posts/assets/owning-the-model-stack/native-model-client-layers.png deleted file mode 100644 index 4600519f6..000000000 Binary files a/docs/devnotes/posts/assets/owning-the-model-stack/native-model-client-layers.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/owning-the-model-stack/retry-boundary.png b/docs/devnotes/posts/assets/owning-the-model-stack/retry-boundary.png deleted file mode 100644 index d01d8f7cd..000000000 Binary files a/docs/devnotes/posts/assets/owning-the-model-stack/retry-boundary.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/owning-the-model-stack/throttle-keying.png b/docs/devnotes/posts/assets/owning-the-model-stack/throttle-keying.png deleted file mode 100644 index 693998f2b..000000000 Binary files a/docs/devnotes/posts/assets/owning-the-model-stack/throttle-keying.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/push-datasets-to-hugging-face-hub/push-to-hub-hero.png b/docs/devnotes/posts/assets/push-datasets-to-hugging-face-hub/push-to-hub-hero.png deleted file mode 100644 index 037da6418..000000000 Binary files a/docs/devnotes/posts/assets/push-datasets-to-hugging-face-hub/push-to-hub-hero.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/push-datasets-to-hugging-face-hub/push-to-hub-pipeline.png b/docs/devnotes/posts/assets/push-datasets-to-hugging-face-hub/push-to-hub-pipeline.png deleted file mode 100644 index de8133405..000000000 Binary files a/docs/devnotes/posts/assets/push-datasets-to-hugging-face-hub/push-to-hub-pipeline.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/push-datasets-to-hugging-face-hub/push-to-hub-round-trip.png b/docs/devnotes/posts/assets/push-datasets-to-hugging-face-hub/push-to-hub-round-trip.png deleted file mode 100644 index cc483ef8d..000000000 Binary files a/docs/devnotes/posts/assets/push-datasets-to-hugging-face-hub/push-to-hub-round-trip.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/push-datasets-to-hugging-face-hub/push-to-hub-schema-transform.png b/docs/devnotes/posts/assets/push-datasets-to-hugging-face-hub/push-to-hub-schema-transform.png deleted file mode 100644 index c3c88a06f..000000000 Binary files a/docs/devnotes/posts/assets/push-datasets-to-hugging-face-hub/push-to-hub-schema-transform.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/retrieval-sdg-toolkit/pipeline.svg b/docs/devnotes/posts/assets/retrieval-sdg-toolkit/pipeline.svg deleted file mode 100644 index 397169cb6..000000000 --- a/docs/devnotes/posts/assets/retrieval-sdg-toolkit/pipeline.svg +++ /dev/null @@ -1,154 +0,0 @@ - - Retriever SDG Toolkit Pipeline - A four-stage flowchart showing source documents flowing through document bundling and chunking, artifact extraction and grounded QA generation, deduplication and judging, then conversion into retriever training and evaluation data. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Source Documents - Docs / policies / tickets / manuals - - - - - - - - - - - - - - - Stage 1 - Bundle Docs - single + multi-doc groups - - - - - - - - - - - - - - - - Stage 1 - Chunk Docs - stable segment IDs - - - - - - - - - - - Stage 2 - Extract Artifacts - concepts / entities / links - - - - - - - - - - Stage 2 - Generate QA - grounded multi-hop questions - - - - - - - - - - - Stage 3 - Deduplicate - near-duplicate queries - - - - - - - - - - - - - - - - - Stage 3 - Judge Quality - relevance / support / clarity - - - - - - - - - - - - Stage 4 - Convert - train/val, BEIR qrels, AutoModel data - - diff --git a/docs/devnotes/posts/assets/search-agent/browsecomp-benchmark-results.jpg b/docs/devnotes/posts/assets/search-agent/browsecomp-benchmark-results.jpg deleted file mode 100644 index fe7d423a9..000000000 Binary files a/docs/devnotes/posts/assets/search-agent/browsecomp-benchmark-results.jpg and /dev/null differ diff --git a/docs/devnotes/posts/assets/search-agent/wikidata-graph-walk.png b/docs/devnotes/posts/assets/search-agent/wikidata-graph-walk.png deleted file mode 100644 index d437e2097..000000000 Binary files a/docs/devnotes/posts/assets/search-agent/wikidata-graph-walk.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/structured-outputs-from-nemotron/structured-outputs-sample-record.png b/docs/devnotes/posts/assets/structured-outputs-from-nemotron/structured-outputs-sample-record.png deleted file mode 100644 index c7777f910..000000000 Binary files a/docs/devnotes/posts/assets/structured-outputs-from-nemotron/structured-outputs-sample-record.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/text-to-sql/bird-benchmark-results.jpg b/docs/devnotes/posts/assets/text-to-sql/bird-benchmark-results.jpg deleted file mode 100644 index 535bc5960..000000000 Binary files a/docs/devnotes/posts/assets/text-to-sql/bird-benchmark-results.jpg and /dev/null differ diff --git a/docs/devnotes/posts/assets/text-to-sql/text-to-sql-pipeline.jpg b/docs/devnotes/posts/assets/text-to-sql/text-to-sql-pipeline.jpg deleted file mode 100644 index f057488ed..000000000 Binary files a/docs/devnotes/posts/assets/text-to-sql/text-to-sql-pipeline.jpg and /dev/null differ diff --git a/docs/devnotes/posts/assets/vlm-long-document-understanding/classification-taxonomy.png b/docs/devnotes/posts/assets/vlm-long-document-understanding/classification-taxonomy.png deleted file mode 100644 index f0397af29..000000000 Binary files a/docs/devnotes/posts/assets/vlm-long-document-understanding/classification-taxonomy.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/vlm-long-document-understanding/iterative-loop.png b/docs/devnotes/posts/assets/vlm-long-document-understanding/iterative-loop.png deleted file mode 100644 index ae71b25ce..000000000 Binary files a/docs/devnotes/posts/assets/vlm-long-document-understanding/iterative-loop.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/vlm-long-document-understanding/pipeline-architecture.png b/docs/devnotes/posts/assets/vlm-long-document-understanding/pipeline-architecture.png deleted file mode 100644 index b007b3540..000000000 Binary files a/docs/devnotes/posts/assets/vlm-long-document-understanding/pipeline-architecture.png and /dev/null differ diff --git a/docs/devnotes/posts/assets/vlm-long-document-understanding/vlm-sdg-hero.png b/docs/devnotes/posts/assets/vlm-long-document-understanding/vlm-sdg-hero.png deleted file mode 100644 index c070e18e1..000000000 Binary files a/docs/devnotes/posts/assets/vlm-long-document-understanding/vlm-sdg-hero.png and /dev/null differ diff --git a/docs/devnotes/posts/async-all-the-way-down.md b/docs/devnotes/posts/async-all-the-way-down.md deleted file mode 100644 index 14338337b..000000000 --- a/docs/devnotes/posts/async-all-the-way-down.md +++ /dev/null @@ -1,284 +0,0 @@ ---- -date: 2026-04-02 -authors: - - amanoel - - nmulepati - - etramel - - jgreco ---- - -# **Async All the Way Down** - -Data Designer's execution engine now schedules work at the cell level rather than the column level. Instead of running each column to completion before starting the next, the async engine dispatches a cell as soon as its specific upstream dependencies complete. Multi-model pipelines keep every endpoint saturated, and single-model pipelines benefit from AIMD-based adaptive concurrency. The result is faster pipelines with no changes to your config. - - - -
- -![From sequential columns to parallel task execution](assets/async-all-the-way-down/async-engine-hero.png) - -
- -This post walks through how we built the new execution layer, what it does differently, and what it means for pipelines at scale. - ---- - -## **The Bottleneck Was Structural** - -Every Data Designer pipeline carries a map of what can run in parallel. Consider a pipeline that generates a `topic`, writes a `summary` and a `trivia` fact from that topic, then produces an `analysis` of the summary. `summary` and `trivia` both depend on `topic`, so they could run alongside each other. `analysis` depends on `summary`, so it has to wait - but only on the same row's summary, not the entire column. These references form a per-cell dependency graph. The previous engine used that graph to order columns, but within each batch it ran each column to completion before starting the next. A row's `analysis` couldn't start until *every* row of `summary` in that batch had finished, even though it only needed its own. - -Now add one more column: `conclusion` depends on `analysis`. The dependency graph now has a branch (`trivia` runs independently) and a chain (`summary` → `analysis` → `conclusion`). That's the "Deep" shape below: - -
- -![DAG shapes used in benchmarks](assets/async-all-the-way-down/dag-shapes.svg){ style="max-width:85%; height:auto" } - -
- -In the sync engine, this pipeline takes about 8.5 seconds for 10 records with `max_parallel_requests=16`. Columns run sequentially even when they're independent — `trivia` waits for `summary` to complete despite not needing its output. Most of the wall-clock time is spent waiting on LLM responses that could have been in flight simultaneously. - -The fix isn't "make the LLM faster." It's "stop waiting when you don't have to." The figure below shows the same deep pipeline under both engines, with each bar representing the time span a column is actively generating: - -
-![Sync engine runs columns sequentially; async engine overlaps independent columns](assets/async-all-the-way-down/sync-vs-async-timeline.png){ style="max-width:100%; height:auto" } -
- -In the sync timeline (top), columns run one after another — all rows of `summary` finish before `trivia` begins. In the async timeline (bottom), the picture is fundamentally different. `summary` and `trivia` start at the same time since they share the same dependency. But the real gain is what happens next: each row's `analysis` kicks off the moment that row's `summary` completes, even while other rows of `summary` and `trivia` are still generating independently. `conclusion` fires per-row as each `analysis` result lands. Same pipeline, same config — across our benchmark trials, this shape averaged 1.3x faster. No wasted cycles, no idle slots — just the dependency graph doing exactly what it was always meant to do. - -Another way to see it: look at the dataset as a grid where each cell is one (row, column) task. The animation below shows four columns of the deep pipeline (`topic`, `summary`, `trivia`, `analysis`) across 8 rows. In the sync engine, cells fill column by column — every row of `summary` must finish before any row of `trivia` starts. In the async engine, each cell dispatches the moment its own upstream cell completes. A row's `analysis` starts as soon as that row's `summary` is done, while other rows of `summary` and `trivia` are still generating in parallel. - -
---8<-- "docs/devnotes/posts/assets/async-all-the-way-down/async-sync-animation.svg" -
- ---- - -## **Three Layers of Concurrency** - -Getting this right required solving three problems at different levels of the stack. We built a layered system where each layer manages one concern. - -
- -![Three layers: AsyncTaskScheduler for dependency dispatch, row-group admission for memory, ThrottleManager for rate limits](assets/async-all-the-way-down/architecture-layers.svg){ style="max-width:100%; height:auto" } - -
- -### **Layer 1: Dependency-aware dispatch** - -At the top sits the `AsyncTaskScheduler`. It builds an `ExecutionGraph` from your column configs using [Kahn's algorithm](https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm) for topological ordering, then tracks per-cell completion via a `CompletionTracker`. When a cell completes, the tracker determines which downstream cells are now ready and pushes them onto the dispatch queue. - -The scheduler maintains a *frontier* — the set of tasks whose inputs are all satisfied. Dispatch is a loop: pull ready tasks from the frontier, acquire a [semaphore](https://en.wikipedia.org/wiki/Semaphore_(programming)) slot, spawn a worker. When the worker completes, mark the cell done, which may add new tasks to the frontier. The loop runs until every cell in every row group has completed or been dropped. - -There's a subtlety in how the scheduler manages its task slots, and getting it right required a delicate dance between two semaphores. A naïve approach would hold a submission slot for the entire lifetime of a task. That's fine for the outbound HTTP call — the slot is released before the request goes out. But the `ThrottleManager` can impose an internal timeout while waiting for a permit during AIMD cooldown, and *that* wait would hold the submission slot hostage. If enough tasks are blocked waiting for throttle permits, the scheduler can't dispatch new work even when the frontier has ready tasks. - -The fix is a one-way semaphore handoff. The scheduler maintains two pools: a *submission* semaphore that caps how many tasks can be dispatched, and an *LLM-wait* semaphore (sized larger) for tasks that are blocked on a model call. When a task is about to call the model, it acquires an LLM-wait slot and releases its submission slot in the same atomic operation — stepping from one pool to the other mid-flight. The dispatch loop immediately sees a free submission slot and can send another task. When the LLM responds, the LLM-wait slot is released. Non-LLM generators (samplers, Jinja expressions) skip the handoff and hold their submission slot for the full duration, which is fine because they complete quickly. - -```py -if is_llm_bound: - await self._llm_wait_semaphore.acquire() - holds_llm_wait = True - self._submission_semaphore.release() - holds_submission = False -``` - -This keeps the dispatch loop saturated without unbounded coroutine growth — the submission semaphore controls how fast tasks enter, and the LLM-wait semaphore controls how many are waiting on the network. - -### **Layer 2: Row-group admission** - -Below the scheduler, the 10,000 rows you requested aren't all in memory at once. They're partitioned into row groups that checkpoint to parquet independently. A semaphore limits how many row groups are in flight simultaneously, preventing memory from growing unboundedly on large runs. - -When a row group completes — all columns generated for all its rows — the buffer manager flushes it to disk and releases the memory. Partial results appear on disk during generation. If something fails, you keep everything that already checkpointed. This is also the basis for fault tolerance, discussed later — the unit of loss is a row group, not the entire run. - -### **Layer 3: Adaptive rate limiting** - -At the bottom, each (provider, model) pair gets an independent concurrency pool with [additive-increase, multiplicative-decrease (AIMD)](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease) rate adaptation. When the provider returns a 429, the pool cuts its concurrency. On streaks of successful requests, it gradually increases. Because this happens per-model, a judge model running on one provider can saturate its endpoint while a generator on another provider is backing off. The [Owning the Model Stack](owning-the-model-stack.md) dev note covers this layer in depth. - -### **How they compose** - -The layers are independent. The scheduler decides *what* to run next. The row-group layer decides *how much* to keep in memory at once. The throttle layer discovers *how fast* each provider will accept requests. No layer needs to know about the others. - -A single task's lifecycle makes the composition concrete: - -
- Task lifecycle: Frontier → Dispatch → Generator → Throttle → LLM Call → Complete, with downstream cells looping back to the frontier -
- -A cell enters the frontier when its upstream dependencies are satisfied. The dispatch loop acquires a submission slot and hands it to a worker. The worker runs the generator, which acquires a throttle permit before making the LLM call. On completion, permits are released, the cell is marked done in the `CompletionTracker`, and any downstream cells whose dependencies are now satisfied enter the frontier. The cycle continues until every cell has completed or been dropped. - ---- - -## **Benchmark Results** - -We tested four DAG shapes that represent common pipeline patterns. All benchmarks used 10 records with `max_parallel_requests=16`, running 4 measured trials (interleaved sync/async to reduce temporal bias) after a warmup. - -
- -![Grouped bar chart showing sync vs async wall clock time across four workload shapes](assets/async-all-the-way-down/benchmark-results.png){ style="max-width:100%; height:auto" } - -
- -The pattern is clear: speedup scales with the amount of parallelism available in the DAG. - -
- -| Workload | DAG shape | Sync | Async | Speedup | -| :--- | :--- | :---: | :---: | :---: | -| **Narrow** | 4-column sequential chain | 5.2s | 4.6s | 1.1x | -| **Deep** | Chain + independent branch | 8.5s | 6.6s | 1.3x | -| **Wide** | 5 independent columns | 6.7s | 4.5s | 1.5x | -| **Dual-model** | 3 generators + 3 judges | 10.0s | 6.1s | 1.6x | - -
- -
- -![Speedup increases from 1.1x to 1.6x as DAG parallelism increases](assets/async-all-the-way-down/speedup-scaling.png){ style="max-width:100%; height:auto" } - -
- -The **narrow** workload is a sequential chain with no cross-column parallelism. The async engine still ekes out a small gain from overlapping row-level dispatch, but there's no structural parallelism to exploit. This is expected: async can't speed up a fundamentally serial pipeline. - -The **dual-model** workload is the most interesting case. Three generation columns use one model, and three judge columns use another. Each model gets its own ThrottleManager pool. The judge model starts processing rows as soon as the first generator finishes, running at full concurrency while the generator is still producing. In the sync engine, all generation has to finish before any judging starts. - -### **At higher record counts** - -The benchmarks above use 10 records deliberately — small batches isolate the scheduling benefit from rate-limit effects. At higher record counts, the bottleneck shifts. The async engine dispatches requests more aggressively, which means it discovers the provider's rate limits sooner. When a 429 hits, the AIMD controller backs off, and the backoff can cascade through downstream columns that were waiting on the throttled model's output. - -This is where the per-model throttle pools become important. Single-model pipelines are most susceptible to cascading backoff because all columns compete for the same pool. Multi-model pipelines hold up well because each model adapts independently — a 429 on the generator model doesn't slow down the judge. In our larger runs, dual-model and multi-provider workloads consistently showed the largest async gains. - -The primary tuning lever is `max_parallel_requests` per model. Set it to a generous upper bound and let AIMD find the real ceiling. See the [Owning the Model Stack](owning-the-model-stack.md) dev note for the full story on adaptive concurrency. - -### **At scale with self-hosted inference** - -Rate limits are a property of hosted API providers. With self-hosted vLLM on your own GPUs, the bottleneck shifts from API quotas to GPU throughput, and the async engine's aggressive dispatch becomes an advantage rather than a risk. - -We ran the dual-model pipeline at 100k records on a Slurm cluster with NVIDIA A100-80GB GPUs: one node running a 120B generator model (TP=4, DP=2) and a second node running an 8B judge model (TP=1, DP=8). Each job processed a 10k-record shard, with 10 shards running in parallel. This is a two-node setup, but the same approach extends to as many nodes and models as your pipeline needs. - -
- -![Boxplot showing 1.6x speedup across 10 shards of 10k records on self-hosted vLLM](assets/async-all-the-way-down/scale-boxplot.png){ style="max-width:85%; height:auto" } - -
- -Across 10 shards, the async engine averaged 16 minutes per shard versus 25 minutes for sync, a consistent 1.6x speedup with low variance. No rate limits, no AIMD backoff, just the scheduling difference. - -The model activity timeline shows why. In sync mode, DD processes each column to completion before starting the next, so the generator and judge models take turns. In async mode, the judge starts processing rows as soon as the first generator results land, keeping both models busy simultaneously. - -
- -![Model activity timeline showing sequential vs overlapping model usage](assets/async-all-the-way-down/scale-model-timeline.png){ style="max-width:100%; height:auto" } - -
- -Look at the dot strips beneath each Gantt chart. In sync mode, each model endpoint is at full capacity while it's active - but only one is active at a time. The generator GPUs sit idle while the judge runs, and vice versa. When a single self-hosted endpoint is already saturated, async scheduling alone can't push more throughput through it. The speedup here comes from pipelines with multiple endpoints, where async keeps all of them busy simultaneously instead of leaving half your GPUs idle. - ---- - -## **Beyond Speed** - -The performance numbers are satisfying, but raw throughput is only part of the picture. The async engine changes several things about the experience of running large pipelines. - -### **Progress you can see** - -Because rows complete out of order and row groups checkpoint independently, results start appearing on disk within seconds. The new progress bars — sticky ANSI bars that redraw in-place at the bottom of the terminal — update on every task completion rather than waiting for a full column to finish. Log messages from the scheduler and throttle layer render above the bars, so you see both the high-level progress and the per-event detail. A 10-minute generation run no longer means staring at nothing until the end. - -``` - column 'topic' ████████████████████████████████████░░░░ 89% | 890/1000 | 148.3 rec/s | eta 1s | 0 failed - column 'summary' ██████████████████████████░░░░░░░░░░░░░░ 65% | 650/1000 | 108.3 rec/s | eta 3s | 2 failed - column 'trivia' █████████████████████████████░░░░░░░░░░░ 72% | 720/1000 | 120.0 rec/s | eta 2s | 0 failed - column 'analysis' ██████████████░░░░░░░░░░░░░░░░░░░░░░░░░░ 35% | 350/1000 | 87.5 rec/s | eta 7s | 1 failed -``` - -When tracing is enabled (`DATA_DESIGNER_ASYNC_TRACE=1` or `RunConfig(async_trace=True)`), the scheduler also records a `TaskTrace` for every task: when it was dispatched, when it acquired a semaphore slot, when it completed, and its status. These traces are available on the result object after the run, so you can reconstruct the scheduler's timeline and understand where time was spent. - -### **Fault tolerance** - -Failures in a long-running pipeline are not exceptional — they're expected. Model endpoints return 429s, connections time out, prompts produce unparseable output. The scheduler classifies errors into two buckets. - -Retryable errors (rate limits, timeouts, transient server errors) are deferred rather than dropped. The task stays on the frontier so it can be re-attempted. If a row group stalls — all of its pending tasks are deferred and nothing is in flight — the scheduler detects the deadlock and runs *salvage rounds*: it re-dispatches the deferred tasks inline, up to a configurable maximum number of attempts. Tasks that still fail after salvage are dropped, and the row group is checkpointed with whatever succeeded. This prevents a stalled row group from holding its semaphore slot forever and blocking admission of new row groups. - -Non-retryable errors (malformed output, validation failures) drop the row immediately. The `CompletionTracker` knows which downstream tasks depended on that row and removes them from the frontier, so no work is wasted on a row that's already lost. - -In both cases, completed row groups are already on disk. The unit of loss is at most one row group, not the entire run. If the scheduler detects a sustained high error rate, it can shut down early, preserving everything that already checkpointed. - -### **Multi-model concurrency** - -Multi-model pipelines are where the architecture pays for itself. With independent throttle pools per model, there's no reason not to use the right model for each job: a reasoning model for generation, a smaller model for judging, an embedding model for deduplication, each running at its own optimal concurrency. The previous engine supported multi-model configs, but running them concurrently is what makes them practical at scale. - -### **Adoption** - -Adoption is opt-in. Set `DATA_DESIGNER_ASYNC_ENGINE=1` in your environment. Your existing pipeline definitions, dependency graph, column configs, model aliases all stay the same. We're keeping it behind an environment variable while we harden edge cases, but the goal is to make async the default. - ---- - -## **The Build** - -This was a ground-up rebuild of the execution layer, delivered across six PRs over four weeks. - -It started with the data structures: `ExecutionGraph`, `CompletionTracker`, and task models ([#356](https://github.com/NVIDIA-NeMo/DataDesigner/pull/356)). Next came the generator migration ([#378](https://github.com/NVIDIA-NeMo/DataDesigner/pull/378)), where we added symmetric `generate()`/`agenerate()` bridging so every generator works in both modes without rewriting. The core scheduler and buffer manager followed in [#404](https://github.com/NVIDIA-NeMo/DataDesigner/pull/404), then integration into `DatasetBuilder` with callbacks and trace export ([#429](https://github.com/NVIDIA-NeMo/DataDesigner/pull/429)). The `ThrottledModelClient` and dual-semaphore scheduler landed in [#449](https://github.com/NVIDIA-NeMo/DataDesigner/pull/449), wiring AIMD concurrency control into every outbound model request. A final polish pass ([#456](https://github.com/NVIDIA-NeMo/DataDesigner/pull/456)) added async preview, unified lifecycle callbacks, and sticky ANSI progress bars. - -The symmetric bridging was critical for adoption. Every `ColumnGenerator` has both a `generate()` and an `agenerate()` method. Implement one, and the base class synthesizes the other: - -```py -class ColumnGenerator: - def generate(self, data): - # If only agenerate() is overridden, bridge to it synchronously - if not self._is_overridden("agenerate"): - raise NotImplementedError - return _run_coroutine_sync(self.agenerate(data)) - - async def agenerate(self, data): - # If only generate() is overridden, run it in a thread pool - if not self._is_overridden("generate"): - raise NotImplementedError - return await asyncio.to_thread(self.generate, data.copy()) -``` - -Generator authors implement whichever method is natural — sync for CPU-bound work, async for generators that make their own network calls — and the base class handles bridging. No existing generator needed to be rewritten. Plugin authors get async support for free. - ---- - -## **Try It** - -Enable the async engine on any existing pipeline by setting an environment variable: - -```bash -DATA_DESIGNER_ASYNC_ENGINE=1 python my_pipeline.py -``` - -Pair it with the new progress bars for real-time feedback: - -```py -from data_designer.config.run_config import RunConfig -from data_designer.interface import DataDesigner - -dd = DataDesigner() -dd.set_run_config(RunConfig( - progress_bar=True, -)) -result = dd.create( - config_builder=config, - num_records=1000, -) -``` - -Pipelines with independent columns or multi-model setups will see the largest gains. Sequential chains will run at roughly the same speed. No config changes required. - -The dependencies were always per-cell. Now the engine schedules them that way. - ---- - -## **Update** - -The async engine is now the default execution path. Set `DATA_DESIGNER_ASYNC_ENGINE=0` to opt back into the legacy sync engine for one transitional release. The [Architecture & Performance](../../concepts/architecture-and-performance.md#async-engine) page covers the configuration knobs and behaviors worth knowing about. - ---- - -Key Resources: - -1. [NeMo Data Designer on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner) -2. [Data Designer Documentation](https://nvidia-nemo.github.io/DataDesigner/) -3. [Owning the Model Stack: Adaptive Concurrency](owning-the-model-stack.md) — companion dev note on the native client layer and AIMD throttling -4. [Async Engine Plan (#346)](https://github.com/NVIDIA-NeMo/DataDesigner/issues/346) — original issue and architecture plan - -*Want to learn more about NeMo Data Designer? Check out our [documentation](https://nvidia-nemo.github.io/DataDesigner/) and start building your own synthetic data pipelines today.* diff --git a/docs/devnotes/posts/data-designer-got-skills.md b/docs/devnotes/posts/data-designer-got-skills.md deleted file mode 100644 index 33513d743..000000000 --- a/docs/devnotes/posts/data-designer-got-skills.md +++ /dev/null @@ -1,239 +0,0 @@ ---- -date: 2026-03-24 -authors: - - jgreco ---- - - -# **Data Designer Got Skills** - -*Lessons from building an agent-first CLI and skill for Data Designer* - -We just published the `data-designer` skill, which leverages agent-focused CLI commands in [Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) to efficiently generate datasets. Just describe the dataset you want and your agent will craft the Data Designer configuration for you — schema design, validation, preview, generation — interactively or on full autopilot (just tell the agent to "be opinionated" or "surprise me"). - - - -![Data Designer Got Skills](assets/data-designer-got-skills/hero_data_designer_got_skills.png){ width=100% } - -Instead of asking agents to explore the source code, a single CLI command (`data-designer agent context`) delivers curated, code-derived context in one read, and three more commands (`validate`, `preview`, `create`) handle the rest. The agent's only job is writing the configuration. Combined with the new skill, this reduces token usage by **~80%**, errors by **~90%**, and wall-clock time by **~47%** — all while improving output quality (mean judge score **4.0 → 4.7**). We benchmarked our approach across 228 sessions each for the skill and a baseline. - -In today's Dev Note, we'll walk through the challenges agents face when using new libraries, how we designed a CLI and skill to help them, and the benchmark results in detail. - -!!! tip "TL;DR – Tips for building agent skills for your library" - - 1. **Consolidate your public API surface.** Keep your user-facing API in a small, well-documented set of files, separate from execution internals. This can significantly reduce the number of files agents need to read in the usage context. - - 2. **Build CLI commands that surface this context.** Agents love CLIs! Build one that exposes code-derived API context, config validation, and workflow execution as commands — agents shouldn't have to crawl your source code or learn standard execution patterns. - - 3. **Always review your session histories.** This has become the "always look at your data" of 2026. Reviewing sessions was one of the most important steps in our skill development process, showing us exactly where and how agents get stuck and waste tokens. - - 4. **Benchmark your skill against a baseline.** We saw large gains in efficiency, error rates, and output quality, but only because we ran hundreds of controlled sessions to verify. Benchmarking along the way helped guide our design decisions and gave us confidence that we were moving in the right direction. - ---- - -## **Agents as First-Class Users** - -Agents have become first-class users of basically all software. Somewhere in the last few months, we crossed a threshold. Models like [Opus 4.5](https://www.anthropic.com/news/claude-opus-4-5) and [Codex 5.1](https://openai.com/index/gpt-5-1-for-developers/), paired with maturing harnesses like [Claude Code](https://code.claude.com/docs/en/overview), [Codex](https://chatgpt.com/codex), and [OpenCode](https://opencode.ai/), have become _really_ good. They're real users of your library, and their experience with your API matters. - -We use agents to both build Data Designer and use it to generate datasets. When we started watching how they actually interact with the tool, a pattern emerged. They spend most of their tokens in the wrong place. Crawling engine internals, reading DAG resolution logic, reconstructing the API after reading most of the source code. They get there eventually, which is impressive, but the path they take is wasteful. - -The problem isn't the agent. Data Designer has a small config API — three or four files that contain nearly all the context you need for the typical use case. But nothing was pointing the agent at those files instead of the backend engine. If your library has a CLI, it's worth asking: does it serve your agent users as well as it serves your human ones? A single CLI command that delivers curated API context can replace dozens of tool calls spent on source-code exploration. - ---- - -## **The Baseline: Let the Agent Figure It Out** - -To see what this looks like in practice, let's walk through a simple example. We prompted Claude Code to build a text-to-python dataset with Data Designer, providing a relatively detailed dataset description, instructions to locate the library source, and a CLI command to discover valid [model aliases](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) in the user's environment. - -The prompt we used is shown below. Note that the hints at the bottom matter more than you might think. Providing the package path and the CLI commands up front streamlines the work the agent needs to do to understand the library and use it. - -
-The prompt - -```text -I need to generate a text-to-python dataset focused on data science and analytics for -supervised fine-tuning (SFT) a code LLM. - -Each record should include at least: -- A natural language instruction describing a data science task in Python. -- A difficulty level sampled from beginner, intermediate, and advanced. -- A subtopic sampled from areas like data cleaning, exploratory analysis, aggregation - and groupby operations, visualization with matplotlib/seaborn, statistical testing, - feature engineering, and working with messy or missing data. -- A complete Python solution generated by an LLM that correctly implements the instruction. - The code should be syntactically valid and self-contained. -- A code validation column that checks the generated solution for syntactic - correctness and reports any issues found. -- An LLM-as-a-judge column that scores each (instruction, solution) pair on correctness, - code quality, and instruction clarity. Use a 1-5 scale and return structured results. - -The instructions should be realistic and diverse — vary the complexity, libraries used, -and required operations to match the difficulty level. - -Use Data Designer for this task. - -You can find the installed package directory by running: - - python -c "import data_designer.config as dd, os; print(os.path.dirname(dd.__file__))" - -Check the available model aliases (those with API keys) by running: - - data-designer config list - -Write a Python file with a function called `load_config_builder` that returns the config builder. - -Validate that it is configured correctly using: - - data-designer validate -``` - -
- -While the agent didn't exactly start from zero, we didn't give it much and it still figured it out – impressive! It found the package, explored the source, pieced together the API, and produced a valid configuration. But look at the path it took: - -- The main agent located the package and ran `config list`, then spawned a **subagent** to "Explore the Data Designer package thoroughly." -- The subagent read **14 source files** (some two or three times), hit an error on `__init__.py`, recovered, and returned a detailed report. **25 tool calls** inside the subagent alone. -- Back in the main agent, it re-read `column_configs.py`, `validator_params.py`, and `config/__init__.py` (files the subagent already covered), grepped for `CategorySamplerParams` and `add_column`, then wrote the config and validated. -- Final tally: **35 tool calls**, **1 error**, **159 seconds**, **~1.4M tokens** between the main agent and the subagent. - -Review the full session below: - ---8<-- "docs/devnotes/posts/assets/data-designer-got-skills/trace-baseline.html" - -While this session started from a clear context, real sessions often don't start that way. There's prior context from the user's other work, back-and-forth as they iterate with the agent on the design, maybe a second or third run. Every token spent on exploration is context budget that can't go toward the actual task. Lean context is essential for multi-turn sessions. - ---- - -## **Shortening the Path: Data Designer's Agent CLI and Skill** - -Data Designer's CLI was previously only used for model configuration and downloading assets. But agents are first-class users now, and they already know how to run commands and read stdout. We saw an opportunity to extend the CLI with commands designed specifically for agent consumption. The `data-designer` skill leverages these new commands with workflows for interactive and autopilot dataset generation. - -```bash -# Bootstrap all code-derived agent context -data-designer agent context - -# Validate a config script -data-designer validate - -# Generate a small sample to inspect and iterate on -data-designer preview - -# Generate the full dataset -data-designer create --num-records --dataset-name -``` - -`agent context` dynamically generates a structured reference from the library's source code. Column types, sampler parameters, validator configs, constraints, processors, available model aliases with their providers, installed persona datasets, and the exact files to read for needed context — everything the subagent spent 25 tool calls piecing together, delivered in one read. Because the output is derived from the code at runtime, it stays in sync as the API evolves. - -The other three commands standardize config validation and dataset generation. `validate` catches config errors before any generation tokens are spent. `preview` generates a small sample to inspect and iterate on (the agent can enter this self-improvement loop on its own). `create` runs the full generation. Everything downstream of the configuration is Data Designer's domain. DAG construction, batching, execution. The agent never touches it. - -### Coding best practices still matter - -Data Designer's modular design and clear boundary between configuration and execution predates any agent work. This design, which we chose for testability and maintainability, turns out to be exactly what agents need – a small, predictable set of files that fully describes the API surface. `agent context` exploits this boundary. It dumps the config layer and nothing else. If your library has a similar separation, you're already most of the way there. You just need to surface it. - -### The Skill in action - -Let's see the [skill](https://github.com/NVIDIA-NeMo/DataDesigner/tree/main/skills/data-designer) in action. Same dataset task as before, but this time the prompt is just the dataset description. No package path, no `config list`, no validate command. The skill provides all of that. - -
-The prompt - -```text -I need to generate a text-to-python dataset focused on data science and analytics for -supervised fine-tuning (SFT) a code LLM. - -Each record should include at least: -- A natural language instruction describing a data science task in Python. -- A difficulty level sampled from beginner, intermediate, and advanced. -- A subtopic sampled from areas like data cleaning, exploratory analysis, aggregation - and groupby operations, visualization with matplotlib/seaborn, statistical testing, - feature engineering, and working with messy or missing data. -- A complete Python solution generated by an LLM that correctly implements the instruction. - The code should be syntactically valid and self-contained. -- A code validation column that checks the generated solution for syntactic - correctness and reports any issues found. -- An LLM-as-a-judge column that scores each (instruction, solution) pair on correctness, - code quality, and instruction clarity. Use a 1-5 scale and return structured results. - -The instructions should be realistic and diverse — vary the complexity, libraries used, -and required operations to match the difficulty level. -``` - -
- ---8<-- "docs/devnotes/posts/assets/data-designer-got-skills/trace-skill.html" - -The skill session followed a direct, linear path: `agent context` → read 6 config files → write config → validate. **9 tools, 0 errors, 92 seconds, ~210k tokens.** Compare that to the baseline: **35 tools, 1 error, 159 seconds, ~1.4M tokens.** - -Of course, these are individual sessions, and there's variance in both directions. Sometimes the baseline finds a lucky path and performs closer to the skill. Sometimes the skill takes a wrong turn. That said, the examples above are representative of the typical (median) outcomes we observed. To see whether the pattern holds, we ran **228 sessions each** for the skill and baseline, as described in the next section. - ---- - -## **Measuring the Difference** - -![Data Designer Skill Benchmark](assets/data-designer-got-skills/hero_benchmark_figure.png){ width=100% } - -Evaluating agent skills is harder than it might seem. Behavior is non-deterministic, sensitive to context, and varies with prompt wording. Environment isolation is critical — coding agents explore their surroundings before they start working, so if a baseline session can discover the skill files on disk, it will use them. We observed this failure mode early on and had to ensure each session got a fully isolated environment. [LangChain's writeup](https://blog.langchain.com/evaluating-skills/) on evaluating skills is an excellent read that covers many of the same challenges. - -In our experiment setup, each session started from a clean slate (new directory, fresh git history, clean venv with no skill files present for baseline runs). We used the text-to-python use case across three prompt detail levels (low, medium, high), half at high reasoning effort and half at low. Claude Code was run in headless mode (i.e., `claude -p `). Each session ends when the agent produces a validated configuration — we stop at `data-designer validate` rather than running full generation, both for easier automation and because once the config is valid, generation is just a simple `data-designer create` away. The main results are shown in the figure above and are summarized below. - -- ⚡ **Our skill and agent CLI use ~80% fewer tokens (panel a).** The skill replaces source-code exploration with directed context. Output tokens fall **65%**, tool calls **72%**, errors **90%**, wall clock time **47%**. Every downstream metric improves. - -- 📈 **Beyond the efficiency gains, the skill also produces higher-quality results (panel b).** We used an LLM judge (GPT-5.3 Codex) on a 1–5 scale. Mean quality score went from **4.0 → 4.7**. The standout is feature utilization — how well the agent uses the library's capabilities — which jumped **3.1 → 4.6**. The skill surfaces capabilities like diversity axes, sampler types, and validators directly in the context. - -- 🛡️ **Errors are nearly eliminated at high reasoning effort (panel c).** Mean errors per session drop from **1.18 → 0.04** when reasoning effort is high, and **1.67 → 0.25** when it's low. Fewer errors mean fewer recovery loops, fewer tokens burned on retries, less chance of the agent going down a dead end. The table below breaks down where the errors come from. The skill nearly wipes out file/path and import errors, and cuts config validation failures by more than two-thirds. - -
- - **Error Breakdown by Category** - - | Group | Total | % | Baseline | Skill | - |---|---|---|---|---| - | **File/Path Not Found** | 228 | 63.5% | 216 | 12 | - | **Config Validation Failures** | 92 | 25.6% | 70 | 22 | - | **Import Errors** | 32 | 8.9% | 32 | 0 | - | **Tool/Environment Issues** | 7 | 1.9% | 7 | 0 | - -
- -- ⏱️ **Wall-clock time is cut roughly in half (panel d).** **193s → 101s** with high reasoning, **172s → 92s** with low. Less exploration, fewer errors, fewer retries. The time savings follow naturally. - ---- - -## **Getting Started** - -First, you will need to install Data Designer and set up your model providers. The [quickstart guide](https://github.com/NVIDIA-NeMo/DataDesigner#quick-start) in our README walks through this. We recommend using a virtual environment to manage dependencies. - -Next, install the skill. Note that while the skill should work with other coding agents that support skills, our development and testing has focused on Claude Code at this stage. - -**Install via [skills.sh](https://skills.sh):** - -```bash -npx skills add NVIDIA-NeMo/DataDesigner -``` - -!!! tip - When prompted, make sure to select **Claude Code** as an additional agent. - -After installation, open Claude Code and type `/data-designer`, or just tell it you want to generate a dataset along with a description of what you want and the skill will kick in. - -The skill has two modes. In interactive mode (the default), the agent asks clarifying questions and has you make key design decisions (diversity axes, sampling strategies, model selection). You review sample records, give feedback, and iterate until it's right. - -Autopilot mode is the opposite. The agent reads your description, makes its own design decisions (and tells you what they are), then validates and generates without waiting. To enter this mode, just tell the agent to "be opinionated", "surprise me", or imply that you don't want to be involved in the design process. - -Both produce the same artifact. A standalone Python script calling Data Designer's public API. Re-runnable, modifiable, version-controllable. - ---- - -## **What's Next** - -Everything described in this post is live, and we're paying close attention to how people use it. Feedback from early adopters is very welcome and will help us shape what comes next. - -On the automation side, the agent already asks if you want it to review the generated dataset and suggest improvements. We're working on closing that loop (generate config, preview, review, improve, repeat) so the agent runs a few iterations on its own before handing you the result. - -We also plan to add domain-specific SDG references that the agent can draw on for specialized use cases (healthcare, finance, legal, etc.). The goal is for the agent to bring domain expertise to dataset design alongside library knowledge. - -Stay tuned. - -👋 Thanks for reading and happy dataset building! - ---- diff --git a/docs/devnotes/posts/deep-research-trajectories.md b/docs/devnotes/posts/deep-research-trajectories.md deleted file mode 100644 index c20c4d2e6..000000000 --- a/docs/devnotes/posts/deep-research-trajectories.md +++ /dev/null @@ -1,1070 +0,0 @@ ---- -date: 2026-02-10 -authors: - - etramel ---- - -# **Deep Research Trajectories with NeMo Data Designer and MCP Tool Use** - -Data Designer [v0.5.0](https://github.com/NVIDIA-NeMo/DataDesigner/releases/tag/v0.5.0)'s MCP [tool-use support](../../concepts/tool_use_and_mcp.md) lets you generate multi-turn research trajectories, the kind of data needed to train deep research agents that iteratively search, read, and synthesize evidence before answering a question. - - - ---- - -![OpenResearcher benchmark results across deep research tasks. Source: Li, Jiang, Ma et al., 2026.](https://huggingface.co/OpenResearcher/OpenResearcher-30B-A3B/resolve/main/imgs/teaser.png){ align=right width=500 } - -Deep research agents like [OpenResearcher](https://github.com/TIGER-AI-Lab/OpenResearcher) (Li, Jiang, Ma et al., 2026) and [Universal Deep Research](https://arxiv.org/abs/2509.00244) (Belcak & Molchanov, 2025) generate long reasoning chains interleaved with tool calls: formulating queries, retrieving documents, reading passages, refining hypotheses, and eventually synthesizing an answer. Training these agents requires trajectory data capturing the full multi-turn interaction between a model and its tools: every search, every document opened, every dead end explored. - -OpenResearcher demonstrated something worth paying attention to: synthetic trajectories generated against a *local* retriever ([BM25](https://dl.acm.org/doi/abs/10.1561/1500000019) over a static corpus, no web APIs) are sufficient to train [Nemotron Nano 3](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) to outperform GPT-4.1 on deep research benchmarks. The data format (complete tool-use traces showing how a model moves through an information space) matters more than model scale. Nemotron Nano 3, with only 3B active parameters, beats models orders of magnitude larger on multi-hop research tasks. - -This post shows how to generate that same kind of training data using Data Designer's MCP tool-use capabilities. We build a retriever as an MCP server, construct a corpus with known-good evidence, run a teacher model through the full research process, and use an LLM judge for rejection sampling. The result is a pipeline that produces high-quality research trajectories you can use for supervised fine-tuning or as a starting point for RL. - -Here's what one of those trajectories looks like, a 4-hop question answered correctly by Claude Opus 4.5 using the pipeline described below. Each line is a tool call; parallel calls within the same turn are grouped. - -
-Example trajectory: 4-hop question, 31 turns, 49 tool calls - -
- - -
Q: Who was the mother of the person under whom the colonizer of the winner of a border war with Libya in the 1st century BC reached its greatest extent?
-
Reference: Marcia
-
⭐ = golden passage (contains evidence for the answer)
- -
T1
-
T2
-
T3
-
T4
openmd_001400
-
T5
-
T6
-
T7
openmd_000167
-
T8
-
T9
openmd_001676 ⭐
-
T10
-
T11
openfw_021746
-
T12
-
T13
openmd_001054 ⭐
-
T14
-
T15
openmd_002075 ⭐
openmd_002510 ⭐
-
T16
-
T17
openmd_002117
-
T18
-
T19
-
T20
openmd_000722
-
T21
-
T22
-
T23
-
T24
openmd_002076 ⭐
-
T25
-
T26
-
T27
-
T28
-
T29
-
T30
-
T31
answer
Based on my extensive research, I have gathered the following evidence:

1. The Roman Empire reached its greatest territorial extent under Trajan (117 CE) — confirmed by multiple documents
2. Trajan's mother was from the Gens Marcia — the Trajan document states he "on his mother's side from the Gens Marcia, of an Italic family of Sabine origin"
3. Egypt was colonized/annexed by Rome after Cleopatra's death (30 BC), making it a Roman province
4. Egypt had border conflicts with Libya (ancient Marmaricae/Cyrenaica region) in the 1st century BC

The answer is Marcia.
- -
-
- ---- - -## **Step 1: Building the Retrieval MCP Server** - -OpenResearcher's key design choice is a three-tool browser interface rather than a single retrieval call. The paper argues (and their ablations confirm) that separating search, document opening, and in-document search forces the model to develop genuine research strategies: skimming results, diving into promising documents, hunting for specific evidence within them. A single monolithic "retrieve" tool collapses this entire workflow into one step, which produces shorter and less useful training trajectories. - -We implement the same three tools as an MCP server that Data Designer can invoke during generation. Our retriever uses [BM25S](https://github.com/xhluca/bm25s) for fast lexical search over the corpus: - -```python -from mcp.server.fastmcp import FastMCP - -mcp_server = FastMCP("corpus-retriever") - -@mcp_server.tool() -def search(query: str, top_k: int = 10) -> dict: - """Search for candidate documents to explore.""" - # BM25S search over the corpus, returns ranked results with snippets - ... - -@mcp_server.tool(name="open") -def open_document(doc_id: str) -> dict: - """Open a document for detailed inspection with cursor-numbered chunks.""" - # Returns content formatted as [1] paragraph... [2] paragraph... - ... - -@mcp_server.tool() -def find(doc_id: str, query: str) -> dict: - """Find matching passages inside a document by keyword.""" - # Returns matching chunks with cursor positions - ... - -if __name__ == "__main__": - mcp_server.run() -``` - -`search` returns a ranked list of document IDs with short snippets, enough for the model to decide which documents look promising. `open` returns the full document content, split into cursor-numbered chunks so the model can reference specific passages. `find` does targeted keyword search *within* a single document, letting the model locate specific evidence without reading the entire thing. The cursor-based chunking across `open` and `find` gives the model a way to scan long documents incrementally, the way a human researcher would scan a paper for the relevant section rather than reading it cover to cover. - -The server runs as a local stdio process, which means Data Designer launches and manages it automatically. No external services, no API keys for retrieval, no rate limits. - ---- - -## **Step 2: Building the Corpus** - -The corpus design follows directly from OpenResearcher's most striking ablation result. They tested what happens when you vary the retrieval corpus while keeping the reasoning model fixed (GPT-OSS-120B). The results, from the [OpenResearcher Appendix](https://boiled-honeycup-4c7.notion.site/Appendix-301e290627b58082abffd1ea2c262eb2): - -
- -| Corpus | BrowseComp-Plus Accuracy | -| :---- | :----: | -| Golden passages only (BrowseComp-Plus corpus) | 56.0% | -| 15M FineWeb + golden passages | 31.2% | -| 15M FineWeb only | 0.71% | - -
- -Without golden passages (documents known to contain evidence for the question), accuracy drops to nearly zero. The model can't learn research strategies from trajectories where every search is a dead end. - -The original OpenResearcher corpus uses 15M documents from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) as distractors alongside 10K golden passages. For this demonstration, we use a lighter-weight approach: we construct the corpus from multi-hop QA datasets: [HotpotQA](https://arxiv.org/abs/1809.09600) (2-hop questions requiring two pieces of linked evidence) and [MuSiQue](https://arxiv.org/abs/2108.00573) (2-4 hop questions composed from single-hop sub-questions). Each question comes with annotated supporting passages, the specific paragraphs that contain the evidence needed to answer it. Golden passages go into the corpus alongside non-supporting passages from the same datasets as distractors, at roughly a 1:9 ratio. The model has to search through noise to find the signal, which is exactly the skill we want the training data to teach. - -The key constraint is that golden passages must be *findable but not obvious*. If the corpus is too small or the golden passages are too easy to identify, the trajectories won't transfer to real-world research where evidence is sparse. The distractor ratio controls this difficulty, and the paper's ablations give us a good starting point for tuning it. - ---- - -## **Step 3: The Data Designer Pipeline** - -With the retriever server and corpus ready, the Data Designer pipeline ties everything together. We configure a teacher model, point it at the MCP retriever, and let it research each question from scratch. For this demo we hosted our own inference server, but anyone can try this pipeline using [Nemotron Nano 3 on build.nvidia.com](https://build.nvidia.com/nvidia/nemotron-3-nano-30b-a3b) with a free API key using the model configuration shown below. - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -# Search rollout model for trajectory generation -config = dd.DataDesignerConfigBuilder() -config.add_model_config( - dd.ModelConfig( - alias="search_rollout_model", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", - inference_parameters=dd.ChatCompletionInferenceParams( - temperature=1.0, - top_p=0.95, - max_tokens=16384, - ), - ) -) -``` - -The temperature and top\_p settings matter here. We want diverse research strategies across seeds (different query formulations, different document exploration orders) so that rejection sampling has a rich pool to select from. Setting temperature to 1.0 with top\_p at 0.95 gives enough variation that the same question can produce meaningfully different trajectories across seeds. - -The MCP tool configuration tells Data Designer which server to use and how many tool-call turns to allow: - -```python -# MCP retriever tool configuration -tool_config = dd.ToolConfig( - tool_alias="knowledge-base", - providers=["corpus-retriever"], - max_tool_call_turns=150, -) -config.add_tool_config(tool_config) -``` - -We set `max_tool_call_turns` high (150) because deep research trajectories can be long. Our longest observed trajectory used 25 tool calls across 53 messages. Capping too low would truncate the most interesting research chains. - -The seed dataset contains the research questions alongside reference answers (which we'll use for rejection sampling in Step 4): - -```python -config.with_seed_dataset( - dd.LocalFileSeedSource(path="questions.jsonl"), -) - -config.add_column( - dd.ExpressionColumnConfig( - name="research_question", - expr="{{ question }}", - ) -) -``` - -The core of the pipeline is the research column, where the teacher model receives a question and a system prompt instructing it to use the retriever tools: - -```python -SYSTEM_PROMPT = """You are a thorough research assistant. You have access to three tools \ -for navigating a knowledge base: -- search(query, top_k): Find candidate documents relevant to your query -- open(doc_id): Open a document to read its full content in numbered chunks -- find(doc_id, query): Locate specific passages within a document by keyword - -Your task is to research the given question by searching for relevant documents, \ -reading their content, and synthesizing an answer from the evidence you find. \ -Be systematic: formulate search queries, explore promising results, and gather \ -evidence before answering. Cite specific passages when possible.""" - -config.add_column( - dd.LLMTextColumnConfig( - name="research_answer", - prompt="Research and answer thoroughly:\n\n{{ research_question }}", - model_alias="search_rollout_model", - system_prompt=SYSTEM_PROMPT, - tool_alias="knowledge-base", - with_trace=dd.TraceType.ALL_MESSAGES, - extract_reasoning_content=True, - ) -) -``` - -Two settings are doing the important work here. `with_trace=dd.TraceType.ALL_MESSAGES` captures the *entire* interaction (every tool call, every tool response, every intermediate reasoning step) into a separate trace column in ChatML format. This is the training data: the full trajectory of how the model moved through the information space. `extract_reasoning_content=True` pulls out the model's internal chain-of-thought separately, so you can include or exclude it depending on your training setup. - ---- - -## **Step 4: Rejection Sampling with an LLM Judge** - -Not every trajectory leads to a correct answer. OpenResearcher's approach is straightforward. Generate multiple trajectories per question, score them for correctness, and keep only the ones that got the right answer. We implement this with Data Designer's `LLMJudgeColumnConfig`, using a separate (smaller) model as the judge: - -```python -# Judge model for rejection sampling -config.add_model_config( - dd.ModelConfig( - alias="judge", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", - ) -) - -config.add_column( - dd.LLMJudgeColumnConfig( - name="correctness", - model_alias="judge", - prompt=( - "Question: {{ research_question }}\n" - "Reference answer: {{ answer }}\n" - "Generated answer: {{ research_answer }}\n" - "Does the generated answer correctly address the question?" - ), - scores=[ - dd.Score( - name="correct", - description="Is the answer factually correct?", - options={ - 1: "Correct", - 0: "Incorrect", - }, - ), - ], - ) -) -``` - -The judge compares the generated answer against the reference answer from the seed dataset. Using a smaller model as judge is deliberate. We don't need the judge to *reason* about the question, just to compare two answers for factual agreement. This keeps costs down when scoring thousands of trajectories. - -In practice, you'd generate multiple trajectories per question (varying the random seed) and filter to `correctness.correct == 1`. The incorrect trajectories aren't wasted; they can serve as negative examples for preference-based training methods like DPO. - ---- - -## **Multi-Turn Tool Calling: Rough Edges in the Open Model Ecosystem** - -The pipeline described above is straightforward in principle. In practice, getting multi-turn tool calling to work reliably with open-weight models served through vLLM turned out to be the hardest part of this project. - -We tested two open-weight models on a self-hosted [vLLM (v0.15.1)](https://github.com/vllm-project/vllm/releases/tag/v0.15.1) instance: [GPT-OSS-120B](https://huggingface.co/openai/gpt-oss-120b) and [Kimi K2.5](https://huggingface.co/moonshotai/Kimi-K2.5). Both failed to produce usable research trajectories, for related but distinct reasons. - -**GPT-OSS-120B** uses a ["Harmony" output format](https://cookbook.openai.com/articles/openai-harmony) that routes text through named channels (reasoning, final answer, tool calls). When tools are involved, vLLM's parser consistently routes the model's output to the wrong channel: everything lands in `reasoning_content` while the `content` field stays empty. This happens at all `reasoning_effort` levels. The model does the research (calls tools, reads documents, formulates queries) but the final synthesized answer never appears where the serving layer expects it. This is a known issue in vLLM's Harmony format handling. Here's the final message from a typical trajectory. The model has been researching for 5 tool calls but produces no answer: - -```json -{ - "role": "assistant", - "content": [{"type": "text", "text": ""}], - "reasoning_content": "It seems that the knowledge base may have a page about - Colin Bateman that includes his biography. Possibly the 'md_001100' entry is - about a footballer, not the author. The author Colin Bateman likely ...", - "tool_calls": null -} -``` - -The model's reasoning shows it *has* the answer (it identified Colin Bateman as the author), but the content field is empty and no tool call is emitted. The trajectory ends here with nothing to show for it. - -**Kimi K2.5** exhibits a different failure mode. With its thinking mode enabled, it has the same channel-routing problem as GPT-OSS. With thinking mode disabled, the model produces content text, but after the first tool result, it *narrates* what it plans to do next rather than emitting another tool call. The serving layer sees text content without tool calls and treats it as the final answer, terminating the research loop after a single search: - -```json -{ - "role": "assistant", - "content": "I found that 'Cycle of Violence' was written by Colin Bateman, - described as a 'Northern Irish author'. Now let me search for more details - about his birthplace to confirm his birth country.", - "reasoning_content": "The search results clearly show that 'Cycle of Violence' - was written by Colin Bateman, a Northern Irish author...", - "tool_calls": null -} -``` - -The model intends to keep researching ("let me search for more details") but describes the action instead of calling the tool. The framework sees content, no tool calls, and stops. We tried multiple tokenizer modes, prompt variations, and vLLM configurations; open issues on the model's HuggingFace page confirm this is a broader compatibility gap. - -The original OpenResearcher codebase handles this by bypassing vLLM's tool call parser entirely. They hit the raw `/completions` endpoint ([`openai_generator.py`](https://github.com/TIGER-AI-Lab/OpenResearcher/blob/main/utils/openai_generator.py#L153-L177)), parse `` XML tags from the output with regex, and continue looping until the model emits an explicit answer marker like `` or `final answer:` ([`deploy_agent.py`](https://github.com/TIGER-AI-Lab/OpenResearcher/blob/main/deploy_agent.py#L313-L407)). - -The open-source tool-calling stack is growing and maturing quickly, but multi-turn tool use with reasoning models is still a rough edge. For now, the practical path is to use models with battle-tested tool-calling support through their native APIs, which is what we do in the results below. - ---- - -## **Results** - -We ran 64 questions uniformly sampled across 2, 3, and 4-hop difficulty levels from MuSiQue, with 50K FineWeb web documents as distractors (a 1:100 golden-to-distractor ratio). We tested two models, Claude Opus 4.5 (via API) and Nemotron Nano 3 (30B total / 3B active params, self-hosted via vLLM with reasoning disabled). - -
- -| | Claude Opus 4.5 | Nemotron Nano 3 | -| :---- | :----: | :----: | -| **Samples** | 64 (55 completed) | 64 (61 completed) | -| **Overall accuracy** | 41/55 (75%) | 32/61 (52%) | -| **2-hop accuracy** | 18/23 (78%) | 13/23 (57%) | -| **3-hop accuracy** | 15/18 (83%) | 11/22 (50%) | -| **4-hop accuracy** | 8/14 (57%) | 8/16 (50%) | -| **Avg tool calls** | 16.8 | 11.8 | -| **Max tool calls** | 57 | 63 | -| **Avg messages per trajectory** | 40.4 | 26.5 | -| **Max messages per trajectory** | 117 | 129 | - -
- -Opus is 22 points more accurate, but Nano runs roughly 5x faster on self-hosted hardware. Both models show tool usage scaling with hop count. Nano uses fewer tools but achieves lower accuracy, with the largest gap on 2-hop questions (78% vs 57%). Splitting by correctness reveals the same pattern in both models: incorrect trajectories are longer. - -**Claude Opus 4.5:** - -
- -| Outcome | Hops | Count | Avg Tool Calls | Avg Messages | Avg Answer Length | -| :---- | :----: | :----: | :----: | :----: | :----: | -| **Correct** | 2 | 18 | 7.3 | 18.9 | 1,072 chars | -| | 3 | 15 | 14.9 | 35.7 | 1,372 chars | -| | 4 | 8 | 21.0 | 50.6 | 1,705 chars | -| | **All** | **41** | **12.8** | **31.2** | **1,305 chars** | -| **Incorrect** | 2 | 5 | 21.0 | 48.6 | 1,534 chars | -| | 3 | 3 | 25.7 | 63.0 | 1,795 chars | -| | 4 | 6 | 36.0 | 85.2 | 1,903 chars | -| | **All** | **14** | **28.4** | **67.4** | **1,748 chars** | - -
- -**Nemotron Nano 3:** - -
- -| Outcome | Hops | Count | Avg Tool Calls | Avg Messages | Avg Answer Length | -| :---- | :----: | :----: | :----: | :----: | :----: | -| **Correct** | 2 | 13 | 6.5 | 16.1 | 773 chars | -| | 3 | 11 | 12.7 | 28.5 | 708 chars | -| | 4 | 8 | 8.0 | 19.0 | 1,600 chars | -| | **All** | **32** | **9.0** | **21.1** | **957 chars** | -| **Incorrect** | 2 | 10 | 10.1 | 23.2 | 799 chars | -| | 3 | 11 | 18.0 | 39.0 | 1,163 chars | -| | 4 | 8 | 16.2 | 35.5 | 848 chars | -| | **All** | **29** | **14.8** | **32.6** | **951 chars** | - -
- -Correct trajectories are shorter at every hop level for both models. Incorrect trajectories are roughly twice as long because the model keeps searching when it can't find evidence, then writes a longer answer to compensate. This anti-correlation between trajectory length and correctness is consistent across model scales, which means trajectory length alone could serve as a lightweight filter during rejection sampling. - ---- - -## **Closing Remarks** - -Thanks to the [OpenResearcher](https://github.com/TIGER-AI-Lab/OpenResearcher) team for their work showing that synthetic research trajectories over local retrieval can train small models to compete with much larger ones. Their results suggest we're only beginning to understand how LLMs interact with search tools and how the structure of those interactions shapes what models learn. We're excited to see where the community takes synthetic data research using [NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) as both the models and the tooling continue to improve. - ---- - -## **Try For Yourself** - -
-Full source: openresearcher_demo.py - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -# Models -config = dd.DataDesignerConfigBuilder() -config.add_model_config( - dd.ModelConfig( - alias="search_rollout_model", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", - inference_parameters=dd.ChatCompletionInferenceParams( - temperature=1.0, - top_p=0.95, - max_tokens=16384, - ), - ) -) -config.add_model_config( - dd.ModelConfig( - alias="judge", - model="nvidia/nemotron-3-nano-30b-a3b", - provider="nvidia", - ) -) - -# MCP retriever -tool_config = dd.ToolConfig( - tool_alias="knowledge-base", - providers=["corpus-retriever"], - max_tool_call_turns=150, -) -config.add_tool_config(tool_config) - -# Seed questions with reference answers -config.with_seed_dataset( - dd.LocalFileSeedSource(path="questions.jsonl"), -) - -config.add_column( - dd.ExpressionColumnConfig( - name="research_question", - expr="{{ question }}", - ) -) - -# Research trajectory generation -config.add_column( - dd.LLMTextColumnConfig( - name="research_answer", - prompt="Research and answer thoroughly:\n\n{{ research_question }}", - model_alias="search_rollout_model", - system_prompt=SYSTEM_PROMPT, - tool_alias="knowledge-base", - with_trace=dd.TraceType.ALL_MESSAGES, - extract_reasoning_content=True, - ) -) - -# Rejection sampling judge -config.add_column( - dd.LLMJudgeColumnConfig( - name="correctness", - model_alias="judge", - prompt=( - "Question: {{ research_question }}\n" - "Reference answer: {{ answer }}\n" - "Generated answer: {{ research_answer }}\n" - "Does the generated answer correctly address the question?" - ), - scores=[ - dd.Score( - name="correct", - description="Is the answer factually correct?", - options={ - 1: "Correct", - 0: "Incorrect", - }, - ), - ], - ) -) - -# Run -mcp_provider = dd.LocalStdioMCPProvider( - name="corpus-retriever", - command="uv", - args=["run", "retriever_mcp.py", "serve"], - env={"CORPUS_PATH": "corpus.jsonl"}, -) -data_designer = DataDesigner(mcp_providers=[mcp_provider]) -results = data_designer.create( - config_builder=config, - num_records=1000, - dataset_name="research-trajectories", -) -``` - -
- -
-Full source: prepare_corpus.py - -```python -# /// script -# requires-python = ">=3.10" -# dependencies = ["datasets", "huggingface_hub", "pyarrow"] -# /// - -"""Prepare a retrieval corpus and question set for the OpenResearcher demo. - -Builds corpus.jsonl and questions.jsonl from two sources: - - 1. MuSiQue — multi-hop QA dataset (2/3/4-hop) with golden passages - 2. FineWeb — web documents as distractors (matches the OpenResearcher paper) - -Golden passages (documents containing evidence for the answer) are mixed with -FineWeb distractors at roughly 1:100 ratio, so the model must search through -noise to find the signal. - -Usage: - uv run prepare_corpus.py -""" - -from __future__ import annotations - -import json -import random -from pathlib import Path -from urllib.parse import urlparse - -# --------------------------------------------------------------------------- -# Configuration -# --------------------------------------------------------------------------- - -NUM_QUESTIONS = 192 # 64 per hop level (2, 3, 4) -NUM_FINEWEB_DISTRACTORS = 50_000 -FINEWEB_SHARD = 0 -OUTPUT_DIR = "data" - -# --------------------------------------------------------------------------- -# MuSiQue extraction -# --------------------------------------------------------------------------- - -def prepare_musique(num_questions: int) -> tuple[list[dict], list[dict]]: - """Load MuSiQue and extract multi-hop questions with golden passages. - - Samples uniformly across hop counts (2, 3, 4) so the dataset has balanced - difficulty. Golden passages (is_supporting=True) go into the corpus; - non-golden passages from the same examples serve as additional distractors. - - Returns: - (questions, corpus_docs) where corpus_docs have is_golden=True/False. - """ - from datasets import load_dataset - - print("Loading MuSiQue (train split)...") - dataset = load_dataset("bdsaglam/musique", split="train") - - # Bucket answerable examples by hop count - hop_buckets: dict[int, list[dict]] = {} - for example in dataset: - if not example.get("answerable", False): - continue - num_hops = len(example.get("question_decomposition", [])) - if num_hops < 2: - continue - hop_buckets.setdefault(num_hops, []).append(example) - - # Sample uniformly: equal questions per hop level - available_hops = sorted(hop_buckets.keys()) - per_hop = num_questions // len(available_hops) - selected_examples = [] - for h in available_hops: - bucket = hop_buckets[h] - n = min(per_hop, len(bucket)) - selected_examples.extend(random.sample(bucket, n)) - - print(f" Selected {len(selected_examples)} questions across hops {available_hops}") - - # Build questions and corpus docs - questions: list[dict] = [] - golden_titles: dict[str, str] = {} - nongolden_titles: dict[str, str] = {} - - for example in selected_examples: - num_hops = len(example["question_decomposition"]) - questions.append({ - "id": f"mq_{len(questions):06d}", - "question": example["question"], - "answer": example["answer"], - "source": "musique", - "num_hops": num_hops, - "seed_id": 0, - }) - - for para in example.get("paragraphs", []): - title = para.get("title", "").strip() - content = para.get("paragraph_text", "").strip() - if not title or not content: - continue - if para.get("is_supporting", False): - if len(content) > len(golden_titles.get(title, "")): - golden_titles[title] = content - else: - if len(content) > len(nongolden_titles.get(title, "")): - nongolden_titles[title] = content - - # Golden passages - corpus_docs = [ - {"title": t, "content": c, "source": "musique", "is_golden": True} - for t, c in sorted(golden_titles.items()) - ] - # Non-golden passages (skip titles already in golden set) - corpus_docs.extend( - {"title": t, "content": c, "source": "musique", "is_golden": False} - for t, c in sorted(nongolden_titles.items()) - if t not in golden_titles - ) - - print(f" Golden passages: {len(golden_titles)}") - print(f" Non-golden passages: {len(corpus_docs) - len(golden_titles)}") - return questions, corpus_docs - - -# --------------------------------------------------------------------------- -# FineWeb distractor caching -# --------------------------------------------------------------------------- - -def cache_fineweb(shard_index: int, max_docs: int) -> list[dict]: - """Download a FineWeb parquet shard and extract English documents. - - Uses huggingface_hub for direct shard download (faster than load_dataset) - and pyarrow for memory-efficient row-group-at-a-time reading. - - Returns: - List of distractor documents with title (domain) and content (text). - """ - from huggingface_hub import hf_hub_download - import pyarrow.parquet as pq - - filename = f"sample/10BT/{shard_index:03d}_00000.parquet" - print(f"Downloading FineWeb shard: {filename}") - parquet_path = hf_hub_download( - repo_id="HuggingFaceFW/fineweb", - repo_type="dataset", - filename=filename, - ) - - pf = pq.ParquetFile(parquet_path) - print(f" {pf.metadata.num_rows:,} rows in shard") - - docs: list[dict] = [] - for rg_idx in range(pf.metadata.num_row_groups): - table = pf.read_row_group(rg_idx, columns=["text", "url", "language", "token_count"]) - batch = table.to_pydict() - - for text, url, lang, tok_count in zip( - batch["text"], batch["url"], batch["language"], batch["token_count"] - ): - if lang != "en" or tok_count < 50: - continue - text = text.strip() - if not text: - continue - - # Use domain as title - try: - domain = urlparse(url).netloc.removeprefix("www.") - except Exception: - domain = "unknown" - - docs.append({ - "title": domain, - "content": text, - "source": "fineweb", - "is_golden": False, - }) - if len(docs) >= max_docs: - break - - if len(docs) >= max_docs: - break - - print(f" Extracted {len(docs):,} English documents (min 50 tokens)") - return docs - - -# --------------------------------------------------------------------------- -# Main -# --------------------------------------------------------------------------- - -def main() -> None: - output_dir = Path(OUTPUT_DIR) - output_dir.mkdir(parents=True, exist_ok=True) - - # Extract MuSiQue questions and golden passages - questions, corpus_docs = prepare_musique(NUM_QUESTIONS) - - # Download FineWeb distractors - fineweb_docs = cache_fineweb(FINEWEB_SHARD, NUM_FINEWEB_DISTRACTORS) - corpus_docs.extend(fineweb_docs) - - # Deduplicate by title (keep longest content) - title_to_best: dict[str, dict] = {} - for doc in corpus_docs: - title = doc["title"] - if title not in title_to_best or len(doc["content"]) > len(title_to_best[title]["content"]): - title_to_best[title] = doc - - corpus = list(title_to_best.values()) - random.shuffle(corpus) - - # Assign stable IDs - prefix_map = {"musique": "md", "fineweb": "fw"} - source_counters: dict[str, int] = {} - for doc in corpus: - prefix = prefix_map.get(doc["source"], "xx") - idx = source_counters.get(doc["source"], 0) - doc["id"] = f"{prefix}_{idx:06d}" - source_counters[doc["source"]] = idx + 1 - - # Write corpus.jsonl - corpus_path = output_dir / "corpus.jsonl" - with open(corpus_path, "w") as f: - for doc in corpus: - f.write(json.dumps(doc, ensure_ascii=False) + "\n") - - # Write questions.jsonl - random.shuffle(questions) - questions_path = output_dir / "questions.jsonl" - with open(questions_path, "w") as f: - for q in questions: - f.write(json.dumps(q, ensure_ascii=False) + "\n") - - # Summary - golden = sum(1 for d in corpus if d["is_golden"]) - nongolden = len(corpus) - golden - print(f"\nCorpus: {len(corpus):,} docs ({golden} golden, {nongolden} distractors)") - print(f"Questions: {len(questions)}") - print(f"Output: {corpus_path.resolve()}") - print(f" {questions_path.resolve()}") - - -if __name__ == "__main__": - main() -``` - -
- -
-Full source: retriever_mcp.py - -```python -# /// script -# requires-python = ">=3.10" -# dependencies = ["mcp", "bm25s", "PyStemmer"] -# /// - -"""MCP Server: BM25S Corpus Retriever for OpenResearcher-style Deep Research - -A single-file MCP server that indexes a JSONL corpus and exposes BM25S -lexical search via three browser tools: - - - search(query, top_k): ranked document discovery - - open(doc_id): full document inspection with cursor-numbered chunks - - find(doc_id, query): in-document evidence lookup - -Corpus format (JSONL, one document per line): - {"id": "wiki_123", "title": "Christopher Nolan", "content": "Christopher Edward Nolan is a..."} - -Server mode (used by Data Designer): - CORPUS_PATH=corpus.jsonl uv run retriever_mcp.py serve -""" - -from __future__ import annotations - -import argparse -import json -import os -import re -import sys - -import bm25s -from mcp.server.fastmcp import FastMCP - -MCP_SERVER_NAME = "corpus-retriever" - -# Global state — populated at server startup -_bm25_retriever: bm25s.BM25 | None = None -_corpus: list[dict[str, str]] = [] -_id_to_index: dict[str, int] = {} - -mcp_server = FastMCP(MCP_SERVER_NAME) - - -def load_corpus(corpus_path: str) -> list[dict[str, str]]: - """Load a JSONL corpus file into a list of document dicts.""" - docs: list[dict[str, str]] = [] - with open(corpus_path, "r", encoding="utf-8") as f: - for line_num, line in enumerate(f, 1): - line = line.strip() - if not line: - continue - try: - doc = json.loads(line) - except json.JSONDecodeError as e: - print(f"Warning: skipping malformed JSON at line {line_num}: {e}", file=sys.stderr) - continue - if "id" not in doc or "content" not in doc: - print(f"Warning: skipping line {line_num}, missing 'id' or 'content'", file=sys.stderr) - continue - docs.append({ - "id": str(doc["id"]), - "title": str(doc.get("title", "")), - "content": str(doc["content"]), - }) - return docs - - -def build_index(docs: list[dict[str, str]]) -> bm25s.BM25: - """Build a BM25S index over title + content for each document.""" - corpus_texts = [f"{d['title']} {d['content']}" for d in docs] - corpus_tokens = bm25s.tokenize(corpus_texts, stopwords="en") - retriever = bm25s.BM25() - retriever.index(corpus_tokens) - return retriever - - -def initialize(corpus_path: str) -> None: - """Load corpus and build index into global state.""" - global _bm25_retriever, _corpus, _id_to_index - print(f"Loading corpus from {corpus_path}...", file=sys.stderr) - _corpus = load_corpus(corpus_path) - if not _corpus: - print("Warning: corpus is empty", file=sys.stderr) - return - _id_to_index = {doc["id"]: idx for idx, doc in enumerate(_corpus)} - print(f"Building BM25S index over {len(_corpus)} documents...", file=sys.stderr) - _bm25_retriever = build_index(_corpus) - print(f"Index ready. {len(_corpus)} documents indexed.", file=sys.stderr) - - -def _chunk_content(content: str) -> list[str]: - """Split document content into cursor-addressable chunks.""" - paragraph_chunks = [c.strip() for c in re.split(r"\n\s*\n+", content) if c.strip()] - if len(paragraph_chunks) > 1: - return paragraph_chunks - line_chunks = [line.strip() for line in content.splitlines() if line.strip()] - if line_chunks: - return line_chunks - stripped = content.strip() - return [stripped] if stripped else [] - - -@mcp_server.tool() -def search(query: str, top_k: int = 10) -> dict: - """Search for candidate documents to explore. - - Args: - query: Search query string. - top_k: Maximum number of ranked results (default: 10). - """ - global _bm25_retriever, _corpus - if _bm25_retriever is None or not _corpus: - return {"error": "Search index not initialized", "results": []} - query_tokens = bm25s.tokenize([query], stopwords="en") - k = max(1, min(top_k, len(_corpus))) - results, scores = _bm25_retriever.retrieve(query_tokens, k=k) - search_results: list[dict] = [] - for i in range(results.shape[1]): - doc_idx = results[0, i] - score = float(scores[0, i]) - if score <= 0: - continue - doc = _corpus[doc_idx] - snippet = doc["content"][:500] - if len(doc["content"]) > 500: - snippet += "..." - search_results.append({ - "id": doc["id"], - "title": doc["title"], - "snippet": snippet, - "score": round(score, 4), - }) - return {"results": search_results, "query": query, "total": len(search_results)} - - -@mcp_server.tool(name="open") -def open_document(doc_id: str) -> dict: - """Open a document for detailed inspection with cursor-numbered chunks. - - Args: - doc_id: The document ID (from search results). - """ - global _corpus, _id_to_index - if not _corpus: - return {"error": "Corpus not loaded"} - idx = _id_to_index.get(doc_id) - if idx is None: - return {"error": f"Document not found: {doc_id}"} - doc = _corpus[idx] - chunks = _chunk_content(doc["content"]) - numbered_chunks = [{"cursor": i + 1, "text": chunk} for i, chunk in enumerate(chunks)] - formatted = "\n".join(f"[{e['cursor']}] {e['text']}" for e in numbered_chunks) - return { - "id": doc["id"], - "title": doc["title"], - "content": formatted, - "chunks": numbered_chunks, - "total_chunks": len(numbered_chunks), - } - - -@mcp_server.tool() -def find(doc_id: str, query: str) -> dict: - """Find matching passages inside a document by keyword. - - Args: - doc_id: Document ID to search within. - query: Text to find (case-insensitive substring and keyword matching). - """ - global _corpus, _id_to_index - if not _corpus: - return {"error": "Corpus not loaded", "matches": []} - idx = _id_to_index.get(doc_id) - if idx is None: - return {"error": f"Document not found: {doc_id}", "matches": []} - query_text = query.strip().lower() - if not query_text: - return {"error": "Query must be non-empty", "matches": []} - doc = _corpus[idx] - chunks = _chunk_content(doc["content"]) - query_terms = [term for term in re.findall(r"\w+", query_text) if term] - matches: list[dict] = [] - for i, chunk in enumerate(chunks, start=1): - haystack = chunk.lower() - if query_text in haystack or (query_terms and all(t in haystack for t in query_terms)): - matches.append({"cursor": i, "text": chunk}) - return { - "doc_id": doc["id"], - "title": doc["title"], - "query": query, - "matches": matches, - "total_matches": len(matches), - } - - -def serve() -> None: - """Run as MCP server subprocess (called by Data Designer).""" - corpus_path = os.environ.get("CORPUS_PATH", "corpus.jsonl") - initialize(corpus_path) - mcp_server.run() - - -if __name__ == "__main__": - parser = argparse.ArgumentParser(description="BM25S corpus retriever MCP server") - subparsers = parser.add_subparsers(dest="command") - subparsers.add_parser("serve", help="Run the MCP server (reads CORPUS_PATH from env)") - stats_parser = subparsers.add_parser("stats", help="Print corpus statistics") - stats_parser.add_argument("--corpus-path", default="corpus.jsonl") - args = parser.parse_args() - if args.command == "serve": - serve() - elif args.command == "stats": - docs = load_corpus(args.corpus_path) - total_chars = sum(len(d["content"]) for d in docs) - print(f"Corpus: {args.corpus_path}") - print(f"Documents: {len(docs)}") - print(f"Total content: {total_chars:,} chars (~{total_chars // 4:,} tokens)") - else: - parser.print_help() -``` - -
- -Key Resources: - -1. [NeMo Data Designer on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner) -2. [OpenResearcher on GitHub](https://github.com/TIGER-AI-Lab/OpenResearcher) -3. [OpenResearcher blog post](https://boiled-honeycup-4c7.notion.site/OpenResearcher-A-Fully-Open-Pipeline-for-Long-Horizon-Deep-Research-Trajectory-Synthesis-2f7e290627b5800cb3a0cd7e8d6ec0ea) -4. [HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering](https://arxiv.org/abs/1809.09600) -5. [MuSiQue: Multi-hop Questions via Single-hop Question Composition](https://arxiv.org/abs/2108.00573) -6. [BM25S: Fast lexical search in Python](https://github.com/xhluca/bm25s) diff --git a/docs/devnotes/posts/design-principles.md b/docs/devnotes/posts/design-principles.md deleted file mode 100644 index 9ff8bd7d8..000000000 --- a/docs/devnotes/posts/design-principles.md +++ /dev/null @@ -1,158 +0,0 @@ ---- -date: 2026-02-10 -authors: - - kthadaka ---- - -# **Designing Data Designer: Why SDG Is a Systems Problem** - -Synthetic data generation is more than a single prompt to a large language model. In this post, we walk through the design principles behind NeMo Data Designer and explain why we built it as a composable orchestration framework \- treating SDG as a system of specialized stages rather than a monolithic generation task. - - - -![Designing Data Designer: from monolithic prompts to composable pipelines](assets/design-principles/design-principles-hero.png) - ---- - -When people first encounter synthetic data generation, the instinct is natural: write a detailed prompt, call a powerful LLM, and collect the output. For quick experiments and small datasets, this works fine. But as you push toward production-quality data at scale \- millions of records, multiple interrelated fields, strict quality requirements \- the cracks start to show. We built Data Designer around the conviction that SDG is fundamentally a *systems* problem, and the framework's architecture reflects that belief at every level. - -This post isn't about a specific dataset or benchmark result. Instead, we want to pull back the curtain on the design principles that shaped Data Designer itself, and share the reasoning behind the decisions we made. - ---- - -## **A Prompt Is Not All You Need** - -The most common approach to synthetic data generation looks something like this: pack everything you need into one prompt \- the schema, the constraints, the style guidelines, the quality criteria \- and send it to the biggest model you have access to. Crank up `max_tokens`, parse the output, and repeat. - -This approach has a ceiling, and you hit it faster than you'd expect. - -**Quality degrades when you overload a single call.** Asking one model to simultaneously generate content, maintain diversity across a dataset, follow a complex schema, and self-assess quality is asking it to solve several distinct problems at once. The model has to allocate its "attention budget" across all of these competing objectives, and something always gives. Usually it's the subtler requirements \- diversity narrows, edge cases get dropped, and the data starts looking suspiciously uniform. - -**Prompts become unmaintainable.** As requirements grow, the prompt balloons. What started as a clean paragraph becomes a multi-page document with nested instructions, conditional logic described in natural language, and examples that compete for context window space. At this point, iterating on one aspect of generation (say, adjusting the complexity distribution or tweaking the output schema) means editing a fragile mega-prompt and hoping nothing else breaks. - -**There are no quality gates.** In a single-call setup, validation happens *after* you've already spent the compute. If 30% of your records are malformed or low-quality, you find out at the end and either filter them out (wasting the tokens) or re-generate (wasting even more). There's no mechanism to catch problems between stages, because there *are* no stages. - -**Scaling is limited.** A single model call is a single point of failure. You can parallelize across records, but you can't parallelize across *stages* of generation, and you can't route different parts of the task to models that are better suited for them. - -None of these are problems with LLMs themselves \- they're problems with treating SDG as a single-step task. The fix isn't a better model. It's a better architecture. - ---- - -## **SDG as a System of Specialized Stages** - -The shift in thinking is straightforward: instead of asking one model to do everything, decompose the generation task into a pipeline of focused stages, each responsible for one well-defined job. - -Regardless of what you're generating \- QA pairs for retrieval training, reasoning traces for pretraining, multi-turn conversations for alignment, product reviews for testing, or labeled examples for classification \- a well-decomposed SDG pipeline typically has four kinds of stages: - -1. **Seed curation.** Control what goes in. Whether you're sampling from an existing corpus, selecting subsets of your data, or generating realistic persona profiles with demographic and personality attributes, the seed data defines the distribution your synthetic data will cover. This is where you control diversity and domain coverage \- before any LLM is involved \- so that downstream generation stages inherit that diversity naturally through their prompts. - -2. **Staged generation.** Each generation step has a focused job. One stage might extract structured metadata from a document. Another might generate content grounded in that metadata. A third might transform or enrich that content further. Because each stage has a narrow scope, its prompt is simple, its output is predictable, and it's easy to iterate on independently. - -3. **Dependency management.** Later stages build on earlier outputs. A content generation stage needs access to extracted metadata. A formatting stage needs the generated content. These dependencies form a directed acyclic graph (DAG), and the system needs to resolve that graph automatically \- so you can focus on defining the stages, not orchestrating them. - -4. **Quality control.** Validation and scoring aren't afterthoughts \- they're explicit stages in the pipeline. An LLM judge can evaluate the output of a generation stage and a validator can check structural constraints. Because these run as part of the generation pipeline, you can identify quality issues early and make informed decisions about which records to keep before investing in further downstream processing. - -This decomposition buys you something that a single prompt never can: the ability to reason about, test, and improve each stage independently. - ---- - -## **Design Principles Behind Data Designer** - -With that framing in mind, here are the principles that guided Data Designer's architecture. - -### **Declarative over imperative** - -When you define a Data Designer workflow, you describe the structure of the dataset you want \- not the execution plan for how to generate it. You declare columns, their types, their prompts or schemas, and the models they should use. The framework handles the rest: resolving dependencies, scheduling execution, managing parallelism, batching requests, and retrying failures. - -This is a deliberate choice. We wanted the configuration to read like a description of the desired *output*, not a script full of API calls and error handling. It makes workflows easier to read, easier to share, and easier to modify \- you can swap a model, adjust a prompt, or add a validation stage without rewriting control flow. - -### **Columns as composable units** - -The core abstraction in Data Designer is the *column*. Each column represents a single field in your dataset, and each column has a well-defined generation strategy: it might be an LLM text generation call, a structured output with a Pydantic schema, an embedding computation, a sampler, a Jinja2 expression that combines other columns, or a quality evaluation from an LLM judge. - -Columns reference each other through Jinja2 templates. When one column's prompt includes `{{ document_artifacts }}`, the framework knows that column depends on the `document_artifacts` column and must run after it. These references are automatically extracted to build a dependency graph, and the framework topologically sorts the graph to determine execution order. You don't write orchestration code \- you just write columns, and the DAG emerges from the references between them. - -This composability is what makes it possible to go from a simple two-column workflow to a complex multi-stage pipeline without changing the underlying execution model. - -### **Multi-model by design** - -Not every stage in a pipeline needs the same model. Extracting structured metadata from a document is a different task than generating creative long-form content, which is a different task than scoring quality, which is a different task than computing embeddings. - -Data Designer treats multi-model orchestration as a first-class concern. Each column can specify its own model alias, and the framework manages model routing, per-model parallelism limits, and usage tracking independently. In practice, this means you can use a large reasoning model for your hardest generation stage, a smaller and faster model for evaluation and scoring, and a dedicated embedding model for semantic representations \- all within the same workflow, without writing any routing logic yourself. - -### **Quality as a first-class stage** - -In Data Designer, quality control isn't a post-processing step you bolt on after generation. Validators and LLM-as-judge evaluations are column types, just like generation columns. They participate in the same dependency graph, run in the same execution engine, and their outputs are available to downstream stages. - -This means you can define a pipeline where a judge evaluates generated records immediately after they're created, and a downstream expression column flags records below a quality threshold \- all within a single workflow definition. Quality scores are part of the pipeline, not something you remember to compute afterwards. - -### **Extensibility via plugins** - -No framework can anticipate every use case. Data Designer's plugin system lets you define custom column generators that work alongside the built-in types. A plugin is a Python class that inherits from the base column generator, packages with a configuration schema, and registers itself through a standard entry point. Once installed, it's indistinguishable from a built-in column type \- it participates in dependency resolution, batching, and parallel execution like everything else. - -This is how domain-specific functionality gets added without forking the framework. If your use case requires embedding-based deduplication with FAISS indices and cosine similarity thresholds, for instance, you can build it as a plugin and drop it into any pipeline that needs it. - ---- - -## **What This Looks Like in Practice** - -These principles apply to any SDG use case. Whether you're generating reasoning traces for pretraining (as in our [RQA dev note](rqa.md)), multi-turn conversations for alignment tuning, labeled examples for text classification, product reviews for testing a recommendation system, or code-repair pairs for training a coding assistant \- the same decomposition applies. You identify the stages, define the columns, declare the dependencies, and let the framework handle execution. - -To make one example concrete, consider a pipeline for generating training data for a retrieval model. The goal is to produce high-quality question-answer pairs grounded in a corpus of documents, with quality scoring. We choose this example because it exercises several stages and model types in a single workflow, but the pattern generalizes to any domain. - -In a single-prompt approach, you'd try to pack all of this into one call: "Given this document, generate diverse QA pairs of varying complexity and only include high-quality ones." The model would do its best, but you'd have limited control over any individual aspect. - -With Data Designer, the same task decomposes into a pipeline of focused stages: - -``` - Seed Documents Seed dataset column ingests documents - │ from local files or HuggingFace - ▼ -┌─────────────────────────┐ -│ Artifact Extraction │ LLM extracts key concepts, entities, -│ │ relationships from each document -└───────────┬─────────────┘ - │ - ▼ -┌─────────────────────────┐ -│ QA Generation │ LLM generates questions & answers grounded -│ │ in the extracted artifacts -└───────────┬─────────────┘ - │ - ▼ -┌─────────────────────────┐ -│ Quality Evaluation │ LLM judge scores each QA pair -│ │ on relevance, accuracy, clarity -└───────────┬─────────────┘ - │ - ▼ - Final Dataset -``` - -Each box is a column. Each one can use a different model. Each one has a focused prompt or algorithm. And because they're declared as columns with explicit dependencies, the framework handles the execution order, the batching, and the parallelism. - -The critical insight \- and the one that applies regardless of your use case \- is that every stage is independently *configurable*, *testable*, and *replaceable*. Want to try a different model for quality evaluation? Swap the model alias on that column. Want to tighten quality thresholds? Adjust the judge column's scoring rubric. Want to add a new stage that generates hard negatives for contrastive learning? Add a column and declare its dependencies. The rest of the pipeline doesn't change. - ---- - -## **Summary** - -Synthetic data generation at scale is a systems problem, not just a prompting problem. The design principles behind Data Designer reflect this: - -1. **Declarative over imperative** \- describe the dataset you want, not the execution plan -2. **Columns as composable units** \- each stage is self-contained, with dependencies resolved automatically via a DAG -3. **Multi-model by design** \- match the model to the task, with per-column model routing -4. **Quality as a first-class stage** \- validators and judges are part of the pipeline, not afterthoughts -5. **Extensibility via plugins** \- add domain-specific logic without forking the framework - -The result is a general-purpose framework where complex, multi-stage generation workflows \- whether you're building retrieval training data, reasoning datasets, conversational corpora, or something we haven't imagined yet \- are expressed as simple column declarations. The hard problems of orchestration, dependency resolution, batching, and error handling are solved once, in the framework, rather than reimplemented in every project. - -Key Resources: - -1. [NeMo Data Designer on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner) -2. [Data Designer Documentation](https://nvidia-nemo.github.io/DataDesigner/) -3. [RQA Dev Note: Graduate-Level Science Reasoning Data](rqa.md) - ---- - -*Want to learn more about NeMo Data Designer? Check out our [documentation](https://nvidia-nemo.github.io/DataDesigner/) and start building your own synthetic data pipelines today.* diff --git a/docs/devnotes/posts/have-it-your-way.md b/docs/devnotes/posts/have-it-your-way.md deleted file mode 100644 index a76263ce3..000000000 --- a/docs/devnotes/posts/have-it-your-way.md +++ /dev/null @@ -1,271 +0,0 @@ ---- -date: 2026-05-05 -authors: - - jgreco - - etramel ---- - -# **Have It Your Way: Customizing Data Designer with Plugins** - -

A plugin framework for the custom pieces every real project ends up needing

- -![Data Designer plugin extensions](assets/have-it-your-way/data-designer-plugins-hero.png){ .devnote-float-right .devnote-hide-in-index } - -Data Designer is built around a simple idea: describe the dataset you want, and let the framework handle execution. A config points to seed data, defines generated columns, picks models, and shapes the final records — no orchestration code required. [Data Designer plugins](../../plugins/overview.md) keep that promise when a project needs something custom. - -As of Data Designer [v0.6.0](https://github.com/NVIDIA-NeMo/DataDesigner/releases/tag/v0.6.0), plugins are out of experimental mode and stable. They are the supported path for turning reusable project-specific logic into normal Data Designer components. - - - -What does "something custom" actually look like? Picture a robotics team sitting on a pile of [Isaac Sim](https://developer.nvidia.com/isaac/sim)-generated warehouse runs, trying to turn robot poses, camera views, and event metadata into instruction data. With an internal simulation-log plugin, the user-facing part can still be this small: - -```bash -uv pip install data-designer-isaac-logs -``` - -```python -from data_designer_isaac_logs.config import IsaacRunSeedSource -from data_designer_isaac_logs.config import WarehouseEventLabelColumnConfig -from data_designer_isaac_logs.config import RobotSFTProcessor - -config_builder.with_seed_dataset( - IsaacRunSeedSource( - run_dir="s3://warehouse-sim/rare-events/", - streams=("robot_pose", "overhead_rgb", "event_log"), - max_events=10_000, - ) -) -config_builder.add_column( - WarehouseEventLabelColumnConfig( - name="safety_instruction", - pose_column="robot_pose", - event_log_column="event_log", - ) -) -config_builder.add_processor(RobotSFTProcessor(output_column="messages")) -``` - -That is the point of plugins: install a package, import its config classes, and keep the workflow declarative. The Isaac run reader, event labeler, and trainer-format processor own the project-specific parsing and trainer-facing shape. Data Designer still does the framework work, from component discovery and dependency ordering to model execution and output handling. - ---- - -## **Customization Is the Normal Case** - -![A confused engineer trying to fit custom building blocks into the wrong framework slots](assets/have-it-your-way/customization-blocks-confusion.png){ .devnote-section-graphic } - -The mess usually starts innocently. A team defines a Data Designer config, then discovers that its seed data lives in an internal layout, its generated column needs a domain simulator, and its trainer expects a slightly different record shape. Someone writes a small reader beside the notebook. Someone patches a generator into a project folder. Someone adds a cleanup script after preview because the final export has one more organization-specific rule. Each choice is reasonable because every project brings a different corpus, policy model, domain vocabulary, or training stack. - -The problem is that the custom behavior now lives around Data Designer instead of inside the Data Designer workflow. It is harder to validate, harder to share, harder to version, and easier to lose. Plugins give that bespoke work a clean package boundary – a name, typed config, runtime implementation, entry point, and tests that travel together. Users still declare the dataset they want, but the local reader, domain generator, or trainer-format processor becomes a normal Data Designer component instead of another layer of glue. - -
- ---- - -## **Where Plugins Fit** - -The first plugin boundaries match the places where real projects most often need customization. - -
- -

📥 Seed reader plugins bring new source systems into Data Designer. Use them for databases, document stores, object stores, internal APIs, file collections, or corpus layouts that need custom hydration before generation can begin.

- -

🧬 Column generator plugins create new column types. Use them when a value should be produced during generation and should participate in dependency ordering like any other column. This is the right place for simulators, domain libraries, retrieval-backed generation, deterministic rule systems, or custom model-backed generation.

- -

🔧 Processor plugins transform records before or after generation. Use them for redaction, cleanup, deduplication, export views, organization-specific schemas, or training formats that should not be hidden inside prompts.

- -
- -These boundaries are intentionally narrow. A plugin should own the behavior that is specific to your use case. Data Designer validates configs and resolves dependencies. It plans batches, runs models, records logs, shows previews, then writes the output. That split lets custom components use the normal workflow without moving orchestration into the project. - -What about [custom columns](../../concepts/custom_columns.md)? Start with a custom column when you are prototyping column-generator behavior or need a one-off column that only one project uses. Custom columns keep the logic in a Python function inside the config, with declared dependencies and optional model access. When that logic needs a stable config schema, tests, packaging, docs, or reuse across teams, promote it to a column generator plugin. - ---- - -## **Author a Plugin: From Glue Code to Seed Reader** - -To make this concrete, let's walk through a full example. Consider a markdown seed reader. The one-off version might be a helper function that walks a directory, splits files into sections, returns a DataFrame, and then gets copied into the next project that needs it. That can work for one project. It becomes a problem when the reader needs options, tests, documentation, versioning, or reuse across teams. At that point, the helper has become a capability whether or not it is packaged like one. - -A plugin packages that same helper as a small Python project: - -- A user-facing config class describes the options. -- An implementation class does the work. -- A `Plugin` object connects the config to the implementation. -- An entry point registers the plugin with Data Designer. - -The config class declares the user-facing options. For a directory-backed reader, Data Designer's `FileSystemSeedSource` already has fields for `path`, `file_pattern`, and `recursive`, we just need to define the seed type discriminator: - -```python -# config.py -from __future__ import annotations - -from typing import Literal - -from data_designer.config.seed_source import FileSystemSeedSource - - -class MarkdownSectionSeedSource(FileSystemSeedSource): - """Configure the markdown sections seed reader.""" - - seed_type: Literal["markdown-sections"] = "markdown-sections" -``` - -The implementation class is where the old helper code should move. For a filesystem seed reader, Data Designer gives you a small interface instead of a blank page: implement `build_manifest(...)` to build a cheap index of candidate inputs, and implement `hydrate_row(...)` to turn each selected manifest row into one or more dataset rows. That split matters because Data Designer can plan work against the lightweight manifest before paying the cost of reading files, parsing sections, or calling project-specific libraries. The parser can still be a normal helper function; the reader class is the framework boundary. - -```python -# impl.py -from __future__ import annotations - -from pathlib import Path -from typing import Any, ClassVar - -from data_designer.engine.resources.seed_reader import ( - FileSystemSeedReader, - SeedReaderFileSystemContext, -) - -from data_designer_markdown_sections.config import MarkdownSectionSeedSource - - -class MarkdownSectionSeedReader(FileSystemSeedReader[MarkdownSectionSeedSource]): - output_columns: ClassVar[list[str]] = [ - "relative_path", - "file_name", - "section_index", - "section_header", - "section_content", - ] - - def build_manifest( - self, - *, - context: SeedReaderFileSystemContext, - ) -> list[dict[str, str]]: - # Fast path: enumerate candidate files and return cheap metadata. - matched_paths = self.get_matching_relative_paths( - context=context, - file_pattern=self.source.file_pattern, - recursive=self.source.recursive, - ) - return [ - {"relative_path": relative_path, "file_name": Path(relative_path).name} - for relative_path in matched_paths - ] - - def hydrate_row( - self, - *, - manifest_row: dict[str, Any], - context: SeedReaderFileSystemContext, - ) -> list[dict[str, Any]]: - # Expensive path: hydrate only the selected manifest rows. - # This is where parsing, fan-out, and source-specific cleanup belong. - relative_path = str(manifest_row["relative_path"]) - file_name = str(manifest_row["file_name"]) - with context.fs.open(relative_path, "r", encoding="utf-8") as handle: - markdown_text = handle.read() - - return [ - { - "relative_path": relative_path, - "file_name": file_name, - "section_index": section_index, - "section_header": section_header, - "section_content": section_content, - } - for section_index, (section_header, section_content) in enumerate( - extract_markdown_sections(markdown_text) - ) - ] -``` - -The same rule applies to column generators and processors: choose the closest base class, keep options on the config object, implement the narrow runtime method, and leave orchestration out of the plugin. - -Two small files connect the plugin to Data Designer — a `Plugin` descriptor that names the config and implementation, and a Python entry point that exposes them at install time: - -```python -# plugin.py -from data_designer.plugins import Plugin, PluginType - -plugin = Plugin( - config_qualified_name="data_designer_markdown_sections.config.MarkdownSectionSeedSource", - impl_qualified_name="data_designer_markdown_sections.impl.MarkdownSectionSeedReader", - plugin_type=PluginType.SEED_READER, -) -``` - -```toml -# pyproject.toml -[project.entry-points."data_designer.plugins"] -markdown-sections = "data_designer_markdown_sections.plugin:plugin" -``` - -After that, users do not import engine internals or run registration code. They import the config class and use it: - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner -from data_designer_markdown_sections.config import MarkdownSectionSeedSource - -builder = dd.DataDesignerConfigBuilder() -builder.with_seed_dataset( - MarkdownSectionSeedSource( - path="docs/", - file_pattern="*.md", - ) -) -builder.add_column( - dd.LLMTextColumnConfig( - name="question", - model_alias="nvidia-text", - prompt="Write a question about this section: {{ section_content }}", - ) -) - -results = DataDesigner().preview(builder, num_records=5) -``` - -No custom orchestration. No separate DataFrame preparation step. The reader is part of the Data Designer workflow. - ---- - -## **Building the Plugin Ecosystem** - -Reusable plugins also need a discovery layer. Once a plugin is useful beyond one project, users need a simple way to find the right package, install it, and get back to declaring datasets. That is why Data Designer includes a built-in NVIDIA plugin catalog and a CLI workflow for discovery and installation. - -The NVIDIA catalog is backed by [NVIDIA-NeMo/DataDesignerPlugins](https://github.com/NVIDIA-NeMo/DataDesignerPlugins), a dedicated home for first-party plugin packages, packaging examples, and plugin-specific docs. Keeping those packages outside the core repository lets them carry optional dependencies, target narrower use cases, and move at their own pace while still using the same plugin interface once installed. - -For users, the first-party path is short: list what is available, search for what you need, and install by package name or alias. - -```bash -data-designer plugin list -data-designer plugin search -data-designer plugin install -``` - -After installation, there is no separate registration step. Data Designer discovers the package's entry points, so users import the plugin's config classes and keep building the same declarative workflow. - -Catalogs are not limited to NVIDIA plugins. A platform group can publish a catalog of approved internal plugins backed by an internal package index or direct package references. A community can publish a catalog for a domain or workflow. The catalog gives users a trusted path to the plugins they prefer, while plugin packages remain independently versioned and distributed. - -```bash -data-designer plugin catalog add -data-designer plugin --catalog install -``` - -This provides a foundation for a rich Data Designer plugin ecosystem: the core framework provides the stable runtime, plugin authors provide specialized capabilities, and catalogs make those capabilities discoverable. For more information, see [Discover Plugins](../../plugins/discover.md). - ---- - -## **Where to Go Next** - -Interested in building your own plugin? Here are some resources to get you started: - -1. [Plugins overview](../../plugins/overview.md) — learn how plugins fit into Data Designer -2. [Build Your Own](../../plugins/build_your_own.md) — follow the authoring guide for seed readers, column generators, and processors -3. [Using Models in Plugins](../../plugins/models.md) — call configured models from plugin code -4. [Markdown Section Seed Reader recipe](../../recipes/plugin_development/markdown_seed_reader.md) — study the complete version of the example from this post -5. [Discover Plugins](../../plugins/discover.md) — learn how to discover and install plugins -6. [DataDesignerPlugins on GitHub](https://github.com/NVIDIA-NeMo/DataDesignerPlugins) — explore first-party plugin packages - -Moving plugins out of experimental mode means Data Designer no longer has to predict every customization users will need. The framework provides the pipeline. Plugins supply the custom pieces. - -🎨 🔌 Thanks for reading and happy plugin building! diff --git a/docs/devnotes/posts/owning-the-model-stack.md b/docs/devnotes/posts/owning-the-model-stack.md deleted file mode 100644 index a6e480c35..000000000 --- a/docs/devnotes/posts/owning-the-model-stack.md +++ /dev/null @@ -1,229 +0,0 @@ ---- -date: 2026-03-25 -authors: - - nmulepati ---- - -# **Owning the Model Stack: Adaptive Concurrency FTW!** - -Picture this: you're generating a million-record dataset. Thirty two concurrent requests per model, three models in the pipeline, two providers. Everything hums along for the first ten minutes — then one provider starts returning 429s, your retry logic kicks in, and suddenly you're in a feedback loop where retries *cause* more 429s. The run stalls. You restart with lower concurrency, waste throughput for hours, and wonder if there's a better way. - -There is. This post is about the native model client layer we built with adaptive throttling (a system that discovers provider capacity at runtime) replacing our dependency on LiteLLM along the way. - - - -![From chaotic request flow to calibrated concurrency via adaptive throttling](assets/owning-the-model-stack/native-model-client-hero.png) - -## **Why We Made the Move** - -LiteLLM gave us a fast path to multi-provider support early on: "just call any model" without writing HTTP adapters from scratch. As Data Designer's workloads scaled to millions of records across multiple models and providers, we wanted more control over what happens between our orchestrator and the provider API. - -The biggest opportunity was **adaptive concurrency**. When you start hitting rate limits at scale, you don't want to just retry. You want the system to *learn* the provider's actual capacity and adjust on the fly. That adaptation needs to be aware of your pipeline's topology. Which models share an endpoint? Which routes share a rate-limit budget? Building that required owning the transport layer. - -We also saw a chance to simplify. We were only using a slice of what LiteLLM provides. A purpose-built stack meant less surface area, faster startup, and a transport lifecycle we could reason about end to end. - -So we built a native client layer. Thin HTTP adapters with adaptive rate-limit handling, deterministic retry policy, and canonical error normalization. The rest of this post walks through how it works. - -## **Architecture: The Native Client Layer** - -The replacement is a layered stack where each layer does one thing. `ModelFacade`, the public orchestration surface that column generators call, didn't change at all. Everything below it is new. - -
- -![Native model client architecture: six layers from ModelFacade down to provider HTTP APIs](assets/owning-the-model-stack/native-model-client-layers.png){ style="max-width:75%; height:auto" } - -
- -From top to bottom: - -1. **ModelFacade**: orchestrates correction loops, MCP tool-calling, and usage tracking. This is the public API. Column generators talk to this layer, and it was untouched during the migration. If you've written a Data Designer pipeline, nothing about your code changes. - -2. **ThrottledModelClient**: the new layer. It's a decorator around `HttpModelClient` — same `ModelClient` protocol, but every outbound call is wrapped with a throttle permit: acquire a concurrency slot before the call, release it after, and feed the outcome (success, 429, or error) back to `ThrottleManager`. This is where adaptive throttling lives. - -3. **ThrottleManager**: the Additive Increase / Multiplicative Decrease (AIMD) controller that `ThrottledModelClient` delegates to. A single instance is created at pipeline startup and shared across all model clients. It owns all the mutable concurrency state — per-domain AIMD counters, global caps, cascade dampening, and cooldown timers. - -4. **HttpModelClient**: an abstract base class that defines the interface for all provider adapters. It owns the shared `httpx` transport lifecycle — connection pooling, timeouts, and transport-level retries for transient failures (502, 503, 504). Boring but important. - -5. **Provider Adapters**: `OpenAICompatibleClient` and `AnthropicClient`, both extending `HttpModelClient`. Each adapter translates between our canonical request/response types and the provider's wire format. Provider-specific shapes are contained here and never leak upward. - -6. **Provider HTTP APIs**: the actual endpoints (OpenAI, NVIDIA NIM, vLLM, Anthropic Messages API). - -The boundary between `ModelFacade` and the client layer is defined by canonical types. `ChatCompletionRequest`, `ChatCompletionResponse`, `EmbeddingRequest`, `EmbeddingResponse`, `ImageGenerationRequest`, `ImageGenerationResponse`, and `ProviderError`. These are plain dataclasses. No provider SDK objects cross this line. A `ModelClient` protocol defines the contract that all adapters implement, and that's the only interface the rest of the system sees. - -## **Adaptive Throttling: The Centerpiece** - -With this client stack in place, we had the foundation to build something that wasn't possible before. Adaptive concurrency control. Let's start with the problem. - -### **The guessing game** - -When you're calling LLM APIs at scale, you need to pick a concurrency level: how many requests to keep in flight at once. Providers publish RPM and TPM limits, but the actual capacity you can sustain depends on factors they don't tell you (current load, your prompt lengths, what other tenants are doing). You could run benchmarking passes to get a better estimate, but that's time-consuming, costs real tokens, and the answer can shift between runs anyway. Set concurrency too high and you trigger 429 storms that cascade through your pipeline. Set it too low and you leave throughput on the table for hours. - -What you actually want is a system that *discovers* the provider's capacity at runtime and adjusts automatically. That's what AIMD does. - -### **AIMD: Additive Increase / Multiplicative Decrease** - -If you've studied networking, this will sound familiar. [AIMD](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease) is the algorithm behind TCP congestion control. We apply the same idea to LLM API concurrency: - -- **During optional startup ramp**: if `rampup_seconds` is set, start a new route at one concurrent request and climb linearly toward `max_parallel_requests` over that duration. -- **On success**: after a window of consecutive successful requests (default: 25), increase the concurrency limit by 1. Slow, cautious growth. -- **On 429**: multiply the current limit by a reduce factor (default: 0.75, a 25% cut). Fast, decisive pullback. Then apply a cooldown using the provider's `Retry-After` header when available, or a default of 2 seconds. - -The asymmetry is deliberate. You probe upward slowly because overshooting wastes requests. You pull back quickly because staying above the limit wastes *everything* because every request in the burst gets rejected. This is the same insight that makes TCP work: be optimistic cautiously, be pessimistic decisively. - -The result is that the system converges on the provider's actual capacity without you setting it. By default it starts at your configured `max_parallel_requests`; for cold inference servers, you can set `rampup_seconds` to ease in from 1 request to that configured peak. Either way, once a 429 arrives, the controller discovers the real limit through rate-limit signals and settles into a steady state that tracks the provider's capacity as it changes. - -
- -![AIMD concurrency control over time: initial phase, 429 drop, recovery, ceiling stabilization, steady state](assets/owning-the-model-stack/aimd-concurrency-over-time.png){ style="max-width:75%; height:auto" } - -
- -This is especially useful when you're self-hosting your inference stack (running vLLM or NVIDIA NIM on your own hardware) as long as the serving framework returns 429s when it's at capacity. The capacity of a self-hosted endpoint depends on your GPU count, model size, quantization, batch settings, and whatever else is sharing the cluster. That capacity might change between runs, or even mid-run if other workloads spin up. If your serving layer signals overload with 429s, you don't need to figure any of that out. Point Data Designer at your endpoint, set `max_parallel_requests` to a generous upper bound, and the system self-adjusts to whatever your infrastructure can actually handle. - -### **Startup ramp** - -Some inference servers do not handle an immediate cold burst well, even when their steady-state capacity is high. For those endpoints, `ThrottleConfig(rampup_seconds=...)` enables a time-based startup ramp. Each throttle domain starts at one concurrent request and linearly climbs toward the configured `max_parallel_requests` ceiling over the ramp duration. - -The ramp is optimistic but interruptible. If no 429s arrive, it reaches the configured peak. If a 429 arrives during the ramp, the ramp is aborted immediately and the domain switches to normal AIMD behavior: multiplicative decrease, cooldown, ceiling recording when the decrease reveals a higher failed limit, and additive recovery. - -### **Ceiling stabilization** - -Classic AIMD has a well-known problem, the sawtooth. After a 429 drops the limit, additive increase climbs all the way back to the configured max, hits another 429, drops again, and repeats. Every climb wastes requests, and the 429 bursts are predictable. - -We dampen it with **ceiling stabilization**. After the first 429, the system records the pre-decrease limit as a `rate_limit_ceiling`. Subsequent additive increases don't climb all the way back to `max_parallel_requests` — they stop at `ceiling * (1 + ceiling_overshoot)` (by default 10% above the observed limit). This lets the system probe gently above what it knows works — the 10% overshoot band — without repeatedly slamming into the wall. If the probe succeeds (no 429), the limit keeps rising within the overshoot band while the ceiling stays put. If a 429 fires at or below the existing ceiling, the ceiling is updated downward to the lower observed limit via `min(existing_ceiling, prev_limit)`, tightening the band over time. The result is that oscillations shrink and the system converges on a tight band around the provider's real capacity. - -### **Cascade dampening** - -Here's a subtlety that bit us during testing. When the system is running at capacity and 429s start coming back, it's not just one request that fails. Multiple in-flight requests hit the rate limit at the same time. Without dampening, each 429 triggers its own multiplicative decrease. If you have 5 concurrent 429s and each one cuts the limit by 25%, you've collapsed from 20 to 4 in a single burst. That's way too aggressive. - -**Cascade dampening** fixes this. Only the first 429 in a burst triggers a decrease. Subsequent 429s in the same cascade are counted (for observability) but don't further reduce the limit. The cascade resets on the next successful request. Simple, but it makes the difference between a graceful pullback and a collapse. - -### **Two-level keying** - -Real pipelines aren't simple. A single provider+model combination might serve chat completions, embeddings, and image generation, potentially on different rate-limit budgets. And multiple [model aliases](../../concepts/models/model-configs.md) in your pipeline might point to the same underlying provider and model (say, one alias for generation and another for judging, both hitting the same NVIDIA endpoint). - -The throttle manager handles this with two-level keying: - -
- -![Two-level throttle keying: global cap per provider+model, independent domain states for chat, embedding, image](assets/owning-the-model-stack/throttle-keying.png){ style="max-width:75%; height:auto" } - -
- -- **Global cap**: keyed by `(provider_name, model_id)`. When multiple model aliases target the same provider and model, the effective max is `min()` of their configured `max_parallel_requests`. This enforces the most conservative limit for shared upstream capacity, because the provider doesn't care what you *call* the model, it sees the same API key. - -- **Domain state**: keyed by `(provider_name, model_id, throttle_domain)`. Each domain (`chat`, `embedding`, `image`, `healthcheck`) maintains its own AIMD state: `current_limit`, `in_flight`, `blocked_until`, `success_streak`, and `rate_limit_ceiling`. Domains float independently but are always capped by the global max. - -The practical effect is that a burst of 429s on the chat route doesn't starve embedding requests, and vice versa. Each route adapts to its own capacity independently while respecting the shared upstream limit. - -## **The Retry Boundary** - -There's a design choice here that isn't obvious until you think about it, and getting it wrong would break the entire throttling system. - -The transport layer (via `httpx` with `RetryTransport`) handles transient server failures like 502, 503, 504, and connection errors. These are hiccups. The server is temporarily broken. Retry with exponential backoff and jitter, and move on. - -But **429 is explicitly excluded from transport retries**. - -
- -![Retry boundary: 502/503/504 retried at transport, 429 passed through to ThrottledModelClient for AIMD feedback](assets/owning-the-model-stack/retry-boundary.png){ style="max-width:75%; height:auto" } - -
- -Why? Because if the retry layer swallows 429s, the throttle manager never learns the provider is overloaded. The whole AIMD feedback loop depends on seeing raw rate-limit signals. A 429 must bubble up to `ThrottledModelClient` so it can call `release_rate_limited()`, cut the concurrency limit, apply the cooldown, and record the ceiling. The next attempt then re-enters the throttle acquire path, waiting for a permit, before making another HTTP call. - -The split is clean and worth remembering. Transport retries handle *server problems*. Throttle adaptation handles *capacity problems*. The provider is working fine, you're just sending too many requests. Conflating the two is how you get retry storms. - -One caveat: this boundary behaves differently depending on the execution mode. In async mode (currently experimental, enabled with `DATA_DESIGNER_ASYNC_ENGINE=1`), 429s bypass transport retries entirely and flow straight to `ThrottledModelClient` for AIMD feedback — this is the full adaptive loop described above. In sync mode, 429s are retried at the transport layer since there's no salvage queue to re-attempt failed rows. AIMD is still wired up but only fires if all transport retries are exhausted. This is temporary — once the async engine graduates from experimental, it will become the default path and the sync codepath will be retired. See [Async All the Way Down](async-all-the-way-down.md) for the full story on the async engine. - -## **Configuration** - -The throttle system is designed to work well out of the box. The defaults are conservative and handle most workloads without tuning. The primary user-facing knob is still `max_parallel_requests` on your model's inference parameters, which sets the hard upper bound for concurrency. AIMD floats below it. - -For workloads where you want to fine-tune the adaptation behavior, `ThrottleConfig` is available on `RunConfig`: - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -data_designer = DataDesigner() -data_designer.set_run_config( - dd.RunConfig( - throttle=dd.ThrottleConfig( - reduce_factor=0.75, - success_window=25, - cooldown_seconds=2.0, - ceiling_overshoot=0.10, - rampup_seconds=0.0, - ) - ) -) -config_builder = dd.DataDesignerConfigBuilder( - model_configs=[ - dd.ModelConfig( - alias="reasoning-model", - model="nvidia/nemotron-3-super-120b-a12b", - provider="nvidia", - inference_parameters=dd.ChatCompletionInferenceParams( - max_parallel_requests=32, - ), - ), - ], -) - -# ... add columns to config_builder ... - -create_result = data_designer.create( - config_builder, - num_records=10_000, -) -``` - -| Parameter | Default | What it does | -|---|---|---| -| `reduce_factor` | 0.75 | Multiplicative decrease on 429 (0.75 = reduce by 25%) | -| `additive_increase` | 1 | How much to increase the limit after a success window | -| `success_window` | 25 | Consecutive successes before additive increase | -| `cooldown_seconds` | 2.0 | Default cooldown when no `Retry-After` header | -| `ceiling_overshoot` | 0.10 | How far above the observed ceiling to probe (10%) | -| `rampup_seconds` | 0.0 | Optional startup ramp duration. `0.0` keeps the previous immediate-start behavior | - -In practice, the parameter most worth adjusting after a 429 is `success_window`. A smaller window (say, 10) makes the system more aggressive about reclaiming throughput after a pullback, useful when you know the provider's capacity fluctuates quickly. A larger window (say, 50) makes it more conservative, better for providers with strict, stable rate limits where you'd rather not probe at all. For cold self-hosted endpoints, set `rampup_seconds` to ease into the first burst without changing steady-state AIMD behavior. - -Most users will never need to touch any of these. The system adapts automatically. - -## **What It Looks Like in the Logs** - -`ThrottleManager` logs every state transition at `INFO` level, so the adaptation story is visible in your terminal as the run progresses. - -``` -# When the system hits a 429 and cuts concurrency: -🪫📉 'nvidia/nemotron-3-super-120b-a12b' [chat] server rate-limited — concurrency reduced from 20 → 15 (retrying in 2s) - -# If the provider's capacity is lower than a previously observed ceiling, the log includes the estimated server limit: -🪫📉 'nvidia/nemotron-3-super-120b-a12b' [chat] server rate-limited at 15 (server limit ~12) — concurrency reduced to 11 (retrying in 2s) - -# As successes accumulate and the limit climbs back: -🪫📈🔥 'nvidia/nemotron-3-super-120b-a12b' [chat] concurrency increased from 11 → 12 - -# When the limit reaches the ceiling band: -🔋✅ 'nvidia/nemotron-3-super-120b-a12b' [chat] concurrency recovered to 13 parallel requests - -# And if no 429s have been observed and the limit reaches the configured max: -🔋✅ 'nvidia/nemotron-3-super-120b-a12b' [chat] concurrency fully recovered (20 parallel requests) -``` - -Reading these lines in sequence tells you exactly what happened: where the system started, when it hit the wall, how far it pulled back, and how it recovered. No guessing, no metrics pipeline required. - -## **Where This Leaves Us** - -This shipped in Data Designer v0.5.4. If you're using Data Designer today, nothing changes in your pipeline code. `ModelFacade` is the same API it's always been. What changes is what happens underneath. The system now discovers provider capacity at runtime, isolates throttle state per route, and separates retry logic from rate-limit adaptation. Adaptive throttling is enabled by default for all providers. You don't opt in or configure anything; it just starts learning. If you want to see this fully in action, turn on async mode — see [Async All the Way Down](async-all-the-way-down.md) for details. - -For most workloads, the defaults are all you need. Set `max_parallel_requests` to a generous upper bound and let AIMD find the right level. If you're running against a stack that returns 429s, the system adapts to the available capacity without any tuning. If you want finer control, `ThrottleConfig` is there — but the goal is that you spend your time designing datasets, not tuning concurrency knobs. - -Key Resources: - -1. [NeMo Data Designer on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner) -2. [Design Principles Dev Note](design-principles.md) - -*Want to learn more about NeMo Data Designer? Check out our [documentation](https://nvidia-nemo.github.io/DataDesigner/) and start building your own synthetic data pipelines today.* diff --git a/docs/devnotes/posts/push-datasets-to-hugging-face-hub.md b/docs/devnotes/posts/push-datasets-to-hugging-face-hub.md deleted file mode 100644 index 28c11c4d1..000000000 --- a/docs/devnotes/posts/push-datasets-to-hugging-face-hub.md +++ /dev/null @@ -1,320 +0,0 @@ ---- -date: 2026-04-16 -authors: - - nmulepati - - davanstrien ---- - -# **Push Datasets to Hugging Face Hub** - -You just generated 10k multilingual greetings (or some other cool dataset). Now what — email a parquet file? -Nah. Call `.push_to_hub()` and you've got a live dataset page on Hugging Face. Done and dusted 🚢. - - - -![Push to Hub Hero](assets/push-datasets-to-hugging-face-hub/push-to-hub-hero.png){ width=100% } - ---- - -Here's the full flow — build a multilingual greeting dataset with a conversation -training processor, generate it, and push it to the Hub in one go: - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -data_designer = DataDesigner() -config_builder = dd.DataDesignerConfigBuilder() - -config_builder.add_column( - dd.SamplerColumnConfig( - name="language", - sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams( - values=["English", "Spanish", "French", "German", "Italian"], - ), - drop=True, - ) -) - -config_builder.add_column( - dd.LLMTextColumnConfig( - name="greeting", - model_alias="nvidia-text", - prompt="Write a casual greeting in {{ language }}.", - ) -) -config_builder.add_column( - dd.LLMTextColumnConfig( - name="response", - model_alias="nvidia-text", - prompt="Write a helpful agent response to this greeting: '{{ greeting }}'.", - ) -) - -# Reshape into an OpenAI-style conversation training format -config_builder.add_processor( - dd.SchemaTransformProcessorConfig( - name="conversations", - template={ - "messages": [ - {"role": "user", "content": "{{ greeting }}"}, - {"role": "assistant", "content": "{{ response }}"}, - ] - }, - ) -) - -results = data_designer.create(config_builder, num_records=10_000) - -# Ship it: -url = results.push_to_hub( - "my-org/multilingual-greetings", - "10k synthetic agent/user conversations across 5 languages.", - tags=["greetings", "multilingual", "conversation"], -) -print(url) # https://huggingface.co/datasets/my-org/multilingual-greetings -``` - ---- -## Two Ways In - same outcome - -**From results** (the happy path) — you just ran `.create()`, you have the -results object, call `.push_to_hub()` on it. - -**From a folder** (the "I closed my notebook" path) — you saved artifacts to -disk earlier and want to push them later: - -```python -from data_designer.integrations.huggingface import HuggingFaceHubClient - -url = HuggingFaceHubClient.push_to_hub_from_folder( - dataset_path="./my-saved-dataset", - repo_id="my-org/multilingual-greetings", - description="10k synthetic agent/user conversations across 5 languages.", -) -``` - ---- -## What You Get on the Hub - -Once pushed, your dataset is live in the Hugging Face ecosystem: - -- **Dataset Viewer** — browsable in the browser immediately. Each processor - config shows up as a separate subset tab (more on this in - [Processors Get First-Class Treatment](#processors-get-first-class-treatment)). -- **Streaming** — parquet means consumers can stream without downloading: - - ```python - from datasets import load_dataset - - ds = load_dataset("my-org/multilingual-greetings", "conversations", split="train", streaming=True) - ``` - -- **[Dataset Viewer API](https://huggingface.co/docs/dataset-viewer/)** — row - pagination, text search, column statistics, and parquet shard URLs with no - extra setup. - ---- -## What Gets Uploaded - -![Push to Hub Pipeline](assets/push-datasets-to-hugging-face-hub/push-to-hub-pipeline.png) - -Everything. The upload pipeline runs in this order: - -``` -1. README.md ← auto-generated dataset card -2. data/*.parquet ← your main dataset (remapped from parquet-files/) -3. images/* ← if you have image columns (skipped otherwise) -4. {processor}/* ← processor outputs (remapped from processors-files/) -5. builder_config.json -6. metadata.json ← paths rewritten to match HF repo layout -``` - -Each step is its own commit on the HF repo, so you get a clean history. - -This is especially nice for large datasets. Data Designer writes output in -batched parquet partitions — generate 100k records and you'll have dozens of -parquet files across `parquet-files/`, `processors-files/`, and maybe `images/`. -Manually uploading all of that, organizing it into the right HF repo structure, -writing the dataset card YAML configs, and rewriting metadata paths would be -tedious and error-prone. `push_to_hub` handles the whole thing in one call — -folder uploads, path remapping, config registration, dataset card generation, -all of it. - -Re-pushing to the same `repo_id` updates the existing repo — no need to delete -and recreate. - ---- -## Processors Get First-Class Treatment - -![Schema Transform for Conversation Training](assets/push-datasets-to-hugging-face-hub/push-to-hub-schema-transform.png) - -Notice the `SchemaTransformProcessorConfig` in the example above. That's doing -the heavy lifting — it takes the raw `greeting` and `response` columns and -reshapes each row into an OpenAI-style `messages` array: - -```python -config_builder.add_processor( - dd.SchemaTransformProcessorConfig( - name="conversations", - template={ - "messages": [ - {"role": "user", "content": "{{ greeting }}"}, - {"role": "assistant", "content": "{{ response }}"}, - ] - }, - ) -) -``` - -The template is Jinja2 all the way down. Keys become columns in the output, -values get rendered per-row with the actual column data. The template dict must -be JSON-serializable — strings, lists, nested objects, all fair game. So you can -build arbitrarily complex conversation schemas (multi-turn, system prompts, -tool calls) just by adding more entries to the `messages` list. - -The processor runs after each batch and writes its output to a separate parquet -file alongside the main dataset. The main dataset (`data/`) still has the raw -columns — the processor output is an *additional* view, not a replacement. - -**When you push to hub, each processor gets its own top-level directory and its -own HF dataset config.** So the `conversations` processor from our example ends -up like this on HF: - -``` -my-org/multilingual-greetings/ -├── README.md -├── data/ -│ ├── batch_00000.parquet ← raw columns (greeting, response) -│ └── batch_00001.parquet -├── conversations/ -│ ├── batch_00000.parquet ← transformed (messages array) -│ └── batch_00001.parquet -├── builder_config.json -└── metadata.json -``` - -The dataset card YAML frontmatter registers each processor as its own named -config: - -```yaml -configs: -- config_name: data - data_files: "data/*.parquet" - default: true -- config_name: conversations - data_files: "conversations/*.parquet" -``` - -So consumers grab exactly the format they need: - -```python -from datasets import load_dataset - -# Raw columns — good for analysis -df = load_dataset("my-org/multilingual-greetings", "data", split="train") - -# Conversation format — ready for fine-tuning -df_conv = load_dataset("my-org/multilingual-greetings", "conversations", split="train") -print(df_conv[0]) -# {'messages': [{'role': 'user', 'content': 'Hey! Como estás?'}, -# {'role': 'assistant', 'content': 'Hola! Estoy bien, gracias...'}]} -``` - -The Quick Start section in the generated README includes these snippets -automatically — one `load_dataset` call per processor. - -**Metadata paths are rewritten too.** Local paths like -`processors-files/conversations/batch_00000.parquet` become -`conversations/batch_00000.parquet` so file references in the metadata match -the actual HF repo structure. - -If there are no processors, all of this is silently skipped — no empty -directories, no phantom configs. - ---- -## The Auto-Generated Dataset Card - -This is the fun part. The upload generates a full HuggingFace dataset card from -your run metadata. It pulls from `metadata.json` and `builder_config.json` to -build: - -- A **Quick Start** section with `load_dataset` code (including processor subsets) -- A **Dataset Summary** with record count, column count, completion % -- A **Schema & Statistics** table — per-column type, uniqueness, null rate, token stats -- **Generation Details** — how many columns of each config type -- A **Citation** block so people can cite your dataset - -Tags default to `["synthetic", "datadesigner"]` plus whatever you pass in. -Size category (`n<1K`, `1K - -## **Why Retriever Data Is the Bottleneck** - -If you are building a RAG system, you have probably hit this wall: the generator is good, the vector database is fast, the prompt is carefully tuned, and the answer is still wrong because the right passage never made it into context. - -That is a retrieval problem. More specifically, it is often a data problem. General-purpose embedding models understand broad semantic similarity, but they do not know the fine-grained distinctions in your product docs, tickets, policies, codebase, manuals, or internal taxonomy. To improve that, you need domain-specific retriever training and evaluation data: realistic queries, positive passages, held-out evals, and enough metadata to know whether the retriever actually found the right evidence. - -The hard part is not asking an LLM to write questions about a document. The hard part is keeping every generated question tied to the exact chunk, document, or multi-hop evidence set that a retriever should recover. Many RAG tutorials stop at chunk, embed, retrieve, and prompt. Fine-tuning recipes often begin once labeled query-passage pairs already exist. The gap in between is where developers lose the most time. - -The plugin fills that gap. It packages a retrieval SDG toolkit that starts with a directory of documents, generates synthetic query-positive examples with NeMo Data Designer, filters them, and exports them for retriever fine-tuning and BEIR-style evaluation. - -This is not just a demo package. The same plugin produced the [Retrieval-Synthetic-NVDocs-v1](https://huggingface.co/datasets/nvidia/Retrieval-Synthetic-NVDocs-v1) dataset from NVIDIA public documentation, and it powers the bootstrap SDG stage for both the NeMo [embedding fine-tune recipe](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/embed) and [reranking fine-tune recipe](https://github.com/NVIDIA-NeMo/Nemotron/tree/e6e8a3281a11b8e1b7b47af098bbf54416c68d47/src/nemotron/recipes/rerank). It is now available as a standalone Data Designer plugin for generating high-quality, complex, multi-document, multi-hop retrieval data compatible with [AutoModel](https://github.com/NVIDIA-NeMo/Automodel). - -This post walks through what the plugin does, why the generated labels matter, and how to make your first small run useful before you scale it up. - ---- - -## **From Documents to Retriever Data** - -The plugin packages a four-stage Data Designer pipeline: - -![Retriever SDG pipeline: source documents flow through document bundling and chunking, artifact extraction and QA generation, deduplication and judging, and conversion into training and evaluation artifacts](assets/retrieval-sdg-toolkit/pipeline.svg){ style="max-width:100%; height:auto" } - -The package contributes two Data Designer extensions: - -| Plugin | Type | Why it matters | -| --- | --- | --- | -| `document-chunker` | seed reader | Turns text files into sentence chunks with stable segment IDs, so each query can point back to the passages that answer it. | -| `embedding-dedup` | column generator | Removes near-duplicate generated questions before judging and export, so the training data has more variety. | - -For local runs, the current package exposes a Python API and a CLI: - -| Surface | Use it when | -| --- | --- | -| `build_qa_generation_pipeline(...)` | You want to customize the Data Designer config in Python. | -| `data-designer-retrieval-sdg generate` | You want the packaged end-to-end generation flow. | -| `data-designer-retrieval-sdg convert` | You want trainer-ready and BEIR-ready files from generated JSON. | - -This is still Data Designer: users declare the corpus and generation settings; the engine handles dependency ordering, model calls, async scheduling, previews, and dataset output. - ---- - -## **Step 1: Chunk Documents So Labels Survive** - -For retriever training, chunking is not just preprocessing. The chunk IDs become labels. If a generated query uses chunks 3, 7, and 8, those IDs have to survive generation, filtering, splitting, and export. - -The `document-chunker` seed reader handles that boundary: - -```python -from data_designer_retrieval_sdg.seed_source import DocumentChunkerSeedSource - -seed_source = DocumentChunkerSeedSource( - path="./docs", - file_pattern="*", - recursive=True, - file_extensions=[".txt", ".md"], - min_text_length=50, - sentences_per_chunk=5, - num_sections=1, -) -``` - -Each row includes the original file name, full text, sentence chunks, structured section text, and bundle metadata. The important part is that chunks carry `chunk_id` values. Those IDs are what later become positive documents in training and qrels. - -For questions that span multiple documents, such as "How does the migration guide change the deployment recommendation from the architecture overview?", enable multi-document bundling: - -```python -seed_source = DocumentChunkerSeedSource( - path="./docs", - file_extensions=[".txt", ".md"], - multi_doc=True, - bundle_size=2, - bundle_strategy="doc_balanced", - max_docs_per_bundle=3, -) -``` - -That gives the model opportunities to generate cross-document questions while still tracking which document each segment came from. - ---- - -## **Step 2: Generate Questions That Point Back to Evidence** - -The pipeline first extracts document artifacts - concepts, relationships, themes, entities, processes, insights, technical terms, and contextual factors. Then it asks the model to generate standalone questions grounded in the chunked context. - -As a library, the path is compact: - -```python -from data_designer.interface import DataDesigner -from data_designer_retrieval_sdg import ( - DocumentChunkerSeedSource, - build_qa_generation_pipeline, -) - -seed_source = DocumentChunkerSeedSource( - path="./docs", - file_extensions=[".txt", ".md"], - sentences_per_chunk=5, -) - -config_builder = build_qa_generation_pipeline( - seed_source=seed_source, - num_pairs=7, - min_hops=2, - max_hops=4, - min_complexity=4, - similarity_threshold=0.9, -) - -results = DataDesigner().create( - config_builder=config_builder, - num_records=200, - dataset_name="retrieval_sdg", -) -``` - -A useful generated example looks like this: - -```json -{ - "question": "How do the deployment requirements change once the system moves from evaluation to production?", - "answer": "Production adds stricter reliability, monitoring, and access-control requirements beyond the evaluation setup.", - "question_complexity": 4, - "query_type": "multi_hop", - "reasoning_type": "procedural", - "segment_ids": [3, 7, 8], - "hop_count": 2, - "hop_contexts": [ - {"hop_number": 1, "segment_ids": [3], "summary": "Evaluation setup and baseline requirements."}, - {"hop_number": 2, "segment_ids": [7, 8], "summary": "Production deployment constraints."} - ] -} -``` - -Notice what is different from a generic QA generator: - -- The question does not say "according to segment 3." -- The answer is grounded in the source text. -- The `segment_ids` preserve the retrieval labels. -- Multi-hop questions keep hop-level evidence summaries. - -That combination is what makes the data useful for retriever training and not just QA evaluation. - ---- - -## **Step 3: Deduplicate and Judge Before Export** - -Synthetic generators are enthusiastic. Ask for seven questions per document across a large corpus and you will get repeats: the same policy phrased three ways, the same setup requirement asked with slightly different wording, the same "how does X relate to Y" pattern over and over. - -This stage has two gates: first remove near-repeated questions, then judge whether the remaining examples are grounded enough to train or evaluate a retriever. - -### **Deduplicate Near-Repeated Questions** - -The `embedding-dedup` column removes near duplicates inside each generated list: - -```python -from data_designer_retrieval_sdg.config import EmbeddingDedupColumnConfig - -config_builder.add_column( - EmbeddingDedupColumnConfig( - name="deduplicated_qa_pairs", - source_column="qa_generation", - items_key="pairs", - text_field="question", - model_alias="embed", - similarity_threshold=0.9, - ) -) -``` - -The implementation embeds the question text, computes cosine similarity, and greedily drops items above the threshold. It also implements native `agenerate()`, so it participates directly in [Data Designer's async scheduler](async-all-the-way-down.md) and uses `model.agenerate_text_embeddings(...)` instead of becoming a separate side job. - -This is a small detail that has a large downstream effect: fewer duplicate queries means cleaner training data and more informative held-out evals. - -### **Judge Grounded Quality** - -Retriever data quality is easy to overestimate. A generated question might sound fluent but be unsupported. An answer might be correct but require a chunk that was not marked positive. A multi-hop question might only need one hop in practice. - -The plugin adds an LLM judge column after deduplication. Each retained QA pair is scored for: - -- Relevance -- Factual accuracy -- Context support -- Clarity -- Overall quality - -The converter defaults to `--quality-threshold 7.0`, keeping only pairs whose overall score passes the threshold. It also drops records where the number of judged pairs does not match the number of deduplicated pairs, because silent misalignment is worse than losing a row. - -Your first inspection pass should focus on the rejected and borderline examples. If many low-scoring examples share the same failure mode, tune chunk size, document cleanup, model choice, or question complexity before scaling up. - ---- - -## **Step 4: Export What Training and Eval Actually Need** - -The final conversion step rebuilds a deduplicated corpus from the generated chunks, maps `segment_ids` to positive document IDs, filters by quality, and writes both training and evaluation formats. - -For training: - -```text -train.json -val.json -corpus/ - train.parquet - merlin_metadata.json -``` - -For evaluation: - -```text -eval_beir/ - corpus.jsonl - queries.jsonl - qrels/ - test.tsv -``` - -This is one of the main reasons the plugin exists. It is easy to generate questions. It is harder to keep training examples, corpus records, and qrels aligned enough that the numbers mean something. - ---- - -## **How to Know the First Run Is Working** - -Before scaling, look at a small sample and ask: - -- Would a real user ask this question? -- Can the answer be supported by the listed `segment_ids`? -- Are multi-hop examples genuinely multi-hop, or would one passage answer them? -- Are rejected examples failing because the source text is messy, the chunks are too small, or the model is too weak? -- Does the BEIR eval contain held-out documents that are meaningfully different from training documents? - -Then iterate: - -| Symptom | Try | -| --- | --- | -| Questions are too shallow | Raise `--min-complexity`, increase `--min-hops`, or use a stronger generation model. | -| Answers are unsupported | Lower chunk size, clean input documents, or raise the quality threshold. | -| Too many duplicates | Lower `--similarity-threshold` to make dedup more aggressive. | -| Cross-document eval is weak | Enable `--multi-doc` and use `doc_balanced` or `interleaved` bundling. | -| Not enough examples survive filtering | Add more documents, lower the quality threshold carefully, or improve document formatting. | - -The goal of the first run is not volume. The goal is to learn how your corpus behaves. - ---- - -## **How Plugins Unlock Custom Retrieval Pipelines** - -Retrieval SDG needs document-specific seed reading, question deduplication, quality judging, and conversion logic. Packaging those pieces as a plugin gives teams a repeatable path from their own corpus to retriever data while preserving declarative Data Designer configs. - -The retrieval SDG plugin includes: - -- A seed reader with a stable config schema and tests. -- A reusable embedding-dedup column that can be used outside this pipeline. -- A CLI for generating and converting retrieval data. -- Conversion logic for retriever training and BEIR evaluation. -- Compatibility metadata and installation through the default NVIDIA Data Designer plugin catalog. - -Users still write declarative configs: - -```python -from data_designer_retrieval_sdg import DocumentChunkerSeedSource -from data_designer_retrieval_sdg import build_qa_generation_pipeline -``` - -No registry mutation. No engine internals. No custom chunking pre-process that has to stay manually aligned with supporting evidence. - -That is the bigger plugin story: Data Designer provides the orchestration framework, and plugins package domain-specific pieces for custom use cases without bloating the core library. - ---- - -## **Try It Yourself** - -Do not start by generating a million examples. Pick 20-100 representative documents, run a preview, inspect the labels, and only then scale up. - -Install the plugin: - -```bash -data-designer plugin install retrieval-sdg -``` - -Run a preview: - -```bash -data-designer-retrieval-sdg generate \ - --input-dir ./my_documents \ - --output-dir ./generated_output \ - --num-files 50 \ - --num-pairs 7 \ - --preview -``` - -If the preview looks reasonable, run the full job: - -```bash -data-designer-retrieval-sdg generate \ - --input-dir ./my_documents \ - --output-dir ./generated_output \ - --num-files 50 \ - --num-pairs 7 -``` - -Convert the generated data: - -```bash -data-designer-retrieval-sdg convert ./generated_output \ - --corpus-id my_corpus \ - --quality-threshold 7.0 -``` - -That produces the training and evaluation artifacts you need to keep moving: - -```text -generated_output_train_eval/ - train.json - val.json - corpus/ - train.parquet - merlin_metadata.json - eval_beir/ - corpus.jsonl - queries.jsonl - qrels/ - test.tsv -``` - -Start here: - -- [data-designer-retrieval-sdg on GitHub](https://github.com/NVIDIA-NeMo/DataDesignerPlugins/tree/main/plugins/data-designer-retrieval-sdg) -- [DataDesignerPlugins catalog](https://github.com/NVIDIA-NeMo/DataDesignerPlugins) -- [Domain-specific embedding fine-tuning recipe](https://huggingface.co/blog/nvidia/domain-specific-embedding-finetune) -- [Retrieval-Synthetic-NVDocs-v1 dataset](https://huggingface.co/datasets/nvidia/Retrieval-Synthetic-NVDocs-v1) -- [NeMo embedding fine-tune recipe](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/embed) -- [NeMo reranking fine-tune recipe](https://github.com/NVIDIA-NeMo/Nemotron/tree/e6e8a3281a11b8e1b7b47af098bbf54416c68d47/src/nemotron/recipes/rerank) -- [AutoModel](https://github.com/NVIDIA-NeMo/Automodel) - -If your RAG system is failing because the retriever does not understand your domain, this is the action step: create the data that lets you measure and improve it. Bring a folder of documents, run the plugin, inspect the labels, and use the output to train and evaluate the retriever you actually need. diff --git a/docs/devnotes/posts/rqa.md b/docs/devnotes/posts/rqa.md deleted file mode 100644 index a72c33fa1..000000000 --- a/docs/devnotes/posts/rqa.md +++ /dev/null @@ -1,247 +0,0 @@ ---- -date: 2026-02-04 -authors: - - dcorneil - - etramel ---- - -# **Graduate-Level Science Reasoning Data with NeMo Data Designer** - -Using NeMo Data Designer, we created the RQA (Reasoning Question-Answer) dataset: a massive collection of graduate-level, reasoning-heavy science samples designed to push the boundaries of model performance. - - - ---- - -Inference-time reasoning has transformed LLM capabilities, boosting performance in difficult domains like math and science. While reasoning is introduced in the post-training phase using Reinforcement Learning (RL), it builds on patterns that the model has seen throughout pretraining. In fact, research from NVIDIA has shown that [front-loading examples of reasoning into the pretraining phase](https://research.nvidia.com/labs/adlr/Synergy/) can have a positive, compounding impact on the quality of the final model. When training Nemotron 3 Nano, our goal was to introduce rich and diverse examples of reasoning directly into pretraining, laying the groundwork for reasoning RL in post-training. - -Using NeMo Data Designer, we created the RQA (Reasoning Question-Answer) dataset: a massive collection of graduate-level, reasoning-heavy science samples designed to push the boundaries of model performance. Each sample contains a question, a trace from a reasoning LLM attempting to answer that question, and the final resulting answer. As we’ll show in the results, introducing RQA into pretraining **didn’t just result in stronger scientific reasoning \- it improved math and coding performance as well**. - -This blog post walks you through how we built it, and how you can adapt our approach for your own reasoning-intensive datasets. - -![RQA Blog](../../images/rqa-blog.png) - ---- - -## **Step 1: Curating High-Quality Science Seeds from Essential-Web** - -For our reasoning dataset, we knew that both quality and diversity were critical. We wanted to show the model examples of reasoning through difficult scientific problems, and we wanted to make sure that those problems covered as wide a range of scientific domains as possible. Using seed passages from web text was an obvious choice, because it allowed us to use the seed data to control both quality and diversity. - -We started with [Essential-Web](https://arxiv.org/abs/2506.14111), a Common Crawl (web text) dataset where each document has been labelled with respect to both quality and subject. For instance, documents are labelled with an estimated *Education Level*, where *Graduate Level* indicates that the text “requires graduate-level education or domain expertise. Assumes deep background knowledge and specialized training to comprehend”. These labels let us rapidly filter down the documents to the highest-quality seeds for our scientific reasoning dataset. - -Starting from the [STEM subset of Essential-Web](https://huggingface.co/datasets/EssentialAI/eai-taxonomy-stem-w-dclm), we filtered to documents that were: - -1. Undergraduate-to-graduate education level -2. Advanced reasoning depth -3. High technical correctness -4. Advanced [Bloom taxonomy levels](https://en.wikipedia.org/wiki/Bloom's_taxonomy) for both cognitive processes (Analyze, Evaluate or Create) and knowledge domains (Conceptual, Procedural or Metacognitive) -5. In the English language and over 1000 characters. - -The resulting subset consisted of roughly 14 million documents, mostly academic. Since many of the documents were very long, we extracted random chunks of \<4096 characters in length. - -Essential-AI also labelled the documents according to the [Free Decimal Correspondence (FDC) code](https://everybodyslibraries.com/about-the-free-decimal-correspondence/#:~:text=What%20is%20the%20Free%20Decimal,group%20of%20subjects%20and%20disciplines.), a public-domain analogue of the Dewey Decimal system. Using the FDC code, we could see that the topics weren’t equally balanced across scientific domains; for instance, Medicine & Health was heavily over-represented. Since we planned to generate \<14 million samples in total, we aimed to capture as broad a range of topics as possible in the subset of seeds we used. - -To arrive at a smaller set of seed documents balanced by topic, we used a hierarchical round-robin approach. First, we rotated between selecting seed documents across 8 major domains (Biology, Chemistry, Computer Science, Engineering, Math, Medicine/Health, Physics, and Other). Within each high-level domain, we further rotated between seed documents based on their 3-digit FDC codes; for instance, given a Physics sample with code 535 (*Light*), the next Physics sample might be from code 536 (*Heat*), then 537 (*Electricity*) and so on, ensuring that no single subdomain dominates. We continued the round robin selection at the first and second decimal place of the FDC code, where they existed. - -We tested approaches using both the first 4.5 million and the first 9 million seeds according to the round-robin approach described above. - ---- - -## **Step 2: Generating Challenging Questions** - -With our seed documents ready, we moved to NeMo Data Designer to design the actual dataset. While the seed documents ground our dataset in the types of advanced scientific topics we’re interested in, they don’t typically show the *active process* of thinking through a difficult scientific problem; instead, scientific papers usually show the polished end result of advanced reasoning. This is where LLMs come in. - -We first needed examples of the type of tough questions that Nemotron might be asked by a user in the real world. To do this, we used Data Designer to prompt a reasoning-enabled LLM to generate a graduate-level question *inspired by* each seed passage: - -```py -import data_designer.config as dd -from data_designer.interface import DataDesigner - -QUESTION_PROMPT = """ -Task: Generate a highly challenging, graduate-level reasoning question -inspired by the following passage. - -Follow these instructions: -1. The text serves only as inspiration for the question. You *must not* - reference the text directly in any way. -2. The question should be appropriate for an advanced graduate-level exam - in a course specialized in this topic. -3. Ensure that the question requires higher-order reasoning beyond simple - recall, such as mathematical reasoning, quantitative analysis, or synthesis. -4. Tag the question with "Question:". - -Text: -{{ seed_passage }} - -Question: [question] -""" - -# Configure the workflow with a reasoning-enabled model -config = dd.DataDesignerConfigBuilder(model_configs=[ - dd.ModelConfig( - alias="reasoning-model", - model="qwen/qwen3-235b-a22b", - provider="nvidia", - ), -]) - -config.with_seed_dataset( - dd.LocalFileSeedSource(path="path/to/seed_data.parquet"), - sampling_strategy=dd.SamplingStrategy.SHUFFLE, -) - -config.add_column( - dd.LLMTextColumnConfig( - name="question", - prompt=QUESTION_PROMPT, - model_alias="reasoning-model", - ) -) -``` - -Note that our prompt emphasizes that the question shouldn’t reference the source passage. We want questions that stand on their own, without including the source passage itself; since these are passages from Common Crawl, we can expect that they appear in the pretraining data already, and our focus here is on generating new tokens. - ---- - -## **Step 3: Generating High-Quality Answers with Reasoning Traces** - -If you’ve ever tried to read a teacher’s answer key before, you know that sometimes the person who *wrote* the question isn’t always the best at explaining how to *answer* it. In the real world, reasoning involves a lot of what-ifs, dead ends and backtracking \- the types of behavior we can only get from a model when it has never seen the question before. This is why we chose to decouple answer generation from question generation, ensuring that the model doesn’t have any context about how the question was generated or the source passage itself when it attempts to answer it. - -Below, we prompt the LLM directly with the questions we generated above, then capture the resulting reasoning trace and final answer for our RQA samples. - -```py -config.add_column( - dd.LLMTextColumnConfig( - name="answer", - prompt="{{ question }}", # Present just the question - model_alias="reasoning-model", - extract_reasoning_content=True, # Extract reasoning into separate column - ) -) - -# Combine question, reasoning trace, and answer into final sample -config.add_column( - dd.ExpressionColumnConfig( - name="rqa_sample", - expr="{{ question }}\n\n{{ answer__reasoning_content }}\n\n{{ answer }}", - ) -) -``` - -In the resulting dataset, we see the following new columns concatenated to the seed data columns: - -- `question` -- `answer` -- `answer__reasoning_content` -- `rqa_sample` - -The `question` and `answer` columns are the final result of the calls to our reasoning LLM, while `answer__reasoning_content` is the reasoning trace generated by the LLM when generating the answer. Typically we discard the reasoning trace, but here it’s critical: we want to include the model’s chain-of-thought in the final training data, distilling strong priors in Nemotron Nano 3 about *how* to work through a challenging problem. The final column, `rqa_sample`, uses Jinja2 syntax to combine all three fields into the final sample for training. - -We chose to use the same frontier reasoning model to answer the questions as we used to generate them \- leveraging the model’s advanced capabilities both for formulating a tough, well-formed question and for puzzling through the answer. But with Data Designer, this choice is up to you \- you can mix-and-match models any way you like. - ---- - -## **Results: Measurable Improvements in STEM Reasoning** - -To evaluate the impact of the RQA data, we ran continued pretraining experiments on an internal checkpoint of [Nemotron-H 8B](https://research.nvidia.com/labs/adlr/nemotronh/). Nemotron-H used a two-phase pretraining approach (you can read more about it in our white paper [here](https://arxiv.org/pdf/2504.03624)). We intervened at the Phase 2 training stage, comparing the result of replacing either 4% or 8% of the existing data blend with RQA samples (taking weight from high-quality Common Crawl data). We ran the intervention for 18k steps, between a checkpoint at 140k steps and a checkpoint at 158k steps. - -| Data Blend | Validation Loss (↓) | MMLU-Pro (with CoT, ↑) | Math 500 (with CoT, ↑) | GSM8K (with CoT, ↑) | Humaneval+ (↑) | MBPP+ (↑) | -| :---- | :---- | :---- | :---- | :---- | :---- | :---- | -| **Baseline data blend (140k steps)** | 1.309 | 36.99 | \- | 79.98 | 38.14 | 48.68 | -| **Baseline data blend (158k steps)** | 1.258 | 43.39 | 71.00 | 81.96 | 42.71 | 53.31 | -| **with RQA (4.5m @4%, 158k steps)** | 1.256 | 44.31 | **73.40** | 82.79 | **47.20** | **54.84** | -| **with RQA (9m @8%, 158k steps)** | **1.255** | **45.80** | **73.40** | **84.76** | 45.61 | 53.80 | - -One of the most surprising (and exciting\!) results was that RQA didn’t just improve performance on tests of scientific reasoning like MMLU-Pro \- it also improved performance on benchmarks associated with math reasoning (Math 500, GSM8K) and coding capabilities (Humaneval+, MBPP+). This shows how early introduction of advanced reasoning capabilities can produce robust improvements across different domains. - -You can check out the RQA dataset we generated for Nemotron 3 Nano [here](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Specialized-v1/viewer/Nemotron-Pretraining-RQA). - ---- - -## **Get Started with Data Designer** - -Apart from the seed data, the entire pipeline is reproducible using NeMo Data Designer. Note how Data Designer handles complex data formatting with ease, leveraging Jinja2 templates in prompt generation and built-in logic to extract reasoning traces from model responses. - -```py -import data_designer.config as dd -from data_designer.interface import DataDesigner - -# Configure your model -model_configs = [ - dd.ModelConfig( - alias="reasoning-model", - model="qwen/qwen3-235b-a22b", - provider="nvidia", - inference_parameters=dd.ChatCompletionInferenceParams( - max_tokens=8192, - timeout=300, # 5 minute timeout for long reasoning chains - ), - ), -] - -# Build the workflow -config = dd.DataDesignerConfigBuilder(model_configs=model_configs) -config.with_seed_dataset( - dd.LocalFileSeedSource(path="path/to/your_seed_data.parquet"), - sampling_strategy=dd.SamplingStrategy.SHUFFLE, -) - -# Generate questions -config.add_column( - dd.LLMTextColumnConfig( - name="question", - prompt=QUESTION_PROMPT, - model_alias="reasoning-model", - ) -) - -# Generate answers with reasoning trace -config.add_column( - dd.LLMTextColumnConfig( - name="answer", - prompt="{{ question }}", - model_alias="reasoning-model", - extract_reasoning_content=True, # Extract reasoning into separate column - ) -) - -# Combine into final sample -config.add_column( - dd.ExpressionColumnConfig( - name="rqa_sample", - expr="{{ question }}\n\n{{ answer__reasoning_content }}\n\n{{ answer }}", - ) -) - -# Run generation and save to disk -data_designer = DataDesigner() -result = data_designer.create( - config_builder=config, - num_records=N_RECORDS, - dataset_name="rqa_dataset", -) -``` - ---- - -## **Summary** - -The RQA dataset demonstrates that targeted synthetic data generation can meaningfully improve advanced reasoning capabilities. By: - -1. Curating high-quality scientific seed data -2. Generating challenging, standalone questions from those seeds -3. Using powerful reasoning models to reason through how to answer those questions - -… we created a dataset that pushes models toward graduate-level science reasoning \- and generalizable improvements on math and code as well. - -Key Resources: - -1. [NeMo Data Designer on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner) -2. [Nemotron 3 Nano Technical Report](https://arxiv.org/pdf/2512.20848) -3. [Essential-Web](https://arxiv.org/abs/2506.14111) - -The workflow is fully configurable and extensible: swap in your own seed data, adjust the prompts, or add custom validators. Data Designer makes it possible to iterate rapidly on synthetic data pipelines, turning what used to be months of manual annotation into hours of programmable generation. - ---- - -*Want to learn more about NeMo Data Designer? Check out our [documentation](https://github.com/NVIDIA-NeMo/DataDesigner) and start building your own high-fidelity synthetic datasets today.* diff --git a/docs/devnotes/posts/search-agent.md b/docs/devnotes/posts/search-agent.md deleted file mode 100644 index a635572e9..000000000 --- a/docs/devnotes/posts/search-agent.md +++ /dev/null @@ -1,587 +0,0 @@ ---- -date: 2026-03-12 -authors: - - dnathawani ---- - -# **Search Agent SFT Data: Teaching LLMs to Browse the Web** - -Training search agents requires trajectory data --- the full multi-turn interaction showing how a model searches, reads, reasons, and answers. We built a four-stage pipeline that generates synthetic search trajectories from Wikidata knowledge graph paths, converts them into BrowseComp-style riddles using NeMo Data Designer, generates multi-step search rollouts with live web search via Tavily, and post-processes the results into SFT-ready training data. - - - ---- - -## **Why This Matters: The Agentic Shift** - -The industry is moving from **models that answer questions** to **agents that take actions**. Real-world AI applications orchestrate multiple steps --- searching the web, querying databases, reading documents, calling APIs --- with the LLM as the reasoning engine deciding *what to do next*. - -Consider this question from OpenAI's [BrowseComp](https://openai.com/index/browsecomp/) benchmark: - -> *Between 1990 and 1994 inclusive, what teams played in a soccer match with a Brazilian referee that had four yellow cards, two for each team where three of the total four were not issued during the first half, and four substitutions, one of which was for an injury in the first 25 minutes of the match.* -> -> **Answer:** Ireland v Romania - -You can't answer this from memory. You need to search, read results, refine your query, search again, and piece it together --- exactly what we want AI agents to do. Training a model for this requires trajectory data: the full record of every search query, every result evaluation, and every reasoning step, not just the final answer. - -Creating this data by hand takes 15-30 minutes per example. At the thousands of trajectories needed for SFT, that's months of annotation work. We needed a way to generate it synthetically. - ---- - -## **End-to-End Pipeline Architecture** - -``` - SEARCH AGENT SFT PIPELINE - ========================= - - ┌─────────────────────────────────────────────────────────────────────────────────────┐ - │ STAGE 1: SEED DATA (Wikidata KG Walks) │ - │ │ - │ Random walks on the Wikidata knowledge graph │ - │ ├─ Anti-meta filters (no category/template/list-y nodes) │ - │ ├─ Hop range: 4 minimum, 8 maximum │ - │ └─ SPARQL queries to fetch neighbors │ - │ │ - │ Output: seed JSONL with hops[], seed_entity, final_answer_entity, path_length │ - │ 50,000 seeds generated │ - └─────────────────────────────────────────┬───────────────────────────────────────────┘ - │ - ▼ - ┌─────────────────────────────────────────────────────────────────────────────────────┐ - │ STAGE 2: SEARCH RIDDLE GENERATION (Data Designer) │ - │ │ - │ user_query_draft ──────────► user_query_obfuscated │ - │ (chain clues from path, (BrowseComp-style rewrite: │ - │ hide intermediate nodes, concise, natural, no breadcrumbs, │ - │ don't name the answer) 1-2 sentences max) │ - │ │ - │ + Heuristic filters: answer leakage, intermediate node leakage, INVALID_PATH │ - │ 50,000 → 37,000 valid seeds → 24,000 valid questions │ - └─────────────────────────────────────────┬───────────────────────────────────────────┘ - │ - ▼ - ┌─────────────────────────────────────────────────────────────────────────────────────┐ - │ STAGE 3: SEARCH TRAJECTORY ROLLOUTS │ - │ │ - │ Thought-Action-Observation loop with live web search (Tavily API) │ - │ ├─ Rollout model: MiniMax-M2 (strong BrowseComp + tool-calling scores) │ - │ ├─ Average ~12 tool calls per sample │ - │ ├─ Multiple rollouts per question for rejection sampling │ - │ └─ 6,974 completed (stop) / 177 truncated (length) │ - │ │ - │ 24,000 questions → 7,000 valid trajectories │ - └─────────────────────────────────────────┬───────────────────────────────────────────┘ - │ - ▼ - ┌─────────────────────────────────────────────────────────────────────────────────────┐ - │ STAGE 4: POST-PROCESSING → SFT DATASET │ - │ │ - │ ├─ Normalize tool outputs to consistent JSON "tool response" shape │ - │ ├─ Drop broken/truncated interactions │ - │ ├─ Select best rollout per question (min tool calls among correct) │ - │ ├─ Write OpenAI-messages style: messages[], tools[], metadata{} │ - │ └─ Manual review + LLM spot-checking (Gemini) │ - │ │ - │ ~7,000 SFT records for Nemotron Super │ - └─────────────────────────────────────────────────────────────────────────────────────┘ -``` - ---- - -## **Step 1: Seed Data from Wikidata Knowledge Graph Walks** - -The core idea: start at a random entity in the [Wikidata](https://www.wikidata.org/) knowledge graph and perform a random walk through its relations, producing a chain of hops that becomes a multi-hop search problem. Each chain provides a `seed_entity` (start), a `final_answer_entity` (destination), and a `readable_path` describing the edges traversed. - -We used **Wikidata SPARQL queries** to fetch neighbors at each hop. The number of hops is directly proportional to the number of tool calls the model would need to solve questions derived from that path --- more hops means harder riddles. - -![Wikidata subgraph showing a random walk from NVIDIA through Jensen Huang, Oregon State University, Benton County to Thomas Hart Benton](assets/search-agent/wikidata-graph-walk.png){ style="max-width:100%; height:auto" } - -``` -START ENTITY: NVIDIA (Q182477) - ⬇ [chief executive officer (P169)] - NODE: Jensen Huang (Q332838) - ⬇ [educated at (P69)] - NODE: Oregon State University (Q861888) - ⬇ [located in the administrative territorial entity (P131)] - NODE: Benton County (Q115372) - ⬇ [named after (P138)] - NODE: Thomas Hart Benton (Q178712) -``` - -``` -START ENTITY: toothache (Q143925) - ⬇ [risk factor (P564)] - NODE: smoking (Q662860) - ⬇ [has effect (P1542)] - NODE: Crohn's disease (Q1472) - ⬇ [drug or therapy used for treatment (P2176)] - NODE: TNF inhibitor (Q1536078) - ⬇ [(is possible treatment of) (P2175)] reverse relation - NODE: Behçet's disease (Q911427) - ⬇ [symptoms and signs (P780)] - NODE: inflammation (Q101991) - ⬇ [drug or therapy used for treatment (P2176)] - NODE: (±)-flurbiprofen (Q419890) - ⬇ [significant drug interaction (P769)] - NODE: parecoxib (Q347941) - ⬇ [significant drug interaction (P769)] - NODE: ibuprofen (Q186969) -``` - -### Heuristics to Keep Walks Coherent - -Unrestricted random walks go off the rails quickly --- you'd get paths like `CEO → Human Being → Civilization → Indus Valley`. We applied several filters: - -- **Anti-meta filters:** Avoid category nodes, template pages, list-y entities, and other degenerate hops that exist for Wikidata bookkeeping rather than representing real-world relationships. -- **Hop range: 4 minimum, 8 maximum.** Below 4 hops, the questions aren't difficult enough to require multi-step search. Above 8, the path wanders off-topic and produces unsolvable riddles. -- **Generic entity filtering:** Remove seeds where the `final_answer_entity` is too abstract ("technology", "people", "field", "concept"). These produce questions where any answer could be correct. - -The resulting seed dataset: **50,000 JSONL records**, each containing `hops[]`, `seed_entity`, `final_answer_entity`, `readable_path`, and `path_length`. - -### A Note on Ground Truth Staleness - -An important caveat when using Wikidata as a seed source: the knowledge graph reflects a snapshot in time. Models with current parametric knowledge or live search results may find answers that are factually correct *today* but disagree with the KG-derived ground truth. For example, a question about "which country contains the headquarters of the owner of U.S. Steel?" has ground truth "United States" from Wikidata --- but U.S. Steel was acquired by Nippon Steel (Japan) in Dec 2023, making "Japan" the correct answer now. This staleness affects both question quality (paths through outdated facts) and evaluation (correct model answers flagged as wrong). We revisit this challenge in the [Correctness Challenge](#the-correctness-challenge) section below. - ---- - -## **Step 2: Creating Search Riddles with Data Designer** - -Each seed path needs to be converted into two things: a search question (obfuscated so it doesn't leak the answer) and a ground truth target entity (the final node in the path). We use two chained LLM columns in Data Designer for this. - -**Stage 1 --- Draft question:** Chain clues from the knowledge path into a multi-hop riddle. Critical rules: don't name intermediate nodes, don't name the final answer, skip weak or illogical hops, and output `INVALID_PATH` if the path is unsalvageable. - -**Stage 2 --- Obfuscated question:** Rewrite the draft in BrowseComp style --- concise, natural, 1-2 sentences max. The solver must figure out *what* to search rather than following explicit breadcrumbs. No relational breadcrumbing like "X is member of Y; Y is based in Z...". - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -config = dd.DataDesignerConfigBuilder(model_configs=[ - dd.ModelConfig( - alias="riddle-gen", - model="qwen/qwen3-235b-a22b", - provider="nvidia", - ), -]) - -config.with_seed_dataset( - dd.LocalFileSeedSource(path="search_agent_seeds.parquet"), - sampling_strategy=dd.SamplingStrategy.SHUFFLE, -) - -# Stage 1: Draft question from knowledge path -config.add_column(dd.LLMTextColumnConfig( - name="user_query_draft", - model_alias="riddle-gen", - prompt=( - "You are an expert Search Evaluator designing Grandmaster-Level search tests.\n" - "Create a complex, multi-step search riddle based on this knowledge path:\n\n" - "{{ readable_path }}\n\n" - "Start Entity: {{ seed_entity }}\n" - "Final Answer Entity: {{ final_answer_entity }}\n\n" - "RULES:\n" - "1. DO NOT name the intermediate nodes. Hide them behind descriptions.\n" - "2. DO NOT name the Final Answer.\n" - "3. Chain the clues logically.\n" - "4. If a step is weak or nonsensical, IGNORE IT.\n" - "5. Output INVALID_PATH if the path is unsalvageable.\n\n" - "Return ONLY the question string (or INVALID_PATH)." - ), -)) - -# Stage 2: BrowseComp-style obfuscation -config.add_column(dd.LLMTextColumnConfig( - name="user_query_obfuscated", - model_alias="riddle-gen", - prompt=( - "Rewrite this search riddle to be MORE obfuscated and natural.\n\n" - "Original: {{ user_query_draft }}\n" - "Secret path: {{ readable_path }}\n\n" - "REQUIREMENTS:\n" - "1. DO NOT reveal the step-by-step plan. No breadcrumb chains.\n" - "2. DO NOT name intermediate or final entities.\n" - "3. 1-2 sentences max. Sound like a real user question.\n" - "4. If original == INVALID_PATH, output INVALID_PATH.\n\n" - "Return ONLY the rewritten question." - ), -)) -``` - -**Example transformation (NVIDIA path):** - -``` -Draft: "Starting from NVIDIA, identify the current CEO, then find - where they received their bachelor's degree, determine which - county houses that university's main campus, and finally - identify the nickname of the 19th-century U.S. Senator - the county is named after." - -Obfuscated: "Identify the nickname ('Old ____') of the 19th-century U.S. - Senator who is the namesake of the specific county that houses - the main campus of the university where the current CEO of - NVIDIA received his bachelor's degree." - -Answer: "Old Bullion" -``` - -The obfuscated version requires the solver to: - -1. Identify Jensen Huang as NVIDIA's CEO -2. Find where he got his bachelor's degree (Oregon State, not Stanford) -3. Identify the county (Benton County, OR) -4. Find who the county is named after (Thomas Hart Benton) -5. Find his nickname --- forcing one final hop to verify it's the Senator, not the painter - ---- - -## **Step 3: Search Trajectory Rollouts** - -For each riddle, we generate multi-step tool-using conversation rollouts where the model makes several web-search tool calls, reads tool outputs, reasons about results, and ends with a final answer. The agent operates through a cyclical **thought-action-observation loop**: - -1. **Thought:** Analyze the current state, identify the next missing link, formulate a search query -2. **Action (Tool Call):** Execute `web_search(query="...")`, retrieving real data from the web -3. **Observation & Synthesis:** Read the results, extract the relevant fact, loop back to step 1 - -The rollout model was [MiniMax-M2](https://huggingface.co/MiniMaxAI/MiniMax-M2), chosen for its strong performance on BrowseComp and tool-calling benchmarks. - -### Rollout Statistics - -```bash -$ jq -r '.finish_reason[-1] // "MISSING"' shard_*/output-rs0.jsonl-async \ - | sort | uniq -c | sort -nr - 6974 stop - 177 length -``` - -- **6,974 completed** (`stop`) --- the model reached a final answer -- **177 truncated** (`length`) --- hit the context limit before answering -- **Average ~12 tool calls** per sample - -### What a Trajectory Looks Like - -Each trajectory captures the full search process. Here's a condensed example (simplified for readability --- production records include OpenAI-spec fields like `tool_call_id`, `id`, and `type` on tool call messages): - -```json -{ - "messages": [ - { - "role": "system", - "content": "You are an expert search agent. Your goal is to answer the user's question efficiently using verifiable online sources.\n\nExecution Rules:\n1. Search Results...\n2. Planning...\n3. Tool Use...\n4. MAXIMUM 25 Calls...\n5. Verification...\n6. Output Format - The last line must be EXACTLY: \"Final Answer: \"...\n" - }, - { - "role": "user", - "content": "Problem: Which equatorial coordinate system, using an angular measurement instead of right ascension, can be reached by starting with lawrencium, then the physicist it honors, the university they attended...?" - }, - { - "role": "assistant", - "content": "\nWe need to parse the clues step by step:\n1. Start with lawrencium... named after Ernest Lawrence...\n2. The university they attended... need to verify...\n", - "tool_calls": [{"function": {"name": "web-search", "arguments": "{\"query\": \"Ernest Lawrence attended university\"}"}}] - }, - { - "role": "tool", - "name": "web-search", - "content": "{\"results\": [{\"title\": \"Physics - Ernest Lawrence\", \"content\": \"He attended Manchester High School prior to entering Manchester University...\"}, {\"title\": \"2025 Hall of Fame...\", \"content\": \"...Ernest Lawrence graduated from USD in 1922...\"}]}" - }, - { - "role": "assistant", - "content": "\nErnest Lawrence attended University of South Dakota (BS) and Yale (PhD)... The clue mentions 'six-month academic schedule' — likely 'semester'...\n", - "tool_calls": [{"function": {"name": "web-search", "arguments": "{\"query\": \"semester six month academic schedule\"}"}}] - }, - { - "role": "tool", - "name": "web-search", - "content": "{\"results\": [{\"title\": \"Academic Calendar\", \"content\": \"Many colleges follow a semester plan...\"}]}" - }, - { - "role": "assistant", - "content": "Final Answer: first equatorial coordinate system" - } - ], - "metadata": { - "ground_truth": "first equatorial coordinate system", - "num_hops_in_graph": 8, - "seed_entity": "lawrencium", - "final_answer_entity": "first equatorial coordinate system", - "num_tool_calls": 2, - "finish_reason": ["tool_calls", "tool_calls", "stop"] - }, - "tools": [ - { - "type": "function", - "function": { - "name": "web-search", - "description": "Search the web for a query.", - "parameters": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]} - } - } - ] -} -``` - ---- - -## **Step 4: Post-Processing to SFT Dataset** - -Raw rollouts need cleanup before they become trainable SFT records: - -1. **Normalize tool outputs** into a consistent JSON "tool response" shape -2. **Drop broken/truncated interactions** (the 177 `length` records) -3. **Select the best rollout per question** (minimum tool calls among correct ones) -4. **Write OpenAI-messages style** JSONL with `messages[]`, `tools[]`, and `metadata{}` -5. **Manual review + LLM spot-checking** --- we reviewed as much SFT data as we could manually and used Gemini to spot-check chunks - ---- - -## **Production Yield Analysis** - -``` - PIPELINE YIELD - ============== - - ┌─────────────────────────────────────────────────────────────────────────────────────┐ - │ 50,000 ───74%──► 37,000 ───65%──► 24,000 ───29%──► 7,000 │ - │ Seeds Valid Seeds Valid Questions Valid Trajectories │ - │ │ - │ Total Yield: 14% │ - └─────────────────────────────────────────────────────────────────────────────────────┘ -``` - -| Stage | Input | Output | Yield | -|-------|-------|--------|-------| -| Seed Creation (Wikidata walks) | 50,000 | 37,000 | 74% | -| Riddle / Question Generation (DD) | 37,000 | 24,000 | 65% | -| Search Trajectory Rollouts (Tavily) | 24,000 | 7,000 | 29% | -| **End-to-End** | **50,000** | **~7,000** | **~14%** | - -The 14% yield might seem low, but each surviving record is a *verified, multi-turn search trajectory* showing a model successfully navigating web search. The alternative --- human annotation at 15-30 minutes per trajectory --- would take months for the same volume. - ---- - -## **The Correctness Challenge** - -Measuring correctness in post-processing was one of the hardest parts of this project, for reasons that go beyond typical evaluation: - -**1. Questions can have multiple valid answers.** A question about "which country contains X" might have a valid answer at multiple levels of granularity, or the entity might have multiple correct associations. - -**2. Wikidata has stale ground truth.** The knowledge graph reflects a snapshot in time. The model's parametric knowledge or live search results may be more current. For example: - -> **Question:** "...city that contains the headquarters of the owner of U.S. Steel?" -> -> **Ground truth (from Wikidata):** United States -> -> **Model answer:** Japan -> -> **Reality:** U.S. Steel was acquired by Nippon Steel (Japan) in Dec 2023. The model's answer is *factually correct today* but wrong according to the outdated KG path. - -### Accuracy Results - -We evaluated 657 trajectories against ground truth using fuzzy matching: - -``` ------------------------------------------------------------------------------------------------------------- -# | GROUND TRUTH | #TC | MODEL ANSWER | STATUS ------------------------------------------------------------------------------------------------------------- -1 | Ramsar Convention | 3 | Ramsar Convention (the Convention. | ✅ MATCH -2 | United States | 4 | United States | ✅ MATCH -3 | South Korea | 3 | Uzbekistan | ❌ MISS -4 | France | 3 | Germany | ❌ MISS -5 | Joseph Poelaert | 4 | Joseph Poelaert | ✅ MATCH -... -653 | Bangladesh | 11 | Bangladesh | ✅ MATCH -654 | Portal:Arithmetic | 10 | Portal:Arithmetic | ✅ MATCH -655 | Monumento 6 Gran Vía.. | 11 | Monumento V (the Monumento a los.. | ❌ MISS -656 | Tehran | 11 | Constantinople | ❌ MISS -657 | United Kingdom | 11 | Germany | ❌ MISS ------------------------------------------------------------------------------------------------------------- - -📊 RESULTS: 181/657 (27.5%) Correct -``` - -The 27.5% accuracy on this sample is for *raw, unfiltered* trajectories. After the full pipeline (rejection sampling, best-rollout selection, manual review), the final SFT dataset has much higher quality. The low raw accuracy underscores why multi-stage filtering is essential. - ---- - -## **Implementing with Data Designer's MCP Integration** - -The same pipeline is reproducible with Data Designer's MCP integration. Three components make this work: - -**`LocalStdioMCPProvider`** launches a Tavily MCP server as a subprocess: - -```python -from data_designer.config.mcp import LocalStdioMCPProvider, ToolConfig - -tavily_provider = LocalStdioMCPProvider( - name="tavily", - command=sys.executable, - args=[str(tavily_server_path)], - env={"TAVILY_API_KEY": os.environ["TAVILY_API_KEY"]}, -) -``` - -**`ToolConfig`** controls safety and limits: - -```python -tool_config = ToolConfig( - tool_alias="tavily", - providers=["tavily"], - allow_tools=["tavily_search"], - max_tool_call_turns=15, - timeout_sec=300.0, -) -``` - -**`tool_alias` + `with_trace`** on the LLM column enables tool calling and captures the full conversation: - -```python -config.add_column(dd.LLMTextColumnConfig( - name="agent_solution_raw", - system_prompt="You are an expert search agent...", - prompt="Problem: {{ user_query_obfuscated }}", - model_alias="search-agent", - tool_alias="tavily", - with_trace=dd.TraceType.ALL_MESSAGES, -)) -``` - -The resulting `agent_solution_raw__trace` column contains the complete ChatML conversation --- every user message, every tool call with arguments, every tool response with search results, and the final assistant response. This trace *is* the SFT training data. - -**Safety controls matter here.** `allow_tools` prevents the model from calling unexpected tools. `max_tool_call_turns=15` prevents infinite search loops. `timeout_sec=300` prevents hung connections. Without these, a fraction of records would consume unbounded resources. - ---- - -## **BrowseComp Benchmark Results** - -This dataset was shipped as part of **Nemotron Super v3** post-training (SFT + RL). On the [BrowseComp benchmark](https://openai.com/index/browsecomp/) (1,266 web browsing problems), Nemotron Super went from **0% to 31.28% accuracy** --- approaching GPT-OSS-120B at 33.89%. - -![BrowseComp Benchmark Results — Nemotron Super accuracy improves from 0% to 31.28%](assets/search-agent/browsecomp-benchmark-results.jpg){ style="max-width:100%; height:auto" } - -| Model | BrowseComp Accuracy (%) | -|-------|------------------------| -| Nemotron Super (before synthetic search agent data) | 0.00 | -| **Nemotron Super (after synthetic search agent data, SFT + RL)** | **31.28** | -| GPT-OSS-120B | 33.89 | - -Before this work, Nemotron Super had **zero** web browsing capability --- it had never been trained on tool-use trajectories with search. Including our synthetic search agent dataset in the SFT blend, combined with other RL datasets in later training stages, enabled the model to go from no capability to near-competitive with GPT-OSS-120B on one of the hardest agentic benchmarks. This dev note focuses on the SFT data generation pipeline. - ---- - -## **Key Takeaways** - -1. **Wikidata provides infinite seed diversity.** Random walks on a knowledge graph with 100M+ entities produce an inexhaustible supply of multi-hop problems. The hop count directly controls difficulty --- 4 hops for warm-up, 8 for expert-level riddles. - -2. **Two-stage obfuscation prevents leakage.** Draft questions tend to follow the path structure too closely (breadcrumbing). The obfuscation rewrite produces concise, natural questions that force the solver to figure out *what* to search. - -3. **Low yield is expected and acceptable.** 14% end-to-end yield from 50k seeds still produces ~7,000 high-quality trajectories --- enough for meaningful SFT impact. Multi-hop search is genuinely hard, and most generated paths or questions are legitimately unsolvable. - -4. **Stale knowledge graphs are a real problem.** Wikidata doesn't update in real-time. Models with current parametric knowledge or live search results will disagree with ground truth on entities that have changed (mergers, leadership changes, geopolitical shifts). Correctness evaluation needs to account for this. - -5. **Iterate on seeds, not just prompts.** Seed filtering (removing generic answers, constraining hop counts, anti-meta filters) has as much impact on quality as prompt engineering. Filter early, save compute. - -6. **Traces are the training data.** The full thought-action-observation loop --- every search query formulation, every result evaluation, every reasoning step --- is what teaches tool-use capability. Final answers alone are worthless without the process. - ---- - -## **Next Steps** - -- **Scale question generation.** Generate closer to ~25,000 filtered questions using Data Designer, up from the current 7k trajectories. -- **Push difficulty higher.** Target questions where `num_tool_calls` consistently exceeds 15+, requiring deeper reasoning chains. -- **Explore fresher knowledge bases.** Wikidata staleness is a real limitation. Investigate more recently updated, freely available knowledge bases for seed generation. -- **Search RL environment.** Use the filtered questions as an RL environment where the model gets reward for correct final answers, complementing the SFT data. - ---- - -## **Try For Yourself** - -The snippet below shows the core pattern: seed data, two-stage riddle generation, and an MCP-enabled agent trajectory with full trace capture. - -
-Minimal example: search agent trajectory pipeline - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -MODEL_ALIAS = "nvidia-text" - -# Tavily MCP provider (hosted endpoint, no local server needed) -mcp_provider = dd.MCPProvider( - name="tavily", - endpoint="https://mcp.tavily.com/mcp/?tavilyApiKey=YOUR_KEY", - provider_type="streamable_http", -) - -tool_config = dd.ToolConfig( - tool_alias="tavily-search", - providers=["tavily"], - allow_tools=["tavily_search"], - max_tool_call_turns=25, - timeout_sec=300.0, -) - -config = dd.DataDesignerConfigBuilder(tool_configs=[tool_config]) -config.with_seed_dataset( - dd.LocalFileSeedSource(path="seeds.jsonl"), - sampling_strategy=dd.SamplingStrategy.SHUFFLE, -) - -# Stage 2a: Draft question from knowledge path -config.add_column(dd.LLMTextColumnConfig( - name="user_query_draft", model_alias=MODEL_ALIAS, - prompt=( - "Create a multi-step search riddle from this knowledge path:\n" - "{{ readable_path }}\n" - "Start: {{ seed_entity }}. Answer: {{ final_answer_entity }}\n" - "Do NOT name intermediate nodes or the answer. Return ONLY the question." - ), -)) - -# Stage 2b: BrowseComp-style obfuscation -config.add_column(dd.LLMTextColumnConfig( - name="user_query_obfuscated", model_alias=MODEL_ALIAS, - prompt=( - "Rewrite this riddle to be concise and natural (1-2 sentences).\n" - "Original: {{ user_query_draft }}\n" - "No breadcrumb chains. No entity names. If INVALID_PATH, output INVALID_PATH." - ), -)) - -# Stage 3: Agent trajectory with MCP tool calling -config.add_column(dd.LLMTextColumnConfig( - name="agent_solution_raw", model_alias=MODEL_ALIAS, - system_prompt="You are an expert search agent. Use tavily_search to find the answer.", - prompt="Problem: {{ user_query_obfuscated }}", - tool_alias="tavily-search", - with_trace=dd.TraceType.ALL_MESSAGES, -)) - -# Run -data_designer = DataDesigner(mcp_providers=[mcp_provider]) -preview = data_designer.preview(config, num_records=5) -preview.display_sample_record() -``` - -
- -
-Full recipe: search_agent.py (self-contained, runnable) - -[Download Code :octicons-download-24:](../../assets/recipes/mcp_and_tooluse/search_agent.py){ .md-button download="search_agent.py" } - -```python ---8<-- "assets/recipes/mcp_and_tooluse/search_agent.py" -``` - -
- ---- - -Key Resources: - -1. [NeMo Data Designer on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner) -2. [BrowseComp Benchmark (OpenAI)](https://openai.com/index/browsecomp/) -3. [Wikidata Knowledge Graph](https://www.wikidata.org/) -4. [Tavily Search API](https://tavily.com/) -5. [MiniMax-M2](https://huggingface.co/MiniMaxAI/MiniMax-M2) -6. [GTC 2026 Workshop: Building Search Agents with NeMo Data Designer](https://www.nvidia.com//gtc/session-catalog/sessions/gtc26-dlit81572/) - ---- diff --git a/docs/devnotes/posts/structured-outputs-from-nemotron.md b/docs/devnotes/posts/structured-outputs-from-nemotron.md deleted file mode 100644 index c18486810..000000000 --- a/docs/devnotes/posts/structured-outputs-from-nemotron.md +++ /dev/null @@ -1,441 +0,0 @@ ---- -date: 2026-02-18 -authors: - - dnathawani ---- - -# **Structured Outputs for Nemotron: Teaching Models to Produce Valid JSON, YAML, and XML** - -Using [NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner), an orchestration framework for generating high-quality synthetic data at scale, we built an iterative pipeline that generates diverse, schema-constrained structured outputs across JSON, YAML, and XML. Through multiple rounds of prompt refinement, rejection sampling, and programmatic validation, we produced a 9,949-sample dataset of verified structured output training data. - - - -The dataset is publicly available: **[Download it on HuggingFace](https://huggingface.co/datasets/nvidia/Nemotron-RL-instruction_following-structured_outputs)** (CC BY 4.0). - -This post walks through the full SDG pipeline: schema generation, multi-format rollouts, rejection sampling, and the caveats we discovered along the way. - ---- - -## **Why This Matters: Reliably Building Agentic AI Applications** - -Every modern AI application that integrates LLMs into a software stack hits the same wall: the model needs to produce *structured* output (JSON, YAML, XML) that downstream code can parse and act on. Function calling, API responses, database inserts, configuration generation, and form filling all require the model to follow a schema precisely. - -The stakes are high. When an LLM serves as a backend for tool-calling agents, a single malformed JSON response doesn't just produce a bad answer; it crashes the entire agentic pipeline. The function call fails, the agent can't recover, and the user sees an error. OpenAI, Anthropic, and Google have all invested heavily in structured output guarantees for exactly this reason. - -On standard benchmarks, our baseline model failed to produce valid structured outputs up to 35% of the time, depending on schema complexity and format. For an API serving thousands of requests, that's hundreds of failures per hour. Our goal was to reduce this as much as possible through targeted synthetic data. - ---- - -## **Impact on Model Performance** - -Training Nemotron Nano v3 with this data improved JSONSchemaBench accuracy from 80.2% to 86.9% and StructEval-Text from 64.5% to 72.1% (+7.6pp overall), confirming the data generalizes across formats. We evaluated across two public benchmarks: - -[**JSONSchemaBench**](https://github.com/guidance-ai/jsonschemabench) tests conformance to JSON schemas of varying complexity (nesting depth, field count): - -| Model | %Valid | -|-------|--------| -| Baseline SFT Model | 80.2% | -| **Nemotron Nano v3 (trained with this data)** | **86.9%** | -| Baseline Qwen3-30B-A3B-Thinking | 92.8% | -| Baseline GPT-OSS-20B | 95.8% | - -[**StructEval-Text**](https://github.com/StructEval/StructEval) tests structured output across multiple formats (CSV, JSON, TOML, XML, YAML): - -| Model | CSV | JSON | TOML | XML | YAML | **Overall** | -|-------|-----|------|------|-----|------|-------------| -| Baseline SFT | 92.5% | 68.5% | 19.2% | 44.0% | 64.2% | **64.5%** | -| **Nemotron Nano v3** | **99.1%** | **77.3%** | **28.4%** | **49.5%** | **72.2%** | **72.1%** | -| Qwen3-30B-Instruct | 99.9% | 71.5% | 17.0% | 54.8% | 77.4% | 72.6% | -| GPT-OSS-20B | 98.7% | 79.4% | 94.1% | 53.7% | 88.9% | 81.3% | - -Key observations: -- **TOML is the hardest format** -- even GPT-OSS-20B only reaches 94%, and smaller models struggle below 30%. Our data improved TOML from 19.2% to 28.4% but there's significant room to grow. -- **CSV is nearly solved** -- all models score above 92%. -- **JSON and YAML benefit most** from our synthetic data, with 8-9 percentage point gains. -- **XML remains stubborn** -- gains were modest (44% to 49.5%), suggesting XML-specific training data may be needed. - ---- - -## **Structured Outputs Pipeline** - -### High-Level Architecture - -The pipeline generates structured output training data through a multi-stage process, with rejection sampling to select the best of multiple rollouts: - -``` - STRUCTURED OUTPUTS SDG PIPELINE - ============================== - - ┌─────────────────────────────────────────────────────────────────────────────────────┐ - │ STAGE 1: SEED DATA │ - │ │ - │ topic_category (10+ categories) topic_subtopic (20 per category) │ - │ 200+ unique topic pairs across diverse domains │ - └─────────────────────────────────────────┬───────────────────────────────────────────┘ - │ - ▼ - ┌─────────────────────────────────────────────────────────────────────────────────────┐ - │ STAGE 2: DIVERSITY CONTROLS │ - │ │ - │ Schema Controls Content Controls Prompt Controls │ - │ ├─ rigidity: strict/mod ├─ format: json/yaml/xml ├─ length: short/med/long │ - │ ├─ fields: 3 / 4 / 5 ├─ turns: 2 / 3 / 4 ├─ type: imp/interr/decl │ - │ ├─ complexity: simple/ ├─ tone ├─ tone: formal/neutral/… │ - │ │ moderate/complex └─ detail_level ├─ mode: sys/user/both │ - │ └─ depth: 2-3 / 4-5 / 6-8 └─ placement: 6 orderings │ - └─────────────────────────────────────────┬───────────────────────────────────────────┘ - │ - ▼ - ┌─────────────────────────────────────────────────────────────────────────────────────┐ - │ STAGE 3: LLM GENERATION (Qwen3-235B) │ - │ │ - │ structured_schema ────► conversation_pairs ────► document │ - │ (unique per record) (multi-turn Q&A) (paragraph or bulleted facts) │ - │ │ │ - │ ▼ │ - │ user_prompt │ - │ (diverse instructions) │ - │ │ │ - │ ▼ │ - │ ┌────────────────────┐ │ - │ │ 3x ROLLOUTS │ │ - │ │ output_1 │ │ - │ │ output_2 │ │ - │ │ output_3 │ │ - │ └─────────┬──────────┘ │ - └────────────────────────────────────┬────────────────────────────────────────────────┘ - │ - ▼ - ┌─────────────────────────────────────────────────────────────────────────────────────┐ - │ STAGE 4: VALIDATION + REJECTION SAMPLING │ - │ │ - │ For each rollout: │ - │ 1. Parse check (json.loads / yaml.safe_load / xml.etree) │ - │ 2. Schema conformance (jsonschema Draft202012Validator) │ - │ 3. Diagnostic: valid / parse_error / schema_violation │ - │ │ - │ Pick best valid rollout ────► fallback: parse + length heuristic │ - └─────────────────────────────────────────┬───────────────────────────────────────────┘ - │ - ▼ - ┌─────────────────────────────────────────────────────────────────────────────────────┐ - │ OUTPUT: SFT / RLVR DATA │ - │ │ - │ messages: [{system}, {user: prompt + schema + doc}, {assistant}] │ - │ 9,949 samples · JSON format · CC BY 4.0 │ - └─────────────────────────────────────────────────────────────────────────────────────┘ -``` - -### What Makes This Pipeline Different - -A simple "generate JSON from a prompt" approach won't cut it. This pipeline has several distinctive design choices: - -1. **Per-record schema generation.** Every training example gets a *unique* schema. The model doesn't memorize a handful of fixed structures; it learns to follow arbitrary schemas. - -2. **Multi-format support.** The same pipeline produces JSON, YAML, and XML training data by controlling the `output_format` sampler. Schemas are generated natively in the target format, then normalized to JSON Schema for validation. - -3. **Depth and complexity control.** Three complexity tiers (`simple`/`moderate`/`complex`) target different nesting depths (2-3, 4-5, 6-8 levels deep). Complex schemas aim for 7-10 levels of nesting when feasible. - -4. **Rejection sampling with 3 rollouts.** For each record, we generate 3 candidate structured outputs and validate all of them. We pick the best valid one. If none are valid, we fall back to a heuristic ranker (parse success + length). - -5. **Prompt diversity as a first-class concern.** The user prompt varies across sentence length, sentence type, tone, prompt mode (system_only/user_only/both), and placement variant (6 orderings of prompt/schema/document). - ---- - -**Step 1: Seed Data and Schema Generation** - -We start with 200+ topic pairs spanning Leisure Activities, Daily Life, Social Interactions, Education, Events, Pets, Parenting, Fitness, Arts, Environment, Developer Configs, Business Docs, Media/Feeds, Geospatial, Civic/Alerts, IoT/Telemetry, and ML/AI. - -Schema generation is controlled by four samplers: - -```python -config.add_column(dd.SamplerColumnConfig( - name="schema_complexity", - sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams(values=["simple", "moderate", "complex"]), -)) - -config.add_column(dd.SamplerColumnConfig( - name="schema_depth_target", - sampler_type=dd.SamplerType.SUBCATEGORY, - params=dd.SubcategorySamplerParams( - category="schema_complexity", - values={ - "simple": ["2-3"], - "moderate": ["4-5"], - "complex": ["6-8"], - }, - ), -)) -``` - -The schema generation prompt instructs the model (Qwen3-235B-A22B) to produce a JSON Schema with the target depth, at least one boolean and one enum property, `additionalProperties: false` at all levels, and a complete `required` list. Each schema is then normalized to JSON Schema draft 2020-12 for validation. - ---- - -**Step 2: Conversation and Document Generation** - -From each schema, we generate multi-turn Q&A pairs whose answers naturally contain the data that the schema describes. These conversation pairs are then transformed into a "document" (either a paragraph or a bulleted list of facts, varying by detail level setting). This document provides the grounding context for the structured output generation. - ---- - -**Step 3: Prompt Assembly and Rollouts** - -The user prompt is assembled from diverse instruction templates conditioned on length, type, and tone. It always asks for a strictly conforming structured instance in the requested `output_format`. The schema and document are placed in one of 6 ordering variants (`prompt_schema_document`, `schema_document_prompt`, etc.) to prevent the model from relying on positional cues. - -We generate 3 rollout completions per record. Each rollout may include `...` reasoning traces, which are stripped before parsing/validation but retained in the final `messages` for reasoning-mode training. - ---- - -**Step 4: Validation and Rejection Sampling** - -Each rollout goes through a three-stage validation pipeline: - -1. **Parse check**: Can the output be parsed as valid JSON/YAML/XML? (XML parsing includes sanitization for pre-root noise, comments, and bare `&` escaping.) -2. **Schema validation**: Does the parsed output conform to the normalized JSON Schema? Uses `jsonschema.Draft202012Validator`. -3. **Rejection sampling**: Among the 3 rollouts, pick the best valid one. If none pass, fall back to a parse-success + length heuristic. - -Each rollout gets a diagnostic: `valid`, `parse_error`, `schema_missing`, `schema_violation`, or `schema_unchecked`. - ---- - -**A Note on `LLMStructuredColumnConfig`** - -Data Designer provides `LLMStructuredColumnConfig`, a column type that guarantees the LLM output conforms to a schema by using Pydantic models or JSON Schema as the `output_format`. This is ideal when every record shares the same structure (e.g., a `QAPair` or `ProductInfo` schema). The framework handles prompting, parsing, and retry logic automatically, ensuring zero schema drift. - -In our pipeline, however, schemas are *dynamic*: each record has a unique, per-record schema generated by the LLM. Since `LLMStructuredColumnConfig` requires a fixed `output_format` at config time, we use `LLMTextColumnConfig` for the generation step and validate conformance programmatically with `jsonschema.Draft202012Validator`. If your use case has a fixed output schema, `LLMStructuredColumnConfig` is the simpler and more reliable choice. - ---- - -**Sample Record** - -Here's what one complete record looks like after running `display_sample_record()` (topic: "Daily Life / Discussing home security", format: JSON, strict schema): - -![Sample record output from display_sample_record()](assets/structured-outputs-from-nemotron/structured-outputs-sample-record.png) -*Screenshot of Data Designer's `display_sample_record()` output showing seed columns, generated samplers, JSON schema, conversation, structured JSON, and both LLM-as-a-Judge scores.* - ---- - -**Published Dataset** - -The dataset is publicly available on HuggingFace: - -**[nvidia/Nemotron-RL-instruction_following-structured_outputs](https://huggingface.co/datasets/nvidia/Nemotron-RL-instruction_following-structured_outputs)** - -| Metric | Value | -|--------|-------| -| Train split | 9,437 samples | -| Validation split | 512 samples | -| Total | 9,949 samples | -| Format | JSON (schema-constrained) | -| Generator model | Qwen3-235B-A22B-Instruct-2507 | -| Rollouts per record | 3 (rejection sampling, best picked) | -| License | CC BY 4.0 | - -Each record contains: -- `responses_create_params`: the full prompt (document + schema + instructions) and model response -- `schema_str`: the JSON Schema the output must conform to (118 chars to 8k chars) -- `schema_type`: format type (`json`) -- `schema_fields_count`: number of required top-level fields (varies from 5 to 12+) - -The dataset is designed for use with [NeMo Gym](https://github.com/NVIDIA/NeMo-RL) for **Reinforcement Learning from Verifiable Reward (RLVR)**, where schema conformance provides a deterministic, programmatically verifiable reward signal with no LLM judge needed. - ---- - -**Future Work** - -- **TOML, XML, and scaling to more formats.** TOML conformance lags significantly; its nuances (inline tables, array of tables, datetime formats) are underrepresented in generator pretraining. XML parsing is fragile due to pre-root noise (comments, processing instructions, stray text). Markdown tables, Protocol Buffers, and SQL DDL are all structured formats that could benefit from this approach. -- **Schema complexity has diminishing returns.** Depth targets of 6-8 push the generator model to its limits. Many "complex" schemas end up with artificial nesting. The sweet spot is 4-5 levels of meaningful nesting. - ---- - -## **Key Takeaways** - -1. **Structured output is a learnable skill.** Targeted synthetic data dramatically improves schema conformance. The baseline-to-trained gap proves this is not a fixed model capability. -2. **Per-record schemas are essential.** Models trained on fixed schemas generalize poorly. A unique schema per record was the single most impactful design choice. -3. **Diversity at every level.** Diverse topics, diverse schemas (depth/width/rigidity), diverse formats, diverse prompts. Each dimension independently improves robustness. -4. **Rejection sampling is cheap insurance.** 3x rollouts push per-record validity from ~80% to >95%. The marginal token cost is small compared to the quality gain. -5. **Validation must be programmatic.** LLM judges assess *design quality* but cannot reliably detect *schema violations*. `jsonschema` + format parsers are non-negotiable. -6. **The hardest formats need the most data.** TOML and XML lag behind JSON and YAML. The pipeline makes it easy to oversample hard formats. - -## **Try It Yourself** - -The demo below generates JSON structured outputs. To extend to YAML or XML, add an `output_format` sampler and reference it in your prompts: - -```python -config.add_column(dd.SamplerColumnConfig( - name="output_format", sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams(values=["json", "yaml", "xml"]), -)) -``` - -Then update the structured output prompt to request `{{ output_format }}` and adjust your validation logic to parse each format accordingly (`json.loads`, `yaml.safe_load`, `xml.etree`). - -
-Full source: structured_outputs_demo.py - -```python -"""Structured Outputs Demo - -Generate schema-conforming JSON with per-record schemas, conversations, -LLM judge evaluation, and programmatic validation. - - pip install data-designer pandas jsonschema - python structured_outputs_demo.py -""" - -import json - -import pandas as pd - -import data_designer.config as dd -from data_designer.interface import DataDesigner - -MODEL_ALIAS = "nvidia-text" - -# Seed data: 20 topic pairs across 5 categories -topics = { - "Leisure Activities": ["Planning a weekend activity", "Discussing a recent movie", - "Talking about hobbies", "Recommending a video game"], - "Daily Life": ["Making dinner plans", "Weather conversation", - "Morning routines", "Managing schedules"], - "Education": ["Choosing a college major", "Study techniques", - "Learning a new language", "Online courses"], - "Technology": ["Smartphone features", "Home automation", - "Cloud computing basics", "AI assistants"], - "Health": ["Workout routines", "Healthy eating", - "Sleep habits", "Stress management"], -} -rows = [{"topic_category": cat, "topic_subtopic": sub} - for cat, subs in topics.items() for sub in subs] -seed_df = pd.DataFrame(rows) - -# Build the pipeline (uses default NVIDIA provider via NVIDIA_API_KEY) -data_designer = DataDesigner() -config = dd.DataDesignerConfigBuilder() - -config.with_seed_dataset( - dd.DataFrameSeedSource(df=seed_df), - sampling_strategy=dd.SamplingStrategy.SHUFFLE, -) - -# Diversity samplers -config.add_column(dd.SamplerColumnConfig( - name="schema_rigidity", sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams(values=["strict", "moderate"]), -)) -config.add_column(dd.SamplerColumnConfig( - name="schema_fields_count", sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams(values=["3", "4", "5"]), -)) -config.add_column(dd.SamplerColumnConfig( - name="num_turns", sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams(values=["2", "3", "4"]), -)) - -# LLM columns: schema -> conversation -> structured JSON -config.add_column(dd.LLMTextColumnConfig( - name="json_schema", model_alias=MODEL_ALIAS, - prompt=( - 'Design a JSON Schema for a response object named "scene_response".\n' - "Controls:\n" - "- Rigidity: {{ schema_rigidity }}\n" - "- Number of top-level properties: {{ schema_fields_count }}\n" - "- Topic Category: {{ topic_category }}\n" - "- Topic Subtopic: {{ topic_subtopic }}\n\n" - "Requirements:\n" - '1. Include a "name" field set to "scene_response"\n' - '2. Include a "schema" field containing a valid JSON Schema (draft 2020-12)\n' - '3. Set "strict": true\n' - "4. Use {{ schema_fields_count }} top-level properties\n" - "5. Include at least one boolean and one enum property\n" - '6. Set "additionalProperties": false\n\n' - "Return ONLY the JSON object, no markdown fences." - ), -)) - -config.add_column(dd.LLMTextColumnConfig( - name="conversation", model_alias=MODEL_ALIAS, - prompt=( - "Create a {{ num_turns }}-turn Q&A conversation about a scene " - "related to {{ topic_category }} / {{ topic_subtopic }}.\n" - "The conversation should naturally lead to information that fits this JSON schema:\n" - "{{ json_schema }}\n\n" - "Format each turn as:\nQ: [question]\nA: [answer]" - ), -)) - -config.add_column(dd.LLMTextColumnConfig( - name="structured_json", model_alias=MODEL_ALIAS, - prompt=( - "Based on the following conversation and JSON schema, generate a JSON object " - "that strictly conforms to the schema.\n\n" - "Conversation:\n{{ conversation }}\n\n" - "JSON Schema:\n{{ json_schema }}\n\n" - "Return ONLY the valid JSON object, no explanation." - ), -)) - -# LLM judges: schema quality (0-3) + JSON validity (binary) -config.add_column(dd.LLMJudgeColumnConfig( - name="quality_score", model_alias=MODEL_ALIAS, - prompt=( - "Evaluate the quality of this JSON schema for the topic.\n\n" - "Schema:\n{{ json_schema }}\n\n" - "Topic: {{ topic_category }} / {{ topic_subtopic }}" - ), - scores=[dd.Score( - name="schema_quality", - description="Overall quality of the JSON schema design", - options={3: "Excellent", 2: "Good", 1: "Fair", 0: "Poor"}, - )], -)) - -config.add_column(dd.LLMJudgeColumnConfig( - name="json_validity_llm", model_alias=MODEL_ALIAS, - prompt=( - "Does the structured JSON conform to the schema?\n\n" - "JSON Schema:\n{{ json_schema }}\n\n" - "Structured JSON:\n{{ structured_json }}" - ), - scores=[dd.Score( - name="json_validity", - description="Does the structured JSON conform to the schema?", - options={1: "Valid", 0: "Invalid"}, - )], -)) - -# Generate and display -preview = data_designer.preview(config, num_records=10) -preview.display_sample_record() - -# Programmatic validation -import jsonschema - -df = preview.dataset if hasattr(preview, "dataset") else preview -valid = 0 -for _, row in df.iterrows(): - try: - schema_obj = json.loads(row["json_schema"]) if isinstance(row["json_schema"], str) else row["json_schema"] - json_obj = json.loads(row["structured_json"]) if isinstance(row["structured_json"], str) else row["structured_json"] - if "schema" in schema_obj: - jsonschema.Draft202012Validator(schema_obj["schema"]).validate(json_obj) - valid += 1 - except Exception as e: - print(f" Record failed: {e}") - -print(f"\nValidation: {valid}/{len(df)} records pass ({valid/len(df)*100:.0f}%)") -``` - -
- ---- - -**Key Resources:** - -- **Dataset (download):** [nvidia/Nemotron-RL-instruction_following-structured_outputs](https://huggingface.co/datasets/nvidia/Nemotron-RL-instruction_following-structured_outputs) (CC BY 4.0) -- **NeMo Data Designer:** [github.com/NVIDIA-NeMo/DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) -- **NeMo Gym:** [github.com/NVIDIA/NeMo-RL](https://github.com/NVIDIA/NeMo-RL) -- **Nemotron Nano v3 Technical Report:** [research.nvidia.com](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf) | [arxiv.org/abs/2512.20848](https://arxiv.org/abs/2512.20848) - ---- diff --git a/docs/devnotes/posts/text-to-sql.md b/docs/devnotes/posts/text-to-sql.md deleted file mode 100644 index d0f25e176..000000000 --- a/docs/devnotes/posts/text-to-sql.md +++ /dev/null @@ -1,593 +0,0 @@ ---- -date: 2026-04-14 -authors: - - dnathawani - - ymeyer - - mvansegbroeck ---- - -# **Engineering an Enterprise-Grade Text-to-SQL Dataset with NeMo Data Designer** - -While LLMs have mastered generic coding, Text-to-SQL remains one of the most challenging frontiers in enterprise AI. In many ways this is due to (i) SQL tasks relying on both code and data and (ii) real-world data and databases being quite messy. Focusing on careful data design that accounts for real-world diversity and complexity, we built a [NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) pipeline that includes conditional sampling, three-stage LLM generation, code validators, and multi-dimensional judge scoring to generate reasoning-heavy text-to-SQL samples across PostgreSQL, MySQL, and SQLite, and automatically filter down to the highest quality 96.5k records. Each sample pairs a natural-language prompt and a fully synthetic database schema context with a target SQL query. To improve robustness and mimic the messiness of production databases, the pipeline injects distractor tables and columns into the schema context, forcing the model to learn to ignore irrelevant schema elements. The final dataset is validated and filtered through per-dialect syntax validators and five LLM-as-a-critic judges. - - -
- -![Text-to-SQL Synthetic Data Pipeline](assets/text-to-sql/text-to-sql-pipeline.jpg){ width=800 } - -
---- - -## **The "Real-World" Gap: Why Academic Data Wasn't Enough** - -The gap between academic benchmarks and the messy reality of enterprise data warehouses is massive. On academic benchmarks like Spider (where schemas are clean, tables are few, and queries are straightforward), frontier models score above 85%. On [BIRD](https://bird-bench.github.io/) (which introduces dirty data, larger schemas, and external knowledge requirements), the best open models reach roughly 70% execution accuracy --- and on [Spider 2.0 Lite](https://spider2-sql.github.io/) (which uses real enterprise databases with hundreds of tables, multiple dialects, and complex business logic), even the best models score below 50%. - -The problem isn't model capability --- it's **training data**. Most open-source text-to-SQL datasets assume a "happy path": intuitive column names, perfect data types, and straightforward questions. Production SQL is different: - -- **Dialect specificity.** Generic "SQL" doesn't compile. We needed valid, executable code for MySQL, PostgreSQL, and SQLite that respects their unique syntax --- `date('now')` in SQLite vs. `CURRENT_DATE` in Postgres, `DISTINCT ON` in PostgreSQL vs. nested subqueries in MySQL. -- **Dirty data.** Real columns contain currency symbols (`$57,500`), mixed date formats, and JSON blobs. The model needs to learn *defensive SQL*: writing queries that use `CAST`, `STR_TO_DATE`, and string manipulation functions to clean data at query time before attempting any aggregation. We explicitly prompted the generation engine to introduce anti-patterns like storing dates as text (`'01-Jan-2023'`), including currency symbols in pricing columns, or burying critical flags inside JSON blobs. -- **Distractor tables and schema linking.** In production, you rarely get just the 2 tables you need; you're more likely to get a schema with 50 tables, many of which look identical. We injected semantically similar "distractor" tables into every context --- `sales_orders` vs. `sales_orders_archive`, `customer_leads` vs. `active_customers` --- forcing the model to perform schema linking based on column constraints and relationships, not just table names. -- **Industry-specific schemas.** Healthcare EHR tables look nothing like financial trading systems. The column names, relationships, and business logic are domain-specific. -- **Complexity gradients.** Junior analysts write simple SELECTs; senior engineers write recursive CTEs with window functions. Training data needs the full spectrum. - -**Domain diversity and complexity coverage matter more than dataset size.** - ---- - -## **Pipeline Overview** - -The pipeline generates text-to-SQL training data through a five-stage process. Each record flows through seeding & diversification, three LLM generation steps, and a validation + quality scoring layer. All three LLM generation stages use a reasoning model whose internal chain-of-thought improves schema design and SQL correctness. The pipeline runs independently for each SQL dialect, with dialect-specific prompts, validators, and judge prompts. - -
-ASCII version of the pipeline diagram - -``` - TEXT-TO-SQL SDG PIPELINE - ======================== - - ┌─────────────────────────────────────────────────────────────────────────────────────┐ - │ STAGE 1: SEEDING & DIVERSIFICATION │ - │ │ - │ Domain Controls SQL Controls Prompt Controls │ - │ ├─ industry_sector (60) ├─ sql_complexity (3 tiers) ├─ instruction_style │ - │ ├─ topic (~700) ├─ sql_concept (89 buckets) │ (5 styles) │ - │ ├─ data_quality_challenge ├─ sql_task_type (12 cats) ├─ linguistic_register│ - │ │ (5 categories) └─ sql_task_concept (94) │ (5 registers) │ - │ └─ knowledge_dependency └─ politeness_level │ - │ (3 categories) (4 levels) │ - └─────────────────────────────────────────┬───────────────────────────────────────────┘ - │ - ▼ - ┌─────────────────────────────────────────────────────────────────────────────────────┐ - │ STAGE 2: PROMPT GENERATION (Reasoning LLM) │ - │ │ - │ Generates a natural-language request to a data assistant. │ - │ Grounded in sampled metadata; no SQL jargon; realistic thresholds. │ - │ Style adapts to instruction_style × linguistic_register × politeness_level. │ - └─────────────────────────────────────────┬───────────────────────────────────────────┘ - │ - ▼ - ┌─────────────────────────────────────────────────────────────────────────────────────┐ - │ STAGE 3: SCHEMA + DATA GENERATION (Reasoning LLM) │ - │ │ - │ Generates dialect-specific DDL (CREATE TABLE) + sample data (INSERT). │ - │ ├─ 3–5 core tables with PKs, FKs, and realistic constraints │ - │ ├─ 1–2 distractor tables (plausible but unnecessary, with FK links) │ - │ ├─ 3–5 distractor columns per table (created_at, updated_by, etc.) │ - │ └─ Dirty data injected per data_quality_concept (mixed formats, embedded chars) │ - └─────────────────────────────────────────┬───────────────────────────────────────────┘ - │ - ▼ - ┌─────────────────────────────────────────────────────────────────────────────────────┐ - │ STAGE 4: SQL GENERATION (Reasoning LLM) │ - │ │ - │ Generates dialect-specific SQL (SQLite / MySQL / PostgreSQL). │ - │ ├─ References only tables/columns from the schema context │ - │ ├─ Handles dirty data with cleaning logic (CAST, REPLACE, SUBSTR, regex) │ - │ ├─ Ignores distractor tables and columns │ - │ └─ Anchors relative time to max date in data (no CURRENT_DATE / NOW()) │ - └─────────────────────────────────────────┬───────────────────────────────────────────┘ - │ - ▼ - ┌─────────────────────────────────────────────────────────────────────────────────────┐ - │ STAGE 5: VALIDATION + QUALITY SCORING │ - │ │ - │ Syntax Validator 5 LLM Judges (0–4 scores) │ - │ ├─ SQL_SQLITE ├─ Prompt: naturalness, specificity, no SQL jargon │ - │ ├─ SQL_MYSQL ├─ SQL: relevance, readability, scalability, standards │ - │ └─ SQL_POSTGRES ├─ Context: relevance, readability, scalability, stds │ - │ ├─ Data Quality: cleaning correctness, efficiency │ - │ └─ Knowledge: application correctness, clarity │ - │ │ - │ 96.5k records pass validation and quality filtering │ - └─────────────────────────────────────────────────────────────────────────────────────┘ -``` - -
- ---- - -## **Step 1: Seeding & Diversification -- Controlling Diversity at the Source** - -Rather than relying on LLM creativity alone for diversity, the pipeline samples structured metadata that deterministically controls every axis of variation. A JSON taxonomy file defines the problem space: - -| Axis | Categories | Subcategories | Role | -|------|-----------|---------------|------| -| Industry sector | 60 | ~700 topics | Domain grounding (Healthcare, FinServ, Gaming, ...) | -| SQL complexity | 3 tiers | 89 concepts | Difficulty level (Beginner → Advanced) | -| SQL task type | 12 categories | 94 concepts | What the query does (analytics, transformation, ...) | -| Data quality | 5 challenges | 12 concepts | Dirty data to inject and clean | -| Knowledge dependency | 3 categories | 9 concepts | Implicit reasoning required | -| Instruction style | 5 styles | -- | imperative, declarative, interrogative, contextual, abbreviated | -| Linguistic register | 5 registers | -- | formal, conversational, technical, academic, direct | -| Politeness level | 4 levels | -- | none, minimal, polite, very polite | - -Standard categorical samplers draw independently from their value lists. Data Designer's `SubcategorySamplerParams` creates hierarchical dependencies --- what we call "Semantic Blueprints" --- that ensure internally consistent records. When `industry_sector` samples "Healthcare", `topic` is drawn only from healthcare-specific subcategories. When `sql_complexity` samples "Beginner", `sql_concept` is restricted to foundational SQL operations. This is the difference between realistic training data and random noise. - -!!! note "Code snippets in this post are illustrative" - The code blocks below show the key configuration patterns for each pipeline stage. Model aliases (`prompt_gen`, `context_gen`, etc.) and companion files (`prompts.py`, `rubrics.py`) are referenced but not fully defined inline. For a complete, runnable pipeline, see the [Enterprise Text-to-SQL Recipe](../../recipes/code_generation/enterprise_text_to_sql/). - -```python -import data_designer.config as dd - -config = dd.DataDesignerConfigBuilder() - -# Industry → Topic (two-level conditional) -config.add_column(dd.SamplerColumnConfig( - name="industry_sector", - sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams(values=[ - "Healthcare", "Finance", "Technology", "Retail", "Manufacturing", - "Aerospace", "Energy", "Telecommunications", "Transportation", "Education", - # ... 60 industries total - ]), -)) - -config.add_column(dd.SamplerColumnConfig( - name="topic", - sampler_type=dd.SamplerType.SUBCATEGORY, - params=dd.SubcategorySamplerParams( - category="industry_sector", - values={ - "Healthcare": ["Electronic Health Records", "Telemedicine Platforms", - "Clinical Trials", "Patient Scheduling", "Insurance Claims"], - "Finance": ["Fraud Detection", "Trading Systems", "Risk Assessment", - "Portfolio Management", "Regulatory Compliance"], - "Technology": ["Cloud Platforms", "ML Pipelines", "DevOps Tools", - "API Gateway Logs", "User Analytics"], - "Retail": ["Inventory Management", "Customer Segmentation", - "Pricing Optimization", "Supply Chain", "Returns Processing"], - # ... 700 subcategories across all industries - }, - ), -)) - -# Complexity → SQL Concept (two-level conditional) -config.add_column(dd.SamplerColumnConfig( - name="sql_complexity", - sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams(values=["Beginner", "Intermediate", "Advanced"]), -)) - -config.add_column(dd.SamplerColumnConfig( - name="sql_concept", - sampler_type=dd.SamplerType.SUBCATEGORY, - params=dd.SubcategorySamplerParams( - category="sql_complexity", - values={ - "Beginner": ["Basic SELECT Statements", "WHERE Clauses", "Simple Aggregations", ...], - "Intermediate": ["Window Functions", "CASE Expressions", "Correlated Subqueries", ...], - "Advanced": ["Recursive CTEs", "Frame Clauses", "Pivot/Unpivot", ...], - }, - ), -)) - -# Dialect control (one value per run; the pipeline runs once per dialect) -config.add_column(dd.SamplerColumnConfig( - name="sql_dialect", - sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams(values=["SQLite"]), # or "MySQL", "PostgreSQL" -)) - -# Task type restricted by complexity via conditional_params -task_types = { - "Foundational Queries & DML": [...], - "Data Quality & Validation": [...], - "Advanced Analytics & Windowing": [...], - "Schema, DDL & Performance": [...], - # ... 12 task types total -} - -task_type_conditional_params = { - "sql_complexity == 'Beginner'": dd.CategorySamplerParams( - values=["Foundational Queries & DML", "Data Quality & Validation", ...] - ), - "sql_complexity == 'Advanced'": dd.CategorySamplerParams( - values=["Advanced Analytics & Windowing", "Schema, DDL & Performance", ...] - ), -} - -config.add_column(dd.SamplerColumnConfig( - name="sql_task_type", - sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams(values=list(task_types.keys())), - conditional_params=task_type_conditional_params, -)) -``` - -Prompt diversity is controlled independently through three additional samplers (instruction style, linguistic register, politeness level). Because these are combinatorial (5 × 5 × 4 = 100 style combinations), even records with identical domain and SQL metadata will produce stylistically distinct prompts. A CFO asking "Can you pull the Q3 numbers?" and an engineer saying "Write a query that joins sales on customer_id" should both produce correct SQL. - ---- - -## **Step 2: Generating Natural-Language Prompts** - -The prompt generation step produces a single natural-language request to a data assistant. The LLM receives all sampled metadata via Jinja2 template variables and must produce a request that: - -- Describes a **business problem**, not a SQL specification (no SQL jargon allowed) -- Matches the sampled instruction style, linguistic register, and politeness level -- Implicitly requires the sampled SQL concept, task type, data quality handling, and knowledge dependency -- Uses realistic thresholds appropriate for small sample data (5-10 rows per table) - -```python -config.add_column(dd.LLMTextColumnConfig( - name="sql_prompt", - model_alias="prompt_gen", - system_prompt=( - "You write natural-language requests to a data assistant. " - "You adapt your writing style based on the specified instruction style, " - "linguistic register, and politeness level." - ), - prompt=( - "Write a single-sentence, natural-language request to a data assistant.\n\n" - "## Style Requirements\n" - "* Instruction Style: {{ instruction_style }}\n" - "* Linguistic Register: {{ linguistic_register }}\n" - "* Politeness Level: {{ politeness_level }}\n\n" - "## Grounding Requirements\n" - "* Industry: {{ industry_sector }} / {{ topic }}\n" - "* SQL Complexity: {{ sql_complexity }} ({{ sql_concept }})\n" - "* Task: {{ sql_task_type }} ({{ sql_task_concept }})\n" - "* Data Quality: {{ data_quality_challenge }} ({{ data_quality_concept }})\n" - "* Knowledge: {{ knowledge_dependency }} ({{ knowledge_concept }})\n" - ), -)) -``` - -Here are example prompts generated from the same underlying SQL concept (window functions) but with different style settings: - -| Style | Example Prompt | -|-------|---------------| -| imperative / formal / none | List each sales representative alongside their quarterly revenue and the running total across the team, ordered by performance. | -| interrogative / conversational / polite | Hey, could you show me how each rep's quarterly numbers stack up against the team's running total? | -| abbreviated / direct / none | Sales rep quarterly revenue, running team total, ranked by performance | -| contextual / academic / polite | For the upcoming performance review, could you provide each representative's quarterly revenue figures alongside a cumulative team total? | - ---- - -## **Step 3: Schema and Data Generation with Distractor Injection** - -This is the most distinctive stage of the pipeline. For each record, the LLM generates a complete database schema (DDL) and sample data (INSERT statements) in the target SQL dialect. The schema must include both the tables needed to answer the prompt *and* deliberate noise: - -- **3–5 core tables** directly related to the industry/topic, connected via foreign keys -- **1–2 distractor tables** that are plausible for the domain but *not* needed to answer the prompt, each with FK relationships to core tables and 5-10 rows of realistic data -- **3–5 distractor columns per table** (e.g., `created_at`, `updated_by`, `description`, `is_active`) that are realistic but irrelevant to the query -- **Dirty data** injected according to the sampled `data_quality_concept` -- stored in TEXT/VARCHAR columns so the schema itself doesn't enforce type correctness - -In production, you rarely get just the 2 tables you need; you're more likely to get a schema with 50 tables, many of which look identical. Injecting semantically similar "distractor" tables --- `sales_orders` vs. `sales_orders_archive`, `customer_leads` vs. `active_customers` --- forces the model to perform schema linking based on column constraints and relationships, not just table names. This is the skill gap between academic benchmarks and production. - -The schema prompt requires four clearly labeled sections (`-- Core Tables`, `-- Distractor Tables`, `-- Sample Data for Core Tables`, `-- Sample Data for Distractor Tables`) and enforces determinism by forbidding real-time functions like `NOW()` or `CURRENT_DATE` in INSERT statements. - -```python -config.add_column(dd.LLMCodeColumnConfig( - name="sql_context", - model_alias="context_gen", - system_prompt="You are an expert SQL database architect who designs well-structured, normalized schemas.", - prompt=( - "Generate {{ sql_dialect }} DDL and sample data for tables relevant to the instruction.\n" - "Instruction: {{ sql_prompt }}\n\n" - "Requirements:\n" - "* Include 3–5 core tables for {{ industry_sector }}/{{ topic }}\n" - "* Include 1–2 distractor tables (plausible but NOT needed for the instruction)\n" - "* Include 3–5 distractor columns per table\n" - "* Introduce {{ data_quality_concept }} dirty data issues\n" - "* Use section headers: -- Core Tables, -- Distractor Tables, etc.\n" - "* No NOW()/CURRENT_DATE in INSERT statements\n" - ), - code_lang=dd.CodeLang.SQL_SQLITE, # or SQL_MYSQL, SQL_POSTGRES -)) -``` - ---- - -## **Step 4: Dialect-Specific SQL Generation** - -The SQL generation step receives the natural-language prompt and the generated schema context, then produces an executable query in the target dialect. The prompt enforces several constraints that are critical for training quality: - -- **Only reference defined tables/columns** -- the LLM is strictly forbidden from inventing schema elements -- **Handle dirty data** -- the query must clean data issues (CAST, REPLACE, SUBSTR, regex) before computing results -- **Ignore distractors** -- no unnecessary joins or column selections; distractor elements must be left untouched -- **Anchor relative time** -- instead of `CURRENT_DATE`, anchor to `(SELECT MAX(date_col) FROM table)` for reproducibility -- **Dialect-specific syntax** -- SQLite uses `strftime`, MySQL uses `DATE_SUB`, PostgreSQL uses `::` casting and `interval`. Each dialect also has prompt-level constraints to ensure portability (e.g., SQLite prompts exclude `LATERAL` joins and `REGEXP_REPLACE`; MySQL prompts exclude `REGEXP_REPLACE` for pre-8.0 compatibility and `CONVERT_TZ` to avoid unpopulated timezone table issues) - -```python -config.add_column(dd.LLMCodeColumnConfig( - name="sql", - model_alias="sql_gen", - system_prompt="You are an expert SQL programmer. Return only the final SQL.", - prompt=( - "Write {{ sql_dialect }} SQL for the instruction using only the provided database context.\n" - "Instruction: {{ sql_prompt }}\n\n" - "Database Context:\n{{ sql_context }}\n\n" - "* Handle {{ data_quality_concept }} issues with cleaning logic\n" - "* Apply {{ knowledge_concept }}\n" - "* Match {{ sql_complexity }} level using {{ sql_concept }}\n" - "* Do NOT join distractor tables or select distractor columns\n" - ), - code_lang=dd.CodeLang.SQL_SQLITE, # or SQL_MYSQL, SQL_POSTGRES -)) -``` - -The pipeline runs independently for each dialect (SQLite, MySQL, PostgreSQL), producing ~32k records per dialect that are combined into the final 96.5k-record dataset. Separating prompt, schema, and query generation across three stages is essential --- when you ask a single prompt to generate all three, the SQL tends to reference tables that don't exist in the schema, or the schema doesn't contain the columns the SQL needs. - -The chain-of-thought traces from the reasoning model teach it to *think like a Data Engineer*: decomposing complex problems, handling edge cases, and verifying logic before writing a single line of code. A typical reasoning trace looks like: - -> "The user wants to filter by date, but the 'timestamp' column is stored as TEXT. I need to first normalize this column using STR_TO_DATE before I can apply the WHERE clause..." - ---- - -## **Step 5: The Quality Waterfall** - -Generating 300,000 samples is straightforward. Ensuring they are correct is the hard part. We implemented a rigorous "Quality Waterfall" that rejected over 68% of the generated data. - -### Hard Validation - -Data Designer's built-in code validator checks each SQL query for syntactic correctness against the target dialect: - -```python -config.add_column(dd.ValidationColumnConfig( - name="sql_validity_result", - target_columns=["sql"], - validator_type=dd.ValidatorType.CODE, - validator_params=dd.CodeValidatorParams(code_lang=dd.CodeLang.SQL_SQLITE), -)) -``` - -The validator returns `is_valid` (boolean) and `error_messages` (string). Records that fail parsing are flagged immediately. Supported dialects: `SQL_SQLITE`, `SQL_POSTGRES`, `SQL_MYSQL`, `SQL_TSQL`, `SQL_BIGQUERY`, `SQL_ANSI`. - -### Five LLM Judges - -Beyond syntax validity, we evaluate record *quality* across five judges, each scoring on a 0-4 scale: - -| Judge | What It Evaluates | Scoring Criteria | -|-------|-------------------|-----------------| -| Prompt Judge | Natural-language prompt quality | Naturalness of wording, specificity and clarity, absence of SQL jargon | -| SQL Judge | Generated SQL quality | Relevance (penalizes unnecessary joins to distractor tables), readability, scalability, standards compliance | -| Context Judge | Schema + sample data quality | Relevance (penalizes missing distractors and bare-minimum schemas), readability, scalability, standards compliance | -| Data Quality Judge | Cleaning logic in SQL | Correctness of cleaning logic, efficiency of cleaning method | -| Knowledge Judge | Implicit knowledge application | Correctness of knowledge application, clarity of inference | - -The SQL judge rubric explicitly penalizes distractor usage: - -> *"The SQL should only JOIN or reference tables that are strictly necessary to answer the prompt. The database context may include distractor tables that look relevant but are not needed -- penalize queries that unnecessarily join or reference these tables."* - -Each judge provides a score *and* reasoning for each dimension, making it easy to diagnose why a record scored low. After configuring the five `LLMJudgeColumnConfig` columns (see the [full recipe](../../recipes/code_generation/enterprise_text_to_sql/) for complete judge definitions), expression columns extract numeric scores into flat columns for downstream filtering: - -```python -config.add_column(dd.ExpressionColumnConfig( - name="sql_relevance_score", - expr="{{ sql_judge_result.relevance.score if sql_judge_result.relevance.score is not none else '' }}", -)) -``` - ---- - -## **Rich Metadata for Precision Training** - -We didn't just generate text pairs --- we generated structured data. Unlike standard datasets that give you a black box of question → SQL, every single record is tagged with rich, granular metadata: - -| Field | Description | Example Values | -|-------|-------------|----------------| -| `industry_sector` | Domain vertical | Healthcare, Finance, Aerospace | -| `topic` | Specific subdomain | Electronic Health Records, Fraud Detection | -| `sql_complexity` | Difficulty tier | Beginner, Intermediate, Advanced | -| `sql_concept` | Target SQL skill | Window Functions, Recursive CTEs | -| `sql_dialect` | Target database | PostgreSQL, MySQL, SQLite | -| `instruction_style` | Prompt style | imperative, interrogative, contextual | -| `linguistic_register` | Language register | formal, conversational, technical | -| `politeness_level` | Politeness level | none, minimal, polite, very polite | -| `data_quality_challenge` | Dirty data type | Type Mismatches, Temporal Drift | -| `knowledge_dependency` | Reasoning required | Domain Knowledge, Implicit Logic | -| 15 judge scores | Per-dimension scores | 0-4 across 5 judges | - -This allows researchers and engineers to "slice and dice" the training data with surgical precision. If you want to fine-tune a model specifically for Finance analytics using Window Functions in PostgreSQL, you can filter for exactly that subset. - ---- - -## **Results** - -| Metric | Value | -|--------|-------| -| Records generated | 300,000 | -| Records after Quality Waterfall | 96,500 | -| Rejection rate | 68% | -| SQL dialects | PostgreSQL, MySQL, SQLite | -| Industry coverage | 60 distinct industries | -| Topic coverage | ~700 distinct subcategories | -| SQL concept coverage | 89 concepts across 3 complexity tiers | -| Syntax validation | 100% verified | -| LLM judges | 5 judges, 15 scoring dimensions | -| Minimum judge score | ≥ 3/4 across all dimensions | - -The high rejection rate is a feature, not a bug. By generating 3x more data than we needed and filtering aggressively, we ensured every record in the final dataset is both syntactically valid and semantically meaningful. - ---- - -## **BIRD Benchmark Results** - -This dataset was shipped in the SFT stage of **Nemotron Super v3**. On the [BIRD SQL benchmark](https://bird-bench.github.io/) (1,534 dev samples, 5-run average), Nemotron Super achieves **41.80% EX** (execution accuracy) --- outperforming GPT-OSS-120B at 38.25%. Including our synthetic dataset in the SFT blend raised Nemotron Super's EX on BIRD by **15 points**, from 26.77% to 41.80%. - -![BIRD SQL Benchmark Results - Nemotron Super EX improves from 26.77% to 41.80%](assets/text-to-sql/bird-benchmark-results.jpg){ width=800 } - -| Model | BIRD EX (%) | -|-------|-------------| -| Nemotron Super (before synthetic text-to-SQL SFT data) | 26.77 | -| GPT-OSS-120B | 38.25 | -| **Nemotron Super (after synthetic text-to-SQL SFT data)** | **41.80** | - -**Caveat on BIRD:** BIRD measures *execution accuracy* (EX) --- whether the query returns the correct result set when run against the ground-truth database. This is stricter than exact-match or string similarity, but it can also be inflated by semantically different queries that happen to produce identical result sets on small test data. BIRD's dev set includes dirty data, external knowledge requirements, and multi-table schemas, making it more representative of production SQL than earlier benchmarks like Spider --- but it does not cover all production challenges (e.g., multi-statement transactions, DDL, stored procedures, or the hundreds-of-tables schemas common in enterprise warehouses). Results here are on the 1,534-sample dev split averaged over 5 runs. - ---- - -## **Key Takeaways** - -1. **Conditional sampling prevents incoherent records.** `SubcategorySamplerParams` ensures "Geospatial SQL" only appears with "Advanced" complexity, and "Electronic Health Records" only appears with "Healthcare". Independent samplers would produce nonsensical combinations that confuse training. - -2. **Three-stage generation beats one-shot.** Separating prompt, schema, and query generation ensures the SQL actually references the tables that exist. One-shot generation frequently hallucinates tables. - -3. **Dirty data must be intentional.** Explicitly prompting for anti-patterns (dates as text, currency symbols, JSON blobs) forces the model to learn defensive SQL. Clean schemas produce clean-only training data. - -4. **Distractor tables teach schema linking.** Injecting semantically similar but irrelevant tables forces the model to *read* the schema instead of guessing from table names. This is the skill gap between academic benchmarks and production. - -5. **Per-dialect generation avoids lowest-common-denominator SQL.** Rather than generating ANSI SQL and hoping it works everywhere, the pipeline produces dialect-specific schemas and queries with appropriate syntax (`strftime` vs `DATE_SUB` vs `interval`). Each dialect gets its own tailored prompts, validators, and judge prompts. - -6. **Hard validators are non-negotiable for code.** LLM judges can assess quality, but they can't reliably detect syntax errors. Syntax validators catch parsing failures that the judge misses. - -7. **Multi-dimension scoring enables targeted filtering.** A query that scores 4 on Relevance but 1 on Efficiency tells you the model understood the task but wrote a bad plan. You can filter differently depending on what you're training for. - -8. **Chain-of-thought teaches reasoning, not just syntax.** Including reasoning traces in the training data teaches models to decompose problems, handle edge cases, and verify logic --- acting as a Data Engineer rather than a translator. - ---- - -## **Next Steps** - -- **Code Sandbox for semantic correctness.** The current Quality Waterfall validates syntax and assesses quality (LLM judges), but it doesn't verify whether the query actually returns the right results. A natural next step would be adding Code Sandbox support to Data Designer --- executing generated SQL against a ground-truth database and comparing results to enable execution-based filtering, end-to-end verification, and hard negative mining for preference training. -- **RL on BIRD.** Run reinforcement learning experiments using the [NeMo Gym](https://github.com/NVIDIA-NeMo/Gym) RL environment for BIRD, training models to improve execution accuracy through reward signals from actual query execution. -- **Schema representation.** Improve how schemas are represented in prompts to close the gap with SOTA approaches that use richer structural encodings (e.g., foreign key graphs, column descriptions, value examples). -- **More benchmarks.** Incorporate additional SQL benchmarks --- [Spider 2.0](https://spider2-sql.github.io/), [LiveSQLBench](https://livesqlbench.ai/) --- to evaluate generalization beyond BIRD and drive the next iteration of the pipeline. - ---- - -## **Try It Yourself** - -The snippet below builds a simplified text-to-SQL pipeline for SQLite using Data Designer. It covers the core stages -- seeding & diversification, prompt generation, schema generation with distractors, SQL generation, syntax validation, and LLM judge scoring. - -
-Minimal example: text-to-SQL pipeline for SQLite - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -MODEL_ALIAS = "nvidia-text" - -# Build the pipeline (uses default NVIDIA provider via NVIDIA_API_KEY) -data_designer = DataDesigner() -config = dd.DataDesignerConfigBuilder() - -# --- Stage 1: Seeding & diversification --- -config.add_column(dd.SamplerColumnConfig( - name="industry_sector", sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams(values=["Healthcare", "Financial Services", "Retail"]), -)) -config.add_column(dd.SamplerColumnConfig( - name="sql_complexity", sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams(values=["Beginner", "Intermediate", "Advanced"]), -)) -config.add_column(dd.SamplerColumnConfig( - name="instruction_style", sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams( - values=["imperative", "declarative", "interrogative", "contextual", "abbreviated"] - ), -)) - -# --- Stage 2: Natural-language prompt --- -config.add_column(dd.LLMTextColumnConfig( - name="sql_prompt", model_alias=MODEL_ALIAS, - prompt=( - "Write a natural-language request to a data assistant about {{ industry_sector }}.\n" - "Style: {{ instruction_style }}. Complexity: {{ sql_complexity }}.\n" - "Describe the business problem without SQL jargon." - ), -)) - -# --- Stage 3: Schema + data with distractors --- -config.add_column(dd.LLMCodeColumnConfig( - name="sql_context", model_alias=MODEL_ALIAS, - prompt=( - "Generate SQLite DDL and sample data for: {{ sql_prompt }}\n" - "Include 3-5 core tables, 1-2 distractor tables, distractor columns per table.\n" - "Use section headers: -- Core Tables, -- Distractor Tables, etc." - ), - code_lang=dd.CodeLang.SQL_SQLITE, -)) - -# --- Stage 4: SQL generation --- -config.add_column(dd.LLMCodeColumnConfig( - name="sql", model_alias=MODEL_ALIAS, - prompt=( - "Write SQLite SQL for: {{ sql_prompt }}\n" - "Database Context:\n{{ sql_context }}\n" - "Ignore distractor tables/columns. Handle dirty data." - ), - code_lang=dd.CodeLang.SQL_SQLITE, -)) - -# --- Stage 5: Validation + judge --- -config.add_column(dd.ValidationColumnConfig( - name="sql_validity", - target_columns=["sql"], - validator_type=dd.ValidatorType.CODE, - validator_params=dd.CodeValidatorParams(code_lang=dd.CodeLang.SQL_SQLITE), -)) - -config.add_column(dd.LLMJudgeColumnConfig( - name="sql_judge", model_alias=MODEL_ALIAS, - prompt=( - "Grade the SQL quality.\n" - "Prompt: {{ sql_prompt }}\nContext: {{ sql_context }}\nSQL: {{ sql }}\n" - "Penalize unnecessary joins to distractor tables." - ), - scores=[ - dd.Score(name="relevance", description="Uses only necessary tables/columns", - options={"4": "Perfect", "3": "Minor extras", "2": "Unnecessary joins", "1": "Largely irrelevant", "0": "Wrong"}), - dd.Score(name="readability", description="Code clarity and formatting", - options={"4": "Excellent", "3": "Good", "2": "Adequate", "1": "Poor", "0": "Unreadable"}), - ], -)) - -# Generate -preview = data_designer.preview(config, num_records=10) -preview.display_sample_record() -``` - -
- -
-Full recipe: enterprise_text_to_sql.py (self-contained, runnable) - -[Download Code :octicons-download-24:](../../assets/recipes/code_generation/enterprise_text_to_sql.py){ .md-button download="enterprise_text_to_sql.py" } - -```python ---8<-- "assets/recipes/code_generation/enterprise_text_to_sql.py" -``` - -
- ---- - -## **Summary** - -This dataset is the result of a cross-functional effort across the NeMo Data Designer and Nemotron teams at NVIDIA, combining expertise in synthetic data generation, SQL engineering, and large-scale model training. - -Because this pipeline is encapsulated in Data Designer, the configuration can be shared with any team --- allowing them to fork our baseline, swap in their own schemas or industry verticals, and generate a custom, high-fidelity dataset for their specific domain. - ---- - -**Key Resources:** - -- **NeMo Data Designer:** [github.com/NVIDIA-NeMo/DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) -- **BIRD Benchmark:** [bird-bench.github.io](https://bird-bench.github.io/) -- **Spider 2.0 Benchmark:** [spider2-sql.github.io](https://spider2-sql.github.io/) ---- - -*Want to learn more about NeMo Data Designer? Check out our [documentation](https://github.com/NVIDIA-NeMo/DataDesigner) and start building your own high-fidelity synthetic datasets today.* diff --git a/docs/devnotes/posts/vlm-long-document-understanding.md b/docs/devnotes/posts/vlm-long-document-understanding.md deleted file mode 100644 index 791bd04af..000000000 --- a/docs/devnotes/posts/vlm-long-document-understanding.md +++ /dev/null @@ -1,604 +0,0 @@ ---- -date: 2026-04-28 -authors: - - nmulepati - - trintamaki - - etramel - - nvilia ---- - -# **Training a VLM to Understand Long Documents: An Iterative SDG Story** - -How do you teach a VLM to read charts, cross-reference tables, and reason over 100+ page PDFs? We generated ~11.4M synthetic visual question-answer pairs (~45B tokens, including questions, answers, thinking traces, and vision tokens) with NeMo Data Designer to improve long-document visual reasoning in a multimodal model. We used [MMLongBench-Doc](https://arxiv.org/abs/2407.01523) as our main evaluation target throughout the project, tracking both overall progress and the specific document-reasoning capabilities the model was still missing. In this post, we cover what worked and what didn't. - - - -
- -![From scattered PDF documents to structured visual question-answer pairs via synthetic data generation](assets/vlm-long-document-understanding/vlm-sdg-hero.png) - -
- ---- - -MMLongBench-Doc tests whether a VLM can answer questions about long, multi-page PDF documents — the kind with tables, charts, diagrams, and dense text spread across dozens of pages. The benchmark is hard because it requires genuine visual reasoning: reading bar charts, counting elements in diagrams, synthesizing evidence from tables, charts, and text spread across dozens of pages, and knowing when a question simply can't be answered from the available content. - -Our starting point was an early version of *Nemotron-3-Nano-Omni-30B-A3B*, a multimodal model that responded "Unanswerable" to almost everything on this benchmark. The score: **26%**. We needed training data that would teach the model to actually look at documents and reason about what it sees. - -Here's how we got there. - ---- - -## **The Approach: Iterative Pipeline Development** - -Rather than designing one pipeline and generating millions of rows, we worked in phases. Each phase produced a new SDG pipeline, a new batch of training data, and a new round of evaluation and failure analysis that showed which document-reasoning capabilities were still missing. The tight feedback loop between SDG pipeline changes and training results was essential. - -
- -![The iterative development loop: design pipeline, generate data, train model, evaluate and analyze, repeat](assets/vlm-long-document-understanding/iterative-loop.png){ style="max-width:85%; height:auto" } - -
- -To keep our iterations fast, we measured the usefulness of our generated data in SFT and RL training runs that used primarily our generated data and a small amount of other data sources. We also experimented with different training configs such as sequence length and different strategies for representing multiple Q&A for the same document. In parallel, we ran bigger experiments with more complete datamixes to make sure our improvements would carry over to the final *Nemotron-3-Nano-Omni-30B-A3B* training recipe. - -The final training blend drew from four data-generation streams — an OCR-based text pipeline, a classification-filtered visual QA pipeline targeting charts, tables, and diagrams, a general single-page QA pipeline, and multi-page / whole-document QA pipelines — all filtered by a strong independent judge before contributing to the SFT dataset. Each stream is described in the phases below. - -
- -![Pipeline architecture: four data-generation streams — text-LLM QA, classification-filtered visual QA, general single-page QA, and multi-page / whole-document QA — all filtered by a strong independent judge before contributing to the SFT training blend](assets/vlm-long-document-understanding/pipeline-architecture.png){ style="max-width:85%; height:auto" } - -
- -The rest of this post walks through the work in order: first the seed data we collected, then each pipeline phase in sequence, followed by the infrastructure lessons and what we learned along the way. - -!!! note - Unless explicitly noted otherwise, the scores below come from experimental development runs rather than the final released model. The data generated through this effort was incorporated into the final training blend for the released *Nemotron-3-Nano-Omni-30B-A3B* model. - ---- - -## **Seed Data: Building the Document Corpus** - -Every SDG pipeline needs source material. Ours was a growing corpus of real PDF documents, rendered as page images with one row per page. As training results revealed gaps, we expanded the corpus to cover the missing document types and reasoning modes. - -- **Common Crawl PDFs** were the starting point. We filtered a large collection of web-crawled PDFs to medium- and long-form documents where cross-page reasoning and long-context visual understanding matter most, yielding ~8.2 million page images. Common Crawl gave us breadth: reports, presentations, manuals, forms, and everything else that ends up on the public web. - -- **SEC filings** came next. We added 10K filings and other financial documents in late January '26 to expand coverage of dense tables, cross-referenced numerical data, and long-form financial layouts. They were useful, but their extreme page counts also taught us a lesson: squeezing very long documents into fixed context windows reduced image resolution enough to hurt training until we learned to cap extreme page counts and increase sequence length. - -- **arXiv papers** expanded coverage of scientific figures, equations, multi-panel plots, and citation-heavy layouts. Across multiple random samples from the arXiv 2023 corpus, we generated over 2.4 million page images. - -- **Technical manuals and brochures** arrived in late February '26 from an internal data acquisition team. This added ~432K English-language in-domain pages rich in diagrams, schematics, and step-by-step procedural layouts — exactly the content types where our model scored lowest. - -The benchmark itself contains publicly available documents that may be included in large-scale text pretraining datasets. However, our model's very low initial score suggests the model hasn't memorized the documents or benchmark question and answer pairs. We used heuristics such as document name matching to try to check that we don't inadvertently add benchmark documents to our long-document dataset. Additionally, we audited our generated Q&A and found negligible Q&A overlap with the benchmark; where overlap occurred, the corresponding documents were entirely different. - -Regardless of source, each dataset then went through the same preparation pipeline: PDFs were rendered to per-page PNG images, stored on the shared filesystem, and referenced by file path in parquet files. Grouping scripts then produced three seed variants: **per-page seeds** (one page image per row, for classification and single-page QA), **windowed seeds** (disjoint sets of 2–8 consecutive pages, with window size scaling by document length, for cross-page QA), and **whole-document seeds** (all pages of a PDF grouped together for full-document QA). Together, these gave us single-page, multi-page, and whole-document generation from the same source documents. - ---- - -## **Phase 1: Getting Data on the Board** - -We started simple on purpose. The first pipeline split the problem into OCR transcription with [Nemotron Parse](https://build.nvidia.com/nvidia/nemotron-parse) and question-answer generation from those transcripts with GPT-OSS-120B. - -Nemotron Parse produced structured text with bounding boxes and semantic classes such as title, section, caption, tables, and image, and its transcriptions were consistently accurate even on dense layouts with tables, charts, and mixed content: - -```python -config.add_column( - dd.LLMTextColumnConfig( - name="raw_ocr_output", - model_alias="ocr", # Nemotron-Parse-v1.1 - prompt="", # Nemotron Parse does not require a user prompt - multi_modal_context=[ - dd.ImageContext( - column_name="png_path", - data_type=dd.ModalityDataType.URL, - image_format=dd.ImageFormat.PNG, - ), - ], - drop=True, - ) -) - -config.add_column( - dd.CustomColumnConfig( - name="transcribed_texts", - generator_function=parse_ocr_output, - ) -) -``` - -The GPT-OSS stage then generated questions and answers from those transcripts: - -```python -class QuestionAnswer(BaseModel): - question: str = Field(..., description="The question to be answered.") - answer: str = Field(..., description="The correct answer to the question.") - -config.add_column( - dd.LLMStructuredColumnConfig( - name="question_and_answer", - model_alias="gpt-oss", # GPT-OSS-120B - prompt=QUESTION_ANSWER_PROMPT, - output_format=QuestionAnswer, - ) -) -``` - -Splitting OCR and QA into separate pipelines meant we only needed to run Nemotron Parse once per seed dataset. After that, we could iterate on QA prompts and generation parameters as many times as needed without re-transcribing. The first large-scale run produced millions of OCR transcriptions across dozens of SLURM jobs, followed by a comparable number of QA pairs. - -While we stopped generating new OCR-based QA after Phase 1, we still sampled a subset of the QA generated in this phase to retain in the final training blend. - -**Result:** A modest gain in raw overall accuracy, from around 26% to around 28%. The model stopped answering "Unanswerable" to everything and started producing real answers across all categories. The approach was validated. - ---- - -## **Phase 2: Seeing the Document** - -Text-only QA from OCR misses everything that makes documents visual: the layout of a bar chart, the structure of a flowchart, the spatial relationships in a floor plan. MMLongBench-Doc specifically tests these capabilities. We needed the model generating training data to actually *look* at the pages. - -Our first attempt was to reuse Nemotron Parse's bounding boxes and coarse element tags to scope QA generation. We fed this metadata into both the GPT-OSS text pipeline and the Qwen3-VL visual pipeline, hoping to tell the model which region of the page to focus on. For GPT-OSS, the boxes provided spatial context alongside the OCR transcript. For Qwen3-VL, we asked the model to restrict its visual attention to the specified region. - -Neither approach worked well. The bbox labels were too coarse to guide useful QA generation: a `table` tag doesn't tell you whether the content is a financial table, a comparison table, or a pivot table, and the `image` tag was even broader. Because the source corpus contained all kinds of images, asking the model to focus only on OCR-tagged `image` regions surfaced everything from company logos and decorative graphics to data-rich charts and diagrams, which produced many low-value QA pairs. Qwen3-VL also couldn't reliably stay inside the specified bounding box when we tried to enforce the spatial constraint, which further reduced QA quality. - -This failure directly motivated what came next: instead of trying to constrain a model's attention to a region, we built a **classification stage** that examines each page image as a whole and determines what visual elements it actually contains. - -
- -![Classification taxonomy: 8 primary categories with 45 subcategories covering quantitative, logic diagrams, hierarchical, spatial, schematic, tabular, infographic, and none](assets/vlm-long-document-understanding/classification-taxonomy.png){ style="max-width:90%; height:auto" } - -
- -Each category maps to specific subcategories — for example, QUANTITATIVE includes bar charts, line graphs, scatter plots, pie charts, area graphs, histograms, box plots, heatmaps, and bubble charts. TABULAR distinguishes simple tables from nested, pivot, comparison, and financial tables. The NONE category catches decorative images, photographs, plain text blocks, and presentation slides with only bullet points. - -The classification prompt instructs the model to score reasoning complexity on a 1–10 scale: high complexity (8–10) requires multi-step inference like cross-referencing data sources or conditional logic chains, medium (4–7) requires single-step analysis like direct comparisons, and low (1–3) covers simple lookups. - -```python -class PageClassification(BaseModel): - contains_reasoning_content: bool - primary_categories: list[VisualElementCategory] - subcategories: list[VisualElementSubcategory] - reasoning_complexity_score: int - justification: str - - -config.add_column( - dd.LLMStructuredColumnConfig( - name="page_classification", - model_alias="qwen-vl", # Qwen3-VL-30B-A3B - prompt=CLASSIFICATION_PROMPT, - output_format=PageClassification, - multi_modal_context=[ - dd.ImageContext( - column_name="png_path", - data_type=dd.ModalityDataType.URL, - image_format=dd.ImageFormat.PNG, - ) - ], - ) -) -``` - -The taxonomy covers 8 primary categories with 45 subcategories. Running classification over our Common Crawl corpus revealed how much of the data actually contained visual reasoning content: - -
- - -
NONE
5.9M pages (72%)
-
TABULAR
 
1.5M
-
INFOGRAPHIC
 
763K
-
QUANTITATIVE
 
561K
-
SCHEMATIC
 
277K
-
SPATIAL
 
157K
-
LOGIC_DIAGRAMS
 
124K
-
HIERARCHICAL
 
61K
- -
Filter: contains_reasoning_content = true
- -
Kept for QA
2.3M pages (28%)
- -
- -72% of pages were plain text, decorative images, or content without reasoning potential. This filtering step was critical: the expensive QA generation stage only ran on the pages where visual reasoning questions were actually possible. - -The QA stage then used the classification to guide question generation. The prompt instructs the model to focus on the specific visual element type identified: - -```python -config.add_column( - dd.LLMStructuredColumnConfig( - name="question", - model_alias="qwen-vl-instruct", # Qwen3-VL-235B-A22B-Instruct-FP8 - prompt=QUESTION_PROMPT, - output_format=Question, - multi_modal_context=[ - dd.ImageContext( - column_name="png_path", - data_type=dd.ModalityDataType.URL, - image_format=dd.ImageFormat.PNG, - ) - ], - ) -) -``` - -**Result:** Training with the visual QA data pushed the score to **37%** — an +11 point improvement over the 26% baseline. The model was learning to read charts and tables. - ---- - -## **Phase 3: Thinking Models and Prompt Engineering** - -The jump from Phase 2 to Phase 3 was driven by two observations from the training results: the model was getting questions wrong that it should have gotten right, and many of the generated questions were either too complex or too trivial. - -We made three changes simultaneously: - -**1. Always use a strong thinking model.** We moved to Qwen3-VL-235B-A22B-**Thinking**, which generates an internal chain-of-thought before producing its answer. While smaller models are faster to run, bigger and stronger models are often qualitatively better at data generation. Data Designer captures the reasoning trace separately via the `extract_reasoning_content=True` flag. - -**2. Dropped structured output for free-form text.** The Phase 2 pipeline used `LLMStructuredColumnConfig` with Pydantic schemas for questions and answers. This guaranteed parseable output but constrained the model's generation. When we inspected the thinking traces from the structured output runs, the chain-of-thought was polluted with JSON formatting: the model was spending reasoning tokens figuring out how to fit its answer into the schema rather than reasoning about the document. We switched every column to `LLMTextColumnConfig`. - -**3. Rewrote every prompt.** Based on what we observed in the data produced by Phase 2, we added more instructions and guardrails. The question prompt nearly doubled in length. The key additions include: - -- **Complexity targeting:** Questions should require at least one step of reasoning — not direct lookup. Also, we explicitly listed anti-patterns to avoid such as ambiguous questions. - -- **Verifiability checklist:** Before outputting a question, the model must ask itself: "Can I answer this by looking at the visual? Can I verify if an answer is correct? Is there a clear, unambiguous correct answer?" - -- **Tolerance for correctness:** The answer correctness judge was updated to accept ±5% for numerical answers and equivalent formats ("25%" = "0.25" = "1/4"). The Phase 2 judge was too strict, marking correct answers as wrong over minor formatting differences. - -```python -config.add_column( - dd.LLMTextColumnConfig( - name="answer", - model_alias="qwen-vl-thinking", # Qwen3-VL-235B-A22B-Thinking-FP8 - prompt=ANSWER_PROMPT, - extract_reasoning_content=True, - multi_modal_context=[...], - ) -) -``` - -**Result:** These changes pushed the score to **39%**. The thinking model and prompt improvements were additive — better questions, better answers, and cleaner reasoning traces all contributed. - ---- - -## **Phase 4: Targeting Weak Spots** - -Phase 3 improved overall accuracy, but the per-category breakdown revealed persistent gaps. Image, chart, and layout categories were still lagging behind text, and multi-page reasoning remained the weakest overall at just 17%. The model was also saying "Not answerable" too often for questions where the answer was clearly present in the document. - -We reduced the share of unanswerable Q&A we sampled and added explicit guidance to the answer prompt to reduce unnecessary refusals, encouraging the model to attempt an answer before concluding something was unanswerable. - -**Multi-page QA** was the other major addition in this phase. Real-world document understanding often requires reasoning across multiple pages — "What is the total revenue across all quarterly reports?" can't be answered from a single page. - -Our first attempt used **disjoint windows** of consecutive pages (2 pages for short documents up to 8 pages for long ones), requiring cross-page reasoning within each window. We started with windows rather than full documents to make generation faster. Windows were a practical starting point, but they had a ceiling: a model that only sees 4 pages at a time can't generate questions like "how many appendices are in this document?" The real breakthrough came when we switched to a **whole-document** approach using Qwen3.5-397B-A17B, a newer and larger model in the Qwen family, feeding all pages of a PDF to the model at once and asking it to generate questions that genuinely require the entire document. - -```python -config.add_column( - dd.LLMTextColumnConfig( - name="question", - model_alias="qwen3p5-vl", # Qwen3.5-397B-A17B-FP8 - prompt=WHOLE_DOC_PROMPT, - multi_modal_context=[ - dd.ImageContext( - column_name="png_paths", # all pages of the document - data_type=dd.ModalityDataType.URL, - image_format=dd.ImageFormat.PNG, - ) - ], - ) -) -``` - -The whole-document prompt was informed by failure analysis of our model on long-document QA. It emphasized capabilities the model was still weak at: counting and aggregation across pages, extracting complete lists from repeated layouts, cross-page computation, and lookup chains that require evidence from multiple sections of a document. The prompt included question-type-specific templates for in-context learning and required the generator to verify that each question genuinely depended on evidence spread across more than one page. - -**Result:** Whole-document QA produced the first strong gains on cross-page reasoning. After SFT with the combined data: **53%**, with the multi-page subcategory jumping from 26% to 44%. - ---- - -## **Phase 5: Quality at Scale** - -By this point we had broad coverage of the main document-reasoning modes we cared about. The final phase focused on improving the quality, correctness, and training signal of the data we already had. - -**Improved reasoning traces.** We found that the shape of our model's reasoning traces wasn't well structured. We improved the quality of the generated reasoning traces by prompting the data generation model to think in a structured and stepwise fashion. As an example, for questions spanning the whole document, we first want the model to scan pages and find content relevant to the question. By contrast, without the scanning phase, our model stopped sometimes too early when it found the first relevant evidence. - -**High quality filtering.** We used a strong independent judge to score a sample of generated QA pairs across five rubrics, with weights tuned to what matters most for training. The model's role was strictly as a filter — it flagged low-quality pairs for removal, but none of its outputs (scores, explanations) were included in the SFT training data. This filtering step dropped about 10% of our samples: - -```python -FINAL_SCORE_WEIGHTS = { - "Answer Correctness": 0.35, - "Training Signal Strength": 0.30, - "Question Quality": 0.15, - "Visual Grounding": 0.10, - "Format Compliance": 0.10, -} - -config.add_column( - dd.LLMJudgeColumnConfig( - name="qa_quality_judge", - model_alias="strong-judge-vlm", - prompt=JUDGE_PROMPT, - scores=[ - answer_correctness_score, - question_quality_score, - visual_grounding_score, - format_compliance_score, - training_signal_score, - ], - multi_modal_context=[ - dd.ImageContext( - column_name="png_base64", - data_type=dd.ModalityDataType.BASE64, - image_format=dd.ImageFormat.PNG, - ) - ], - ) -) - -config.add_column( - dd.CustomColumnConfig( - name="weighted_quality_score", - generator_function=compute_weighted_score, - ) -) -``` - -Using a different model family for evaluation and data generation provides independent quality assessment. The heaviest weight goes to Answer Correctness (0.35) because wrong answers poison training, followed by Training Signal Strength (0.30) because multi-page reasoning was one of the weakest remaining capabilities we needed the data to strengthen. - -**Result:** Best SFT score of **55.7%**. - ---- - -## **RL on top of SFT** - -SFT got the model reading documents, but the error analysis showed two persistent issues: the model was often confidently wrong about what it saw, and it refused too quickly when the evidence was actually there. RL was partly about fixing those specific failures, but what surprised us was how much the rest of the benchmark moved along with them. - -We ran GRPO on the best SFT checkpoint. The single-page rollouts came mostly from [MMPR](https://huggingface.co/datasets/OpenGVLab/MMPR), where the answers are numbers, short lists, or multiple-choice and score cleanly against ground truth, which is why online RL with automatic scoring made more sense than a preference-based approach like MPO. For the harder multi-page cases we used an online LLM judge in place of string matching. The biggest improvements landed on the visual categories, which had been the SFT model's weakest areas. - -**Unanswerable data as a hard negative.** We generated "unanswerable" training examples by swapping the image in existing VQA pairs, drawing from 15+ source datasets at zero annotation cost. Mixed into the rollouts at 5–7%, these image-question mismatches reduced hallucinations on unanswerable questions as expected. The surprise was that they also improved accuracy on answerable questions, especially on image, layout, and text. Our best guess is a combination of language-bias mitigation (forcing the model to actually look at the image rather than guess from the question), a hard-negative effect on image-text alignment, and a learned evidence-checking behavior. - -**GRPO over-reinforcing refusal.** "Not answerable" is a short, near-constant output, and GRPO is happy to over-reinforce it once it becomes a viable response in the rollout pool. Mid-training we saw overall accuracy peak, dip as the model leaned too hard on refusal, and then recover as other categories caught up. Without dense evaluation we would have missed this shape entirely, and the eventual fix was a wider plateau rather than earlier stopping. - -**Multi-page remains the largest gap**, and the reason is the verifier rather than the RL algorithm or the compute budget. Our single-page gains rode on MMPR's clean answer formats. The multi-page data we built from MMLongBench-style sources has free-form string answers, where rule-based scoring fails on negations, numerical tolerances, and phrasing variants. The online LLM judge gets around this, but calibrating the judge well enough to train against reliably is now the gating step. - -**Result:** Overall accuracy moved from 55.7% to **59.0%**. - ---- - -## **Infrastructure and Scale** - -Every pipeline in this project ran on SLURM clusters with NVIDIA A100, H100, and B200 GPUs. We used an internal wrapper tool to orchestrate Data Designer jobs on the cluster — each job booted vLLM servers hosting the generation models and ran a Data Designer client alongside them. Large runs were sharded across many SLURM jobs, each processing a partition of the seed dataset. - -Beyond the basic setup, running this at scale broke things we didn't expect. Here are the problems that cost us the most time: - -**File paths over base64.** Our first attempt embedded base64-encoded page images directly in the seed parquet files. This made each row enormous and caused DuckDB (which Data Designer uses for dataset loading) to stall for 10+ minutes per job as it globbed across millions of rows. The fix was simple but transformative: store file paths in the parquet and use vLLM's `--allowed-local-media-path` flag to let the inference server read images directly from the shared filesystem. Dataset loading went from minutes to seconds. - -**Scaling throughput.** We used tensor parallelism to fit models of different sizes (TP=2 for the 30B classification model, TP=4 for the 235B, TP=8 for the 397B QA models), with data-parallel replication to run multiple vLLM server instances per node and fill remaining GPUs. Early runs with the 397B and 235B models in FP16 were slow, so we switched to FP8 quantized versions — a significant speedup with no noticeable quality loss. - -**Benchmarking tool.** Each generation job involved a different model, quantization level, and parallelism configuration, and we had limited ways to predict how long a full run would take before committing hundreds of GPU-hours to it. We built an internal benchmarking tool that ran short profiling jobs to measure throughput (requests/min) and latency distributions for each configuration. This let us estimate wall-clock time, catch configuration issues early, and choose the right parallelism settings before launching at scale. - ---- - -## **What We Learned** - -**Classification-first filtering saves everything downstream.** Running a cheap classification model (Qwen3-VL-30B-A3B) over all pages before expensive QA generation filtered out 72% of pages that had no visual reasoning potential. This saved enormous compute on the QA stage and improved data quality since every generated question targeted actual visual content. - -**Structured output schemas leak into thinking traces.** When we used `LLMStructuredColumnConfig` with Pydantic schemas, the thinking models' reasoning traces were contaminated with JSON structures — the model would "think" in JSON rather than natural language. Switching to `LLMTextColumnConfig` with `extract_reasoning_content=True` gave us clean reasoning traces and better questions and answers. - -**Multi-page and whole-document reasoning was the missing capability.** Single-page QA drove steady improvements, but progress plateaued until we added multi-page and whole-document data. Questions requiring reasoning across an entire document — "What is the total across these three quarterly tables?" or "How many appendices are in this document?" — were the training signal the model was missing. - -**Independent evaluation catches what self-evaluation misses.** Using the same model to generate and judge QA pairs (Qwen judging Qwen) has blind spots. Bringing in another strong VLM as an independent judge caught quality issues that self-evaluation missed, and the filtered data produced measurably better training results. - -**Bounding-box scoping didn't help QA generation.** We tried using bbox annotations to scope QA generation to specific page regions, but the downstream QA model couldn't leverage the spatial information effectively. This held for both text-only and visual pipelines (see Phase 2), and directly motivated the classification-first approach that worked. - ---- - -## **Results** - -### Evaluation on MMLongBench-Doc - -The table below summarizes performance on MMLongBench-Doc at key checkpoints, including overall accuracy and the per-category breakdown. Baseline refers to the early *Nemotron-3-Nano-Omni-30B-A3B* checkpoint we started from: - -| Category | Baseline | Jan 14 | Jan 23 | Jan 29 | Feb 6 | Feb 9 | Mar 12 | Mar 25 | **Mar 26 (+RL)** | -| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | -| **overall accuracy** | 26.32 | 27.73 | 36.95 | 39.06 | 45.41 | 46.98 | 53.22 | 55.68 | **59.00** | -| text | 30.37 | 33.85 | 35.16 | 35.75 | 39.92 | 44.22 | 51.36 | 54.72 | 57.38 | -| layout | 25.48 | .3150 | 29.02 | 33.95 | 34.73 | 37.62 | 50.62 | 54.90 | 58.28 | -| table | 21.00 | .2634 | 23.91 | 28.56 | 37.94 | 41.75 | 55.30 | 57.13 | 58.73 | -| chart | 23.73 | .3005 | 30.36 | 35.26 | 35.99 | 35.27 | 45.94 | 44.71 | 50.85 | -| image | 23.42 | .2733 | 25.52 | 3.221 | 35.06 | 38.56 | 47.27 | 52.42 | 55.27 | -| single-page | 35.66 | 41.91 | 42.42 | 47.98 | 50.79 | 53.77 | 58.49 | 61.18 | 64.88 | -| multi-page | 14.49 | 18.41 | 15.40 | 17.06 | 23.65 | 25.57 | 43.94 | 47.22 | 48.41 | -| unanswerable | 25.20 | 13.41 | 57.72 | 53.65 | 66.66 | 64.93 | 56.40 | 57.21 | 62.90 | - -A few things stand out. **Multi-page** was the hardest category to move — it sat at 14–18% for weeks until the whole-document QA pipeline landed in Phase 4, when it jumped to 44% (a 3x improvement over baseline). **Table** showed a strong response to visual QA data, going from 21% to 55%. **Unanswerable** spiked early (the model learned to stop refusing everything) and stayed high. - -By Mar 12, the model matched or exceeded Qwen3-Omni-30B in every category of the MMLongBench-Doc benchmark. After RL, the final score reached **59%**, closing much of the gap to larger reference models on this benchmark: - -
- - -
Nemotron-3-Nano-Omni-30B-A3B (baseline)
26%
-
Qwen3-Omni-30B-A3B (measured)
45%
-
Nemotron-3-Nano-Omni-30B-A3B (ours, SFT)
55.7%
-
Qwen3-VL-235B-A22B
56.2%
-
Qwen3.5-Omni-Plus
57.5%
-
Nemotron-3-Nano-Omni-30B-A3B (ours, SFT + RL)
59%
-
Qwen3.5-35B-A3B
59.5%
- -
- -### Impact on Other Benchmarks - -The synthetic data we generated was blended into the model's broader training recipe alongside other document-understanding and visual-reasoning data. Throughout the project, we tracked impact on other vision benchmarks and found that the data produced positive lift beyond MMLongBench-Doc. The visual reasoning skills developed for chart reading, table understanding, and cross-page reasoning generalized well. Refer to the [Nemotron-3-Nano-Omni technical report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Omni-report.pdf) for final results across the full evaluation suite. - - -## **Upstream Contributions: Features Born from This Project** - -One of the most valuable outcomes of this project was the set of features and fixes it drove back into Data Designer itself. When you push a framework to its limits on a real workload — millions of multimodal records across hundreds of SLURM jobs — you find the gaps fast. Every issue we hit became a PR that benefits all Data Designer users. - -- **Multi-image support & images-before-text ordering** ([#257](https://github.com/NVIDIA-NeMo/DataDesigner/pull/257)) — list-of-images columns for multi-page VLM prompts, with image contexts placed before text for better VLM quality -- **Non-LLM concurrency controls** ([#242](https://github.com/NVIDIA-NeMo/DataDesigner/pull/242)) — user-tunable parallelism for plugin-based generators -- **Early shutdown fixes** ([#201](https://github.com/NVIDIA-NeMo/DataDesigner/pull/201), [#203](https://github.com/NVIDIA-NeMo/DataDesigner/pull/203)) — race condition and incomplete disable flag under high error rates -- **RunConfig overhaul** ([#186](https://github.com/NVIDIA-NeMo/DataDesigner/pull/186), [#208](https://github.com/NVIDIA-NeMo/DataDesigner/pull/208), [#209](https://github.com/NVIDIA-NeMo/DataDesigner/pull/209)) — fine-grained shutdown, retry, and buffer controls -- **LLM text response deserialization fix** ([#233](https://github.com/NVIDIA-NeMo/DataDesigner/pull/233)) — preserve mixed-type text responses as-is -- **Seed dataset partitioning** ([#8](https://github.com/NVIDIA-NeMo/DataDesigner/pull/8)) — `PartitionBlock` and `IndexRange` strategies for splitting work across SLURM jobs -- **Wildcard seed paths** ([#12](https://github.com/NVIDIA-NeMo/DataDesigner/pull/12)) — glob support for consuming partitioned parquet output - -These span the full stack — config, engine, interface, integrations — and collectively made Data Designer viable for production-scale multimodal SDG. - ---- - -## **The Compound Effect** - -No single pipeline got us from 26% to 59% — a **2.3× lift**. Each iteration broke something, revealed a gap, or overturned an assumption. The bbox approach failed, so we built a classification taxonomy. Single-page QA plateaued, so we added multi-page. Self-evaluation had blind spots, so we brought in a different model family as judge. The progress came from shortening the loop between data generation, training, evaluation, and failure analysis so we could identify missing capabilities and address them quickly. - -As noted at the top of this post, the scores above reflect our experimental development runs. The data generated through this effort was incorporated into the final training blend for the released *Nemotron-3-Nano-Omni-30B-A3B* model, which achieves **57.5%** overall accuracy on MMLongBench-Doc. For the full details on the model, training recipe, and complete benchmark results, see the [Nemotron-3-Nano-Omni technical report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Omni-report.pdf). - ---- - -## **Try For Yourself** - -The recipes below are the self-contained, runnable scripts for each stage of the pipeline described in this post. They are ordered to match the pipeline flow — run them in sequence, feeding the output of each stage as seed data to the next. - - - -
- -| # | Recipe | | -| :---: | :--- | :--- | -| 01 | Seed dataset preparation | [:octicons-download-24:](../../assets/recipes/vlm_long_doc/01-seed-dataset-preparation.py){ download="01-seed-dataset-preparation.py" } | -| 02 | Nemotron Parse OCR transcription | [:octicons-download-24:](../../assets/recipes/vlm_long_doc/02-nemotron-parse-ocr-sdg.py){ download="02-nemotron-parse-ocr-sdg.py" } | -| 03 | Text QA from OCR transcripts | [:octicons-download-24:](../../assets/recipes/vlm_long_doc/03-text-qa-sdg.py){ download="03-text-qa-sdg.py" } | -| 04 | Page classification | [:octicons-download-24:](../../assets/recipes/vlm_long_doc/04-page-classification-sdg.py){ download="04-page-classification-sdg.py" } | -| 05 | Visual QA | [:octicons-download-24:](../../assets/recipes/vlm_long_doc/05-visual-qa-sdg.py){ download="05-visual-qa-sdg.py" } | -| 06 | Single-page QA | [:octicons-download-24:](../../assets/recipes/vlm_long_doc/06-single-page-qa-sdg.py){ download="06-single-page-qa-sdg.py" } | -| 07 | Multi-page windowed QA | [:octicons-download-24:](../../assets/recipes/vlm_long_doc/07-multi-page-windowed-qa-sdg.py){ download="07-multi-page-windowed-qa-sdg.py" } | -| 08 | Whole-document QA | [:octicons-download-24:](../../assets/recipes/vlm_long_doc/08-whole-document-qa-sdg.py){ download="08-whole-document-qa-sdg.py" } | -| 09 | Frontier judgement QA filter | [:octicons-download-24:](../../assets/recipes/vlm_long_doc/09-frontier-judge-sdg.py){ download="09-frontier-judge-sdg.py" } | - -
- ---- - -Key Resources: - -1. [NeMo Data Designer on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner) -2. [Nemotron-3-Nano-Omni technical report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Omni-report.pdf) -3. [NVIDIA Tech post announcing Nemotron-3-Nano-Omni](https://developer.nvidia.com/blog/nvidia-nemotron-3-nano-omni-powers-multimodal-agent-reasoning-in-a-single-efficient-open-model) -4. [Hugging Face post introducing Nemotron-3-Nano-Omni](https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence) -5. [Nemotron-3-Nano-Omni HF Model page](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16) -6. [MMLongBench-Doc Benchmark](https://arxiv.org/abs/2407.01523) -7. [Qwen3-VL Model Family](https://huggingface.co/Qwen) diff --git a/docs/images/deployment-enterprise-gateway.png b/docs/images/deployment-enterprise-gateway.png deleted file mode 100644 index 8b337406d..000000000 Binary files a/docs/images/deployment-enterprise-gateway.png and /dev/null differ diff --git a/docs/images/deployment-library-decentralized.png b/docs/images/deployment-library-decentralized.png deleted file mode 100644 index cc3a29623..000000000 Binary files a/docs/images/deployment-library-decentralized.png and /dev/null differ diff --git a/docs/images/deployment-microservice.png b/docs/images/deployment-microservice.png deleted file mode 100644 index e870278c3..000000000 Binary files a/docs/images/deployment-microservice.png and /dev/null differ diff --git a/docs/images/rqa-blog.png b/docs/images/rqa-blog.png deleted file mode 100644 index f43fe156e..000000000 Binary files a/docs/images/rqa-blog.png and /dev/null differ diff --git a/docs/images/top-models.png b/docs/images/top-models.png deleted file mode 100644 index ee6754b7a..000000000 Binary files a/docs/images/top-models.png and /dev/null differ diff --git a/docs/index.md b/docs/index.md deleted file mode 100644 index c9297db97..000000000 --- a/docs/index.md +++ /dev/null @@ -1,103 +0,0 @@ -# 🎨 NeMo Data Designer - -[![GitHub](https://img.shields.io/badge/github-repo-952fc6?logo=github)](https://github.com/NVIDIA-NeMo/DataDesigner) [![License](https://img.shields.io/badge/License-Apache_2.0-0074df.svg)](https://opensource.org/licenses/Apache-2.0) [![NeMo Microservices](https://img.shields.io/badge/NeMo-Microservices-76b900)](https://docs.nvidia.com/nemo/microservices/latest/index.html) - -👋 Welcome! Data Designer is an orchestration framework for generating high-quality synthetic data. You provide LLM endpoints (NVIDIA, OpenAI, vLLM, etc.), and Data Designer handles batching, parallelism, validation, and more. - -**Configure** columns and models → **Preview** samples and iterate → **Create** your full dataset at scale. - -Unlike raw LLM calls, Data Designer gives you statistical diversity, field correlations, automated validation, and reproducible workflows. For details, see [Architecture & Performance](concepts/architecture-and-performance.md). - -📝 Want to hear from the team? Check out our **[Dev Notes](devnotes/index.md)** for deep dives, best practices, and insights. - -## Install - -```bash -pip install data-designer -``` - -## Setup - -Get an API key from one of the default providers and set it as an environment variable: - -```bash -# NVIDIA (build.nvidia.com) - recommended -export NVIDIA_API_KEY="your-api-key-here" - -# OpenAI (platform.openai.com) -export OPENAI_API_KEY="your-openai-api-key-here" - -# OpenRouter (openrouter.ai) -export OPENROUTER_API_KEY="your-openrouter-api-key-here" -``` - -Verify your configuration is ready: - -```bash -data-designer config list -``` - -This displays the pre-configured model providers and models. See [CLI Configuration](concepts/models/configure-model-settings-with-the-cli.md) to customize. - -## Your First Dataset - -Let's generate multilingual greetings to see Data Designer in action: - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -# Initialize with default model providers -data_designer = DataDesigner() -config_builder = dd.DataDesignerConfigBuilder() - -# Add a sampler column to randomly select a language -config_builder.add_column( - dd.SamplerColumnConfig( - name="language", - sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams( - values=["English", "Spanish", "French", "German", "Italian"], - ), - ) -) - -# Add an LLM text generation column -config_builder.add_column( - dd.LLMTextColumnConfig( - name="greeting", - model_alias="nvidia-text", - prompt="Write a casual and formal greeting in {{ language }}.", - ) -) - -# Generate a preview -results = data_designer.preview(config_builder) -results.display_sample_record() -``` - -🎉 That's it! You've just designed your first synthetic dataset. - -## 🚀 Next Steps - -
- -- :material-book-open-variant: **[Tutorials](notebooks/README.md)** - - Step-by-step notebooks covering core features - -- :material-chef-hat: **[Recipes](recipes/cards.md)** - - Ready-to-use examples for common use cases - -- :material-cog: **[Concepts](concepts/columns.md)** - - Deep dive into columns, models, and configuration - -
- -## Learn More - -- **[Deployment Options](concepts/deployment-options.md)** – Library vs. NeMo Microservice -- **[Model Configuration](concepts/models/default-model-settings.md)** – Configure LLM providers and models -- **[Architecture & Performance](concepts/architecture-and-performance.md)** – Optimize for throughput and scale diff --git a/docs/js/toc-toggle.js b/docs/js/toc-toggle.js deleted file mode 100644 index 22f7e079a..000000000 --- a/docs/js/toc-toggle.js +++ /dev/null @@ -1,25 +0,0 @@ -// Wrap in a check to ensure document$ exists -if (typeof document$ !== "undefined") { - document$.subscribe(function() { - // Check if this is a Code Reference page (contains mkdocstrings content) - const isCodeReferencePage = document.querySelector(".doc.doc-contents"); - - // Check if this is a Concepts page (URL contains /concepts/) - const isConceptsPage = window.location.pathname.includes("/concepts/"); - - // Check if this is a Plugins page (URL contains /plugins/) - const isPluginsPage = window.location.pathname.includes("/plugins/"); - - if (isCodeReferencePage || isConceptsPage || isPluginsPage) { - // Show TOC for Code Reference, Concepts, and Plugins pages by adding class to body - document.body.classList.add("show-toc"); - console.log("Code Reference, Concepts, or Plugins page detected - showing TOC"); - } else { - // Hide TOC for all other pages by removing class from body - document.body.classList.remove("show-toc"); - console.log("Non-Code Reference/Concepts/Plugins page - hiding TOC"); - } - }); -} else { - console.error("document$ observable not found - Material theme may not be loaded"); -} diff --git a/docs/notebook_source/1-the-basics.py b/docs/notebook_source/1-the-basics.py index f44f59d3b..184cd0505 100644 --- a/docs/notebook_source/1-the-basics.py +++ b/docs/notebook_source/1-the-basics.py @@ -37,7 +37,7 @@ # # - `DataDesigner` is the main object responsible for managing the data generation process. # -# - When initialized without arguments, the [default model providers](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) are used. +# - When initialized without arguments, the [default model providers](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) are used. # # %% @@ -50,7 +50,7 @@ # # - The "model alias" is used to reference the model in the Data Designer config (as we will see below). # -# - The "model provider" is the external service that hosts the model (see the [model config](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) docs for more details). +# - The "model provider" is the external service that hosts the model (see the [model config](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) docs for more details). # # - By default, we use [build.nvidia.com](https://build.nvidia.com/models) as the model provider. # @@ -324,11 +324,11 @@ # # Now that you've seen the basics of Data Designer, check out the following notebooks to learn more about: # -# - [Structured outputs, jinja expressions, and conditional generation](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/2-structured-outputs-and-jinja-expressions/) +# - [Structured outputs, jinja expressions, and conditional generation](https://docs.nvidia.com/nemo/datadesigner/tutorials/structured-outputs-jinja-expressions-and-conditional-generation) # -# - [Seeding synthetic data generation with an external dataset](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/3-seeding-with-a-dataset/) +# - [Seeding synthetic data generation with an external dataset](https://docs.nvidia.com/nemo/datadesigner/tutorials/seeding-with-an-external-dataset) # -# - [Providing images as context](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/4-providing-images-as-context/) +# - [Providing images as context](https://docs.nvidia.com/nemo/datadesigner/tutorials/providing-images-as-context) # -# - [Generating images](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/) +# - [Generating images](https://docs.nvidia.com/nemo/datadesigner/tutorials/generating-images) # diff --git a/docs/notebook_source/2-structured-outputs-and-jinja-expressions.py b/docs/notebook_source/2-structured-outputs-and-jinja-expressions.py index f389709f6..05a3c443e 100644 --- a/docs/notebook_source/2-structured-outputs-and-jinja-expressions.py +++ b/docs/notebook_source/2-structured-outputs-and-jinja-expressions.py @@ -19,7 +19,7 @@ # # In this notebook, we will continue our exploration of Data Designer, demonstrating more advanced data generation using structured outputs, Jinja expressions, and conditional generation with `skip.when`. # -# If this is your first time using Data Designer, we recommend starting with the [first notebook](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/) in this tutorial series. +# If this is your first time using Data Designer, we recommend starting with the [first notebook](https://docs.nvidia.com/nemo/datadesigner/tutorials/the-basics) in this tutorial series. # # %% [markdown] @@ -39,7 +39,7 @@ # # - `DataDesigner` is the main object that is used to interface with the library. # -# - When initialized without arguments, the [default model providers](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) are used. +# - When initialized without arguments, the [default model providers](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) are used. # # %% @@ -52,7 +52,7 @@ # # - The "model alias" is used to reference the model in the Data Designer config (as we will see below). # -# - The "model provider" is the external service that hosts the model (see the [model config](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) docs for more details). +# - The "model provider" is the external service that hosts the model (see the [model config](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) docs for more details). # # - By default, we use [build.nvidia.com](https://build.nvidia.com/models) as the model provider. # @@ -458,9 +458,9 @@ class ProductReview(BaseModel): # # Check out the following notebook to learn more about: # -# - [Seeding synthetic data generation with an external dataset](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/3-seeding-with-a-dataset/) +# - [Seeding synthetic data generation with an external dataset](https://docs.nvidia.com/nemo/datadesigner/tutorials/seeding-with-an-external-dataset) # -# - [Providing images as context](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/4-providing-images-as-context/) +# - [Providing images as context](https://docs.nvidia.com/nemo/datadesigner/tutorials/providing-images-as-context) # -# - [Generating images](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/) +# - [Generating images](https://docs.nvidia.com/nemo/datadesigner/tutorials/generating-images) # diff --git a/docs/notebook_source/3-seeding-with-a-dataset.py b/docs/notebook_source/3-seeding-with-a-dataset.py index e4f9218ee..10c6fb64d 100644 --- a/docs/notebook_source/3-seeding-with-a-dataset.py +++ b/docs/notebook_source/3-seeding-with-a-dataset.py @@ -19,7 +19,7 @@ # # In this notebook, we will demonstrate how to seed synthetic data generation in Data Designer with an external dataset. # -# If this is your first time using Data Designer, we recommend starting with the [first notebook](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/) in this tutorial series. +# If this is your first time using Data Designer, we recommend starting with the [first notebook](https://docs.nvidia.com/nemo/datadesigner/tutorials/the-basics) in this tutorial series. # # %% [markdown] @@ -39,7 +39,7 @@ # # - `DataDesigner` is the main object responsible for managing the data generation process. # -# - When initialized without arguments, the [default model providers](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) are used. +# - When initialized without arguments, the [default model providers](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) are used. # # %% @@ -52,7 +52,7 @@ # # - The "model alias" is used to reference the model in the Data Designer config (as we will see below). # -# - The "model provider" is the external service that hosts the model (see the [model config](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) docs for more details). +# - The "model provider" is the external service that hosts the model (see the [model config](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) docs for more details). # # - By default, we use [build.nvidia.com](https://build.nvidia.com/models) as the model provider. # @@ -272,7 +272,7 @@ # # Check out the following notebook to learn more about: # -# - [Providing images as context](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/4-providing-images-as-context/) +# - [Providing images as context](https://docs.nvidia.com/nemo/datadesigner/tutorials/providing-images-as-context) # -# - [Generating images](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/) +# - [Generating images](https://docs.nvidia.com/nemo/datadesigner/tutorials/generating-images) # diff --git a/docs/notebook_source/4-providing-images-as-context.py b/docs/notebook_source/4-providing-images-as-context.py index 7d849c89a..5a9833c4b 100644 --- a/docs/notebook_source/4-providing-images-as-context.py +++ b/docs/notebook_source/4-providing-images-as-context.py @@ -23,7 +23,7 @@ # - ✨ **Visual Document Processing**: Converting images to chat-ready format for model consumption # - 🔍 **Vision-Language Generation**: Using vision models to generate detailed summaries from images # -# If this is your first time using Data Designer, we recommend starting with the [first notebook](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/) in this tutorial series. +# If this is your first time using Data Designer, we recommend starting with the [first notebook](https://docs.nvidia.com/nemo/datadesigner/tutorials/the-basics) in this tutorial series. # # %% [markdown] @@ -56,7 +56,7 @@ # # - `DataDesigner` is the main object responsible for managing the data generation process. # -# - When initialized without arguments, the [default model providers](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) are used. +# - When initialized without arguments, the [default model providers](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) are used. # # %% @@ -69,7 +69,7 @@ # # - The config builder provides an intuitive interface for building this configuration. # -# - When initialized without arguments, the [default model configurations](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) are used. +# - When initialized without arguments, the [default model configurations](https://docs.nvidia.com/nemo/datadesigner/concepts/models/default-model-settings) are used. # # %% @@ -260,5 +260,5 @@ def convert_image_to_chat_format(record, height: int) -> dict: # - Combine vision-based descriptions with other column types for multi-modal workflows # - Apply this pattern to other vision tasks like image captioning, OCR validation, or visual question answering # -# - [Generating images](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/) with Data Designer +# - [Generating images](https://docs.nvidia.com/nemo/datadesigner/tutorials/generating-images) with Data Designer # diff --git a/docs/notebook_source/5-generating-images.py b/docs/notebook_source/5-generating-images.py index dfdc5782a..ba8ed00fc 100644 --- a/docs/notebook_source/5-generating-images.py +++ b/docs/notebook_source/5-generating-images.py @@ -27,7 +27,7 @@ # # > **Prerequisites**: This tutorial uses [OpenRouter](https://openrouter.ai) with the Flux 2 Pro image model. Set `OPENROUTER_API_KEY` in your environment before running. # -# If this is your first time using Data Designer, we recommend starting with the [first notebook](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/) in this tutorial series. +# If this is your first time using Data Designer, we recommend starting with the [first notebook](https://docs.nvidia.com/nemo/datadesigner/tutorials/the-basics) in this tutorial series. # # %% [markdown] @@ -291,9 +291,9 @@ # %% [markdown] # ## ⏭️ Next steps # -# - [The basics](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/): samplers and LLM text columns -# - [Structured outputs and Jinja](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/2-structured-outputs-and-jinja-expressions/) -# - [Seeding with a dataset](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/3-seeding-with-a-dataset/) -# - [Providing images as context](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/4-providing-images-as-context/) -# - [Image-to-image editing](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/6-editing-images-with-image-context/): edit existing images with seed datasets +# - [The basics](https://docs.nvidia.com/nemo/datadesigner/tutorials/the-basics): samplers and LLM text columns +# - [Structured outputs and Jinja](https://docs.nvidia.com/nemo/datadesigner/tutorials/structured-outputs-jinja-expressions-and-conditional-generation) +# - [Seeding with a dataset](https://docs.nvidia.com/nemo/datadesigner/tutorials/seeding-with-an-external-dataset) +# - [Providing images as context](https://docs.nvidia.com/nemo/datadesigner/tutorials/providing-images-as-context) +# - [Image-to-image editing](https://docs.nvidia.com/nemo/datadesigner/tutorials/image-to-image-editing): edit existing images with seed datasets # diff --git a/docs/notebook_source/6-editing-images-with-image-context.py b/docs/notebook_source/6-editing-images-with-image-context.py index a322d84a0..211f1e761 100644 --- a/docs/notebook_source/6-editing-images-with-image-context.py +++ b/docs/notebook_source/6-editing-images-with-image-context.py @@ -23,11 +23,11 @@ # - 🔗 **Chaining image columns**: Use `ImageContext` to pass generated images to a follow-up editing column # - 🎲 **Sampler-driven diversity**: Combine sampled accessories and settings for varied edits # -# This tutorial uses an **autoregressive** model (one that supports both text-to-image *and* image-to-image generation via the chat completions API). Diffusion models (DALL·E, Stable Diffusion, etc.) do not support image context—see [Tutorial 5](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/) for text-to-image generation with diffusion models. +# This tutorial uses an **autoregressive** model (one that supports both text-to-image *and* image-to-image generation via the chat completions API). Diffusion models (DALL·E, Stable Diffusion, etc.) do not support image context—see [Tutorial 5](https://docs.nvidia.com/nemo/datadesigner/tutorials/generating-images) for text-to-image generation with diffusion models. # # > **Prerequisites**: This tutorial uses [OpenRouter](https://openrouter.ai) with the Flux 2 Pro model. Set `OPENROUTER_API_KEY` in your environment before running. # -# If this is your first time using Data Designer, we recommend starting with the [first notebook](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/) in this tutorial series. +# If this is your first time using Data Designer, we recommend starting with the [first notebook](https://docs.nvidia.com/nemo/datadesigner/tutorials/the-basics) in this tutorial series. # # %% [markdown] @@ -264,7 +264,7 @@ def display_before_after(row, index: int, base_path: Path | None = None) -> None # # Related tutorials: # -# - [The basics](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/1-the-basics/): samplers and LLM text columns -# - [Providing images as context](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/4-providing-images-as-context/): image-to-text with VLMs -# - [Generating images](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/): text-to-image generation with diffusion models +# - [The basics](https://docs.nvidia.com/nemo/datadesigner/tutorials/the-basics): samplers and LLM text columns +# - [Providing images as context](https://docs.nvidia.com/nemo/datadesigner/tutorials/providing-images-as-context): image-to-text with VLMs +# - [Generating images](https://docs.nvidia.com/nemo/datadesigner/tutorials/generating-images): text-to-image generation with diffusion models # diff --git a/docs/notebook_source/README.md b/docs/notebook_source/README.md index 053f9ca8b..7572af626 100644 --- a/docs/notebook_source/README.md +++ b/docs/notebook_source/README.md @@ -6,7 +6,7 @@ In this folder you can find all our tutorial notebooks in `.py` format. They can make convert-execute-notebooks ``` -from the root of the repository. This will not only convert but also execute all of the notebooks -- for that to work, make sure you went through our [Quick Start](https://nvidia-nemo.github.io/DataDesigner/quick-start/) and have API keys set. A new folder `docs/notebooks` will be created, including `README.md` and `pyproject.toml` files. +from the root of the repository. This will not only convert but also execute all of the notebooks -- for that to work, make sure you went through our [Quick Start](https://docs.nvidia.com/nemo/datadesigner/getting-started/welcome) and have API keys set. A new folder `docs/notebooks` will be created, including `README.md` and `pyproject.toml` files. Alternatively, you can use Jupytext directly diff --git a/docs/overrides/main.html b/docs/overrides/main.html deleted file mode 100644 index 2dc45e869..000000000 --- a/docs/overrides/main.html +++ /dev/null @@ -1,31 +0,0 @@ -{% extends "base.html" %} - -{% block outdated %} - You're not viewing the latest version. - - Click here to go to latest. - -{% endblock %} - -{% block content %} -{% if page.nb_url %} -
- - Open In Colab - - -
- - {% include ".icons/material/download.svg" %} - -{% endif %} - -{{ super() }} -{% endblock content %} diff --git a/docs/plugins/build_your_own.md b/docs/plugins/build_your_own.md deleted file mode 100644 index 649b8bdd7..000000000 --- a/docs/plugins/build_your_own.md +++ /dev/null @@ -1,307 +0,0 @@ -# Build Your Own - -Data Designer supports three plugin types: **column generators**, **seed readers**, and **processors**. They all use the same package shape: a config class, an implementation class, and a `Plugin` object registered through a `data_designer.plugins` entry point. - -Use this page as the implementation checklist for plugin packages. Each tab below shows the core files for one plugin type. - -## Package shape - -Use the same structure for each plugin package: - -```text -data-designer-my-plugin/ -|-- pyproject.toml -`-- src/ - `-- data_designer_my_plugin/ - |-- __init__.py - |-- config.py - |-- impl.py - `-- plugin.py -``` - -## Implementation patterns - -=== "Column generator" - - This `index-multiplier` plugin adds a custom column whose value is the row index multiplied by a configurable integer. - - !!! note "Model-backed generators" - If your column generator interacts with models, include at least one `model_alias` field in the config and use the model registry from the implementation. See [Using Models in Plugins](models.md) for the registry access pattern. - - !!! info "Full-column vs cell-by-cell generators" - The example below uses `ColumnGeneratorFullColumn` because it can fill the whole batch from the DataFrame index. Use `ColumnGeneratorCellByCell` when each row can be generated independently from its upstream values and your `generate` method should receive and return a row dictionary. Cell-by-cell generation is especially useful for independent LLM calls because the async engine can run rows concurrently; the built-in [LLM completion generators](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/packages/data-designer-engine/src/data_designer/engine/column_generators/generators/llm_completion.py) are good examples. Prefer `ColumnGeneratorFullColumn` for vectorized pandas operations, batched external APIs, or logic that needs to inspect or update the full batch at once. - - `config.py`: - - ```python - from __future__ import annotations - - from typing import Literal - - from data_designer.config.base import SingleColumnConfig - - - class IndexMultiplierColumnConfig(SingleColumnConfig): - column_type: Literal["index-multiplier"] = "index-multiplier" - multiplier: int = 2 - - @staticmethod - def get_column_emoji() -> str: - return "✖️" - - @property - def required_columns(self) -> list[str]: - return [] - - @property - def side_effect_columns(self) -> list[str]: - return [] - ``` - - `impl.py`: - - ```python - from __future__ import annotations - - from typing import TYPE_CHECKING - - from data_designer.engine.column_generators.generators.base import ColumnGeneratorFullColumn - - from data_designer_index_multiplier.config import IndexMultiplierColumnConfig - - if TYPE_CHECKING: - import pandas as pd - - - class IndexMultiplierColumnGenerator(ColumnGeneratorFullColumn[IndexMultiplierColumnConfig]): - def generate(self, data: pd.DataFrame) -> pd.DataFrame: - data[self.config.name] = data.index * self.config.multiplier - return data - ``` - - `plugin.py`: - - ```python - from __future__ import annotations - - from data_designer.plugins import Plugin, PluginType - - plugin = Plugin( - config_qualified_name="data_designer_index_multiplier.config.IndexMultiplierColumnConfig", - impl_qualified_name="data_designer_index_multiplier.impl.IndexMultiplierColumnGenerator", - plugin_type=PluginType.COLUMN_GENERATOR, - ) - ``` - - Entry point: - - ```toml - [project.entry-points."data_designer.plugins"] - index-multiplier = "data_designer_index_multiplier.plugin:plugin" - ``` - - For the generator implementation contract, see [Column Generators](../code_reference/engine/column_generators.md). For inline custom functions, see [Custom Columns](../concepts/custom_columns.md). - -=== "Seed reader" - - This `prefixed-text-files` plugin loads text files from a directory and emits a seed dataset with prefixed file contents. - - `config.py`: - - ```python - from __future__ import annotations - - from typing import Literal - - from data_designer.config.seed_source import FileSystemSeedSource - - - class PrefixedTextSeedSource(FileSystemSeedSource): - seed_type: Literal["prefixed-text-files"] = "prefixed-text-files" - prefix: str = "plugin" - ``` - - `impl.py`: - - ```python - from __future__ import annotations - - from pathlib import Path - from typing import Any - - import data_designer.lazy_heavy_imports as lazy - from data_designer.engine.resources.seed_reader import ( - FileSystemSeedReader, - SeedReaderFileSystemContext, - ) - - from data_designer_prefixed_text_seed_reader.config import PrefixedTextSeedSource - - - class PrefixedTextSeedReader(FileSystemSeedReader[PrefixedTextSeedSource]): - output_columns = ["relative_path", "file_name", "prefixed_content"] - - def build_manifest( - self, - *, - context: SeedReaderFileSystemContext, - ) -> lazy.pd.DataFrame | list[dict[str, str]]: - matched_paths = self.get_matching_relative_paths( - context=context, - file_pattern=self.source.file_pattern, - recursive=self.source.recursive, - ) - return [ - { - "relative_path": relative_path, - "file_name": Path(relative_path).name, - } - for relative_path in matched_paths - ] - - def hydrate_row( - self, - *, - manifest_row: dict[str, Any], - context: SeedReaderFileSystemContext, - ) -> dict[str, str]: - relative_path = str(manifest_row["relative_path"]) - with context.fs.open(relative_path, "r", encoding="utf-8") as handle: - content = handle.read().strip() - return { - "relative_path": relative_path, - "file_name": str(manifest_row["file_name"]), - "prefixed_content": f"{self.source.prefix}:{content}", - } - ``` - - `plugin.py`: - - ```python - from __future__ import annotations - - from data_designer.plugins import Plugin, PluginType - - plugin = Plugin( - config_qualified_name="data_designer_prefixed_text_seed_reader.config.PrefixedTextSeedSource", - impl_qualified_name="data_designer_prefixed_text_seed_reader.impl.PrefixedTextSeedReader", - plugin_type=PluginType.SEED_READER, - ) - ``` - - Entry point: - - ```toml - [project.entry-points."data_designer.plugins"] - prefixed-text-files = "data_designer_prefixed_text_seed_reader.plugin:plugin" - ``` - - For the engine API behind this example, see [Seed Readers](../code_reference/engine/seed_readers.md). - -=== "Processor" - - This `regex-filter` plugin filters rows whose column value matches a regular expression. - - `config.py`: - - ```python - from __future__ import annotations - - from typing import Literal - - from pydantic import Field - - from data_designer.config.base import ProcessorConfig - - - class RegexFilterProcessorConfig(ProcessorConfig): - processor_type: Literal["regex-filter"] = "regex-filter" - column: str = Field(description="Column to match against.") - pattern: str = Field(description="Regex pattern to match.") - invert: bool = Field(default=False, description="If True, keep rows that do not match.") - ``` - - `impl.py`: - - ```python - from __future__ import annotations - - from typing import TYPE_CHECKING - - from data_designer.engine.processing.processors.base import Processor - - from data_designer_regex_filter.config import RegexFilterProcessorConfig - - if TYPE_CHECKING: - import pandas as pd - - - class RegexFilterProcessor(Processor[RegexFilterProcessorConfig]): - def process_after_generation(self, data: pd.DataFrame) -> pd.DataFrame: - mask = data[self.config.column].astype(str).str.contains(self.config.pattern, regex=True) - if self.config.invert: - mask = ~mask - return data[mask].reset_index(drop=True) - ``` - - `plugin.py`: - - ```python - from __future__ import annotations - - from data_designer.plugins import Plugin, PluginType - - plugin = Plugin( - config_qualified_name="data_designer_regex_filter.config.RegexFilterProcessorConfig", - impl_qualified_name="data_designer_regex_filter.impl.RegexFilterProcessor", - plugin_type=PluginType.PROCESSOR, - ) - ``` - - Entry point: - - ```toml - [project.entry-points."data_designer.plugins"] - regex-filter = "data_designer_regex_filter.plugin:plugin" - ``` - - For callback selection and processor execution details, see [Processors](../concepts/processors.md). For the engine API behind this example, see [Engine Processors code reference](../code_reference/engine/processors.md). - -## Install and use locally - -Install any plugin package in editable mode from the package directory: - -```bash -uv pip install -e . -``` - -The editable install registers the `data_designer.plugins` entry point so Data Designer can discover the plugin. - -!!! note "Restart your kernel after installing" - Data Designer caches the plugin registry on first import, so an `import data_designer` that already happened in your Python process — typical in a notebook — won't pick up a freshly installed plugin. After `uv pip install -e .`, restart the kernel (or interpreter) so the next import rebuilds the registry. - -## Validate plugins - -Data Designer provides a testing utility for common plugin structure checks: - -```python -from data_designer.engine.testing.utils import assert_valid_plugin -from data_designer_index_multiplier.plugin import plugin - -assert_valid_plugin(plugin) -``` - -`assert_valid_plugin` checks that the plugin's config inherits from `ConfigBase` and that the implementation class inherits from the appropriate base for its plugin type (`ConfigurableTask` for column generators, `SeedReader` for seed readers). - -For published plugins, add at least one functional test that runs the plugin through `DataDesigner.preview(...)`. This catches packaging and entry point issues that a direct implementation test can miss. - -## Multiple plugins in one package - -A single Python package can register multiple plugins by defining multiple `Plugin` objects and entry points: - -```toml -[project.entry-points."data_designer.plugins"] -my-column-generator = "my_package.plugins.column_generator.plugin:column_generator_plugin" -my-seed-reader = "my_package.plugins.seed_reader.plugin:seed_reader_plugin" -my-processor = "my_package.plugins.processor.plugin:processor_plugin" -``` diff --git a/docs/plugins/discover.md b/docs/plugins/discover.md deleted file mode 100644 index 053bd2933..000000000 --- a/docs/plugins/discover.md +++ /dev/null @@ -1,101 +0,0 @@ -# Discover Plugins - -The Data Designer CLI is the recommended way to discover and install published plugins. It uses plugin catalogs to show install details and compatibility before installing the selected plugin package into your current environment or active `uv` project. - -Plugins are distributed as Python packages. A single package can expose one or more runtime plugins, so the CLI installs and uninstalls packages rather than individual runtime plugin names. - -## NVIDIA catalog - -The default `nvidia` catalog is maintained in the [DataDesignerPlugins repository](https://github.com/NVIDIA-NeMo/DataDesignerPlugins). You do not need to configure it before using the CLI. - -You can also browse the first-party [plugin documentation](https://nvidia-nemo.github.io/DataDesignerPlugins/plugins/) and [plugin package source](https://github.com/NVIDIA-NeMo/DataDesignerPlugins/tree/main/plugins) directly. - -## Find a plugin package - -When a CLI command requires a plugin package argument, you can pass either the full package name or the package alias. The package alias is the package name without the `data-designer-` prefix. For example, `data-designer-github` can be addressed as `github`. - -Start by listing or searching the compatible packages in the default catalog. Search can match package names, package aliases, descriptions, runtime plugin names, and runtime plugin types. - -```bash -# List compatible plugin packages from the default NVIDIA catalog -data-designer plugin list - -# Search for a package -data-designer plugin search github - -# Inspect one package before installing it -data-designer plugin info github -``` - -## Install a plugin package - -Install the package by full package name or package alias: - -```bash -data-designer plugin install github -``` - -After installation, Data Designer discovers the package's `data_designer.plugins` entry points. Use `installed` to see the plugin packages available in the current Python environment and the runtime plugins they expose: - -```bash -data-designer plugin installed -``` - -Uninstall with the same package name or alias: - -```bash -data-designer plugin uninstall github -``` - -!!! note - Plugins are ordinary Python packages. You can still publish a plugin to PyPI or another package index and install it directly with `pip` or `uv`. This is the path we recommend for individual plugin developers from the community. See [Community plugins](#community-plugins) below. - -## How catalogs work - -A plugin catalog is a JSON file that tells Data Designer which plugin packages are available and how to install them. The catalog can be hosted anywhere that serves raw JSON. Each entry points to an installable Python package and includes its docs URL, Python and Data Designer compatibility requirements, the runtime plugins it exposes after installation, and the installer metadata needed to fetch the package. - -The package itself can live in any Python package index, or be referenced with any valid [PEP 508 direct reference](https://packaging.python.org/en/latest/specifications/dependency-specifiers/#direct-references). The package does not have to live in the same repository as the catalog. - -The NVIDIA catalog is published at: - -```text -https://nvidia-nemo.github.io/DataDesignerPlugins/catalog/plugins.json -``` - -The NVIDIA plugin packages are served from a PyPI-compatible Python Simple API index published beside that catalog: - -```text -https://nvidia-nemo.github.io/DataDesignerPlugins/simple/ -``` - -Catalog discovery and runtime plugin discovery are separate. Reading a catalog lets the CLI show available packages and install plans without importing plugin code. Runtime plugins become available only after their package is installed and Data Designer discovers the package's `data_designer.plugins` entry points. - -Other catalogs can follow the same pattern as the NVIDIA plugin repository: publish a raw `catalog/plugins.json` file and, for index-backed packages, a PyPI-compatible hosted package index. Catalog entries can also point to packages on the installer's default index or to direct package references. - -## Use another catalog - -Add a catalog when a team or community publishes a compatible catalog JSON file. For example, an internal platform team might publish a catalog that lists approved Data Designer plugin packages and points each package at an internal Python package index. Teammates can then add that catalog once and install approved plugins by package name or alias. - -Choose a short catalog name and use it with `--catalog`: - -```bash -data-designer plugin catalog add -data-designer plugin --catalog list -data-designer plugin --catalog install -``` - -For published catalogs, prefer sharing the raw catalog JSON URL. Local catalog files and directories are useful while authoring or testing a catalog before publishing it. - -```bash -# See configured catalog names -data-designer plugin catalog list - -# Remove a catalog -data-designer plugin catalog remove -``` - -## Community plugins - -We do not have any community plugins to list here yet, but yours could be the first! If you build a plugin that could be useful to other Data Designer users, we would love to hear about it. - -To get started, follow the patterns in the [plugin overview](overview.md) and [Build Your Own](build_your_own.md) guides, then publish your plugin package to PyPI. When your plugin is ready, open an issue on the [Data Designer GitHub repository](https://github.com/NVIDIA-NeMo/DataDesigner/issues) with the package name, source repository, documentation link, supported Data Designer versions, and the plugin types it provides. The Data Designer team will review the plugin and add it here if it seems generally useful for the community. diff --git a/docs/plugins/models.md b/docs/plugins/models.md deleted file mode 100644 index c3bb25228..000000000 --- a/docs/plugins/models.md +++ /dev/null @@ -1,194 +0,0 @@ -# Using Models in Plugins - -Model access belongs in column generator implementations, not config objects. Keep the config declarative by asking users for model aliases, then resolve those aliases at runtime through the model registry. - -Do not construct model clients in plugin configs, read API keys in configs, or bypass Data Designer's model providers. The engine builds a `ResourceProvider` and exposes its model registry to every generator at: - -```python -self.resource_provider.model_registry -``` - -## Access the registry - -Use a model-aware column generator base whenever your plugin needs the registry: - -| Need | Base class | Registry access | -|------|------------|-----------------| -| Primary model alias | `ColumnGeneratorWithModel` | Use `self.model`, `self.model_config`, and `self.inference_parameters`. | -| Multiple aliases or provider inspection | `ColumnGeneratorWithModelRegistry` | Use `self.get_model(alias)`, `self.get_model_config(alias)`, and `self.get_model_provider_name(alias)`. | - -`ColumnGeneratorWithModel` is a convenience subclass of `ColumnGeneratorWithModelRegistry`. It expects the config to have a `model_alias` field and resolves that one alias for you. For independent model calls, return `GenerationStrategy.CELL_BY_CELL` so the runtime can fan out rows like the built-in LLM, embedding, and image generators. Use full-column generation only when your plugin intentionally calls a batched API for the whole DataFrame. - -```python -from __future__ import annotations - -from data_designer.config.column_configs import GenerationStrategy -from data_designer.engine.column_generators.generators.base import ColumnGeneratorWithModel -from data_designer.engine.models.parsers.errors import ParserException - -from data_designer_sentiment_label.config import SentimentLabelColumnConfig - - -def parse_sentiment_label(response: str) -> str: - label = response.strip().lower() - if label not in {"positive", "neutral", "negative"}: - raise ParserException("Expected exactly one of: positive, neutral, negative.", source=response) - return label - - -class SentimentLabelColumnGenerator(ColumnGeneratorWithModel[SentimentLabelColumnConfig]): - @staticmethod - def get_generation_strategy() -> GenerationStrategy: - return GenerationStrategy.CELL_BY_CELL - - async def agenerate(self, data: dict) -> dict: - label, _ = await self.model.agenerate( - prompt=f"Classify the sentiment of this text: {data[self.config.source_column]}", - system_prompt="Return exactly one label: positive, neutral, or negative.", - parser=parse_sentiment_label, - max_correction_steps=self.resource_provider.run_config.max_conversation_correction_steps, - max_conversation_restarts=self.resource_provider.run_config.max_conversation_restarts, - purpose=f"running generation for column '{self.config.name}'", - ) - data[self.config.name] = label - return data -``` - -The matching config must include `model_alias: str` as a normal user-facing field: - -```python -from __future__ import annotations - -from typing import Literal - -from data_designer.config.base import SingleColumnConfig - - -class SentimentLabelColumnConfig(SingleColumnConfig): - column_type: Literal["sentiment-label"] = "sentiment-label" - source_column: str - model_alias: str - - @property - def required_columns(self) -> list[str]: - return [self.source_column] - - @property - def side_effect_columns(self) -> list[str]: - return [] -``` - -Users set that alias from default model settings or from `DataDesignerConfigBuilder(model_configs=...)`. - -## Use multiple models - -If your plugin uses multiple model aliases, inherit from `ColumnGeneratorWithModelRegistry` and resolve each alias explicitly with `self.get_model(...)`. - -The startup model health check pings every alias your column declares. By default, `SingleColumnConfig.get_model_aliases()` returns the primary `model_alias` field, which covers single-model plugins for free. A config for this pattern might also define `judge_model_alias`, `critic_model_alias`, or another task-specific alias. Override `get_model_aliases()` to return every alias the column depends on so a typo, missing API key, or unreachable endpoint surfaces at startup instead of at first generation. - -The matching config opts every alias into the standard startup health check by listing them all in `get_model_aliases()`: - -```python -from __future__ import annotations - -from typing import Literal - -from data_designer.config.base import SingleColumnConfig - - -class PairwiseJudgeColumnConfig(SingleColumnConfig): - column_type: Literal["pairwise-judge"] = "pairwise-judge" - question_column: str - model_alias: str - judge_model_alias: str - - @property - def required_columns(self) -> list[str]: - return [self.question_column] - - @property - def side_effect_columns(self) -> list[str]: - return [] - - def get_model_aliases(self) -> list[str]: - return [self.model_alias, self.judge_model_alias] -``` - -```python -from __future__ import annotations - -from data_designer.config.column_configs import GenerationStrategy -from data_designer.engine.column_generators.generators.base import ColumnGeneratorWithModelRegistry -from data_designer.engine.models.parsers.errors import ParserException - -from data_designer_pairwise_judge.config import PairwiseJudgeColumnConfig - - -def parse_score(response: str) -> int: - text = response.strip() - if text not in {"1", "2", "3", "4", "5"}: - raise ParserException("Expected an integer score from 1 to 5.", source=response) - return int(text) - - -class PairwiseJudgeColumnGenerator(ColumnGeneratorWithModelRegistry[PairwiseJudgeColumnConfig]): - @staticmethod - def get_generation_strategy() -> GenerationStrategy: - return GenerationStrategy.CELL_BY_CELL - - async def agenerate(self, data: dict) -> dict: - generator_model = self.get_model(self.config.model_alias) - judge_model = self.get_model(self.config.judge_model_alias) - retry_kwargs = { - "max_correction_steps": self.resource_provider.run_config.max_conversation_correction_steps, - "max_conversation_restarts": self.resource_provider.run_config.max_conversation_restarts, - } - - draft, _ = await generator_model.agenerate( - prompt=f"Draft an answer for: {data[self.config.question_column]}", - purpose=f"drafting an answer for column '{self.config.name}'", - **retry_kwargs, - ) - score, _ = await judge_model.agenerate( - prompt=f"Score this answer from 1 to 5: {draft}", - system_prompt="Return exactly one integer from 1 to 5.", - parser=parse_score, - purpose=f"judging an answer for column '{self.config.name}'", - **retry_kwargs, - ) - data[self.config.name] = {"draft": draft, "score": score} - return data -``` - -If your config has no `model_alias` field at all (uncommon but valid), override `get_model_aliases()` to return whichever fields name your model dependencies — the default implementation reads `model_alias` via `getattr` and returns an empty list when it is absent, so it will not crash on configs without it. - -## What the registry returns - -`get_model(...)` returns a `ModelFacade`. Call the facade based on the modality your plugin needs: - -- Chat completion aliases use `model.generate(...)` or `await model.agenerate(...)` and return `(parsed_output, trace)`. -- Embedding aliases use `model.generate_text_embeddings(...)` or `await model.agenerate_text_embeddings(...)` and return `list[list[float]]`. -- Image aliases use `model.generate_image(...)` or `await model.agenerate_image(...)` and return `list[str]` of base64-encoded image data. - -Choose a model alias whose `ModelConfig.inference_parameters.generation_type` matches the facade method you call. The facade merges the alias's configured inference parameters into each request. - -Pass runtime context such as `prompt`, `system_prompt`, `parser`, `tool_alias`, `multi_modal_context`, `max_correction_steps`, `max_conversation_restarts`, and `purpose` at the call site. Parser functions should raise `ParserException` for invalid model responses; that is what allows `ModelFacade.generate(...)` and `ModelFacade.agenerate(...)` to run correction turns and conversation restarts. - -Prefer implementing `agenerate(...)` for model-backed plugins. The base `generate(...)` method can bridge to `agenerate(...)` for sync runs when the subclass only implements async generation. If your plugin has a sync-specific path, implement both `generate(...)` and `agenerate(...)`, as the built-in generators do. - -## Health checks and scheduling - -The model-aware bases mark the generator as LLM-bound, so the async scheduler treats the work like other model calls. - -Plugin discovery treats column generator implementations that inherit from `ColumnGeneratorWithModelRegistry` as model-generated column types for startup model health checks. The standard health-check collection calls `SingleColumnConfig.get_model_aliases()` on each column config and pings every alias it returns. The default implementation returns the column's primary `model_alias` (or an empty list for configs without one); configs with multiple model fields should override it so the startup check exercises every endpoint they depend on. - -## Built-in patterns - -The built-in model-backed generators use these same hooks: - -- `LLMTextCellGenerator`, `LLMCodeCellGenerator`, `LLMStructuredCellGenerator`, and `LLMJudgeCellGenerator` inherit through a chat-completion base that uses `ColumnGeneratorWithModel`. They render prompts from row data, call `self.model.generate(...)` or `self.model.agenerate(...)`, pass parsers into the `ModelFacade`, and store optional trace side-effect columns. -- `EmbeddingCellGenerator` uses `ColumnGeneratorWithModel` but calls the facade's embedding methods instead of chat completion. -- `ImageCellGenerator` uses `ColumnGeneratorWithModel`, renders a prompt, calls the facade's image methods, and writes generated media through the artifact storage supplied by the same `ResourceProvider`. -- `CustomColumnGenerator` is the inline-function counterpart: when users declare `model_aliases`, it builds a `models` dict from `resource_provider.model_registry`. Packaged plugins usually use `ColumnGeneratorWithModel` or `ColumnGeneratorWithModelRegistry` directly instead of recreating that dict. - -See [Column Generators](../code_reference/engine/column_generators.md) for the full base-class API and [Custom Model Settings](../concepts/models/custom-model-settings.md) for configuring model aliases. diff --git a/docs/plugins/overview.md b/docs/plugins/overview.md deleted file mode 100644 index b071966c8..000000000 --- a/docs/plugins/overview.md +++ /dev/null @@ -1,35 +0,0 @@ -# Data Designer Plugins - -Plugins let you add new object types to Data Designer without modifying the core library. Once installed, plugins behave like native Data Designer objects: they use the same declarative config patterns, builder APIs, discovery flow, and runtime execution paths as the built-in objects. - -## Supported plugin types - -Data Designer supports three plugin types: - -- **Column generator plugins**: Custom [column generators](../code_reference/engine/column_generators.md) you pass to the config builder's [add_column](../code_reference/config/config_builder.md#data_designer.config.config_builder.DataDesignerConfigBuilder.add_column) method. -- **Seed reader plugins**: Custom [seed readers](../code_reference/engine/seed_readers.md) that load data from new sources, such as databases, cloud storage, or custom file formats. -- **Processor plugins**: Custom [processor implementations](../code_reference/engine/processors.md) configured by processor config objects that transform data before batches, after batches, or after generation completes. Pass them to the config builder's [add_processor](../code_reference/config/config_builder.md#data_designer.config.config_builder.DataDesignerConfigBuilder.add_processor) method. - -## Use an Installed Plugin - -Plugin packages register their `Plugin` objects through Python package [entry points](https://packaging.python.org/en/latest/guides/creating-and-discovering-plugins/#using-package-metadata). Data Designer discovers installed plugin entry points automatically, so no extra registration code is required. Once a plugin package is installed, use its new object types in your Data Designer workflow. - -If you install a plugin after `data_designer` has already been imported, restart the Python process so plugin discovery can rebuild from the new entry points. - -## Build a Plugin - -For implementation instructions across all plugin types, see the [Build Your Own](build_your_own.md) section. - -## Find Plugins - -Use the Data Designer CLI to discover and install published plugin packages from catalogs. See [Discover Plugins](discover.md) for the catalog workflow, first-party plugin documentation, and source links. - -## Discovery troubleshooting - -If a plugin is installed but not available, check these items first: - -- The entry point group must be exactly `data_designer.plugins`. -- Check the value of the `DISABLE_DATA_DESIGNER_PLUGINS` environment variable. If it is set to `true`, entry point discovery is disabled. -- The plugin discriminator default must be a string. Use `column_type`, `seed_type`, or `processor_type`, depending on the plugin type. -- Avoid duplicate plugin names. Discovery stores plugins by `plugin.name`, which comes from the discriminator default. -- For plugin packages under development, call `assert_valid_plugin` on the plugin object to catch common structural issues at import time. diff --git a/docs/recipes/cards.md b/docs/recipes/cards.md deleted file mode 100644 index 302198c79..000000000 --- a/docs/recipes/cards.md +++ /dev/null @@ -1,221 +0,0 @@ -# Use Case Recipes - -Recipes are a collection of code examples that demonstrate how to leverage Data Designer in specific use cases. -Each recipe is a self-contained example that can be run independently. - -!!! question "New to Data Designer?" - Recipes provide working code for specific use cases without detailed explanations. If you're learning Data Designer for the first time, we recommend starting with our [tutorial notebooks](../../notebooks/), which offer step-by-step guidance and explain core concepts. Once you're familiar with the basics, return here for practical, ready-to-use implementations. - -!!! tip Prerequisite - These recipes use the Open AI model provider by default. Ensure your OpenAI model provider has been set up using the Data Designer CLI before running a recipe. - -
- -- :material-snake:{ .lg .middle } **Text to Python** - - Generate a dataset of natural language instructions paired with Python code implementations, with varying complexity levels and industry focuses. - - --- - - **Demonstrates:** - - - Python code generation - - Python code validation - - LLM-as-judge - - --- - - [:material-book-open-page-variant: View Recipe](code_generation/text_to_python.md){ .md-button } - [Download Code :octicons-download-24:](../assets/recipes/code_generation/text_to_python.py){ .md-button download="text_to_python.py" } - -- :material-database:{ .lg .middle } **Text to SQL** - - Generate a dataset of natural language instructions paired with SQL code implementations, with varying complexity levels and industry focuses. - - --- - - **Demonstrates:** - - - SQL code generation - - SQL code validation - - LLM-as-judge - - --- - - [:material-book-open-page-variant: View Recipe](code_generation/text_to_sql.md){ .md-button } - [Download Code :octicons-download-24:](../assets/recipes/code_generation/text_to_sql.py){ .md-button download="text_to_sql.py" } - -- :material-database-search:{ .lg .middle } **Nemotron Super Text to SQL** - - Generate enterprise-grade text-to-SQL training data used for Nemotron Super v3 SFT -- dialect-specific SQL, distractor injection, dirty data, 5 LLM judges with 15 scoring dimensions. - - --- - - **Demonstrates:** - - - Dialect-specific SQL generation (SQLite, MySQL, PostgreSQL) - - Distractor table/column and dirty data injection - - Conditional sampling with SubcategorySamplerParams - - 5 LLM judges with 15 score extraction columns - - --- - - [:material-book-open-page-variant: View Recipe](code_generation/enterprise_text_to_sql.md){ .md-button } - [Download Code :octicons-download-24:](../assets/recipes/code_generation/enterprise_text_to_sql.py){ .md-button download="enterprise_text_to_sql.py" } - - -- :material-chat:{ .lg .middle } **Product Info QA** - - Generate a dataset that contains information about products and associated question/answer pairs. - - --- - - **Demonstrates:** - - - Structured outputs - - Expression columns - - LLM-as-judge - - --- - - [:material-book-open-page-variant: View Recipe](qa_and_chat/product_info_qa.md){ .md-button } - [Download Code :octicons-download-24:](../assets/recipes/qa_and_chat/product_info_qa.py){ .md-button download="product_info_qa.py" } - - -- :material-chat:{ .lg .middle } **Multi-Turn Chat** - - Generate a dataset of multi-turn chat conversations between a user and an AI assistant. - - --- - - **Demonstrates:** - - - Structured outputs - - Expression columns - - LLM-as-judge - - --- - - [:material-book-open-page-variant: View Recipe](qa_and_chat/multi_turn_chat.md){ .md-button } - [Download Code :octicons-download-24:](../assets/recipes/qa_and_chat/multi_turn_chat.py){ .md-button download="multi_turn_chat.py" } - -- :material-source-branch:{ .lg .middle } **Agent Rollout Trace Distillation** - - Read agent rollout traces from disk and turn each imported rollout into a structured workflow record inside a Data Designer pipeline. - - --- - - **Demonstrates:** - - - `AgentRolloutSeedSource` across ATIF, Claude Code, Codex, and Hermes Agent rollout formats - - Using normalized trace columns in generation prompts - - Distilling agent traces into reusable structured records - - --- - - [:material-book-open-page-variant: View Recipe](trace_ingestion/agent_rollout_distillation.md){ .md-button } - [:material-file-document-outline: Ingestion Guide](../concepts/agent-rollout-ingestion.md){ .md-button } - [Download Code :octicons-download-24:](../assets/recipes/trace_ingestion/agent_rollout_distillation.py){ .md-button download="agent_rollout_distillation.py" } - - -- :material-tools:{ .lg .middle } **Basic MCP Tool Use** - - Minimal example of MCP tool calling with Data Designer. Defines a simple MCP server with basic tools and generates data that requires tool calls to complete. - - --- - - **Demonstrates:** - - - MCP tool calling with LocalStdioMCPProvider - - Simple tool server definition - - Tool-augmented text generation - - --- - - [:material-book-open-page-variant: View Recipe](mcp_and_tooluse/basic_mcp.md){ .md-button } - [Download Code :octicons-download-24:](../assets/recipes/mcp_and_tooluse/basic_mcp.py){ .md-button download="basic_mcp.py" } - -- :material-tools:{ .lg .middle } **PDF Document QA (MCP + Tool Use)** - - Generate grounded Q&A pairs from PDF documents using MCP tool calls and BM25 search. - - --- - - **Demonstrates:** - - - MCP tool calling with LocalStdioMCPProvider - - BM25 lexical search for retrieval - - Retrieval-grounded QA generation - - Per-column trace capture - - --- - - [:material-book-open-page-variant: View Recipe](mcp_and_tooluse/pdf_qa.md){ .md-button } - [Download Code :octicons-download-24:](../assets/recipes/mcp_and_tooluse/pdf_qa.py){ .md-button download="pdf_qa.py" } - -- :material-magnify:{ .lg .middle } **Nemotron Super Search Agent (MCP + Tool Use)** - - Generate multi-turn search agent trajectories used for Nemotron Super post-training -- Tavily web search via MCP, Wikidata KG seeding, BrowseComp-style question generation. - - --- - - **Demonstrates:** - - - MCP tool calling with Tavily web search - - Wikidata knowledge graph seeding - - Two-stage question generation (draft + BrowseComp obfuscation) - - Full trajectory capture with traces - - Structured output formatting - - --- - - [:material-book-open-page-variant: View Recipe](mcp_and_tooluse/search_agent.md){ .md-button } - [Download Code :octicons-download-24:](../assets/recipes/mcp_and_tooluse/search_agent.py){ .md-button download="search_agent.py" } - -- :material-file-document-multiple:{ .lg .middle } **Markdown Section Seed Reader** - - Define a custom `FileSystemSeedReader` inline and turn Markdown files into one seed row per heading section. - - --- - - **Demonstrates:** - - - Single-file custom seed reader pattern - - `hydrate_row()` fanout from `1 -> N` - - Manifest-based file selection semantics - - `DirectorySeedSource` customization without a new `seed_type` - - --- - - [:material-book-open-page-variant: View Recipe](plugin_development/markdown_seed_reader.md){ .md-button } - [Download Code :octicons-download-24:](../assets/recipes/plugin_development/markdown_seed_reader.py){ .md-button download="markdown_seed_reader.py" } - -- :material-file-eye:{ .lg .middle } **VLM Long-Document Understanding** - - A 9-recipe pipeline for generating visual QA training data from long PDF documents — OCR, page classification, single-page / multi-page / whole-document QA, and frontier-model quality filtering. Used to generate SFT data for Nemotron-3-Nano-Omni-30B-A3B's training recipe on long document understanding. - - --- - - **Demonstrates:** - - - Multi-modal image context (`LLMTextColumnConfig`, `LLMStructuredColumnConfig`) - - Classification-first filtering with visual taxonomy - - Thinking models with `extract_reasoning_content` - - Multi-image and whole-document VLM generation - - `LLMJudgeColumnConfig` with multi-rubric scoring - - --- - - | # | Recipe | | - | :---: | :--- | :--- | - | 01 | [Seed Dataset Preparation](vlm_long_doc/seed_dataset_preparation.md) | [Download :octicons-download-24:](../assets/recipes/vlm_long_doc/01-seed-dataset-preparation.py){ download="01-seed-dataset-preparation.py" } | - | 02 | [Nemotron Parse OCR](vlm_long_doc/nemotron_parse_ocr.md) | [Download :octicons-download-24:](../assets/recipes/vlm_long_doc/02-nemotron-parse-ocr-sdg.py){ download="02-nemotron-parse-ocr-sdg.py" } | - | 03 | [Text QA from OCR Transcripts](vlm_long_doc/text_qa.md) | [Download :octicons-download-24:](../assets/recipes/vlm_long_doc/03-text-qa-sdg.py){ download="03-text-qa-sdg.py" } | - | 04 | [Page Classification](vlm_long_doc/page_classification.md) | [Download :octicons-download-24:](../assets/recipes/vlm_long_doc/04-page-classification-sdg.py){ download="04-page-classification-sdg.py" } | - | 05 | [Visual QA](vlm_long_doc/visual_qa.md) | [Download :octicons-download-24:](../assets/recipes/vlm_long_doc/05-visual-qa-sdg.py){ download="05-visual-qa-sdg.py" } | - | 06 | [Single-Page QA](vlm_long_doc/single_page_qa.md) | [Download :octicons-download-24:](../assets/recipes/vlm_long_doc/06-single-page-qa-sdg.py){ download="06-single-page-qa-sdg.py" } | - | 07 | [Multi-Page Windowed QA](vlm_long_doc/multi_page_windowed_qa.md) | [Download :octicons-download-24:](../assets/recipes/vlm_long_doc/07-multi-page-windowed-qa-sdg.py){ download="07-multi-page-windowed-qa-sdg.py" } | - | 08 | [Whole-Document QA](vlm_long_doc/whole_document_qa.md) | [Download :octicons-download-24:](../assets/recipes/vlm_long_doc/08-whole-document-qa-sdg.py){ download="08-whole-document-qa-sdg.py" } | - | 09 | [Frontier Judge QA Filter](vlm_long_doc/frontier_judge.md) | [Download :octicons-download-24:](../assets/recipes/vlm_long_doc/09-frontier-judge-sdg.py){ download="09-frontier-judge-sdg.py" } | - -
diff --git a/docs/recipes/code_generation/enterprise_text_to_sql.md b/docs/recipes/code_generation/enterprise_text_to_sql.md deleted file mode 100644 index e58ceb80a..000000000 --- a/docs/recipes/code_generation/enterprise_text_to_sql.md +++ /dev/null @@ -1,10 +0,0 @@ -# Nemotron Super Text to SQL - -!!! note "Dev Note" - For a deep dive into the pipeline design, distractor injection strategy, quality waterfall analysis, and BIRD benchmark results, see [Engineering an Enterprise-Grade Text-to-SQL Dataset with NeMo Data Designer](../../../devnotes/engineering-an-enterprise-grade-text-to-sql-dataset-with-nemo-data-designer/). - -[Download Code :octicons-download-24:](../../../assets/recipes/code_generation/enterprise_text_to_sql.py){ .md-button download="enterprise_text_to_sql.py" } - -```python ---8<-- "assets/recipes/code_generation/enterprise_text_to_sql.py" -``` diff --git a/docs/recipes/code_generation/text_to_python.md b/docs/recipes/code_generation/text_to_python.md deleted file mode 100644 index 81575e61a..000000000 --- a/docs/recipes/code_generation/text_to_python.md +++ /dev/null @@ -1,5 +0,0 @@ -[Download Code :octicons-download-24:](../../../assets/recipes/code_generation/text_to_python.py){ .md-button download="text_to_python.py" } - -```python ---8<-- "assets/recipes/code_generation/text_to_python.py" -``` diff --git a/docs/recipes/code_generation/text_to_sql.md b/docs/recipes/code_generation/text_to_sql.md deleted file mode 100644 index af736cc3d..000000000 --- a/docs/recipes/code_generation/text_to_sql.md +++ /dev/null @@ -1,7 +0,0 @@ -# Text to SQL - -[Download Code :octicons-download-24:](../../../assets/recipes/code_generation/text_to_sql.py){ .md-button download="text_to_sql.py" } - -```python ---8<-- "assets/recipes/code_generation/text_to_sql.py" -``` diff --git a/docs/recipes/mcp_and_tooluse/basic_mcp.md b/docs/recipes/mcp_and_tooluse/basic_mcp.md deleted file mode 100644 index 2f525accd..000000000 --- a/docs/recipes/mcp_and_tooluse/basic_mcp.md +++ /dev/null @@ -1,5 +0,0 @@ -[Download Code :octicons-download-24:](../../../assets/recipes/mcp_and_tooluse/basic_mcp.py){ .md-button download="basic_mcp.py" } - -```python ---8<-- "assets/recipes/mcp_and_tooluse/basic_mcp.py" -``` diff --git a/docs/recipes/mcp_and_tooluse/pdf_qa.md b/docs/recipes/mcp_and_tooluse/pdf_qa.md deleted file mode 100644 index 2c6f3e9e3..000000000 --- a/docs/recipes/mcp_and_tooluse/pdf_qa.md +++ /dev/null @@ -1,5 +0,0 @@ -[Download Code :octicons-download-24:](../../../assets/recipes/mcp_and_tooluse/pdf_qa.py){ .md-button download="pdf_qa.py" } - -```python ---8<-- "assets/recipes/mcp_and_tooluse/pdf_qa.py" -``` diff --git a/docs/recipes/mcp_and_tooluse/search_agent.md b/docs/recipes/mcp_and_tooluse/search_agent.md deleted file mode 100644 index 9ab8b789a..000000000 --- a/docs/recipes/mcp_and_tooluse/search_agent.md +++ /dev/null @@ -1,13 +0,0 @@ -# Nemotron Super Search Agent - -!!! note "Dev Note" - For a deep dive into the pipeline design, production yield analysis, correctness challenges, and key takeaways, see [Search Agent SFT Data: Teaching LLMs to Browse the Web](../../../devnotes/search-agent-sft-data-teaching-llms-to-browse-the-web/). - -!!! tip "Seed Dataset" - This recipe includes built-in demo seeds (3 Wikidata knowledge graph paths) for quick testing. For production use, generate your own seed dataset from Wikidata random walks -- the dev note above describes the seed generation process (SPARQL queries, anti-meta filters, hop range 4-8). Each seed row needs: `seed_entity`, `final_answer_entity`, `readable_path`, `num_hops_in_graph`, and `ground_truth`. Pass your seed file via `--seed-path`. - -[Download Code :octicons-download-24:](../../../assets/recipes/mcp_and_tooluse/search_agent.py){ .md-button download="search_agent.py" } - -```python ---8<-- "assets/recipes/mcp_and_tooluse/search_agent.py" -``` diff --git a/docs/recipes/plugin_development/markdown_seed_reader.md b/docs/recipes/plugin_development/markdown_seed_reader.md deleted file mode 100644 index 6f81582df..000000000 --- a/docs/recipes/plugin_development/markdown_seed_reader.md +++ /dev/null @@ -1,31 +0,0 @@ -# Markdown Section Seed Reader - -Turn a directory of Markdown files into a seed dataset with one row per section. This recipe stays in the same single-file format as the other recipes: it creates sample files, defines an inline `FileSystemSeedReader[DirectorySeedSource]`, and passes that reader to `DataDesigner(seed_readers=[...])`. - -This keeps the example focused on the actual seed reader contract: - -- implementing `build_manifest(...)` -- returning `1:N` hydrated rows from `hydrate_row(...)` -- declaring `output_columns` for the hydrated schema -- keeping `IndexRange` selection manifest-based - -Because the example reuses `DirectorySeedSource`, it does not register a brand-new `seed_type`. To package the same reader as an installable plugin, see [Build Your Own](../../plugins/build_your_own.md). - -## Run the Recipe - -Run the script directly: - -```bash -uv run markdown_seed_reader.py -``` - -The script prints two previews: - -- the full section dataset across all Markdown files -- a manifest-only selection using `IndexRange(start=1, end=1)` that still returns every section from the selected file - -[Download Code :octicons-download-24:](../../assets/recipes/plugin_development/markdown_seed_reader.py){ .md-button download="markdown_seed_reader.py" } - -```python ---8<-- "assets/recipes/plugin_development/markdown_seed_reader.py" -``` diff --git a/docs/recipes/qa_and_chat/multi_turn_chat.md b/docs/recipes/qa_and_chat/multi_turn_chat.md deleted file mode 100644 index d9b404d27..000000000 --- a/docs/recipes/qa_and_chat/multi_turn_chat.md +++ /dev/null @@ -1,5 +0,0 @@ -[Download Code :octicons-download-24:](../../../assets/recipes/qa_and_chat/multi_turn_chat.py){ .md-button download="multi_turn_chat.py" } - -```python ---8<-- "assets/recipes/qa_and_chat/multi_turn_chat.py" -``` diff --git a/docs/recipes/qa_and_chat/product_info_qa.md b/docs/recipes/qa_and_chat/product_info_qa.md deleted file mode 100644 index a3a41e4e0..000000000 --- a/docs/recipes/qa_and_chat/product_info_qa.md +++ /dev/null @@ -1,5 +0,0 @@ -[Download Code :octicons-download-24:](../../../assets/recipes/qa_and_chat/product_info_qa.py){ .md-button download="product_info_qa.py" } - -```python ---8<-- "assets/recipes/qa_and_chat/product_info_qa.py" -``` diff --git a/docs/recipes/trace_ingestion/agent_rollout_distillation.md b/docs/recipes/trace_ingestion/agent_rollout_distillation.md deleted file mode 100644 index 21e433dfe..000000000 --- a/docs/recipes/trace_ingestion/agent_rollout_distillation.md +++ /dev/null @@ -1,16 +0,0 @@ -[Download Code :octicons-download-24:](../../assets/recipes/trace_ingestion/agent_rollout_distillation.py){ .md-button download="agent_rollout_distillation.py" } - -This recipe ingests built-in agent rollout traces with `AgentRolloutSeedSource(...)`, selecting the format with -`--format` and optionally overriding the input directory with `--trace-dir`. It works with `atif`, `claude_code`, -`codex`, and `hermes_agent`; `atif` expects standalone `.json` trajectory files and requires `--trace-dir`, while -`claude_code`, `codex`, and `hermes_agent` can use their default locations when `--trace-dir` is omitted. The pipeline turns each -imported trace into a compact task digest, a standalone instruction-response pair for coding-assistant SFT, and a -judge-scored quality signal you can use for downstream filtering. It supports both full dataset creation and in-memory -preview mode via `--preview`. - -!!! info "Looking for ingestion details?" - See [Agent Rollout Ingestion](../../concepts/agent-rollout-ingestion.md) for supported formats, default paths, normalized columns, and rollout-specific parsing behavior. This recipe stays focused on the distillation pipeline. - -```python ---8<-- "assets/recipes/trace_ingestion/agent_rollout_distillation.py" -``` diff --git a/docs/recipes/vlm_long_doc/frontier_judge.md b/docs/recipes/vlm_long_doc/frontier_judge.md deleted file mode 100644 index bbf11d149..000000000 --- a/docs/recipes/vlm_long_doc/frontier_judge.md +++ /dev/null @@ -1,5 +0,0 @@ -[Download Code :octicons-download-24:](../../../assets/recipes/vlm_long_doc/09-frontier-judge-sdg.py){ .md-button download="09-frontier-judge-sdg.py" } - -```python ---8<-- "assets/recipes/vlm_long_doc/09-frontier-judge-sdg.py" -``` diff --git a/docs/recipes/vlm_long_doc/multi_page_windowed_qa.md b/docs/recipes/vlm_long_doc/multi_page_windowed_qa.md deleted file mode 100644 index 28de86d0b..000000000 --- a/docs/recipes/vlm_long_doc/multi_page_windowed_qa.md +++ /dev/null @@ -1,5 +0,0 @@ -[Download Code :octicons-download-24:](../../../assets/recipes/vlm_long_doc/07-multi-page-windowed-qa-sdg.py){ .md-button download="07-multi-page-windowed-qa-sdg.py" } - -```python ---8<-- "assets/recipes/vlm_long_doc/07-multi-page-windowed-qa-sdg.py" -``` diff --git a/docs/recipes/vlm_long_doc/nemotron_parse_ocr.md b/docs/recipes/vlm_long_doc/nemotron_parse_ocr.md deleted file mode 100644 index ed6943989..000000000 --- a/docs/recipes/vlm_long_doc/nemotron_parse_ocr.md +++ /dev/null @@ -1,5 +0,0 @@ -[Download Code :octicons-download-24:](../../../assets/recipes/vlm_long_doc/02-nemotron-parse-ocr-sdg.py){ .md-button download="02-nemotron-parse-ocr-sdg.py" } - -```python ---8<-- "assets/recipes/vlm_long_doc/02-nemotron-parse-ocr-sdg.py" -``` diff --git a/docs/recipes/vlm_long_doc/page_classification.md b/docs/recipes/vlm_long_doc/page_classification.md deleted file mode 100644 index 953685ba7..000000000 --- a/docs/recipes/vlm_long_doc/page_classification.md +++ /dev/null @@ -1,5 +0,0 @@ -[Download Code :octicons-download-24:](../../../assets/recipes/vlm_long_doc/04-page-classification-sdg.py){ .md-button download="04-page-classification-sdg.py" } - -```python ---8<-- "assets/recipes/vlm_long_doc/04-page-classification-sdg.py" -``` diff --git a/docs/recipes/vlm_long_doc/seed_dataset_preparation.md b/docs/recipes/vlm_long_doc/seed_dataset_preparation.md deleted file mode 100644 index b9a1f2b1b..000000000 --- a/docs/recipes/vlm_long_doc/seed_dataset_preparation.md +++ /dev/null @@ -1,5 +0,0 @@ -[Download Code :octicons-download-24:](../../../assets/recipes/vlm_long_doc/01-seed-dataset-preparation.py){ .md-button download="01-seed-dataset-preparation.py" } - -```python ---8<-- "assets/recipes/vlm_long_doc/01-seed-dataset-preparation.py" -``` diff --git a/docs/recipes/vlm_long_doc/single_page_qa.md b/docs/recipes/vlm_long_doc/single_page_qa.md deleted file mode 100644 index 01a2bebd3..000000000 --- a/docs/recipes/vlm_long_doc/single_page_qa.md +++ /dev/null @@ -1,5 +0,0 @@ -[Download Code :octicons-download-24:](../../../assets/recipes/vlm_long_doc/06-single-page-qa-sdg.py){ .md-button download="06-single-page-qa-sdg.py" } - -```python ---8<-- "assets/recipes/vlm_long_doc/06-single-page-qa-sdg.py" -``` diff --git a/docs/recipes/vlm_long_doc/text_qa.md b/docs/recipes/vlm_long_doc/text_qa.md deleted file mode 100644 index 4e6cbad35..000000000 --- a/docs/recipes/vlm_long_doc/text_qa.md +++ /dev/null @@ -1,5 +0,0 @@ -[Download Code :octicons-download-24:](../../../assets/recipes/vlm_long_doc/03-text-qa-sdg.py){ .md-button download="03-text-qa-sdg.py" } - -```python ---8<-- "assets/recipes/vlm_long_doc/03-text-qa-sdg.py" -``` diff --git a/docs/recipes/vlm_long_doc/visual_qa.md b/docs/recipes/vlm_long_doc/visual_qa.md deleted file mode 100644 index e729dab88..000000000 --- a/docs/recipes/vlm_long_doc/visual_qa.md +++ /dev/null @@ -1,5 +0,0 @@ -[Download Code :octicons-download-24:](../../../assets/recipes/vlm_long_doc/05-visual-qa-sdg.py){ .md-button download="05-visual-qa-sdg.py" } - -```python ---8<-- "assets/recipes/vlm_long_doc/05-visual-qa-sdg.py" -``` diff --git a/docs/recipes/vlm_long_doc/whole_document_qa.md b/docs/recipes/vlm_long_doc/whole_document_qa.md deleted file mode 100644 index cb624ceee..000000000 --- a/docs/recipes/vlm_long_doc/whole_document_qa.md +++ /dev/null @@ -1,5 +0,0 @@ -[Download Code :octicons-download-24:](../../../assets/recipes/vlm_long_doc/08-whole-document-qa-sdg.py){ .md-button download="08-whole-document-qa-sdg.py" } - -```python ---8<-- "assets/recipes/vlm_long_doc/08-whole-document-qa-sdg.py" -``` diff --git a/fern/AGENTS.md b/fern/AGENTS.md index c1d90db86..d679e99e0 100644 --- a/fern/AGENTS.md +++ b/fern/AGENTS.md @@ -30,7 +30,7 @@ The `docs-website` branch must already contain the historical Fern archive (`v0. Frozen `vX.Y.Z.yml` navs on `docs-website` must point only at their own `vX.Y.Z/pages/...` files. The release sync materializes shared historical pages into each version folder before publishing. -Dev Notes publishing patches only Dev Notes from `main` into the current latest docs on `docs-website`, mirroring the MkDocs workflow. +Dev Notes publishing patches only Dev Notes from `main` into the current latest docs on `docs-website`. ## Release Prep diff --git a/fern/README.md b/fern/README.md index c6aad488a..81601270b 100644 --- a/fern/README.md +++ b/fern/README.md @@ -1,16 +1,15 @@ # Fern Docs -This folder is the Fern Docs build for NeMo Data Designer. The site currently deploys to **`datadesigner.docs.buildwithfern.com/nemo/datadesigner`**; [`docs.yml`](docs.yml) also declares the future `docs.nvidia.com/nemo/datadesigner` custom domain. +This folder is the Fern Docs build for NeMo Data Designer. The site deploys to **`datadesigner.docs.buildwithfern.com/nemo/datadesigner`** with the `docs.nvidia.com/nemo/datadesigner` custom domain declared in [`docs.yml`](docs.yml). -## Migration phase +## Current state -Data Designer is moving from MkDocs to Fern over several releases. During that transition: +Data Designer docs are Fern-first: -- Keep the MkDocs build and release archive working. -- Keep Fern working in parallel for local checks and hosted validation. -- Treat `docs/` as the docs source of truth unless a page has already been intentionally moved to Fern-only MDX. +- Edit docs prose under `fern/`. - Treat `docs/notebook_source/*.py` as the notebook source of truth. - Keep generated Fern API reference and notebook artifacts gitignored. +- Keep the legacy MkDocs `gh-pages` archive frozen for releases `0.5.7` and older. ## Prerequisites @@ -61,15 +60,15 @@ make serve-fern-docs-locally ## CI and publishing -Fern publishing runs alongside MkDocs during migration: +Fern publishing uses the dedicated Fern workflows: - `.github/workflows/build-fern-docs.yml` runs on release publication or manual dispatch. It snapshots release docs into the CI-managed `docs-website` branch, builds executed notebooks from the release source, runs `make check-fern-docs` from `docs-website`, and publishes Fern. - `.github/workflows/publish-fern-devnotes.yml` runs on `main` when Dev Notes or Fern Dev Notes assets change, plus manual dispatch. It patches only Dev Notes into the `docs-website` branch's current latest docs, reuses the last docs notebook artifact, runs `make check-fern-docs`, and publishes Fern. -- `.github/workflows/docs-preview.yml` remains the PR preview workflow and posts both MkDocs and Fern preview links for same-repository PRs. It converts tutorial sources without execution outputs for preview builds. Fork PRs still run docs build/checks, but skip hosted previews because those require deployment secrets. +- `.github/workflows/docs-preview.yml` posts Fern preview links for same-repository PRs. Fork PRs still run docs checks, but skip hosted previews because those require deployment secrets. These workflows require the org-level `DOCS_FERN_TOKEN` secret. The workflows expose it to the Fern CLI as `FERN_TOKEN`. -Fern release snapshots live on `docs-website`, not on `main`. This mirrors the MkDocs `gh-pages` model without mixing Fern source state into the MkDocs output branch. The branch stores a source snapshot, not only `fern/`, because `make check-fern-docs` needs the Python packages and workspace metadata. Pushes to `docs-website` use `GITHUB_TOKEN`, so publishing happens inline in the same workflow instead of relying on a second workflow trigger. +Fern release snapshots live on `docs-website`, not on `main`. The branch stores a source snapshot, not only `fern/`, because `make check-fern-docs` needs the Python packages and workspace metadata. Pushes to `docs-website` use `GITHUB_TOKEN`, so publishing happens inline in the same workflow instead of relying on a second workflow trigger. The `docs-website` branch is an orphan-style publish branch. Published commits include `fern/publish-metadata.json` with the source repository, ref, SHA, release tag when applicable, and published branch. @@ -107,7 +106,7 @@ Each frozen `vX.Y.Z.yml` nav on `docs-website` must point only at that version's Normal GitHub releases do not need a dedicated pre-release Fern PR. The release workflow snapshots the release into `docs-website` and publishes from that branch. -Dev Notes publishing mirrors MkDocs: it patches only the Dev Notes nav and pages from `main` into the current latest docs on `docs-website`, then republishes Fern. +Dev Notes publishing patches only the Dev Notes nav and pages from `main` into the current latest docs on `docs-website`, then republishes Fern. ## Folder layout @@ -118,7 +117,7 @@ fern/ ├── fern.config.json ← organization, fern-api version pin ├── main.css ← bundled NVIDIA theme CSS ├── assets/ ← logos, favicon, recipe assets, devnote post images -├── images/ ← /images/* references from MDX (mirror of docs/images) +├── images/ ← /images/* references from MDX ├── styles/ ← component-level CSS (notebook-viewer, authors, metrics-table, …) ├── components/ ← React components used by MDX │ ├── NotebookViewer.tsx ← renders converted .ipynb cells diff --git a/mkdocs.yml b/mkdocs.yml deleted file mode 100644 index b36566d1b..000000000 --- a/mkdocs.yml +++ /dev/null @@ -1,228 +0,0 @@ -site_name: NeMo Data Designer -site_url: https://nvidia-nemo.github.io/DataDesigner/ -repo_url: https://github.com/NVIDIA-NeMo/DataDesigner - -nav: - - Getting Started: - - Welcome: index.md - - Contributing: CONTRIBUTING.md - - Concepts: - - Columns: concepts/columns.md - - Seed Datasets: concepts/seed-datasets.md - - Agent Rollout Ingestion: concepts/agent-rollout-ingestion.md - - Models: - - Default Model Settings: concepts/models/default-model-settings.md - - Configure with the CLI: concepts/models/configure-model-settings-with-the-cli.md - - Custom Model Settings: concepts/models/custom-model-settings.md - - Model Providers: concepts/models/model-providers.md - - Model Configs: concepts/models/model-configs.md - - Inference Parameters: concepts/models/inference-parameters.md - - Custom Columns: concepts/custom_columns.md - - Validators: concepts/validators.md - - Processors: concepts/processors.md - - Workflow Chaining: concepts/workflow-chaining.md - - Person Sampling: concepts/person_sampling.md - - Traces: concepts/traces.md - - Tool Use & MCP: - - Overview: concepts/tool_use_and_mcp.md - - MCP Providers: concepts/mcp/mcp-providers.md - - Tool Configs: concepts/mcp/tool-configs.md - - Enabling Tools: concepts/mcp/enabling-tools.md - - CLI Configuration: concepts/mcp/configure-mcp-cli.md - - Safety & Limits: concepts/mcp/safety-and-limits.md - - Architecture & Performance: concepts/architecture-and-performance.md - - Deployment Options: concepts/deployment-options.md - - Security: concepts/security.md - - Tutorials: - - Overview: notebooks/README.md - - The Basics: notebooks/1-the-basics.ipynb - - Structured Outputs, Jinja Expressions, and Conditional Generation: notebooks/2-structured-outputs-and-jinja-expressions.ipynb - - Seeding with an External Dataset: notebooks/3-seeding-with-a-dataset.ipynb - - Providing Images as Context: notebooks/4-providing-images-as-context.ipynb - - Generating Images: notebooks/5-generating-images.ipynb - - Image-to-Image Editing: notebooks/6-editing-images-with-image-context.ipynb - - Recipes: - - Recipe Cards: recipes/cards.md - - Code Generation: - - Text to Python: recipes/code_generation/text_to_python.md - - Text to SQL: recipes/code_generation/text_to_sql.md - - "Nemotron Super Text to SQL": recipes/code_generation/enterprise_text_to_sql.md - - Plugin Development: - - Markdown Section Seed Reader Plugin: recipes/plugin_development/markdown_seed_reader.md - - QA and Chat: - - Product Info QA: recipes/qa_and_chat/product_info_qa.md - - Multi-Turn Chat: recipes/qa_and_chat/multi_turn_chat.md - - Trace Ingestion: - - Agent Rollout Trace Distillation: recipes/trace_ingestion/agent_rollout_distillation.md - - MCP and Tool Use: - - "Basic MCP Tool Use": recipes/mcp_and_tooluse/basic_mcp.md - - "PDF Document QA": recipes/mcp_and_tooluse/pdf_qa.md - - "Nemotron Super Search Agent": recipes/mcp_and_tooluse/search_agent.md - - VLM Long-Document Understanding: - - "Seed Dataset Preparation": recipes/vlm_long_doc/seed_dataset_preparation.md - - "Nemotron Parse OCR": recipes/vlm_long_doc/nemotron_parse_ocr.md - - "Text QA from OCR Transcripts": recipes/vlm_long_doc/text_qa.md - - "Page Classification": recipes/vlm_long_doc/page_classification.md - - "Visual QA": recipes/vlm_long_doc/visual_qa.md - - "Single-Page QA": recipes/vlm_long_doc/single_page_qa.md - - "Multi-Page Windowed QA": recipes/vlm_long_doc/multi_page_windowed_qa.md - - "Whole-Document QA": recipes/vlm_long_doc/whole_document_qa.md - - "Frontier Judge QA Filter": recipes/vlm_long_doc/frontier_judge.md - - Plugins: - - Overview: plugins/overview.md - - Build Your Own: plugins/build_your_own.md - - Using Models: plugins/models.md - - Discover Plugins: plugins/discover.md - - Code Reference: - - Overview: code_reference/index.md - # Keep module reference pages ordered alphabetically by nav label within each package group. - - Config: - - Overview: code_reference/config/index.md - - analysis: code_reference/config/analysis.md - - column_configs: code_reference/config/column_configs.md - - config_builder: code_reference/config/config_builder.md - - data_designer_config: code_reference/config/data_designer_config.md - - mcp: code_reference/config/mcp.md - - models: code_reference/config/models.md - - plugins: code_reference/config/plugins.md - - processors: code_reference/config/processors.md - - run_config: code_reference/config/run_config.md - - sampler_params: code_reference/config/sampler_params.md - - seeds: code_reference/config/seeds.md - - validator_params: code_reference/config/validator_params.md - - Engine: - - Overview: code_reference/engine/index.md - - column_generators: code_reference/engine/column_generators.md - - mcp: code_reference/engine/mcp.md - - processors: code_reference/engine/processors.md - - seed_readers: code_reference/engine/seed_readers.md - - Interface: - - Overview: code_reference/interface/index.md - - data_designer: code_reference/interface/data_designer.md - - errors: code_reference/interface/errors.md - - results: code_reference/interface/results.md - - Dev Notes: - # NOTE: Order is most recent -> oldest (so sidebar shows recent first!) - - devnotes/index.md - - Retriever SDG Plugin: devnotes/posts/retrieval-sdg-toolkit.md - - Have It Your Way: devnotes/posts/have-it-your-way.md - - VLM Long Document Understanding: devnotes/posts/vlm-long-document-understanding.md - - Push Datasets to Hugging Face Hub: devnotes/posts/push-datasets-to-hugging-face-hub.md - - "Text-to-SQL for Nemotron Super": devnotes/posts/text-to-sql.md - - "Async All the Way Down": devnotes/posts/async-all-the-way-down.md - - Owning the Model Stack: devnotes/posts/owning-the-model-stack.md - - Data Designer Got Skills: devnotes/posts/data-designer-got-skills.md - - Search Agent: devnotes/posts/search-agent.md - - Structured Outputs from Nemotron: devnotes/posts/structured-outputs-from-nemotron.md - - Deep Research Trajectories: devnotes/posts/deep-research-trajectories.md - - Design Principles: devnotes/posts/design-principles.md - - RQA Dataset: devnotes/posts/rqa.md - -theme: - name: material - custom_dir: docs/overrides - font: - text: Roboto - code: Fira Code - icon: - logo: fontawesome/solid/palette - repo: fontawesome/brands/github-alt - favicon: assets/palette-favicon.png - palette: - - media: "(prefers-color-scheme: dark)" - scheme: slate - primary: black - toggle: - icon: material/lightbulb-outline - name: Switch to light mode - - media: "(prefers-color-scheme: light)" - scheme: default - primary: black - toggle: - icon: material/lightbulb - name: Switch to dark mode - features: - - navigation.path - - navigation.footer - - navigation.indexes - - content.code.copy - - content.code.select - - content.code.annotate - -extra: - version: - provider: mike - alias: true - default: latest - -watch: - - packages/data-designer-config/src/data_designer - - packages/data-designer-engine/src/data_designer - - packages/data-designer/src/data_designer - - docs/ - -plugins: - - search - - blog: - blog_dir: devnotes - blog_toc: true - post_date_format: long - post_url_format: "{slug}" - categories: false - archive: true - archive_toc: true - - mike: - alias_type: symlink - canonical_version: latest - - mkdocs-jupyter: - execute: false - include_requirejs: true - ignore_h1_titles: True - include_source: True - ignore: - - "assets/recipes/**/*.py" - - mkdocstrings: - handlers: - python: - paths: - - packages/data-designer-config/src - - packages/data-designer-engine/src - - packages/data-designer/src - options: - show_symbol_type_heading: true - show_symbol_type_toc: true - show_root_toc_entry: true - show_object_full_path: false - filters: ["!^_"] - docstring_options: - ignore_init_summary: false - merge_init_into_class: true - docstring_section_style: table - summary: true - -extra_css: - - css/style.css - - css/mkdocstrings.css - -extra_javascript: - - js/toc-toggle.js - -markdown_extensions: - - tables - - pymdownx.superfences - - attr_list - - md_in_html - - admonition - - pymdownx.snippets: - base_path: - - docs/ - - . - url_download: true - - pymdownx.highlight: - pygments_lang_class: true - use_pygments: true - - pymdownx.tabbed: - alternate_style: true - - pymdownx.emoji: - emoji_index: !!python/name:material.extensions.emoji.twemoji - emoji_generator: !!python/name:material.extensions.emoji.to_svg diff --git a/pyproject.toml b/pyproject.toml index 828de9cd3..af1fa521f 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -45,17 +45,7 @@ dev = [ ] docs = [ "jupytext>=1.16.0,<2", - "mistune>=3.2.1,<4", # 3.2.1 fixes security advisory pulled in by nbconvert - "mike>=2.1.3,<3", - "mkdocs-jupyter>=0.25.1,<1", - "mkdocs-material>=9.6.22,<10", - "mkdocs-redirects>=1.2.2,<2", - "mkdocs>=1.6.1,<2", - "mkdocstrings-python>=1.18.2,<2", - "mkdocstrings>=0.30.1,<1", - "nbconvert>=7.17.1,<8", # 7.17.1 fixes security advisory pulled in by mkdocs-jupyter "py2fern==0.1.6", - "pymdown-extensions>=10.21.2,<11", ] notebooks = [ "aiohttp>=3.13.5,<4", # 3.13.5 fixes CVE-2026-22815 and CVE-2026-34513 through CVE-2026-34525 diff --git a/uv.lock b/uv.lock index d21c51538..cea7c0666 100644 --- a/uv.lock +++ b/uv.lock @@ -338,20 +338,6 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/a0/59/76ab57e3fe74484f48a53f8e337171b4a2349e506eabe136d7e01d059086/backports_asyncio_runner-1.2.0-py3-none-any.whl", hash = "sha256:0da0a936a8aeb554eccb426dc55af3ba63bcdc69fa1a600b5bb305413a4477b5", size = 12313, upload-time = "2025-07-02T02:27:14.263Z" }, ] -[[package]] -name = "backrefs" -version = "6.2" -source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/4e/a6/e325ec73b638d3ede4421b5445d4a0b8b219481826cc079d510100af356c/backrefs-6.2.tar.gz", hash = "sha256:f44ff4d48808b243b6c0cdc6231e22195c32f77046018141556c66f8bab72a49", size = 7012303, upload-time = "2026-02-16T19:10:15.828Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/1b/39/3765df263e08a4df37f4f43cb5aa3c6c17a4bdd42ecfe841e04c26037171/backrefs-6.2-py310-none-any.whl", hash = "sha256:0fdc7b012420b6b144410342caeb8adc54c6866cf12064abc9bb211302e496f8", size = 381075, upload-time = "2026-02-16T19:10:04.322Z" }, - { url = "https://files.pythonhosted.org/packages/0f/f0/35240571e1b67ffb19dafb29ab34150b6f59f93f717b041082cdb1bfceb1/backrefs-6.2-py311-none-any.whl", hash = "sha256:08aa7fae530c6b2361d7bdcbda1a7c454e330cc9dbcd03f5c23205e430e5c3be", size = 392874, upload-time = "2026-02-16T19:10:06.314Z" }, - { url = "https://files.pythonhosted.org/packages/e3/63/77e8c9745b4d227cce9f5e0a6f68041278c5f9b18588b35905f5f19c1beb/backrefs-6.2-py312-none-any.whl", hash = "sha256:c3f4b9cb2af8cda0d87ab4f57800b57b95428488477be164dd2b47be54db0c90", size = 398787, upload-time = "2026-02-16T19:10:08.274Z" }, - { url = "https://files.pythonhosted.org/packages/c5/71/c754b1737ad99102e03fa3235acb6cb6d3ac9d6f596cbc3e5f236705abd8/backrefs-6.2-py313-none-any.whl", hash = "sha256:12df81596ab511f783b7d87c043ce26bc5b0288cf3bb03610fe76b8189282b2b", size = 400747, upload-time = "2026-02-16T19:10:09.791Z" }, - { url = "https://files.pythonhosted.org/packages/af/75/be12ba31a6eb20dccef2320cd8ccb3f7d9013b68ba4c70156259fee9e409/backrefs-6.2-py314-none-any.whl", hash = "sha256:e5f805ae09819caa1aa0623b4a83790e7028604aa2b8c73ba602c4454e665de7", size = 412602, upload-time = "2026-02-16T19:10:12.317Z" }, - { url = "https://files.pythonhosted.org/packages/21/f8/d02f650c47d05034dcd6f9c8cf94f39598b7a89c00ecda0ecb2911bc27e9/backrefs-6.2-py39-none-any.whl", hash = "sha256:664e33cd88c6840b7625b826ecf2555f32d491800900f5a541f772c485f7cda7", size = 381077, upload-time = "2026-02-16T19:10:13.74Z" }, -] - [[package]] name = "beautifulsoup4" version = "4.14.3" @@ -876,6 +862,8 @@ dependencies = [ { name = "mcp" }, { name = "networkx", version = "3.4.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" }, { name = "networkx", version = "3.6.1", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" }, + { name = "numpy", version = "2.2.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" }, + { name = "numpy", version = "2.4.3", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" }, { name = "python-multipart" }, { name = "ruff" }, { name = "scipy", version = "1.15.3", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" }, @@ -903,6 +891,7 @@ requires-dist = [ { name = "marko", specifier = ">=2.1.2,<3" }, { name = "mcp", specifier = ">=1.26.0,<2" }, { name = "networkx", specifier = ">=3.0,<4" }, + { name = "numpy", specifier = ">=1.23.5,<3" }, { name = "python-multipart", specifier = ">=0.0.27,<1" }, { name = "ruff", specifier = ">=0.14.10,<1" }, { name = "scipy", specifier = ">=1.11.0,<2" }, @@ -928,17 +917,7 @@ dev = [ ] docs = [ { name = "jupytext" }, - { name = "mike" }, - { name = "mistune" }, - { name = "mkdocs" }, - { name = "mkdocs-jupyter" }, - { name = "mkdocs-material" }, - { name = "mkdocs-redirects" }, - { name = "mkdocstrings" }, - { name = "mkdocstrings-python" }, - { name = "nbconvert" }, { name = "py2fern" }, - { name = "pymdown-extensions" }, ] notebooks = [ { name = "aiohttp" }, @@ -971,17 +950,7 @@ dev = [ ] docs = [ { name = "jupytext", specifier = ">=1.16.0,<2" }, - { name = "mike", specifier = ">=2.1.3,<3" }, - { name = "mistune", specifier = ">=3.2.1,<4" }, - { name = "mkdocs", specifier = ">=1.6.1,<2" }, - { name = "mkdocs-jupyter", specifier = ">=0.25.1,<1" }, - { name = "mkdocs-material", specifier = ">=9.6.22,<10" }, - { name = "mkdocs-redirects", specifier = ">=1.2.2,<2" }, - { name = "mkdocstrings", specifier = ">=0.30.1,<1" }, - { name = "mkdocstrings-python", specifier = ">=1.18.2,<2" }, - { name = "nbconvert", specifier = ">=7.17.1,<8" }, { name = "py2fern", specifier = "==0.1.6" }, - { name = "pymdown-extensions", specifier = ">=10.21.2,<11" }, ] notebooks = [ { name = "aiohttp", specifier = ">=3.13.5,<4" }, @@ -1372,53 +1341,6 @@ http = [ { name = "aiohttp" }, ] -[[package]] -name = "ghp-import" -version = "2.1.0" -source = { registry = "https://pypi.org/simple" } -dependencies = [ - { name = "python-dateutil" }, -] -sdist = { url = "https://files.pythonhosted.org/packages/d9/29/d40217cbe2f6b1359e00c6c307bb3fc876ba74068cbab3dde77f03ca0dc4/ghp-import-2.1.0.tar.gz", hash = "sha256:9c535c4c61193c2df8871222567d7fd7e5014d835f97dc7b7439069e2413d343", size = 10943, upload-time = "2022-05-02T15:47:16.11Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/f7/ec/67fbef5d497f86283db54c22eec6f6140243aae73265799baaaa19cd17fb/ghp_import-2.1.0-py3-none-any.whl", hash = "sha256:8337dd7b50877f163d4c0289bc1f1c7f127550241988d568c1db512c4324a619", size = 11034, upload-time = "2022-05-02T15:47:14.552Z" }, -] - -[[package]] -name = "griffe" -version = "2.0.0" -source = { registry = "https://pypi.org/simple" } -dependencies = [ - { name = "griffecli" }, - { name = "griffelib" }, -] -sdist = { url = "https://files.pythonhosted.org/packages/04/56/28a0accac339c164b52a92c6cfc45a903acc0c174caa5c1713803467b533/griffe-2.0.0.tar.gz", hash = "sha256:c68979cd8395422083a51ea7cf02f9c119d889646d99b7b656ee43725de1b80f", size = 293906, upload-time = "2026-03-23T21:06:53.402Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/8b/94/ee21d41e7eb4f823b94603b9d40f86d3c7fde80eacc2c3c71845476dddaa/griffe-2.0.0-py3-none-any.whl", hash = "sha256:5418081135a391c3e6e757a7f3f156f1a1a746cc7b4023868ff7d5e2f9a980aa", size = 5214, upload-time = "2026-02-09T19:09:44.105Z" }, -] - -[[package]] -name = "griffecli" -version = "2.0.0" -source = { registry = "https://pypi.org/simple" } -dependencies = [ - { name = "colorama" }, - { name = "griffelib" }, -] -sdist = { url = "https://files.pythonhosted.org/packages/a4/f8/2e129fd4a86e52e58eefe664de05e7d502decf766e7316cc9e70fdec3e18/griffecli-2.0.0.tar.gz", hash = "sha256:312fa5ebb4ce6afc786356e2d0ce85b06c1c20d45abc42d74f0cda65e159f6ef", size = 56213, upload-time = "2026-03-23T21:06:54.8Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/e6/ed/d93f7a447bbf7a935d8868e9617cbe1cadf9ee9ee6bd275d3040fbf93d60/griffecli-2.0.0-py3-none-any.whl", hash = "sha256:9f7cd9ee9b21d55e91689358978d2385ae65c22f307a63fb3269acf3f21e643d", size = 9345, upload-time = "2026-02-09T19:09:42.554Z" }, -] - -[[package]] -name = "griffelib" -version = "2.0.0" -source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/ad/06/eccbd311c9e2b3ca45dbc063b93134c57a1ccc7607c5e545264ad092c4a9/griffelib-2.0.0.tar.gz", hash = "sha256:e504d637a089f5cab9b5daf18f7645970509bf4f53eda8d79ed71cce8bd97934", size = 166312, upload-time = "2026-03-23T21:06:55.954Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/4d/51/c936033e16d12b627ea334aaaaf42229c37620d0f15593456ab69ab48161/griffelib-2.0.0-py3-none-any.whl", hash = "sha256:01284878c966508b6d6f1dbff9b6fa607bc062d8261c5c7253cb285b06422a7f", size = 142004, upload-time = "2026-02-09T19:09:40.561Z" }, -] - [[package]] name = "h11" version = "0.16.0" @@ -2269,15 +2191,6 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/40/44/3ee09a5b60cb44c4f2fbc1c9015cfd6ff5afc08f991cab295d3024dcbf2d/lxml-6.1.0-pp311-pypy311_pp73-win_amd64.whl", hash = "sha256:7da13bb6fbadfafb474e0226a30570a3445cfd47c86296f2446dafbd77079ace", size = 3508860, upload-time = "2026-04-18T04:32:48.619Z" }, ] -[[package]] -name = "markdown" -version = "3.10.2" -source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/2b/f4/69fa6ed85ae003c2378ffa8f6d2e3234662abd02c10d216c0ba96081a238/markdown-3.10.2.tar.gz", hash = "sha256:994d51325d25ad8aa7ce4ebaec003febcce822c3f8c911e3b17c52f7f589f950", size = 368805, upload-time = "2026-02-09T14:57:26.942Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/de/1f/77fa3081e4f66ca3576c896ae5d31c3002ac6607f9747d2e3aa49227e464/markdown-3.10.2-py3-none-any.whl", hash = "sha256:e91464b71ae3ee7afd3017d9f358ef0baf158fd9a298db92f1d4761133824c36", size = 108180, upload-time = "2026-02-09T14:57:25.787Z" }, -] - [[package]] name = "markdown-it-py" version = "4.0.0" @@ -2442,32 +2355,6 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/b3/38/89ba8ad64ae25be8de66a6d463314cf1eb366222074cfda9ee839c56a4b4/mdurl-0.1.2-py3-none-any.whl", hash = "sha256:84008a41e51615a49fc9966191ff91509e3c40b939176e643fd50a5c2196b8f8", size = 9979, upload-time = "2022-08-14T12:40:09.779Z" }, ] -[[package]] -name = "mergedeep" -version = "1.3.4" -source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/3a/41/580bb4006e3ed0361b8151a01d324fb03f420815446c7def45d02f74c270/mergedeep-1.3.4.tar.gz", hash = "sha256:0096d52e9dad9939c3d975a774666af186eda617e6ca84df4c94dec30004f2a8", size = 4661, upload-time = "2021-02-05T18:55:30.623Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/2c/19/04f9b178c2d8a15b076c8b5140708fa6ffc5601fb6f1e975537072df5b2a/mergedeep-1.3.4-py3-none-any.whl", hash = "sha256:70775750742b25c0d8f36c55aed03d24c3384d17c951b3175d898bd778ef0307", size = 6354, upload-time = "2021-02-05T18:55:29.583Z" }, -] - -[[package]] -name = "mike" -version = "2.1.4" -source = { registry = "https://pypi.org/simple" } -dependencies = [ - { name = "jinja2" }, - { name = "mkdocs" }, - { name = "pyparsing" }, - { name = "pyyaml" }, - { name = "pyyaml-env-tag" }, - { name = "verspec" }, -] -sdist = { url = "https://files.pythonhosted.org/packages/ec/09/de1cab0018eb5f1fbd9dcc26b6e61f9453c5ec2eb790949d6ed75e1ffe55/mike-2.1.4.tar.gz", hash = "sha256:75d549420b134603805a65fc67f7dcd9fcd0ad1454fb2c893d9e844cba1aa6e4", size = 38190, upload-time = "2026-03-08T02:46:29.187Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/48/f7/10f5e101db25741b91e4f4792c5d97b4fa834ead5cf509ae91097d939424/mike-2.1.4-py3-none-any.whl", hash = "sha256:39933e992e155dd70f2297e749a0ed78d8fd7942bc33a3666195d177758a280e", size = 33820, upload-time = "2026-03-08T02:46:28.149Z" }, -] - [[package]] name = "mistune" version = "3.2.1" @@ -2480,150 +2367,6 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/2a/7f/a946aa4f8752b37102b41e64dca18a1976ac705c3a0d1dfe74d820a02552/mistune-3.2.1-py3-none-any.whl", hash = "sha256:78cdb0ba5e938053ccf63651b352508d2efa9411dc8810bfb05f2dc5140c0048", size = 53749, upload-time = "2026-05-03T14:33:20.551Z" }, ] -[[package]] -name = "mkdocs" -version = "1.6.1" -source = { registry = "https://pypi.org/simple" } -dependencies = [ - { name = "click" }, - { name = "colorama", marker = "sys_platform == 'win32'" }, - { name = "ghp-import" }, - { name = "jinja2" }, - { name = "markdown" }, - { name = "markupsafe" }, - { name = "mergedeep" }, - { name = "mkdocs-get-deps" }, - { name = "packaging" }, - { name = "pathspec" }, - { name = "pyyaml" }, - { name = "pyyaml-env-tag" }, - { name = "watchdog" }, -] -sdist = { url = "https://files.pythonhosted.org/packages/bc/c6/bbd4f061bd16b378247f12953ffcb04786a618ce5e904b8c5a01a0309061/mkdocs-1.6.1.tar.gz", hash = "sha256:7b432f01d928c084353ab39c57282f29f92136665bdd6abf7c1ec8d822ef86f2", size = 3889159, upload-time = "2024-08-30T12:24:06.899Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/22/5b/dbc6a8cddc9cfa9c4971d59fb12bb8d42e161b7e7f8cc89e49137c5b279c/mkdocs-1.6.1-py3-none-any.whl", hash = "sha256:db91759624d1647f3f34aa0c3f327dd2601beae39a366d6e064c03468d35c20e", size = 3864451, upload-time = "2024-08-30T12:24:05.054Z" }, -] - -[[package]] -name = "mkdocs-autorefs" -version = "1.4.4" -source = { registry = "https://pypi.org/simple" } -dependencies = [ - { name = "markdown" }, - { name = "markupsafe" }, - { name = "mkdocs" }, -] -sdist = { url = "https://files.pythonhosted.org/packages/52/c0/f641843de3f612a6b48253f39244165acff36657a91cc903633d456ae1ac/mkdocs_autorefs-1.4.4.tar.gz", hash = "sha256:d54a284f27a7346b9c38f1f852177940c222da508e66edc816a0fa55fc6da197", size = 56588, upload-time = "2026-02-10T15:23:55.105Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/28/de/a3e710469772c6a89595fc52816da05c1e164b4c866a89e3cb82fb1b67c5/mkdocs_autorefs-1.4.4-py3-none-any.whl", hash = "sha256:834ef5408d827071ad1bc69e0f39704fa34c7fc05bc8e1c72b227dfdc5c76089", size = 25530, upload-time = "2026-02-10T15:23:53.817Z" }, -] - -[[package]] -name = "mkdocs-get-deps" -version = "0.2.2" -source = { registry = "https://pypi.org/simple" } -dependencies = [ - { name = "mergedeep" }, - { name = "platformdirs" }, - { name = "pyyaml" }, -] -sdist = { url = "https://files.pythonhosted.org/packages/ce/25/b3cccb187655b9393572bde9b09261d267c3bf2f2cdabe347673be5976a6/mkdocs_get_deps-0.2.2.tar.gz", hash = "sha256:8ee8d5f316cdbbb2834bc1df6e69c08fe769a83e040060de26d3c19fad3599a1", size = 11047, upload-time = "2026-03-10T02:46:33.632Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/88/29/744136411e785c4b0b744d5413e56555265939ab3a104c6a4b719dad33fd/mkdocs_get_deps-0.2.2-py3-none-any.whl", hash = "sha256:e7878cbeac04860b8b5e0ca31d3abad3df9411a75a32cde82f8e44b6c16ff650", size = 9555, upload-time = "2026-03-10T02:46:32.256Z" }, -] - -[[package]] -name = "mkdocs-jupyter" -version = "0.25.1" -source = { registry = "https://pypi.org/simple" } -dependencies = [ - { name = "ipykernel" }, - { name = "jupytext" }, - { name = "mkdocs" }, - { name = "mkdocs-material" }, - { name = "nbconvert" }, - { name = "pygments" }, -] -sdist = { url = "https://files.pythonhosted.org/packages/6c/23/6ffb8d2fd2117aa860a04c6fe2510b21bc3c3c085907ffdd851caba53152/mkdocs_jupyter-0.25.1.tar.gz", hash = "sha256:0e9272ff4947e0ec683c92423a4bfb42a26477c103ab1a6ab8277e2dcc8f7afe", size = 1626747, upload-time = "2024-10-15T14:56:32.373Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/08/37/5f1fd5c3f6954b3256f8126275e62af493b96fb6aef6c0dbc4ee326032ad/mkdocs_jupyter-0.25.1-py3-none-any.whl", hash = "sha256:3f679a857609885d322880e72533ef5255561bbfdb13cfee2a1e92ef4d4ad8d8", size = 1456197, upload-time = "2024-10-15T14:56:29.854Z" }, -] - -[[package]] -name = "mkdocs-material" -version = "9.7.5" -source = { registry = "https://pypi.org/simple" } -dependencies = [ - { name = "babel" }, - { name = "backrefs" }, - { name = "colorama" }, - { name = "jinja2" }, - { name = "markdown" }, - { name = "mkdocs" }, - { name = "mkdocs-material-extensions" }, - { name = "paginate" }, - { name = "pygments" }, - { name = "pymdown-extensions" }, - { name = "requests" }, -] -sdist = { url = "https://files.pythonhosted.org/packages/74/76/5c202fecdc45d53e83e03a85bae70c48b6c81e9f87f0bc19a9e9c723bdc0/mkdocs_material-9.7.5.tar.gz", hash = "sha256:f76bdab532bad1d9c57ca7187b37eccf64dd12e1586909307f8856db3be384ea", size = 4097749, upload-time = "2026-03-10T15:43:22.809Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/45/e1/e8080dcfa95cca267662a6f4afe29237452bdeb5a2a6555ac83646d21915/mkdocs_material-9.7.5-py3-none-any.whl", hash = "sha256:7cf9df2ff121fd098ff6e05c732b0be3699afca9642e2dfe4926c40eb5873eec", size = 9305251, upload-time = "2026-03-10T15:43:19.089Z" }, -] - -[[package]] -name = "mkdocs-material-extensions" -version = "1.3.1" -source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/79/9b/9b4c96d6593b2a541e1cb8b34899a6d021d208bb357042823d4d2cabdbe7/mkdocs_material_extensions-1.3.1.tar.gz", hash = "sha256:10c9511cea88f568257f960358a467d12b970e1f7b2c0e5fb2bb48cab1928443", size = 11847, upload-time = "2023-11-22T19:09:45.208Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/5b/54/662a4743aa81d9582ee9339d4ffa3c8fd40a4965e033d77b9da9774d3960/mkdocs_material_extensions-1.3.1-py3-none-any.whl", hash = "sha256:adff8b62700b25cb77b53358dad940f3ef973dd6db797907c49e3c2ef3ab4e31", size = 8728, upload-time = "2023-11-22T19:09:43.465Z" }, -] - -[[package]] -name = "mkdocs-redirects" -version = "1.2.2" -source = { registry = "https://pypi.org/simple" } -dependencies = [ - { name = "mkdocs" }, -] -sdist = { url = "https://files.pythonhosted.org/packages/f1/a8/6d44a6cf07e969c7420cb36ab287b0669da636a2044de38a7d2208d5a758/mkdocs_redirects-1.2.2.tar.gz", hash = "sha256:3094981b42ffab29313c2c1b8ac3969861109f58b2dd58c45fc81cd44bfa0095", size = 7162, upload-time = "2024-11-07T14:57:21.109Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/c4/ec/38443b1f2a3821bbcb24e46cd8ba979154417794d54baf949fefde1c2146/mkdocs_redirects-1.2.2-py3-none-any.whl", hash = "sha256:7dbfa5647b79a3589da4401403d69494bd1f4ad03b9c15136720367e1f340ed5", size = 6142, upload-time = "2024-11-07T14:57:19.143Z" }, -] - -[[package]] -name = "mkdocstrings" -version = "0.30.1" -source = { registry = "https://pypi.org/simple" } -dependencies = [ - { name = "jinja2" }, - { name = "markdown" }, - { name = "markupsafe" }, - { name = "mkdocs" }, - { name = "mkdocs-autorefs" }, - { name = "pymdown-extensions" }, -] -sdist = { url = "https://files.pythonhosted.org/packages/c5/33/2fa3243439f794e685d3e694590d28469a9b8ea733af4b48c250a3ffc9a0/mkdocstrings-0.30.1.tar.gz", hash = "sha256:84a007aae9b707fb0aebfc9da23db4b26fc9ab562eb56e335e9ec480cb19744f", size = 106350, upload-time = "2025-09-19T10:49:26.446Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/7b/2c/f0dc4e1ee7f618f5bff7e05898d20bf8b6e7fa612038f768bfa295f136a4/mkdocstrings-0.30.1-py3-none-any.whl", hash = "sha256:41bd71f284ca4d44a668816193e4025c950b002252081e387433656ae9a70a82", size = 36704, upload-time = "2025-09-19T10:49:24.805Z" }, -] - -[[package]] -name = "mkdocstrings-python" -version = "1.19.0" -source = { registry = "https://pypi.org/simple" } -dependencies = [ - { name = "griffe" }, - { name = "mkdocs-autorefs" }, - { name = "mkdocstrings" }, - { name = "typing-extensions", marker = "python_full_version < '3.11'" }, -] -sdist = { url = "https://files.pythonhosted.org/packages/75/1c/3af8413919b0839b96a78f60e8bd0dfd26c844d3717eeb77f80b43f5be1c/mkdocstrings_python-1.19.0.tar.gz", hash = "sha256:917aac66cf121243c11db5b89f66b0ded6c53ec0de5318ff5e22424eb2f2e57c", size = 204010, upload-time = "2025-11-10T13:30:55.915Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/98/5c/2597cef67b6947b15c47f8dba967a0baf19fbdfdc86f6e4a8ba7af8b581a/mkdocstrings_python-1.19.0-py3-none-any.whl", hash = "sha256:395c1032af8f005234170575cc0c5d4d20980846623b623b35594281be4a3059", size = 143417, upload-time = "2025-11-10T13:30:54.164Z" }, -] - [[package]] name = "multidict" version = "6.7.1" @@ -3075,15 +2818,6 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/df/b2/87e62e8c3e2f4b32e5fe99e0b86d576da1312593b39f47d8ceef365e95ed/packaging-26.2-py3-none-any.whl", hash = "sha256:5fc45236b9446107ff2415ce77c807cee2862cb6fac22b8a73826d0693b0980e", size = 100195, upload-time = "2026-04-24T20:15:22.081Z" }, ] -[[package]] -name = "paginate" -version = "0.5.7" -source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/ec/46/68dde5b6bc00c1296ec6466ab27dddede6aec9af1b99090e1107091b3b84/paginate-0.5.7.tar.gz", hash = "sha256:22bd083ab41e1a8b4f3690544afb2c60c25e5c9a63a30fa2f483f6c60c8e5945", size = 19252, upload-time = "2024-08-25T14:17:24.139Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/90/96/04b8e52da071d28f5e21a805b19cb9390aa17a47462ac87f5e2696b9566d/paginate-0.5.7-py2.py3-none-any.whl", hash = "sha256:b885e2af73abcf01d9559fd5216b57ef722f8c42affbb63942377668e35c7591", size = 13746, upload-time = "2024-08-25T14:17:22.55Z" }, -] - [[package]] name = "pandas" version = "2.3.3" @@ -3741,19 +3475,6 @@ crypto = [ { name = "cryptography" }, ] -[[package]] -name = "pymdown-extensions" -version = "10.21.2" -source = { registry = "https://pypi.org/simple" } -dependencies = [ - { name = "markdown" }, - { name = "pyyaml" }, -] -sdist = { url = "https://files.pythonhosted.org/packages/df/08/f1c908c581fd11913da4711ea7ba32c0eee40b0190000996bb863b0c9349/pymdown_extensions-10.21.2.tar.gz", hash = "sha256:c3f55a5b8a1d0edf6699e35dcbea71d978d34ff3fa79f3d807b8a5b3fa90fbdc", size = 853922, upload-time = "2026-03-29T15:01:55.233Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/f7/27/a2fc51a4a122dfd1015e921ae9d22fee3d20b0b8080d9a704578bf9deece/pymdown_extensions-10.21.2-py3-none-any.whl", hash = "sha256:5c0fd2a2bea14eb39af8ff284f1066d898ab2187d81b889b75d46d4348c01638", size = 268901, upload-time = "2026-03-29T15:01:53.244Z" }, -] - [[package]] name = "pymupdf" version = "1.27.2" @@ -3770,15 +3491,6 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/e6/a4/e9257882f0569a21d51207a58f7586a799e76dc6b4008029a04f2329194c/pymupdf-1.27.2-cp314-cp314t-manylinux_2_28_x86_64.whl", hash = "sha256:261c916915cede4c546559810d3210277f86f31b52dd3de138f1e12d95a4c6b6", size = 24985149, upload-time = "2026-03-10T12:39:02.636Z" }, ] -[[package]] -name = "pyparsing" -version = "3.3.2" -source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/f3/91/9c6ee907786a473bf81c5f53cf703ba0957b23ab84c264080fb5a450416f/pyparsing-3.3.2.tar.gz", hash = "sha256:c777f4d763f140633dcb6d8a3eda953bf7a214dc4eff598413c070bcdc117cbc", size = 6851574, upload-time = "2026-01-21T03:57:59.36Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/10/bd/c038d7cc38edc1aa5bf91ab8068b63d4308c66c4c8bb3cbba7dfbc049f9c/pyparsing-3.3.2-py3-none-any.whl", hash = "sha256:850ba148bd908d7e2411587e247a1e4f0327839c40e2e5e6d05a007ecc69911d", size = 122781, upload-time = "2026-01-21T03:57:55.912Z" }, -] - [[package]] name = "pytest" version = "9.0.3" @@ -4021,18 +3733,6 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/f1/12/de94a39c2ef588c7e6455cfbe7343d3b2dc9d6b6b2f40c4c6565744c873d/pyyaml-6.0.3-cp314-cp314t-win_arm64.whl", hash = "sha256:ebc55a14a21cb14062aa4162f906cd962b28e2e9ea38f9b4391244cd8de4ae0b", size = 149341, upload-time = "2025-09-25T21:32:56.828Z" }, ] -[[package]] -name = "pyyaml-env-tag" -version = "1.1" -source = { registry = "https://pypi.org/simple" } -dependencies = [ - { name = "pyyaml" }, -] -sdist = { url = "https://files.pythonhosted.org/packages/eb/2e/79c822141bfd05a853236b504869ebc6b70159afc570e1d5a20641782eaa/pyyaml_env_tag-1.1.tar.gz", hash = "sha256:2eb38b75a2d21ee0475d6d97ec19c63287a7e140231e4214969d0eac923cd7ff", size = 5737, upload-time = "2025-05-13T15:24:01.64Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/04/11/432f32f8097b03e3cd5fe57e88efb685d964e2e5178a48ed61e841f7fdce/pyyaml_env_tag-1.1-py3-none-any.whl", hash = "sha256:17109e1a528561e32f026364712fee1264bc2ea6715120891174ed1b980d2e04", size = 4722, upload-time = "2025-05-13T15:23:59.629Z" }, -] - [[package]] name = "pyzmq" version = "27.1.0" @@ -4958,15 +4658,6 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/83/e4/d04a086285c20886c0daad0e026f250869201013d18f81d9ff5eada73a88/uvicorn-0.41.0-py3-none-any.whl", hash = "sha256:29e35b1d2c36a04b9e180d4007ede3bcb32a85fbdfd6c6aeb3f26839de088187", size = 68783, upload-time = "2026-02-16T23:07:22.357Z" }, ] -[[package]] -name = "verspec" -version = "0.1.0" -source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/e7/44/8126f9f0c44319b2efc65feaad589cadef4d77ece200ae3c9133d58464d0/verspec-0.1.0.tar.gz", hash = "sha256:c4504ca697b2056cdb4bfa7121461f5a0e81809255b41c03dda4ba823637c01e", size = 27123, upload-time = "2020-11-30T02:24:09.646Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/a4/ce/3b6fee91c85626eaf769d617f1be9d2e15c1cca027bbdeb2e0d751469355/verspec-0.1.0-py3-none-any.whl", hash = "sha256:741877d5633cc9464c45a469ae2a31e801e6dbbaa85b9675d481cda100f11c31", size = 19640, upload-time = "2020-11-30T02:24:08.387Z" }, -] - [[package]] name = "virtualenv" version = "21.2.0" @@ -4983,38 +4674,6 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/c6/59/7d02447a55b2e55755011a647479041bc92a82e143f96a8195cb33bd0a1c/virtualenv-21.2.0-py3-none-any.whl", hash = "sha256:1bd755b504931164a5a496d217c014d098426cddc79363ad66ac78125f9d908f", size = 5825084, upload-time = "2026-03-09T17:24:35.378Z" }, ] -[[package]] -name = "watchdog" -version = "6.0.0" -source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/db/7d/7f3d619e951c88ed75c6037b246ddcf2d322812ee8ea189be89511721d54/watchdog-6.0.0.tar.gz", hash = "sha256:9ddf7c82fda3ae8e24decda1338ede66e1c99883db93711d8fb941eaa2d8c282", size = 131220, upload-time = "2024-11-01T14:07:13.037Z" } -wheels = [ - { url = "https://files.pythonhosted.org/packages/0c/56/90994d789c61df619bfc5ce2ecdabd5eeff564e1eb47512bd01b5e019569/watchdog-6.0.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:d1cdb490583ebd691c012b3d6dae011000fe42edb7a82ece80965b42abd61f26", size = 96390, upload-time = "2024-11-01T14:06:24.793Z" }, - { url = "https://files.pythonhosted.org/packages/55/46/9a67ee697342ddf3c6daa97e3a587a56d6c4052f881ed926a849fcf7371c/watchdog-6.0.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:bc64ab3bdb6a04d69d4023b29422170b74681784ffb9463ed4870cf2f3e66112", size = 88389, upload-time = "2024-11-01T14:06:27.112Z" }, - { url = "https://files.pythonhosted.org/packages/44/65/91b0985747c52064d8701e1075eb96f8c40a79df889e59a399453adfb882/watchdog-6.0.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:c897ac1b55c5a1461e16dae288d22bb2e412ba9807df8397a635d88f671d36c3", size = 89020, upload-time = "2024-11-01T14:06:29.876Z" }, - { url = "https://files.pythonhosted.org/packages/e0/24/d9be5cd6642a6aa68352ded4b4b10fb0d7889cb7f45814fb92cecd35f101/watchdog-6.0.0-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:6eb11feb5a0d452ee41f824e271ca311a09e250441c262ca2fd7ebcf2461a06c", size = 96393, upload-time = "2024-11-01T14:06:31.756Z" }, - { url = "https://files.pythonhosted.org/packages/63/7a/6013b0d8dbc56adca7fdd4f0beed381c59f6752341b12fa0886fa7afc78b/watchdog-6.0.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:ef810fbf7b781a5a593894e4f439773830bdecb885e6880d957d5b9382a960d2", size = 88392, upload-time = "2024-11-01T14:06:32.99Z" }, - { url = "https://files.pythonhosted.org/packages/d1/40/b75381494851556de56281e053700e46bff5b37bf4c7267e858640af5a7f/watchdog-6.0.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:afd0fe1b2270917c5e23c2a65ce50c2a4abb63daafb0d419fde368e272a76b7c", size = 89019, upload-time = "2024-11-01T14:06:34.963Z" }, - { url = "https://files.pythonhosted.org/packages/39/ea/3930d07dafc9e286ed356a679aa02d777c06e9bfd1164fa7c19c288a5483/watchdog-6.0.0-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:bdd4e6f14b8b18c334febb9c4425a878a2ac20efd1e0b231978e7b150f92a948", size = 96471, upload-time = "2024-11-01T14:06:37.745Z" }, - { url = "https://files.pythonhosted.org/packages/12/87/48361531f70b1f87928b045df868a9fd4e253d9ae087fa4cf3f7113be363/watchdog-6.0.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:c7c15dda13c4eb00d6fb6fc508b3c0ed88b9d5d374056b239c4ad1611125c860", size = 88449, upload-time = "2024-11-01T14:06:39.748Z" }, - { url = "https://files.pythonhosted.org/packages/5b/7e/8f322f5e600812e6f9a31b75d242631068ca8f4ef0582dd3ae6e72daecc8/watchdog-6.0.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:6f10cb2d5902447c7d0da897e2c6768bca89174d0c6e1e30abec5421af97a5b0", size = 89054, upload-time = "2024-11-01T14:06:41.009Z" }, - { url = "https://files.pythonhosted.org/packages/68/98/b0345cabdce2041a01293ba483333582891a3bd5769b08eceb0d406056ef/watchdog-6.0.0-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:490ab2ef84f11129844c23fb14ecf30ef3d8a6abafd3754a6f75ca1e6654136c", size = 96480, upload-time = "2024-11-01T14:06:42.952Z" }, - { url = "https://files.pythonhosted.org/packages/85/83/cdf13902c626b28eedef7ec4f10745c52aad8a8fe7eb04ed7b1f111ca20e/watchdog-6.0.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:76aae96b00ae814b181bb25b1b98076d5fc84e8a53cd8885a318b42b6d3a5134", size = 88451, upload-time = "2024-11-01T14:06:45.084Z" }, - { url = "https://files.pythonhosted.org/packages/fe/c4/225c87bae08c8b9ec99030cd48ae9c4eca050a59bf5c2255853e18c87b50/watchdog-6.0.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:a175f755fc2279e0b7312c0035d52e27211a5bc39719dd529625b1930917345b", size = 89057, upload-time = "2024-11-01T14:06:47.324Z" }, - { url = "https://files.pythonhosted.org/packages/30/ad/d17b5d42e28a8b91f8ed01cb949da092827afb9995d4559fd448d0472763/watchdog-6.0.0-pp310-pypy310_pp73-macosx_10_15_x86_64.whl", hash = "sha256:c7ac31a19f4545dd92fc25d200694098f42c9a8e391bc00bdd362c5736dbf881", size = 87902, upload-time = "2024-11-01T14:06:53.119Z" }, - { url = "https://files.pythonhosted.org/packages/5c/ca/c3649991d140ff6ab67bfc85ab42b165ead119c9e12211e08089d763ece5/watchdog-6.0.0-pp310-pypy310_pp73-macosx_11_0_arm64.whl", hash = "sha256:9513f27a1a582d9808cf21a07dae516f0fab1cf2d7683a742c498b93eedabb11", size = 88380, upload-time = "2024-11-01T14:06:55.19Z" }, - { url = "https://files.pythonhosted.org/packages/a9/c7/ca4bf3e518cb57a686b2feb4f55a1892fd9a3dd13f470fca14e00f80ea36/watchdog-6.0.0-py3-none-manylinux2014_aarch64.whl", hash = "sha256:7607498efa04a3542ae3e05e64da8202e58159aa1fa4acddf7678d34a35d4f13", size = 79079, upload-time = "2024-11-01T14:06:59.472Z" }, - { url = "https://files.pythonhosted.org/packages/5c/51/d46dc9332f9a647593c947b4b88e2381c8dfc0942d15b8edc0310fa4abb1/watchdog-6.0.0-py3-none-manylinux2014_armv7l.whl", hash = "sha256:9041567ee8953024c83343288ccc458fd0a2d811d6a0fd68c4c22609e3490379", size = 79078, upload-time = "2024-11-01T14:07:01.431Z" }, - { url = "https://files.pythonhosted.org/packages/d4/57/04edbf5e169cd318d5f07b4766fee38e825d64b6913ca157ca32d1a42267/watchdog-6.0.0-py3-none-manylinux2014_i686.whl", hash = "sha256:82dc3e3143c7e38ec49d61af98d6558288c415eac98486a5c581726e0737c00e", size = 79076, upload-time = "2024-11-01T14:07:02.568Z" }, - { url = "https://files.pythonhosted.org/packages/ab/cc/da8422b300e13cb187d2203f20b9253e91058aaf7db65b74142013478e66/watchdog-6.0.0-py3-none-manylinux2014_ppc64.whl", hash = "sha256:212ac9b8bf1161dc91bd09c048048a95ca3a4c4f5e5d4a7d1b1a7d5752a7f96f", size = 79077, upload-time = "2024-11-01T14:07:03.893Z" }, - { url = "https://files.pythonhosted.org/packages/2c/3b/b8964e04ae1a025c44ba8e4291f86e97fac443bca31de8bd98d3263d2fcf/watchdog-6.0.0-py3-none-manylinux2014_ppc64le.whl", hash = "sha256:e3df4cbb9a450c6d49318f6d14f4bbc80d763fa587ba46ec86f99f9e6876bb26", size = 79078, upload-time = "2024-11-01T14:07:05.189Z" }, - { url = "https://files.pythonhosted.org/packages/62/ae/a696eb424bedff7407801c257d4b1afda455fe40821a2be430e173660e81/watchdog-6.0.0-py3-none-manylinux2014_s390x.whl", hash = "sha256:2cce7cfc2008eb51feb6aab51251fd79b85d9894e98ba847408f662b3395ca3c", size = 79077, upload-time = "2024-11-01T14:07:06.376Z" }, - { url = "https://files.pythonhosted.org/packages/b5/e8/dbf020b4d98251a9860752a094d09a65e1b436ad181faf929983f697048f/watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl", hash = "sha256:20ffe5b202af80ab4266dcd3e91aae72bf2da48c0d33bdb15c66658e685e94e2", size = 79078, upload-time = "2024-11-01T14:07:07.547Z" }, - { url = "https://files.pythonhosted.org/packages/07/f6/d0e5b343768e8bcb4cda79f0f2f55051bf26177ecd5651f84c07567461cf/watchdog-6.0.0-py3-none-win32.whl", hash = "sha256:07df1fdd701c5d4c8e55ef6cf55b8f0120fe1aef7ef39a1c6fc6bc2e606d517a", size = 79065, upload-time = "2024-11-01T14:07:09.525Z" }, - { url = "https://files.pythonhosted.org/packages/db/d9/c495884c6e548fce18a8f40568ff120bc3a4b7b99813081c8ac0c936fa64/watchdog-6.0.0-py3-none-win_amd64.whl", hash = "sha256:cbafb470cf848d93b5d013e2ecb245d4aa1c8fd0504e863ccefa32445359d680", size = 79070, upload-time = "2024-11-01T14:07:10.686Z" }, - { url = "https://files.pythonhosted.org/packages/33/e8/e40370e6d74ddba47f002a32919d91310d6074130fe4e17dabcafc15cbf1/watchdog-6.0.0-py3-none-win_ia64.whl", hash = "sha256:a1914259fa9e1454315171103c6a30961236f508b9b623eae470268bbcc6a22f", size = 79067, upload-time = "2024-11-01T14:07:11.845Z" }, -] - [[package]] name = "wcwidth" version = "0.6.0"