Skip to content

docs: add async service ingestor examples#2182

Open
colesmcintosh wants to merge 3 commits into
NVIDIA:mainfrom
colesmcintosh:docs-async-ingestor-api-jobs
Open

docs: add async service ingestor examples#2182
colesmcintosh wants to merge 3 commits into
NVIDIA:mainfrom
colesmcintosh:docs-async-ingestor-api-jobs

Conversation

@colesmcintosh
Copy link
Copy Markdown

@colesmcintosh colesmcintosh commented May 31, 2026

Description

Adds service-mode API guide examples for async ingestor jobs.

This documents:

  • when to use ingest(), ingest_stream(), and aingest_stream()
  • a synchronous streaming example using ingest_stream()
  • a true async example using async for event in ingestor.aingest_stream()
  • the event shapes emitted by service-mode streaming ingest jobs

Closes #1134.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

Validation:

  • git diff --check (clean)
  • mkdocs build -f mkdocs.yml --strict passes
  • verified Markdown code fences and event table counts

Signed-off-by: Cole McIntosh <colemcintosh6@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 31, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@colesmcintosh colesmcintosh marked this pull request as ready for review June 1, 2026 22:17
@colesmcintosh colesmcintosh requested review from a team as code owners June 1, 2026 22:17
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 1, 2026

Greptile Summary

This PR adds a new "Service-mode async ingest jobs" section to the API reference, documenting when to use ingest(), ingest_stream(), and aingest_stream(), along with complete code examples and an event shapes table. The previously flagged issues around asyncio.run() compatibility and the unhandled job_started event have both been addressed with clarifying prose in this revision.

  • Adds synchronous streaming example (using match/case) and an async streaming example (using if/elif), both demonstrating the main event types emitted by service-mode ingest jobs.
  • Adds a reference table covering all nine event types with their key fields and a note that unhandled events (e.g., job_started) can safely be ignored.
  • Includes guidance on when to prefer each ingest surface and how to avoid asyncio.run() in already-running event loops.

Confidence Score: 5/5

Documentation-only change that adds no executable production code; safe to merge.

The change is entirely additive documentation. Both code examples are syntactically correct Python, the event shapes table is internally consistent, and the previously raised concerns about asyncio.run() compatibility and the unhandled job_started event have been resolved with explanatory prose in this revision.

No files require special attention.

Important Files Changed

Filename Overview
docs/docs/extraction/nemo-retriever-api-reference.md Adds 131 lines of new documentation covering service-mode async ingest, with two code examples and a full event shapes reference table; no logic issues found.

Sequence Diagram

sequenceDiagram
    participant C as Client (Python)
    participant S as NeMo Retriever Service

    C->>S: "create_ingestor(run_mode="service") + .extract().embed()"
    C->>S: Upload documents (ingest_stream / aingest_stream)
    S-->>C: "job_created {job_id, expected_documents}"
    S-->>C: "upload_complete {filename, document_id} (per file)"
    S-->>C: "upload_failed {filename, error} (on failure)"
    S-->>C: "job_started {job_id, status, ...}"
    loop Per document
        S-->>C: "document_complete {document_id, status, result_rows, ...}"
    end
    loop Progress milestones
        S-->>C: "job_progress {completed, failed, remaining, ...}"
    end
    S-->>C: "job_finalized | job_partial | job_failed {job_id, status, ...}"
Loading

Reviews (2): Last reviewed commit: "docs: clarify async entry point and unha..." | Re-trigger Greptile

Comment on lines +91 to +95
asyncio.run(main())
```

Use this form from async web services, task runners, or notebooks that need to
keep other async work moving while ingestion is in flight.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 asyncio.run() incompatible with the stated use cases (notebooks, web services)

The closing asyncio.run(main()) only works in a standalone script. Jupyter notebooks already have a running event loop, so this call raises RuntimeError: This event loop is already running. FastAPI and other async web frameworks have the same restriction — callers there would await main() directly or inline the async for loop. The guidance text says "Use this form from async web services, task runners, or notebooks", which is exactly where asyncio.run() cannot be called.

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/docs/extraction/nemo-retriever-api-reference.md
Line: 91-95

Comment:
**`asyncio.run()` incompatible with the stated use cases (notebooks, web services)**

The closing `asyncio.run(main())` only works in a standalone script. Jupyter notebooks already have a running event loop, so this call raises `RuntimeError: This event loop is already running`. FastAPI and other async web frameworks have the same restriction — callers there would `await main()` directly or inline the `async for` loop. The guidance text says "Use this form from async web services, task runners, or notebooks", which is exactly where `asyncio.run()` cannot be called.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +97 to +113
### Event shapes

The streaming APIs yield dictionaries. Check the `event` key first, then read the
fields that apply to that event type:

| Event | Meaning | Key fields |
| --- | --- | --- |
| `job_created` | The service created one aggregate job for the submitted document set. | `job_id`, `expected_documents` |
| `upload_complete` | One local file uploaded and was assigned a service document ID. | `job_id`, `filename`, `document_id` |
| `document_complete` | One document reached a terminal document state. | `job_id`, `document_id`, `status`, `result_rows`, `elapsed_s`, `error` |
| `upload_failed` | One local file could not be uploaded. | `job_id`, `filename`, `error` |
| `job_started` | At least one document in the job started processing. | `job_id`, `status`, `expected_documents`, `counts`, `completed`, `failed`, `remaining`, `progress_pct`, `started_at` |
| `job_progress` | The job reached a progress reporting milestone. | `job_id`, `status`, `expected_documents`, `counts`, `completed`, `failed`, `remaining`, `progress_pct`, `elapsed_s` |
| `job_finalized` | All documents completed successfully. | `job_id`, `status`, `expected_documents`, `counts`, `completed`, `failed`, `remaining`, `progress_pct`, `elapsed_s`, `finalized_at` |
| `job_partial` | Some documents completed and some failed. | `job_id`, `status`, `expected_documents`, `counts`, `completed`, `failed`, `remaining`, `progress_pct`, `elapsed_s`, `finalized_at` |
| `job_failed` | Every document in the job failed. | `job_id`, `status`, `expected_documents`, `counts`, `completed`, `failed`, `remaining`, `progress_pct`, `elapsed_s`, `finalized_at` |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 job_started event documented in the table but absent from both code examples

The event shapes table lists job_started with its full set of key fields (job_id, status, expected_documents, counts, completed, failed, remaining, progress_pct, started_at), but neither the synchronous nor the asynchronous code example handles it. Readers who copy the examples verbatim will silently drop job_started events. A catch-all (case _: ... / else: ...) branch or a brief note clarifying that job_started does not require a handler in typical use would remove the ambiguity.

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/docs/extraction/nemo-retriever-api-reference.md
Line: 97-113

Comment:
**`job_started` event documented in the table but absent from both code examples**

The event shapes table lists `job_started` with its full set of key fields (`job_id`, `status`, `expected_documents`, `counts`, `completed`, `failed`, `remaining`, `progress_pct`, `started_at`), but neither the synchronous nor the asynchronous code example handles it. Readers who copy the examples verbatim will silently drop `job_started` events. A catch-all (`case _: ...` / `else: ...`) branch or a brief note clarifying that `job_started` does not require a handler in typical use would remove the ambiguity.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@colesmcintosh
Copy link
Copy Markdown
Author

Thanks for the review. Both points addressed in 574e93b:

  1. asyncio.run() in the async example — good catch. asyncio.run() cannot be nested inside a running loop, which is exactly the Jupyter/FastAPI case the prose calls out. I kept asyncio.run(main()) for the standalone-script path and added a note that inside an already-running loop you await main() (or inline the async for loop) instead.

  2. job_started not handled in the examples — added a sentence to the Event shapes intro clarifying that the examples handle the events most callers act on and that any unhandled event (such as job_started) can be safely ignored. I left the example code as-is rather than adding no-op branches, since the note conveys the intent without cluttering the illustrations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DOC]: Add docs for async ingestor API jobs

1 participant