docs: add async service ingestor examples#2182
Conversation
Signed-off-by: Cole McIntosh <colemcintosh6@gmail.com>
Greptile SummaryThis PR adds a new "Service-mode async ingest jobs" section to the API reference, documenting when to use
|
| Filename | Overview |
|---|---|
| docs/docs/extraction/nemo-retriever-api-reference.md | Adds 131 lines of new documentation covering service-mode async ingest, with two code examples and a full event shapes reference table; no logic issues found. |
Sequence Diagram
sequenceDiagram
participant C as Client (Python)
participant S as NeMo Retriever Service
C->>S: "create_ingestor(run_mode="service") + .extract().embed()"
C->>S: Upload documents (ingest_stream / aingest_stream)
S-->>C: "job_created {job_id, expected_documents}"
S-->>C: "upload_complete {filename, document_id} (per file)"
S-->>C: "upload_failed {filename, error} (on failure)"
S-->>C: "job_started {job_id, status, ...}"
loop Per document
S-->>C: "document_complete {document_id, status, result_rows, ...}"
end
loop Progress milestones
S-->>C: "job_progress {completed, failed, remaining, ...}"
end
S-->>C: "job_finalized | job_partial | job_failed {job_id, status, ...}"
Reviews (2): Last reviewed commit: "docs: clarify async entry point and unha..." | Re-trigger Greptile
| asyncio.run(main()) | ||
| ``` | ||
|
|
||
| Use this form from async web services, task runners, or notebooks that need to | ||
| keep other async work moving while ingestion is in flight. |
There was a problem hiding this comment.
asyncio.run() incompatible with the stated use cases (notebooks, web services)
The closing asyncio.run(main()) only works in a standalone script. Jupyter notebooks already have a running event loop, so this call raises RuntimeError: This event loop is already running. FastAPI and other async web frameworks have the same restriction — callers there would await main() directly or inline the async for loop. The guidance text says "Use this form from async web services, task runners, or notebooks", which is exactly where asyncio.run() cannot be called.
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/docs/extraction/nemo-retriever-api-reference.md
Line: 91-95
Comment:
**`asyncio.run()` incompatible with the stated use cases (notebooks, web services)**
The closing `asyncio.run(main())` only works in a standalone script. Jupyter notebooks already have a running event loop, so this call raises `RuntimeError: This event loop is already running`. FastAPI and other async web frameworks have the same restriction — callers there would `await main()` directly or inline the `async for` loop. The guidance text says "Use this form from async web services, task runners, or notebooks", which is exactly where `asyncio.run()` cannot be called.
How can I resolve this? If you propose a fix, please make it concise.| ### Event shapes | ||
|
|
||
| The streaming APIs yield dictionaries. Check the `event` key first, then read the | ||
| fields that apply to that event type: | ||
|
|
||
| | Event | Meaning | Key fields | | ||
| | --- | --- | --- | | ||
| | `job_created` | The service created one aggregate job for the submitted document set. | `job_id`, `expected_documents` | | ||
| | `upload_complete` | One local file uploaded and was assigned a service document ID. | `job_id`, `filename`, `document_id` | | ||
| | `document_complete` | One document reached a terminal document state. | `job_id`, `document_id`, `status`, `result_rows`, `elapsed_s`, `error` | | ||
| | `upload_failed` | One local file could not be uploaded. | `job_id`, `filename`, `error` | | ||
| | `job_started` | At least one document in the job started processing. | `job_id`, `status`, `expected_documents`, `counts`, `completed`, `failed`, `remaining`, `progress_pct`, `started_at` | | ||
| | `job_progress` | The job reached a progress reporting milestone. | `job_id`, `status`, `expected_documents`, `counts`, `completed`, `failed`, `remaining`, `progress_pct`, `elapsed_s` | | ||
| | `job_finalized` | All documents completed successfully. | `job_id`, `status`, `expected_documents`, `counts`, `completed`, `failed`, `remaining`, `progress_pct`, `elapsed_s`, `finalized_at` | | ||
| | `job_partial` | Some documents completed and some failed. | `job_id`, `status`, `expected_documents`, `counts`, `completed`, `failed`, `remaining`, `progress_pct`, `elapsed_s`, `finalized_at` | | ||
| | `job_failed` | Every document in the job failed. | `job_id`, `status`, `expected_documents`, `counts`, `completed`, `failed`, `remaining`, `progress_pct`, `elapsed_s`, `finalized_at` | | ||
|
|
There was a problem hiding this comment.
job_started event documented in the table but absent from both code examples
The event shapes table lists job_started with its full set of key fields (job_id, status, expected_documents, counts, completed, failed, remaining, progress_pct, started_at), but neither the synchronous nor the asynchronous code example handles it. Readers who copy the examples verbatim will silently drop job_started events. A catch-all (case _: ... / else: ...) branch or a brief note clarifying that job_started does not require a handler in typical use would remove the ambiguity.
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/docs/extraction/nemo-retriever-api-reference.md
Line: 97-113
Comment:
**`job_started` event documented in the table but absent from both code examples**
The event shapes table lists `job_started` with its full set of key fields (`job_id`, `status`, `expected_documents`, `counts`, `completed`, `failed`, `remaining`, `progress_pct`, `started_at`), but neither the synchronous nor the asynchronous code example handles it. Readers who copy the examples verbatim will silently drop `job_started` events. A catch-all (`case _: ...` / `else: ...`) branch or a brief note clarifying that `job_started` does not require a handler in typical use would remove the ambiguity.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
Thanks for the review. Both points addressed in 574e93b:
|
Description
Adds service-mode API guide examples for async ingestor jobs.
This documents:
ingest(),ingest_stream(), andaingest_stream()ingest_stream()async for event in ingestor.aingest_stream()Closes #1134.
Checklist
Validation:
git diff --check(clean)mkdocs build -f mkdocs.yml --strictpasses