Skip to content

Commit 97f285c

Browse files
authored
changing default models to nemotron reasoning models and adding a chat feature (#409)
1 parent 4aee4e9 commit 97f285c

File tree

23 files changed

+2198
-596
lines changed

23 files changed

+2198
-596
lines changed

community/ai-vws-sizing-advisor/CHANGELOG.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,46 @@ All notable changes to this project will be documented in this file.
33
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
44

55

6+
## [2.3] - 2026-01-08
7+
8+
This release focuses on improved sizing recommendations, enhanced Nemotron model integration, and comprehensive documentation updates.
9+
10+
### Added
11+
- **Demo Screenshots** — Added visual examples showcasing the Configuration Wizard, RAG-powered sizing recommendations, and Local Deployment verification
12+
- **Official Documentation Link** — Added link to [NVIDIA vGPU Docs Hub](https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html) in README
13+
14+
### Changed
15+
- **README Overhaul** — Reorganized documentation to highlight NVIDIA Nemotron models
16+
- Llama-3.3-Nemotron-Super-49B powers the RAG backend
17+
- Nemotron-3 Nano 30B (FP8) as default for workload sizing
18+
- New Demo section with screenshots demonstrating key features
19+
20+
- **Sizing Recommendation Improvements**
21+
- Enhanced 95% usable capacity rule for profile selection (5% reserved for system overhead)
22+
- Improved profile selection logic: picks smallest profile where (profile × 0.95) >= workload
23+
- Better handling of edge cases near profile boundaries
24+
25+
- **GPU Passthrough Logic**
26+
- Automatic passthrough recommendation when workload exceeds max single vGPU profile
27+
- Clearer passthrough examples in RAG context (e.g., 92GB on BSE → 2× BSE GPU passthrough)
28+
- Calculator now returns `vgpu_profile: null` with multi-GPU passthrough recommendation
29+
30+
- **vLLM Local Deployment**
31+
- Updated to vLLM v0.12.0 for proper NemotronH (hybrid Mamba-Transformer) architecture support
32+
- Improved GPU memory utilization calculations for local testing
33+
- Better max-model-len auto-detection (only set when explicitly specified)
34+
35+
- **Chat Improvements**
36+
- Enhanced conversational mode with vGPU configuration context
37+
- Better model extraction from sizing responses for follow-up questions
38+
- Improved context handling for RAG vs inference workload discussions
39+
40+
### Improved
41+
- **Nemotron Model Integration**
42+
- Default model changed to Nemotron-3 Nano 30B FP8 in configuration wizard
43+
- Nemotron thinking prompt support for enhanced reasoning
44+
- Better model matching for Nemotron variants in calculator
45+
646
## [2.2] - 2025-11-04
747

848
### Changed

community/ai-vws-sizing-advisor/README.md

Lines changed: 72 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,67 @@
11
# AI vWS Sizing Advisor
22

3+
<p align="center">
4+
<img src="deployment_examples/example_rag_config.png" alt="AI vWS Sizing Advisor" width="800">
5+
</p>
6+
7+
<p align="center">
8+
<strong>RAG-powered vGPU sizing recommendations for AI Virtual Workstations</strong><br>
9+
Powered by NVIDIA NeMo™ and Nemotron models
10+
</p>
11+
12+
<p align="center">
13+
<a href="https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html">Official Documentation</a> •
14+
<a href="#demo">Demo</a> •
15+
<a href="#deployment">Quick Start</a> •
16+
<a href="./CHANGELOG.md">Changelog</a>
17+
</p>
18+
19+
---
20+
321
## Overview
422

523
AI vWS Sizing Advisor is a RAG-powered tool that helps you determine the optimal NVIDIA vGPU sizing configuration for AI workloads on NVIDIA AI Virtual Workstation (AI vWS). Using NVIDIA vGPU documentation and best practices, it provides tailored recommendations for optimal performance and resource efficiency.
624

25+
### Powered by NVIDIA Nemotron
26+
27+
This tool leverages **NVIDIA Nemotron models** for intelligent sizing recommendations:
28+
29+
- **[Llama-3.3-Nemotron-Super-49B](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1)** — Powers the RAG backend for intelligent conversational sizing guidance
30+
- **[Nemotron-3 Nano 30B](https://build.nvidia.com/nvidia/nvidia-nemotron-3-nano-30b-a3b-fp8)** — Default model for workload sizing calculations (FP8 optimized)
31+
32+
### Key Capabilities
33+
734
Enter your workload requirements and receive validated recommendations including:
835

9-
- **vGPU Profile** - Recommended profile (e.g., L40S-24Q) based on your workload
10-
- **Resource Requirements** - vCPUs, GPU memory, system RAM needed
11-
- **Performance Estimates** - Expected latency, throughput, and time to first token
12-
- **Live Testing** - Instantly deploy and validate your configuration locally using vLLM containers
36+
- **vGPU Profile** Recommended profile (e.g., L40S-24Q) based on your workload
37+
- **Resource Requirements** vCPUs, GPU memory, system RAM needed
38+
- **Performance Estimates** Expected latency, throughput, and time to first token
39+
- **Live Testing** Instantly deploy and validate your configuration locally using vLLM containers
1340

1441
The tool differentiates between RAG and inference workloads by accounting for embedding vectors and database overhead. It intelligently suggests GPU passthrough when jobs exceed standard vGPU profile limits.
1542

43+
---
44+
45+
## Demo
46+
47+
### Configuration Wizard
48+
49+
Configure your workload parameters including model selection, GPU type, quantization, and token sizes:
50+
51+
<p align="center">
52+
<img src="deployment_examples/configuration_wizard.png" alt="Configuration Wizard" width="700">
53+
</p>
54+
55+
### Local Deployment Verification
56+
57+
Validate your configuration by deploying a vLLM container locally and comparing actual GPU memory usage against estimates:
58+
59+
<p align="center">
60+
<img src="deployment_examples/local_deployment.png" alt="Local Deployment" width="700">
61+
</p>
62+
63+
---
64+
1665
## Prerequisites
1766

1867
### Hardware
@@ -44,8 +93,10 @@ docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
4493
> **Note:** Docker must be at `/usr/bin/docker` (verified in `deploy/compose/docker-compose-rag-server.yaml`). User must be in docker group or have socket permissions.
4594
4695
### API Keys
47-
- **NVIDIA Build API Key** (Required) - [Get your key](https://build.nvidia.com/settings/api-keys)
48-
- **HuggingFace Token** (Optional) - [Create token](https://huggingface.co/settings/tokens) for gated models
96+
- **NVIDIA Build API Key** (Required) — [Get your key](https://build.nvidia.com/settings/api-keys)
97+
- **HuggingFace Token** (Optional) — [Create token](https://huggingface.co/settings/tokens) for gated models
98+
99+
---
49100

50101
## Deployment
51102

@@ -74,28 +125,32 @@ npm install
74125
npm run dev
75126
```
76127

128+
---
129+
77130
## Usage
78131

79-
2. **Select Workload Type:** RAG or Inference
132+
1. **Select Workload Type:** RAG or Inference
80133

81-
3. **Enter Parameters:**
82-
- Model name (e.g., `meta-llama/Llama-2-7b-chat-hf`)
134+
2. **Enter Parameters:**
135+
- Model name (default: **Nemotron-3 Nano 30B FP8**)
83136
- GPU type
84137
- Prompt size (input tokens)
85138
- Response size (output tokens)
86-
- Quantization (FP16, INT8, INT4)
139+
- Quantization (FP16, FP8, INT8, INT4)
87140
- For RAG: Embedding model and vector dimensions
88141

89-
4. **View Recommendations:**
142+
3. **View Recommendations:**
90143
- Recommended vGPU profiles
91144
- Resource requirements (vCPUs, RAM, GPU memory)
92145
- Performance estimates
93146

94-
5. **Test Locally** (optional):
147+
4. **Test Locally** (optional):
95148
- Run local inference with a containerized vLLM server
96149
- View performance metrics
97150
- Compare actual results versus suggested profile configuration
98151

152+
---
153+
99154
## Management Commands
100155

101156
```bash
@@ -120,6 +175,8 @@ The stop script automatically performs Docker cleanup operations:
120175
- Optionally removes dangling images (`--cleanup-images`)
121176
- Optionally removes all data volumes (`--volumes`)
122177

178+
---
179+
123180
## Adding Documents to RAG Context
124181

125182
The tool includes NVIDIA vGPU documentation by default. To add your own:
@@ -134,8 +191,7 @@ curl -X POST -F "file=@./vgpu_docs/your-document.pdf" http://localhost:8082/v1/i
134191

135192
**Supported formats:** PDF, TXT, DOCX, HTML, PPTX
136193

137-
138-
194+
---
139195

140196
## License
141197

@@ -145,6 +201,6 @@ Models governed by [NVIDIA AI Foundation Models Community License](https://docs.
145201

146202
---
147203

148-
**Version:** 2.2 (November 2025) - See [CHANGELOG.md](./CHANGELOG.md)
204+
**Version:** 2.3 (January 2026) — See [CHANGELOG.md](./CHANGELOG.md)
149205

150-
**Support:** [GitHub Issues](https://github.com/NVIDIA/GenerativeAIExamples/issues) | [NVIDIA Forums](https://forums.developer.nvidia.com/)
206+
**Support:** [GitHub Issues](https://github.com/NVIDIA/GenerativeAIExamples/issues) | [NVIDIA Forums](https://forums.developer.nvidia.com/) | [Official Docs](https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html)

community/ai-vws-sizing-advisor/deploy/compose/docker-compose-ingestor-server.yaml

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
1+
# ============================================================================
2+
# CENTRALIZED MODEL CONFIGURATION
3+
# Change these values to use different models throughout the application
4+
# ============================================================================
5+
x-model-config:
6+
# Embedding Model Configuration
7+
embedding-model: &embedding-model "nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"
8+
19
services:
210

311
# Main ingestor server which is responsible for ingestion
@@ -38,10 +46,14 @@ services:
3846
NGC_API_KEY: ${NGC_API_KEY:?"NGC_API_KEY is required"}
3947

4048
##===Embedding Model specific configurations===
49+
# Model name - pulls from centralized config at top of file (can be overridden by env var)
50+
APP_EMBEDDINGS_MODELNAME: *embedding-model
4151
# url on which embedding model is hosted. If "", Nvidia hosted API is used
42-
APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL-"nemoretriever-embedding-ms:8000"}
43-
APP_EMBEDDINGS_MODELNAME: ${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-mistral-7b-v2}
44-
APP_EMBEDDINGS_DIMENSIONS: ${APP_EMBEDDINGS_DIMENSIONS:-2048}
52+
APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL:-"nemoretriever-embedding-ms:8000"}
53+
# Embedding dimensions - IMPORTANT: Must match your embedding model!
54+
# nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1: 4096
55+
# nvidia/nv-embedqa-mistral-7b-v2: 2048
56+
APP_EMBEDDINGS_DIMENSIONS: ${APP_EMBEDDINGS_DIMENSIONS:-4096}
4557

4658
##===NV-Ingest Connection Configurations=======
4759
APP_NVINGEST_MESSAGECLIENTHOSTNAME: ${APP_NVINGEST_MESSAGECLIENTHOSTNAME:-"nv-ingest-ms-runtime"}
@@ -115,9 +127,10 @@ services:
115127
- AUDIO_INFER_PROTOCOL=grpc
116128
- CUDA_VISIBLE_DEVICES=0
117129
- MAX_INGEST_PROCESS_WORKERS=${MAX_INGEST_PROCESS_WORKERS:-16}
118-
- EMBEDDING_NIM_MODEL_NAME=${EMBEDDING_NIM_MODEL_NAME:-${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-7b-v2}}
130+
# Embedding model - uses APP_EMBEDDINGS_MODELNAME which pulls from centralized config
131+
- EMBEDDING_NIM_MODEL_NAME=${APP_EMBEDDINGS_MODELNAME:-nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1}
119132
# Incase of self-hosted embedding model, use the endpoint url as - https://integrate.api.nvidia.com/v1
120-
- EMBEDDING_NIM_ENDPOINT=${EMBEDDING_NIM_ENDPOINT:-${APP_EMBEDDINGS_SERVERURL-http://nemoretriever-embedding-ms:8000/v1}}
133+
- EMBEDDING_NIM_ENDPOINT=${EMBEDDING_NIM_ENDPOINT:-http://nemoretriever-embedding-ms:8000/v1}
121134
- INGEST_LOG_LEVEL=DEFAULT
122135
- INGEST_EDGE_BUFFER_SIZE=64
123136
# Message client for development

community/ai-vws-sizing-advisor/deploy/compose/docker-compose-rag-server.yaml

Lines changed: 22 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,14 @@
1+
# ============================================================================
2+
# CENTRALIZED MODEL CONFIGURATION
3+
# Change these values to use different models throughout the application
4+
# ============================================================================
5+
x-model-config:
6+
# Chat/LLM Model Configuration
7+
llm-model: &llm-model "nvidia/llama-3.3-nemotron-super-49b-v1"
8+
9+
# Embedding Model Configuration
10+
embedding-model: &embedding-model "nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"
11+
112
services:
213

314
# Main orchestrator server which stiches together all calls to different services to fulfill the user request
@@ -35,25 +46,16 @@ services:
3546
VECTOR_DB_TOPK: ${VECTOR_DB_TOPK:-100}
3647

3748
##===LLM Model specific configurations===
38-
APP_LLM_MODELNAME: ${APP_LLM_MODELNAME:-"meta/llama-3.1-8b-instruct"}
49+
# Model name - pulls from centralized config at top of file (can be overridden by env var)
50+
APP_LLM_MODELNAME: *llm-model
3951
# url on which llm model is hosted. If "", Nvidia hosted API is used
40-
APP_LLM_SERVERURL: ${APP_LLM_SERVERURL-""}
41-
42-
##===Query Rewriter Model specific configurations===
43-
APP_QUERYREWRITER_MODELNAME: ${APP_QUERYREWRITER_MODELNAME:-"meta/llama-3.1-8b-instruct"}
44-
# url on which query rewriter model is hosted. If "", Nvidia hosted API is used
45-
APP_QUERYREWRITER_SERVERURL: ${APP_QUERYREWRITER_SERVERURL-"nim-llm-llama-8b-ms:8000"}
52+
APP_LLM_SERVERURL: ${APP_LLM_SERVERURL:-""}
4653

4754
##===Embedding Model specific configurations===
55+
# Model name - pulls from centralized config at top of file (can be overridden by env var)
56+
APP_EMBEDDINGS_MODELNAME: *embedding-model
4857
# url on which embedding model is hosted. If "", Nvidia hosted API is used
49-
APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL-""}
50-
APP_EMBEDDINGS_MODELNAME: ${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-mistral-7b-v2}
51-
52-
##===Reranking Model specific configurations===
53-
# url on which ranking model is hosted. If "", Nvidia hosted API is used
54-
APP_RANKING_SERVERURL: ${APP_RANKING_SERVERURL-""}
55-
APP_RANKING_MODELNAME: ${APP_RANKING_MODELNAME:-nv-rerank-qa-mistral-4b:1}
56-
ENABLE_RERANKER: ${ENABLE_RERANKER:-True}
58+
APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL:-""}
5759

5860
NVIDIA_API_KEY: ${NGC_API_KEY:?"NGC_API_KEY is required"}
5961

@@ -65,7 +67,7 @@ services:
6567

6668
# enable multi-turn conversation in the rag chain - this controls conversation history usage
6769
# while doing query rewriting and in LLM prompt
68-
ENABLE_MULTITURN: ${ENABLE_MULTITURN:-False}
70+
ENABLE_MULTITURN: ${ENABLE_MULTITURN:-True}
6971

7072
# enable query rewriting for multiturn conversation in the rag chain.
7173
# This will improve accuracy of the retrieiver pipeline but increase latency due to an additional LLM call
@@ -139,10 +141,10 @@ services:
139141
context: ../../frontend
140142
dockerfile: ./Dockerfile
141143
args:
142-
# Model name for LLM
143-
NEXT_PUBLIC_MODEL_NAME: ${APP_LLM_MODELNAME:-meta/llama-3.1-8b-instruct}
144-
# Model name for embeddings
145-
NEXT_PUBLIC_EMBEDDING_MODEL: ${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-mistral-7b-v2}
144+
# Model name for LLM - pulls from centralized config at top of file
145+
NEXT_PUBLIC_MODEL_NAME: *llm-model
146+
# Model name for embeddings - pulls from centralized config at top of file
147+
NEXT_PUBLIC_EMBEDDING_MODEL: *embedding-model
146148
# Model name for reranking
147149
NEXT_PUBLIC_RERANKER_MODEL: ${APP_RANKING_MODELNAME:-nv-rerank-qa-mistral-4b:1}
148150
# URL for rag server container
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# ============================================================================
2+
# CENTRALIZED MODEL CONFIGURATION
3+
# ============================================================================
4+
# This file centralizes all model configurations for the RAG system.
5+
# Source this file or set these environment variables to change models.
6+
#
7+
# Usage:
8+
# source model_config.env
9+
# docker compose -f docker-compose-rag-server.yaml up
10+
#
11+
# ============================================================================
12+
13+
# ----------------------------------------------------------------------------
14+
# CHAT/LLM MODEL CONFIGURATION
15+
# ----------------------------------------------------------------------------
16+
# The main language model used for generating responses
17+
# Default: nvidia/llama-3.3-nemotron-super-49b-v1
18+
#
19+
# Other options:
20+
# - meta/llama-3.1-405b-instruct
21+
# - meta/llama-3.1-70b-instruct
22+
# - meta/llama-3.1-8b-instruct
23+
# - mistralai/mixtral-8x22b-instruct-v0.1
24+
#
25+
export APP_LLM_MODELNAME="nvidia/llama-3.3-nemotron-super-49b-v1"
26+
27+
# LLM Server URL (leave empty "" to use NVIDIA hosted API)
28+
export APP_LLM_SERVERURL=""
29+
30+
# ----------------------------------------------------------------------------
31+
# EMBEDDING MODEL CONFIGURATION
32+
# ----------------------------------------------------------------------------
33+
# The embedding model used for vectorizing documents and queries
34+
# Default: nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1
35+
#
36+
# Other options:
37+
# - nvidia/nv-embedqa-mistral-7b-v2
38+
# - nvidia/nv-embed-v2
39+
# - nvidia/llama-3.2-nv-embedqa-1b-v2
40+
#
41+
export APP_EMBEDDINGS_MODELNAME="nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"
42+
43+
# Embedding Server URL (leave empty "" to use NVIDIA hosted API, or set to self-hosted)
44+
# Example for self-hosted: "nemoretriever-embedding-ms:8000"
45+
export APP_EMBEDDINGS_SERVERURL=""
46+
47+
# Embedding dimensions (adjust based on your embedding model)
48+
# IMPORTANT: This MUST match your chosen embedding model!
49+
# - nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1: 4096 (current default)
50+
# - nvidia/nv-embedqa-mistral-7b-v2: 2048
51+
# - nvidia/nv-embed-v2: 4096
52+
export APP_EMBEDDINGS_DIMENSIONS="4096"
53+
54+
# ----------------------------------------------------------------------------
55+
# REFLECTION MODEL CONFIGURATION (for response quality checking)
56+
# ----------------------------------------------------------------------------
57+
# Model used for reflection/self-checking if ENABLE_REFLECTION=true
58+
export REFLECTION_LLM="mistralai/mixtral-8x22b-instruct-v0.1"
59+
export REFLECTION_LLM_SERVERURL="nim-llm-mixtral-8x22b:8000"
60+
61+
# ----------------------------------------------------------------------------
62+
# CAPTION MODEL CONFIGURATION (for image/chart understanding)
63+
# ----------------------------------------------------------------------------
64+
# Model used for generating captions for images, charts, and tables
65+
export APP_NVINGEST_CAPTIONMODELNAME="meta/llama-3.2-11b-vision-instruct"
66+
export APP_NVINGEST_CAPTIONENDPOINTURL="http://vlm-ms:8000/v1/chat/completions"
67+
export VLM_CAPTION_MODEL_NAME="meta/llama-3.2-11b-vision-instruct"
68+
export VLM_CAPTION_ENDPOINT="http://vlm-ms:8000/v1/chat/completions"
69+
70+
# ----------------------------------------------------------------------------
71+
# ADDITIONAL NOTES
72+
# ----------------------------------------------------------------------------
73+
# 1. After changing models, you may need to rebuild containers:
74+
# docker compose -f docker-compose-rag-server.yaml build --no-cache rag-playground
75+
#
76+
# 2. For self-hosted models, make sure the corresponding NIM services are running
77+
#
78+
# 3. The embedding dimensions must match your chosen embedding model
79+
#
80+
# 4. When switching between hosted and self-hosted, update both the model name
81+
# and the server URL accordingly
82+
848 KB
Loading
843 KB
Loading
773 KB
Loading

0 commit comments

Comments
 (0)