You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: community/ai-vws-sizing-advisor/CHANGELOG.md
+40Lines changed: 40 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,46 @@ All notable changes to this project will be documented in this file.
3
3
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
4
4
5
5
6
+
## [2.3] - 2026-01-08
7
+
8
+
This release focuses on improved sizing recommendations, enhanced Nemotron model integration, and comprehensive documentation updates.
9
+
10
+
### Added
11
+
-**Demo Screenshots** — Added visual examples showcasing the Configuration Wizard, RAG-powered sizing recommendations, and Local Deployment verification
12
+
-**Official Documentation Link** — Added link to [NVIDIA vGPU Docs Hub](https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html) in README
13
+
14
+
### Changed
15
+
-**README Overhaul** — Reorganized documentation to highlight NVIDIA Nemotron models
16
+
- Llama-3.3-Nemotron-Super-49B powers the RAG backend
17
+
- Nemotron-3 Nano 30B (FP8) as default for workload sizing
18
+
- New Demo section with screenshots demonstrating key features
19
+
20
+
-**Sizing Recommendation Improvements**
21
+
- Enhanced 95% usable capacity rule for profile selection (5% reserved for system overhead)
AI vWS Sizing Advisor is a RAG-powered tool that helps you determine the optimal NVIDIA vGPU sizing configuration for AI workloads on NVIDIA AI Virtual Workstation (AI vWS). Using NVIDIA vGPU documentation and best practices, it provides tailored recommendations for optimal performance and resource efficiency.
6
24
25
+
### Powered by NVIDIA Nemotron
26
+
27
+
This tool leverages **NVIDIA Nemotron models** for intelligent sizing recommendations:
28
+
29
+
-**[Llama-3.3-Nemotron-Super-49B](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1)** — Powers the RAG backend for intelligent conversational sizing guidance
30
+
-**[Nemotron-3 Nano 30B](https://build.nvidia.com/nvidia/nvidia-nemotron-3-nano-30b-a3b-fp8)** — Default model for workload sizing calculations (FP8 optimized)
31
+
32
+
### Key Capabilities
33
+
7
34
Enter your workload requirements and receive validated recommendations including:
8
35
9
-
-**vGPU Profile**- Recommended profile (e.g., L40S-24Q) based on your workload
10
-
-**Resource Requirements**- vCPUs, GPU memory, system RAM needed
11
-
-**Performance Estimates**- Expected latency, throughput, and time to first token
12
-
-**Live Testing**- Instantly deploy and validate your configuration locally using vLLM containers
36
+
-**vGPU Profile**— Recommended profile (e.g., L40S-24Q) based on your workload
37
+
-**Resource Requirements**— vCPUs, GPU memory, system RAM needed
38
+
-**Performance Estimates**— Expected latency, throughput, and time to first token
39
+
-**Live Testing**— Instantly deploy and validate your configuration locally using vLLM containers
13
40
14
41
The tool differentiates between RAG and inference workloads by accounting for embedding vectors and database overhead. It intelligently suggests GPU passthrough when jobs exceed standard vGPU profile limits.
15
42
43
+
---
44
+
45
+
## Demo
46
+
47
+
### Configuration Wizard
48
+
49
+
Configure your workload parameters including model selection, GPU type, quantization, and token sizes:
@@ -44,8 +93,10 @@ docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
44
93
> **Note:** Docker must be at `/usr/bin/docker` (verified in `deploy/compose/docker-compose-rag-server.yaml`). User must be in docker group or have socket permissions.
45
94
46
95
### API Keys
47
-
-**NVIDIA Build API Key** (Required) - [Get your key](https://build.nvidia.com/settings/api-keys)
48
-
-**HuggingFace Token** (Optional) - [Create token](https://huggingface.co/settings/tokens) for gated models
96
+
-**NVIDIA Build API Key** (Required) — [Get your key](https://build.nvidia.com/settings/api-keys)
97
+
-**HuggingFace Token** (Optional) — [Create token](https://huggingface.co/settings/tokens) for gated models
98
+
99
+
---
49
100
50
101
## Deployment
51
102
@@ -74,28 +125,32 @@ npm install
74
125
npm run dev
75
126
```
76
127
128
+
---
129
+
77
130
## Usage
78
131
79
-
2.**Select Workload Type:** RAG or Inference
132
+
1.**Select Workload Type:** RAG or Inference
80
133
81
-
3.**Enter Parameters:**
82
-
- Model name (e.g., `meta-llama/Llama-2-7b-chat-hf`)
134
+
2.**Enter Parameters:**
135
+
- Model name (default: **Nemotron-3 Nano 30B FP8**)
83
136
- GPU type
84
137
- Prompt size (input tokens)
85
138
- Response size (output tokens)
86
-
- Quantization (FP16, INT8, INT4)
139
+
- Quantization (FP16, FP8, INT8, INT4)
87
140
- For RAG: Embedding model and vector dimensions
88
141
89
-
4.**View Recommendations:**
142
+
3.**View Recommendations:**
90
143
- Recommended vGPU profiles
91
144
- Resource requirements (vCPUs, RAM, GPU memory)
92
145
- Performance estimates
93
146
94
-
5.**Test Locally** (optional):
147
+
4.**Test Locally** (optional):
95
148
- Run local inference with a containerized vLLM server
96
149
- View performance metrics
97
150
- Compare actual results versus suggested profile configuration
0 commit comments