Production vLLM inference server as a Flox environment. Installs flox/vllm-flox-runtime (model provisioning and serving scripts) and flox-cuda/python3Packages.vllm (vLLM + CUDA + Python) from the Flox catalog.
- vLLM: 0.15.1
- CUDA: requires NVIDIA driver with CUDA support
- Platform: Linux only (
x86_64-linux)
# Activate and start the vLLM service
flox activate --start-services
# Override the model at activation time
VLLM_MODEL=DeepSeek-R1-Distill-Qwen-7B \
VLLM_MODEL_ORG=deepseek-ai \
flox activate --start-services# Health check (no auth required)
curl http://127.0.0.1:8000/health
# List loaded models
curl http://127.0.0.1:8000/v1/models \
-H "Authorization: Bearer sk-vllm-local-dev"
# Chat completion
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-vllm-local-dev" \
-d '{
"model": "Phi-3.5-mini-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 256
}'The examples/ directory contains self-contained demo scripts that start a vLLM server, run smoke tests (health, models, completions, chat), and shut down cleanly.
The default model is microsoft/Phi-3.5-mini-instruct-AWQ (~2.2 GB, AWQ 4-bit quantization). It is installed as a Flox package via Nix store-path and resolved via the flox source. No download required — the model is available immediately after activation. AWQ 4-bit quantization works on all CUDA GPUs including Tesla T4 (sm75).
Override the default port:
VLLM_PORT=8800 flox activate --start-servicesThe service command chains three scripts in a pipeline:
vllm-preflight && vllm-resolve-model && vllm-serve
┌──────────────────────────────────────────────────────┐
│ Consuming Environment (.flox/env/manifest.toml) │
│ │
│ [install] │
│ flox/vllm-flox-runtime # 3-script pipeline │
│ flox/vllm-python312-cuda* # vLLM + CUDA │
│ (optional) flox/vllm-flox-monitoring │
│ │
│ [services] │
│ vllm → vllm-preflight │
│ && vllm-resolve-model │
│ && vllm-serve │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ vllm-preflight │ │
│ │ Port reclaim ← /proc/net/tcp + /proc/<pid>/ │ │
│ │ GPU health ← NVML → nvidia-smi → skip │ │
│ ├─────────────────────────────────────────────────┤ │
│ │ vllm-resolve-model │ │
│ │ Sources: flox → local → hf-cache → r2 → hub │ │
│ │ Output: per-model .env file (mode 600) │ │
│ ├─────────────────────────────────────────────────┤ │
│ │ vllm-serve │ │
│ │ Loads .env → validates args → exec vllm serve│ │
│ └─────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘
- vllm-preflight — Reclaims the port if occupied by a stale vLLM process, checks GPU health via NVML or nvidia-smi, optionally executes a downstream command.
- vllm-resolve-model — Provisions the model from configured sources with locking and atomic swaps, validates the model directory (config, tokenizer, weight shards), writes a per-model env file.
- vllm-serve — Loads the env file (safe or trusted mode), validates all required vars, builds the
vllm serveargv from env vars +config.yaml, andexecs.
Scripts are provided by the flox/vllm-flox-runtime package (~1,700 lines of hardened Bash) and available on PATH after activation.
The server exposes an OpenAI-compatible API. All authenticated endpoints require the Authorization: Bearer <VLLM_API_KEY> header.
| Endpoint | Method | Auth | Description |
|---|---|---|---|
/health |
GET |
No | Health check. Returns 200 OK when ready |
/v1/models |
GET |
Yes | List loaded models |
/v1/chat/completions |
POST |
Yes | Chat completions (streaming supported) |
/v1/completions |
POST |
Yes | Text completions (streaming supported) |
/metrics |
GET |
No | Prometheus metrics |
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-vllm-local-dev" \
-d '{
"model": "Phi-3.5-mini-instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain TCP in one paragraph."}
],
"max_tokens": 256,
"temperature": 0.7
}'curl --no-buffer http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-vllm-local-dev" \
-d '{
"model": "Phi-3.5-mini-instruct",
"messages": [{"role": "user", "content": "Write a haiku about CUDA."}],
"max_tokens": 64,
"stream": true
}'curl http://127.0.0.1:8000/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-vllm-local-dev" \
-d '{
"model": "Phi-3.5-mini-instruct",
"prompt": "The capital of France is",
"max_tokens": 32
}'Settings are split between a static config file and runtime environment variables.
A default config is bundled in the vllm-flox-runtime package and auto-copied to $FLOX_ENV_CACHE/vllm-config.yaml on first run. Edit that copy to customize. These settings are read by vllm-serve and passed directly to vllm serve via --config. host and port are overridden by the VLLM_HOST/VLLM_PORT env vars (passed as CLI args, which take precedence over config file values).
| Parameter | Default | Description |
|---|---|---|
host |
0.0.0.0 |
Bind address |
port |
8000 |
HTTP listen port |
dtype |
float16 |
Weight data type. Use float16 for AWQ-quantized models; auto selects BF16 for BF16 models, FP16 for FP16/FP32 models |
gpu-memory-utilization |
0.85 |
Per-GPU VRAM fraction for KV cache. Reduce if you see OOM during prefill. Increase for cards with more headroom (e.g., 0.92 for 24 GB, 0.95 for 48 GB+) |
quantization |
awq |
Quantization method. Set to match the model's quantization (e.g., awq, gptq). Remove for unquantized models |
disable-log-requests |
true |
Suppress per-request logging |
uvicorn-log-level |
warning |
Uvicorn server log level |
All vars use ${VAR:-default} in the on-activate hook so they can be overridden at activation time:
VLLM_MAX_MODEL_LEN=16384 VLLM_KV_CACHE_DTYPE=fp8 flox activate --start-services| Variable | Default | Description |
|---|---|---|
VLLM_MODEL |
Phi-3.5-mini-instruct-AWQ |
Model directory name. Must be a single safe path element (no /, \, ., .., or control characters) |
VLLM_MODEL_ORG |
microsoft |
HuggingFace org. Used to derive the model ID as $VLLM_MODEL_ORG/$VLLM_MODEL when VLLM_MODEL_ID is not set |
VLLM_MODEL_SOURCES |
flox,local,hf-cache,hf-hub |
Comma-separated source order for model provisioning. Available sources: flox, local, hf-cache, r2, hf-hub |
VLLM_MODELS_DIR |
$FLOX_ENV_PROJECT/models |
Root directory for model storage and HF cache. Created automatically on activation |
VLLM_SERVED_MODEL_NAME |
Phi-3.5-mini-instruct |
Model name returned in /v1/models responses and used in API requests |
| Variable | Default | Description |
|---|---|---|
VLLM_HOST |
0.0.0.0 |
Server bind address |
VLLM_PORT |
8000 |
Server listen port. Must be 1-65535 |
VLLM_API_KEY |
sk-vllm-local-dev |
Bearer token for API authentication |
| Variable | Default | Description |
|---|---|---|
VLLM_TENSOR_PARALLEL_SIZE |
1 |
Number of GPUs for tensor parallelism. Must be > 0 |
VLLM_PIPELINE_PARALLEL_SIZE |
1 |
Number of GPUs for pipeline parallelism. Must be > 0 |
VLLM_PREFIX_CACHING |
false |
Automatic prefix caching. Accepts true/false/1/0/yes/no |
VLLM_KV_CACHE_DTYPE |
auto |
KV cache precision. auto matches model dtype; fp8 halves KV cache memory at minor quality cost. Must not contain whitespace |
VLLM_MAX_MODEL_LEN |
4096 |
Max sequence length (input + output tokens). Must not exceed the model's native context length. Lower values reduce memory. Must be > 0 |
VLLM_MAX_NUM_BATCHED_TOKENS |
4096 |
Chunked prefill budget. Increase for throughput at the cost of higher per-request latency. Must be > 0 |
| Variable | Default | Description |
|---|---|---|
VLLM_LOGGING_LEVEL |
WARNING |
vLLM Python log level (DEBUG, INFO, WARNING, ERROR) |
PROMETHEUS_MULTIPROC_DIR |
$FLOX_ENV_CACHE/vllm-prometheus |
Directory for Prometheus client multiprocess metrics. Created automatically on activation |
Searches configured sources in order, validates the model directory, and writes an env file that vllm-serve loads. The first source that produces a valid model wins.
Sources are tried in the order specified by VLLM_MODEL_SOURCES. The script's internal default is flox,local,hf-cache,r2,hf-hub; the manifest sets flox,local,hf-cache,hf-hub.
| Source | What it checks | Skip condition | Resolution |
|---|---|---|---|
flox |
$FLOX_ENV/share/models/hub/models--<slug>/snapshots/ |
FLOX_ENV not set |
Sets HF_HOME to the flox package model path |
local |
$VLLM_MODELS_DIR/<VLLM_MODEL>/ |
Directory missing or fails validation | Sets VLLM_MODEL_PATH to the local directory |
hf-cache |
$VLLM_MODELS_DIR/hub/models--<slug>/snapshots/ |
No usable snapshot found | Sets HF_HOME to $VLLM_MODELS_DIR |
r2 |
Downloads from s3://$R2_BUCKET/$R2_MODELS_PREFIX/$VLLM_MODEL/ |
aws CLI missing, R2_BUCKET/R2_MODELS_PREFIX not set, or credentials fail |
Stages to temp dir, validates, atomic-swaps into $VLLM_MODELS_DIR/<VLLM_MODEL>/ |
hf-hub |
Downloads from HuggingFace Hub using hf/huggingface-cli/python3 |
No download tool found | Stages to temp dir, validates, atomic-swaps into $VLLM_MODELS_DIR/<VLLM_MODEL>/ |
Required:
| Variable | Description |
|---|---|
VLLM_MODEL |
Model name (single safe path element) |
VLLM_MODELS_DIR |
Base directory for local models and HF cache |
Optional:
| Variable | Default | Description |
|---|---|---|
VLLM_MODEL_ID |
Derived from $VLLM_MODEL_ORG/$VLLM_MODEL |
Explicit HuggingFace model ID (org/name). When empty, derived from VLLM_MODEL_ORG (which must then be set) |
VLLM_MODEL_ORG |
(none; manifest sets microsoft) |
Org prefix for deriving model ID. Required when VLLM_MODEL_ID is empty |
VLLM_MODEL_SOURCES |
flox,local,hf-cache,r2,hf-hub |
Comma-separated source order |
FLOX_ENV |
(set by Flox) | Flox environment path. Required for flox source |
FLOX_ENV_CACHE |
(set by Flox) | Cache directory for env files. Required when VLLM_MODEL_ENV_FILE is not set |
VLLM_MODEL_ENV_FILE |
$FLOX_ENV_CACHE/vllm-model.<slug>.<hash>.env |
Override env file output path |
R2_BUCKET |
(none) | Cloudflare R2 bucket name. Required for r2 source |
R2_MODELS_PREFIX |
(none) | R2 key prefix for models. Required for r2 source |
R2_ENDPOINT_URL |
(none) | AWS CLI endpoint URL for R2 |
VLLM_RESOLVE_LOCK_TIMEOUT |
300 |
Seconds to wait for the per-model lock |
VLLM_SKIP_TOKENIZER_CHECK |
0 |
Set to 1 to skip tokenizer asset validation |
VLLM_KEEP_LOGS |
0 |
Set to 1 to keep download logs even on success. Logs are always kept on failure |
HF_TOKEN |
(none) | HuggingFace token for gated model access |
Every candidate model directory must pass three checks before it is accepted:
config.json— must exist at the directory root.- Tokenizer assets — at least one recognized tokenizer file must exist in
<dir>/,<dir>/tokenizer/, or<dir>/tokenizer_files/. Recognized files:tokenizer.json,tokenizer.model,spiece.modelvocab.json+merges.txtvocab.txttokenizer_config.json+ (vocab.jsonorvocab.txt)- Skip this check with
VLLM_SKIP_TOKENIZER_CHECK=1.
- Weight shards — determined by the presence of shard index files:
- If
*.index.jsonexists: all shard files referenced inweight_mapmust exist. - If no index but
-00001-of-NNNNNpattern detected: all N shards must exist. - Otherwise: at least one weight-like file (
.safetensors,.bin,.pt,.pth,.gguf) must exist.
- If
Written atomically (mktemp + mv) to $FLOX_ENV_CACHE/vllm-model.<slug>.<hash12>.env with mode 600 (umask 077). Contains:
# generated by vllm-resolve-model
export HF_HOME='/path/to/hf/home' # when resolved via flox or hf-cache
export VLLM_MODEL='Llama-3.1-8B-Instruct'
export VLLM_MODEL_ID='meta-llama/Llama-3.1-8B-Instruct'
export VLLM_MODEL_PATH='/path/to/models/Llama-3.1-8B-Instruct' # when resolved locally
export _VLLM_RESOLVED_MODEL='meta-llama/Llama-3.1-8B-Instruct'
export _VLLM_RESOLVED_VIA='hf-hub'The <slug> is the model ID with unsafe characters mapped to -. The <hash12> is the first 12 hex chars of SHA-256 of the model ID, computed using whichever is available: sha256sum, shasum, openssl, or python3.
Gated models that require authentication need a HuggingFace token:
HF_TOKEN=hf_... flox activate --start-servicesRestrict sources to avoid network access:
VLLM_MODEL_SOURCES=local flox activate --start-services # local only
VLLM_MODEL_SOURCES=local,hf-cache flox activate --start-services # local + cached- Per-model lock: acquired before any source search. Uses
flockif available, falls back tomkdir-based locking with stale PID detection. Timeout:VLLM_RESOLVE_LOCK_TIMEOUTseconds (default 300). - Atomic swap (r2 and hf-hub only): downloads stage into a temp directory under
$VLLM_MODELS_DIR/.staging/. After validation, the staged directory replaces the target via backup+rename. If interrupted, the next run restores the newest backup automatically.
Pre-flight validation: reclaims the vLLM port if occupied, checks GPU health, and optionally executes a downstream command.
Platform: Linux only (requires /proc).
vllm-preflight # checks only
vllm-preflight ./start.sh arg1 arg2 # checks, then exec command
vllm-preflight -- python -m vllm ... # checks, then exec command (after --)Stable contract — these codes are safe to match on programmatically.
| Code | Meaning | When |
|---|---|---|
0 |
Success | Port free (or reclaimed), GPU OK, downstream command exec'd |
1 |
Validation error | Bad env var value, GPU hard failure, bad config, python3 not found |
2 |
Port owned by non-vLLM process | A non-vLLM listener holds the port. Will not kill |
3 |
Different UID | vLLM process on the port belongs to another user. Will not kill (unless VLLM_ALLOW_OTHER_UID_KILL=1) |
4 |
Not attributable | Listener found but cannot map socket inodes to PIDs (permissions / hidepid) |
5 |
Stop failed | Sent SIGTERM/SIGKILL but port is still listening after timeout |
In --dry-run mode (VLLM_DRY_RUN=1), exit codes are 0/2/3/4 only (never 5, since nothing is killed).
| Variable | Default | Validation | Description |
|---|---|---|---|
VLLM_HOST |
0.0.0.0 |
IP/hostname | Bind address to check |
VLLM_PORT |
8000 |
Integer, 1-65535 | Port to check and reclaim |
VLLM_DRY_RUN |
0 |
0 or 1 |
Report what would happen without sending signals |
VLLM_SKIP_GPU_CHECK |
0 |
0 or 1 |
Skip all GPU checks |
VLLM_MIN_FREE_GPU_GB |
4 |
Numeric, >= 0 | Minimum free GPU memory (GiB). Hard-fails if memory in VLLM_GPU_FAIL_ON |
VLLM_MAX_GPU_TEMP_C |
85 |
Integer, >= 1 | Hard-fail if GPU temperature exceeds this (Celsius) |
VLLM_MAX_GPU_UTIL_PCT |
95 |
Integer, 0-100 | Hard-fail if GPU utilization exceeds this percentage |
VLLM_GPU_FAIL_ON |
temperature |
Comma-separated | Conditions that trigger hard failure: temperature, memory, utilization |
VLLM_GPU_DEVICES |
(unset) | CSV | GPU device indices/UUIDs to check. Falls back to CUDA_VISIBLE_DEVICES |
VLLM_ALLOW_OTHER_UID_KILL |
0 |
0 or 1 |
Allow killing vLLM processes owned by other UIDs |
VLLM_STOP_TIMEOUT |
15 |
Integer, >= 0 | Seconds for full stop cycle (SIGTERM → SIGKILL) |
VLLM_PROCESS_SIGNATURES |
(unset) | Comma-separated | Additional cmdline signatures to identify as vLLM processes |
VLLM_PREFLIGHT_LOCKFILE |
/tmp/vllm-preflight.{port}.lock |
File path | Lock file path. Port-keyed by default |
VLLM_PREFLIGHT_JSON |
0 |
0 or 1 |
JSON output on stdout. Incompatible with downstream command |
VLLM_PREFLIGHT_PROXY_CHILD |
1 |
0 or 1 |
Proxy mode: start child, wait for bind, forward signals. Set 0 for plain exec |
VLLM_START_BIND_TIMEOUT |
60 |
Numeric, > 0 | Max seconds to wait for downstream to bind the target port (proxy mode) |
VLLM_START_BIND_POLL |
0.2 |
Numeric, > 0 | Poll interval while waiting for downstream bind (proxy mode) |
VLLM_PROCESS_SIGNATURES example for unusual launchers:
VLLM_PROCESS_SIGNATURES='my.custom.launcher,another_entrypoint'- Parses
/proc/net/tcpand/proc/net/tcp6for LISTEN-state sockets matching the configured host and port (including wildcard0.0.0.0/::catchall). - Maps socket inodes to PIDs via
/proc/<pid>/fd/symlink scanning. - Reads
/proc/<pid>/cmdlineand/proc/<pid>/exeto classify each listener as vLLM or non-vLLM:- Built-in signatures:
vllm.entrypoints.openai.api_server,vllm.entrypoints.api_server,vllm.entrypoints.openai.run_server,vllm serve,vllm.entrypoints. - Custom signatures: set
VLLM_PROCESS_SIGNATURESfor unusual launchers.
- Built-in signatures:
- Non-vLLM listener → exit 2 (refuses to kill).
- Different UID → exit 3 (unless
VLLM_ALLOW_OTHER_UID_KILL=1). - Unmappable inodes → exit 4 (e.g., hidepid restrictions).
- Own vLLM → builds process tree via
/proc/<pid>/statparent chain, sends SIGTERM to all descendants (post-order), waits up toVLLM_STOP_TIMEOUTseconds (default 15), then SIGKILL any survivors. - Polls until port is free or
VLLM_STOP_TIMEOUTexpires. If still listening → exit 5.
Runs after port reclaim. Three-tier cascade:
- NVML (preferred): ctypes probe of
libcuda.so.1+libnvidia-ml.so.1. Per-GPU: name, memory, temperature, utilization, pstate, clock throttle reasons. Hard-fails if driver present but 0 devices. - nvidia-smi (fallback): same fields via CSV output.
- Neither available: warning, continue.
Threshold checks:
- Memory: hard-fails if free <
VLLM_MIN_FREE_GPU_GBandmemoryinVLLM_GPU_FAIL_ON. - Temperature: hard-fails if >
VLLM_MAX_GPU_TEMP_CandtemperatureinVLLM_GPU_FAIL_ON(default: yes). - Utilization: hard-fails if >
VLLM_MAX_GPU_UTIL_PCTandutilizationinVLLM_GPU_FAIL_ON.
When VLLM_PREFLIGHT_JSON=1, a single JSON object is printed to stdout. Human-readable logs still go to stderr. Incompatible with downstream command execution.
Examples:
{"ok":true,"host":"0.0.0.0","port":8000,"port_action":"noop","reclaimed_roots":[],"blocked_roots":[],"could_scan":true,"gpus":[],"gpu_check_source":"skipped"}
{"ok":true,"host":"0.0.0.0","port":8000,"port_action":"stopped","reclaimed_roots":[12345],"blocked_roots":[],"could_scan":true,"gpus":[...],"gpu_check_source":"nvml"}Prevents two concurrent preflight runs from racing on the same port:
- flock (preferred): opens
$VLLM_PREFLIGHT_LOCKFILEwithflock -n. Validates the lockfile is not a symlink and is a regular file. - mkdir fallback: creates
$VLLM_PREFLIGHT_LOCKFILE.d/atomically. Detects stale locks via PID file.
Loads the resolved model env file and executes vllm serve with validated arguments.
vllm-serve # standard launch
vllm-serve --print-cmd # print the vllm serve argv to stderr, then exec
vllm-serve --dry-run # print the argv to stderr and exit 0 (no exec)
vllm-serve -h # show help
vllm-serve -- --extra-flag val # pass extra args through to vllm serveAlways required:
| Variable | Validation | Description |
|---|---|---|
FLOX_ENV_CACHE |
Must be a directory | Cache directory (for default vllm-config.yaml). Not required if VLLM_CONFIG_FILE is set |
VLLM_HOST |
Non-empty | Server bind address |
VLLM_PORT |
Positive integer | Server listen port |
VLLM_TENSOR_PARALLEL_SIZE |
Positive integer | Tensor parallelism GPU count |
VLLM_PIPELINE_PARALLEL_SIZE |
Positive integer | Pipeline parallelism GPU count |
VLLM_KV_CACHE_DTYPE |
Non-empty, no whitespace | KV cache dtype (e.g., auto, fp8) |
VLLM_MAX_MODEL_LEN |
Positive integer | Max sequence length |
VLLM_MAX_NUM_BATCHED_TOKENS |
Positive integer | Chunked prefill budget |
VLLM_SERVED_MODEL_NAME |
Non-empty | Model name for API responses |
Required when VLLM_MODEL_ENV_FILE is not set (the standard case):
| Variable | Description |
|---|---|
FLOX_ENV_CACHE |
Cache directory. Must exist as a directory |
VLLM_MODEL_ID |
Full model ID (org/model), OR VLLM_MODEL_ORG + VLLM_MODEL must both be set |
| Variable | Default | Description |
|---|---|---|
VLLM_MODEL_ENV_FILE |
Derived from FLOX_ENV_CACHE + model ID |
Explicit env file path. Bypasses the standard derivation |
VLLM_PREFIX_CACHING |
false |
Enable automatic prefix caching. Accepts true/false/1/0/yes/no |
VLLM_CONFIG_FILE |
$FLOX_ENV_CACHE/vllm-config.yaml |
Override config file path. Default is auto-copied from the package on first run |
VLLM_ENV_FILE_TRUSTED |
false |
Skip safe-mode env file parsing and source the file directly. Accepts true/false/1/0/yes/no |
Two modes:
Safe mode (default, VLLM_ENV_FILE_TRUSTED=false): The env file is parsed by a Python script that enforces a restricted .env subset:
- Lines must be
KEY=VALUEorexport KEY=VALUE. - Values may be single-quoted, double-quoted, or unquoted.
- Double-quoted values support
\\,\",\n,\tescapes. - Trailing
# commentsare allowed after quoted values. - No multiline values, no
${VAR}interpolation, no command substitution. - Generates sanitized
export KEY='value'lines, thensources the sanitized output. - Requires
python3(orpython) on PATH.
Trusted mode (VLLM_ENV_FILE_TRUSTED=true): The env file is sourced directly as shell code. Only use this if you fully trust the env file contents.
The env file must define _VLLM_RESOLVED_MODEL or vllm-serve exits with an error.
vllm-serve builds the final argv as:
vllm serve <_VLLM_RESOLVED_MODEL> \
--config <config_file> \
--host <VLLM_HOST> \
--port <VLLM_PORT> \
--tensor-parallel-size <VLLM_TENSOR_PARALLEL_SIZE> \
--pipeline-parallel-size <VLLM_PIPELINE_PARALLEL_SIZE> \
--kv-cache-dtype <VLLM_KV_CACHE_DTYPE> \
--max-model-len <VLLM_MAX_MODEL_LEN> \
--max-num-batched-tokens <VLLM_MAX_NUM_BATCHED_TOKENS> \
--served-model-name <VLLM_SERVED_MODEL_NAME> \
[--enable-prefix-caching] # when VLLM_PREFIX_CACHING is true/1/yes
[extra args...] # anything after -- on the vllm-serve command lineThe env var to vLLM CLI flag mapping:
| Env var | CLI flag |
|---|---|
_VLLM_RESOLVED_MODEL |
positional (model argument) |
VLLM_CONFIG_FILE or $FLOX_ENV_CACHE/vllm-config.yaml |
--config |
VLLM_HOST |
--host |
VLLM_PORT |
--port |
VLLM_TENSOR_PARALLEL_SIZE |
--tensor-parallel-size |
VLLM_PIPELINE_PARALLEL_SIZE |
--pipeline-parallel-size |
VLLM_KV_CACHE_DTYPE |
--kv-cache-dtype |
VLLM_MAX_MODEL_LEN |
--max-model-len |
VLLM_MAX_NUM_BATCHED_TOKENS |
--max-num-batched-tokens |
VLLM_SERVED_MODEL_NAME |
--served-model-name |
VLLM_PREFIX_CACHING |
--enable-prefix-caching (when truthy) |
- If
VLLM_CONFIG_FILEis set, use that path. - Otherwise, use
$FLOX_ENV_CACHE/vllm-config.yaml. - If that file doesn't exist, the default config bundled in the package (
<pkg-root>/share/vllm-flox-runtime/config.yaml) is auto-copied there on first run. - The file must exist and be readable, or
vllm-serveexits with an error.
# 2-way tensor parallel (most common)
VLLM_TENSOR_PARALLEL_SIZE=2 flox activate --start-services
# 4-way pipeline parallel
VLLM_PIPELINE_PARALLEL_SIZE=4 flox activate --start-services
# 4-way hybrid: TP=2 x PP=2
VLLM_TENSOR_PARALLEL_SIZE=2 \
VLLM_PIPELINE_PARALLEL_SIZE=2 \
flox activate --start-servicesTP (tensor parallelism) shards weight matrices across GPUs — reduces per-GPU memory, best for latency. PP (pipeline parallelism) distributes layers sequentially across GPUs — useful when TP alone isn't enough. TP x PP must equal your total GPU count.
# Override at activation time
VLLM_MODEL=Qwen2.5-7B-Instruct \
VLLM_MODEL_ORG=Qwen \
flox activate --start-services
# Or edit the on-activate defaults in manifest.toml and restart
flox services restart vllmflox services status # check service state
flox services logs vllm # tail service logs
flox services logs vllm -f # follow logs
flox services restart vllm # restart the vLLM service
flox services stop # stop all services
flox activate --start-services # activate and start in one stepInstall flox/vllm-flox-monitoring alongside this environment to add Prometheus + Grafana:
# Add to [install]
prometheus.pkg-path = "prometheus"
grafana.pkg-path = "grafana"
vllm-flox-monitoring.pkg-path = "flox/vllm-flox-monitoring"
# Add to end of on-activate
# . vllm-monitoring-init
# Add to [services]
# prometheus.command = "vllm-monitoring-prometheus"
# grafana.command = "vllm-monitoring-grafana"Key override env vars: PROMETHEUS_PORT (default 9090), GF_SERVER_HTTP_PORT (default 3000).
curl http://127.0.0.1:9090/api/v1/targets # Prometheus targets
curl http://127.0.0.1:3000/api/health # Grafana healthRaw vLLM metrics are always available at http://127.0.0.1:8000/metrics without additional packages.
Deploy vLLM to Kubernetes using the Flox "Imageless Kubernetes" (uncontained) pattern. The Flox containerd shim pulls the environment from FloxHub at pod startup, replacing the need for a container image.
- A Kubernetes cluster with the Flox containerd shim installed on GPU nodes
- NVIDIA GPU operator or device plugin configured
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml| File | Purpose |
|---|---|
k8s/namespace.yaml |
Creates the vllm namespace |
k8s/deployment.yaml |
Single-replica Recreate deployment with Flox shim, GPU resources, emptyDir volume, health probes |
k8s/service.yaml |
ClusterIP service on port 8000 |
k8s/pvc.yaml |
(optional) 50 Gi ReadWriteOnce volume for persistent model storage |
The deployment uses runtimeClassName: flox and image: flox/empty:1.0.0 — the Flox shim intercepts pod creation, pulls flox/vllm-runtime from FloxHub, activates the environment, then runs the entrypoint (vllm-preflight && vllm-resolve-model && vllm-serve). The Recreate strategy ensures the old pod releases its GPU before the new pod starts.
By default the deployment uses an emptyDir volume at /models. The default Phi-3.5-mini-instruct-AWQ model is included as a Flox package and resolved via the flox source — no download required at startup, so ephemeral storage is sufficient.
For persistent model storage (survives pod restarts, avoids re-downloading large models), apply k8s/pvc.yaml and update the deployment volume:
kubectl apply -f k8s/pvc.yaml# In deployment.yaml, replace the emptyDir volume:
volumes:
- name: vllm-models
persistentVolumeClaim:
claimName: vllm-modelsSet the storageClassName in k8s/pvc.yaml to match your cluster:
storageClassName: gp3 # AWS EBS
storageClassName: standard-rwo # GKE
storageClassName: managed-premium # AKSCreate a Kubernetes Secret for API authentication and gated model access, then uncomment the secretKeyRef blocks in the deployment:
kubectl -n vllm create secret generic vllm-secrets \
--from-literal=api-key='your-production-api-key' \
--from-literal=hf-token='hf_...'Without the secret, VLLM_API_KEY defaults to sk-vllm-local-dev from the on-activate hook.
Override the model via pod environment variables:
env:
- name: VLLM_MODEL
value: "Qwen2.5-7B-Instruct"
- name: VLLM_MODEL_ORG
value: "Qwen"For multi-GPU inference, set VLLM_TENSOR_PARALLEL_SIZE and request additional GPUs:
env:
- name: VLLM_TENSOR_PARALLEL_SIZE
value: "2"
resources:
limits:
nvidia.com/gpu: 2The startupProbe allows 10 minutes (60 failures x 10s) for warm starts (Flox-bundled model or cached model on a PVC). For cold starts (first-time model download), increase the threshold:
startupProbe:
failureThreshold: 120 # 20 minutes for cold startLiveness and readiness probes are gated behind the startup probe and will not kill slow-starting pods.
# Watch pod startup
kubectl -n vllm get pods -w
# Check logs
kubectl -n vllm logs -f deployment/vllm
# Health check (from within the cluster)
kubectl -n vllm run curl --rm -it --image=curlimages/curl -- \
curl http://vllm-server:8000/health
# Port-forward for local access
kubectl -n vllm port-forward svc/vllm-server 8000:8000
curl http://localhost:8000/healthThe service defaults to ClusterIP. For external access, change the type or add an Ingress:
# Quick LoadBalancer
kubectl -n vllm patch svc vllm-server -p '{"spec":{"type":"LoadBalancer"}}'
# Or use port-forward for development
kubectl -n vllm port-forward svc/vllm-server 8000:8000Common issues and their solutions. Exit codes refer to vllm-preflight.
vllm-preflight automatically reclaims the port from stale vLLM processes. If it exits with code 2, a non-vLLM process is using the port.
# Find what's on the port
ss -tlnp | grep :8000
# Either stop that process or change the port
VLLM_PORT=8001 flox activate --start-servicesAnother user's vLLM process holds the port. Either ask them to stop it, or:
VLLM_ALLOW_OTHER_UID_KILL=1 flox activate --start-servicesYour system restricts /proc/<pid>/fd visibility (hidepid mount option). Run as the owning user, relax hidepid, or free the port manually.
The port is still listening after SIGTERM + SIGKILL + timeout. Increase the timeout or investigate manually:
VLLM_STOP_TIMEOUT=30 flox activate --start-servicesVerify with nvidia-smi. This environment requires NVIDIA driver 575+. To skip the GPU check (e.g., for CPU-only testing):
VLLM_SKIP_GPU_CHECK=1 flox activate --start-servicesGated models require a HuggingFace token:
HF_TOKEN=hf_... flox activate --start-servicesEnsure you've accepted the model's license on the HuggingFace website.
Reduce memory pressure:
- Lower
gpu-memory-utilizationin.flox/cache/vllm-config.yaml(e.g.,0.85). - Reduce
VLLM_MAX_MODEL_LEN(e.g.,2048). - Use
VLLM_KV_CACHE_DTYPE=fp8to halve KV cache memory. - Increase tensor parallelism to spread the model across GPUs.
Some models use non-standard tokenizer layouts. Skip the check:
VLLM_SKIP_TOKENIZER_CHECK=1 flox activate --start-servicesIf a previous run was killed mid-operation, the lock file may be stale:
# For vllm-preflight
rm -f /tmp/vllm-preflight.*.lock
# For vllm-resolve-model (lockfile is next to the env file)
rm -f "$FLOX_ENV_CACHE"/vllm-model.*.lockThe mkdir-based fallback lock includes stale PID detection and self-cleans.
vllm-serve cannot find the model env file. Run vllm-resolve-model first:
vllm-resolve-model && vllm-serveOr specify the env file explicitly:
VLLM_MODEL_ENV_FILE=/path/to/env vllm-servevllm-serve --print-cmd # print the vllm serve argv to stderr, then run it
vllm-serve --dry-run # print the argv and exit without runningVLLM_LOGGING_LEVEL=DEBUG flox activate --start-servicesvllm-runtime/
.flox/env/manifest.toml # Flox manifest (packages, on-activate hook, service)
.flox/cache/vllm-config.yaml # vLLM server config (auto-copied from package on first run)
k8s/ # Kubernetes manifests (Flox uncontained pattern)
models/ # Model cache (created on activation)
examples/ # Demo scripts (flox-bundled and HF-cached models)
README.md # This file
Scripts (vllm-preflight, vllm-resolve-model, vllm-serve) are provided by the flox/vllm-flox-runtime package and available on PATH after activation. They are not stored in this directory. The default config.yaml (gpu-memory-utilization, dtype, logging) is bundled in the package and auto-copied to .flox/cache/vllm-config.yaml on first run — edit that copy to customize.
The runtime scripts handle untrusted input (model names, env files, lock files) and apply defense-in-depth.
The model env file is a trust boundary. In safe mode (default), vllm-serve parses the file with a restrictive Python parser that rejects shell interpolation and command substitution. In trusted mode, the file is sourced directly — only enable this for env files you control.
Even in safe mode, the env file can set arbitrary environment variables (e.g., PATH, LD_LIBRARY_PATH, HF_HOME), so protect its location.
- Env files: written with
umask 077andchmod 600— readable only by the owning user. - Lock files: created with
umask 077. Symlink safety is checked before opening. - Staging directories: created under
$VLLM_MODELS_DIR/.staging/withumask 077.
vllm-preflightvalidates the lockfile is not a symlink and is a regular file before opening.vllm-resolve-modeluses per-model lock files (one per env file path) to prevent concurrent provisioning of the same model.- The
mkdir-based fallback includes stale PID detection to recover from crashes.