From fe72c7312a681961545b43aeb9f72b1671ed33e4 Mon Sep 17 00:00:00 2001
From: AlexHe99 <alehe@amd.com>
Date: Tue, 24 Dec 2024 18:47:05 +0800
Subject: [PATCH 1/5] Update deploying_with_k8s.md with AMD ROCm GPU example

Add the example of using AMD ROCm GPU

Signed-off-by: Alex He <alehe@amd.com>
---
 docs/source/serving/deploying_with_k8s.md | 73 +++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/docs/source/serving/deploying_with_k8s.md b/docs/source/serving/deploying_with_k8s.md
index d27db826cd00..81ffc3e3703a 100644
--- a/docs/source/serving/deploying_with_k8s.md
+++ b/docs/source/serving/deploying_with_k8s.md
@@ -119,6 +119,79 @@ spec:
           periodSeconds: 5
 ```
 
+- AMD ROCm GPU
+
+You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X.
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: mistral-7b
+  namespace: default
+  labels:
+    app: mistral-7b
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: mistral-7b
+  template:
+    metadata:
+      labels:
+        app: mistral-7b
+    spec:
+      volumes:
+      # PVC
+      - name: cache-volume
+        persistentVolumeClaim:
+          claimName: mistral-7b
+      # vLLM needs to access the host's shared memory for tensor parallel inference.
+      - name: shm
+        emptyDir:
+          medium: Memory
+          sizeLimit: "8Gi"
+      hostNetwork: true
+      hostIPC: true
+      containers:
+      - name: mistral-7b
+        image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
+        securityContext:
+          seccompProfile:
+            type: Unconfined
+          runAsGroup: 44
+          capabilities:
+            add:
+            - SYS_PTRACE
+        command: ["/bin/sh", "-c"]
+        args: [
+          "vllm serve mistralai/Mistral-7B-v0.3 --port 8000 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
+        ]
+        env:
+        - name: HUGGING_FACE_HUB_TOKEN
+          valueFrom:
+            secretKeyRef:
+              name: hf-token-secret
+              key: token
+        ports:
+        - containerPort: 8000
+        resources:
+          limits:
+            cpu: "10"
+            memory: 20G
+            amd.com/gpu: "1"
+          requests:
+            cpu: "6"
+            memory: 6G
+            amd.com/gpu: "1"
+        volumeMounts:
+        - name: cache-volume
+          mountPath: /root/.cache/huggingface
+        - name: shm
+          mountPath: /dev/shm
+```
+The full example is at https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve.
+
 2. **Create a Kubernetes Service for vLLM**
 
 Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:

From b8401799c52cd58d367af915b0d5c908b5fffb32 Mon Sep 17 00:00:00 2001
From: AlexHe99 <alehe@amd.com>
Date: Fri, 27 Dec 2024 10:09:12 +0800
Subject: [PATCH 2/5] Update docs/source/serving/deploying_with_k8s.md

Good suggestion! Thank you.

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Alex He <alehe@amd.com>
---
 docs/source/serving/deploying_with_k8s.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/serving/deploying_with_k8s.md b/docs/source/serving/deploying_with_k8s.md
index 81ffc3e3703a..f25e7640472e 100644
--- a/docs/source/serving/deploying_with_k8s.md
+++ b/docs/source/serving/deploying_with_k8s.md
@@ -190,7 +190,7 @@ spec:
         - name: shm
           mountPath: /dev/shm
 ```
-The full example is at https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve.
+The full example is at <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.
 
 2. **Create a Kubernetes Service for vLLM**
 

From 5d7a897d1e6b0ecd64ee63f37901d4030bf7f883 Mon Sep 17 00:00:00 2001
From: AlexHe99 <alehe@amd.com>
Date: Fri, 27 Dec 2024 10:29:14 +0800
Subject: [PATCH 3/5] Update deploying_with_k8s.md

- Split it to two sub-section about the writing deployment.yaml for NVIDIA GPU and AMD GPU.

Signed-off-by: Alex He <alehe@amd.com>
---
 docs/source/serving/deploying_with_k8s.md | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/docs/source/serving/deploying_with_k8s.md b/docs/source/serving/deploying_with_k8s.md
index f25e7640472e..77f848088ea4 100644
--- a/docs/source/serving/deploying_with_k8s.md
+++ b/docs/source/serving/deploying_with_k8s.md
@@ -47,7 +47,11 @@ data:
   token: "REPLACE_WITH_TOKEN"
 ```
 
-Create a deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model:
+Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model.
+
+Here are two examples for using NVIDIA GPU and AMD GPU. 
+
+- NVIDIA GPU
 
 ```yaml
 apiVersion: apps/v1
@@ -119,7 +123,7 @@ spec:
           periodSeconds: 5
 ```
 
-- AMD ROCm GPU
+- AMD GPU
 
 You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X.
 
@@ -190,7 +194,7 @@ spec:
         - name: shm
           mountPath: /dev/shm
 ```
-The full example is at <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.
+You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.
 
 2. **Create a Kubernetes Service for vLLM**
 

From 3fc12d017052cb2f3a4e41f29910457cc08f1736 Mon Sep 17 00:00:00 2001
From: AlexHe99 <alehe@amd.com>
Date: Fri, 27 Dec 2024 10:30:21 +0800
Subject: [PATCH 4/5] Update deploying_with_k8s.md

Signed-off-by: Alex He <alehe@amd.com>
---
 docs/source/serving/deploying_with_k8s.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/serving/deploying_with_k8s.md b/docs/source/serving/deploying_with_k8s.md
index 77f848088ea4..47ad926e2080 100644
--- a/docs/source/serving/deploying_with_k8s.md
+++ b/docs/source/serving/deploying_with_k8s.md
@@ -49,7 +49,7 @@ data:
 
 Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model.
 
-Here are two examples for using NVIDIA GPU and AMD GPU. 
+Here are two exampels for using NVIDIA GPU and AMD GPU. 
 
 - NVIDIA GPU
 

From aabd116b6cd2e3b439c00773fb382f45a375f9db Mon Sep 17 00:00:00 2001
From: AlexHe99 <alehe@amd.com>
Date: Fri, 27 Dec 2024 10:34:50 +0800
Subject: [PATCH 5/5] Update deploying_with_k8s.md

Signed-off-by: Alex He <alehe@amd.com>
---
 docs/source/serving/deploying_with_k8s.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/serving/deploying_with_k8s.md b/docs/source/serving/deploying_with_k8s.md
index 47ad926e2080..77f848088ea4 100644
--- a/docs/source/serving/deploying_with_k8s.md
+++ b/docs/source/serving/deploying_with_k8s.md
@@ -49,7 +49,7 @@ data:
 
 Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model.
 
-Here are two exampels for using NVIDIA GPU and AMD GPU. 
+Here are two examples for using NVIDIA GPU and AMD GPU. 
 
 - NVIDIA GPU