Skip to content

Commit 17c0637

Browse files
Add monitoring for CodeGen/CodeTrans deployed by Docker compose. (#2322)
Signed-off-by: Yao, Qing <qing.yao@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 72f2e01 commit 17c0637

File tree

33 files changed

+968
-32
lines changed

33 files changed

+968
-32
lines changed

CodeGen/README.md

Lines changed: 49 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -106,19 +106,58 @@ flowchart LR
106106

107107
This CodeGen example can be deployed manually on various hardware platforms using Docker Compose or Kubernetes. Select the appropriate guide based on your target environment:
108108

109-
| Hardware | Deployment Mode | Guide Link |
110-
| :-------------- | :------------------- | :----------------------------------------------------------------------- |
111-
| Intel Xeon CPU | Single Node (Docker) | [Xeon Docker Compose Guide](./docker_compose/intel/cpu/xeon/README.md) |
112-
| Intel Gaudi HPU | Single Node (Docker) | [Gaudi Docker Compose Guide](./docker_compose/intel/hpu/gaudi/README.md) |
113-
| AMD EPYC CPU | Single Node (Docker) | [EPYC Docker Compose Guide](./docker_compose/amd/cpu/epyc/README.md) |
114-
| AMD ROCm GPU | Single Node (Docker) | [ROCm Docker Compose Guide](./docker_compose/amd/gpu/rocm/README.md) |
115-
| Intel Xeon CPU | Kubernetes (Helm) | [Kubernetes Helm Guide](./kubernetes/helm/README.md) |
116-
| Intel Gaudi HPU | Kubernetes (Helm) | [Kubernetes Helm Guide](./kubernetes/helm/README.md) |
117-
| Intel Xeon CPU | Kubernetes (GMC) | [Kubernetes GMC Guide](./kubernetes/gmc/README.md) |
118-
| Intel Gaudi HPU | Kubernetes (GMC) | [Kubernetes GMC Guide](./kubernetes/gmc/README.md) |
109+
| Hardware | Deployment Mode | Guide Link |
110+
| :-------------- | :----------------------------------- | :--------------------------------------------------------------------------------------- |
111+
| Intel Xeon CPU | Single Node (Docker) | [Xeon Docker Compose Guide](./docker_compose/intel/cpu/xeon/README.md) |
112+
| Intel Xeon CPU | Single Node (Docker) with Monitoring | [Xeon Docker Compose with Monitoring Guide](./docker_compose/intel/cpu/xeon/README.md) |
113+
| Intel Gaudi HPU | Single Node (Docker) | [Gaudi Docker Compose Guide](./docker_compose/intel/hpu/gaudi/README.md) |
114+
| Intel Gaudi HPU | Single Node (Docker) with Monitoring | [Gaudi Docker Compose with Monitoring Guide](./docker_compose/intel/hpu/gaudi/README.md) |
115+
| AMD EPYC CPU | Single Node (Docker) | [EPYC Docker Compose Guide](./docker_compose/amd/cpu/epyc/README.md) |
116+
| AMD ROCm GPU | Single Node (Docker) | [ROCm Docker Compose Guide](./docker_compose/amd/gpu/rocm/README.md) |
117+
| Intel Xeon CPU | Kubernetes (Helm) | [Kubernetes Helm Guide](./kubernetes/helm/README.md) |
118+
| Intel Gaudi HPU | Kubernetes (Helm) | [Kubernetes Helm Guide](./kubernetes/helm/README.md) |
119+
| Intel Xeon CPU | Kubernetes (GMC) | [Kubernetes GMC Guide](./kubernetes/gmc/README.md) |
120+
| Intel Gaudi HPU | Kubernetes (GMC) | [Kubernetes GMC Guide](./kubernetes/gmc/README.md) |
119121

120122
_Note: Building custom microservice images can be done using the resources in [GenAIComps](https://github.com/opea-project/GenAIComps)._
121123

124+
## Monitoring
125+
126+
The CodeGen example supports monitoring capabilities for Intel Xeon and Intel Gaudi platforms. Monitoring includes:
127+
128+
- **Prometheus**: For metrics collection and querying
129+
- **Grafana**: For visualization and dashboards
130+
- **Node Exporter**: For system metrics collection
131+
132+
### Monitoring Features
133+
134+
- Real-time metrics collection from all CodeGen microservices
135+
- Pre-configured dashboards for:
136+
- vLLM/TGI performance metrics
137+
- CodeGen MegaService metrics
138+
- System resource utilization
139+
- Node-level metrics
140+
141+
### Enabling Monitoring
142+
143+
Monitoring can be enabled by using the `compose.monitoring.yaml` file along with the main compose file:
144+
145+
```bash
146+
# For Intel Xeon
147+
docker compose -f compose.yaml -f compose.monitoring.yaml up -d
148+
149+
# For Intel Gaudi
150+
docker compose -f compose.yaml -f compose.monitoring.yaml up -d
151+
```
152+
153+
### Accessing Monitoring Services
154+
155+
Once deployed with monitoring, you can access:
156+
157+
- **Prometheus**: `http://${HOST_IP}:9090`
158+
- **Grafana**: `http://${HOST_IP}:3000` (username: `admin`, password: `admin`)
159+
- **Node Exporter**: `http://${HOST_IP}:9100`
160+
122161
## Benchmarking
123162

124163
Guides for evaluating the performance and accuracy of this CodeGen deployment are available:

CodeGen/docker_compose/intel/cpu/xeon/README.md

Lines changed: 60 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,8 @@ This uses the default vLLM-based deployment using `compose.yaml`.
4949
# export https_proxy="your_https_proxy"
5050
# export no_proxy="localhost,127.0.0.1,${HOST_IP}" # Add other hosts if necessary
5151
source intel/set_env.sh
52-
cd /intel/cpu/xeon
52+
cd intel/cpu/xeon
53+
bash grafana/dashboards/download_opea_dashboard.sh
5354
```
5455

5556
_Note: The compose file might read additional variables from set_env.sh. Ensure all required variables like ports (`LLM_SERVICE_PORT`, `MEGA_SERVICE_PORT`, etc.) are set if not using defaults from the compose file._
@@ -146,7 +147,7 @@ Key parameters are configured via environment variables set before running `dock
146147
Most of these parameters are in `set_env.sh`, you can either modify this file or overwrite the env variables by setting them.
147148
148149
```shell
149-
source CodeGen/docker_compose/set_env.sh
150+
source CodeGen/docker_compose/intel/set_env.sh
150151
```
151152
152153
#### Compose Files
@@ -252,7 +253,63 @@ Users can interact with the backend service using the `Neural Copilot` VS Code e
252253
- **"Container name is in use"**: Stop existing containers (`docker compose down`) or change `container_name` in the compose file.
253254
- **Resource Issues:** CodeGen models can be memory-intensive. Monitor host RAM usage. Increase Docker resources if needed.
254255

255-
## Stopping the Application
256+
## Monitoring Deployment
257+
258+
To enable monitoring for the CodeGen application, you can use the monitoring Docker Compose file along with the main deployment.
259+
260+
### Option #1: Default Deployment (without monitoring)
261+
262+
To deploy the CodeGen services without monitoring, execute:
263+
264+
```bash
265+
docker compose up -d
266+
```
267+
268+
### Option #2: Deployment with Monitoring
269+
270+
> NOTE: To enable monitoring, `compose.monitoring.yaml` file need to be merged along with default `compose.yaml` file.
271+
272+
To deploy with monitoring:
273+
274+
```bash
275+
bash grafana/dashboards/download_opea_dashboard.sh
276+
docker compose -f compose.yaml -f compose.monitoring.yaml up -d
277+
```
278+
279+
### Accessing Monitoring Services
280+
281+
Once deployed with monitoring, you can access:
282+
283+
- **Prometheus**: `http://${HOST_IP}:9090`
284+
- **Grafana**: `http://${HOST_IP}:3000` (username: `admin`, password: `admin`)
285+
- **Node Exporter**: `http://${HOST_IP}:9100`
286+
287+
### Monitoring Components
288+
289+
The monitoring stack includes:
290+
291+
- **Prometheus**: For metrics collection and querying
292+
- **Grafana**: For visualization and dashboards
293+
- **Node Exporter**: For system metrics collection
294+
295+
### Monitoring Dashboards
296+
297+
The following dashboards are automatically downloaded and configured:
298+
299+
- vLLM Dashboard
300+
- TGI Dashboard
301+
- CodeGen MegaService Dashboard
302+
- Node Exporter Dashboard
303+
304+
### Stopping the Application
305+
306+
If monitoring is enabled, execute the following command:
307+
308+
```bash
309+
docker compose -f compose.yaml -f compose.monitoring.yaml down
310+
```
311+
312+
If monitoring is not enabled, execute:
256313

257314
```bash
258315
docker compose down # for vLLM (compose.yaml)
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
services:
5+
prometheus:
6+
image: prom/prometheus:v2.52.0
7+
container_name: opea_prometheus
8+
user: root
9+
volumes:
10+
- ./prometheus.yaml:/etc/prometheus/prometheus.yaml
11+
- ./prometheus_data:/prometheus
12+
command:
13+
- '--config.file=/etc/prometheus/prometheus.yaml'
14+
ports:
15+
- '9090:9090'
16+
ipc: host
17+
restart: unless-stopped
18+
19+
grafana:
20+
image: grafana/grafana:11.0.0
21+
container_name: grafana
22+
volumes:
23+
- ./grafana_data:/var/lib/grafana
24+
- ./grafana/dashboards:/var/lib/grafana/dashboards
25+
- ./grafana/provisioning:/etc/grafana/provisioning
26+
user: root
27+
environment:
28+
GF_SECURITY_ADMIN_PASSWORD: admin
29+
GF_RENDERING_CALLBACK_URL: http://grafana:3000/
30+
GF_LOG_FILTERS: rendering:debug
31+
no_proxy: ${no_proxy}
32+
host_ip: ${host_ip}
33+
depends_on:
34+
- prometheus
35+
ports:
36+
- '3000:3000'
37+
ipc: host
38+
restart: unless-stopped
39+
40+
node-exporter:
41+
image: prom/node-exporter
42+
container_name: node-exporter
43+
volumes:
44+
- /proc:/host/proc:ro
45+
- /sys:/host/sys:ro
46+
- /:/rootfs:ro
47+
command:
48+
- '--path.procfs=/host/proc'
49+
- '--path.sysfs=/host/sys'
50+
- --collector.filesystem.ignored-mount-points
51+
- "^/(sys|proc|dev|host|etc|rootfs/var/lib/docker/containers|rootfs/var/lib/docker/overlay2|rootfs/run/docker/netns|rootfs/var/lib/docker/aufs)($$|/)"
52+
environment:
53+
no_proxy: ${no_proxy}
54+
ports:
55+
- 9100:9100
56+
restart: always
57+
deploy:
58+
mode: global
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#!/bin/bash
2+
# Copyright (C) 2025 Intel Corporation
3+
# SPDX-License-Identifier: Apache-2.0
4+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
5+
cd "$SCRIPT_DIR"
6+
if ls *.json 1> /dev/null 2>&1; then
7+
rm *.json
8+
fi
9+
10+
wget https://raw.githubusercontent.com/opea-project/GenAIEval/refs/heads/main/evals/benchmark/grafana/vllm_grafana.json
11+
wget https://raw.githubusercontent.com/opea-project/GenAIEval/refs/heads/main/evals/benchmark/grafana/tgi_grafana.json
12+
wget https://raw.githubusercontent.com/opea-project/GenAIEval/refs/heads/main/evals/benchmark/grafana/codegen_megaservice_grafana.json
13+
wget https://raw.githubusercontent.com/opea-project/GenAIEval/refs/heads/main/evals/benchmark/grafana/node_grafana.json
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Copyright (C) 2025 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
apiVersion: 1
5+
6+
providers:
7+
- name: 'default'
8+
orgId: 1
9+
folder: ''
10+
type: file
11+
disableDeletion: false
12+
updateIntervalSeconds: 10 #how often Grafana will scan for changed dashboards
13+
options:
14+
path: /var/lib/grafana/dashboards
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Copyright (C) 2025 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
# config file version
5+
apiVersion: 1
6+
7+
# list of datasources that should be deleted from the database
8+
deleteDatasources:
9+
- name: Prometheus
10+
orgId: 1
11+
12+
# list of datasources to insert/update depending
13+
# what's available in the database
14+
datasources:
15+
# <string, required> name of the datasource. Required
16+
- name: Prometheus
17+
# <string, required> datasource type. Required
18+
type: prometheus
19+
# <string, required> access mode. direct or proxy. Required
20+
access: proxy
21+
# <int> org id. will default to orgId 1 if not specified
22+
orgId: 1
23+
# <string> url
24+
url: http://$host_ip:9090
25+
# <string> database password, if used
26+
password:
27+
# <string> database user, if used
28+
user:
29+
# <string> database name, if used
30+
database:
31+
# <bool> enable/disable basic auth
32+
basicAuth: false
33+
# <string> basic auth username, if used
34+
basicAuthUser:
35+
# <string> basic auth password, if used
36+
basicAuthPassword:
37+
# <bool> enable/disable with credentials headers
38+
withCredentials:
39+
# <bool> mark as default datasource. Max one per org
40+
isDefault: true
41+
# <map> fields that will be converted to json and stored in json_data
42+
jsonData:
43+
httpMethod: GET
44+
graphiteVersion: "1.1"
45+
tlsAuth: false
46+
tlsAuthWithCACert: false
47+
# <string> json object of data that will be encrypted.
48+
secureJsonData:
49+
tlsCACert: "..."
50+
tlsClientCert: "..."
51+
tlsClientKey: "..."
52+
version: 1
53+
# <bool> allow users to edit datasources from the UI.
54+
editable: true
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Copyright (C) 2025 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
# [IP_ADDR]:{PORT_OUTSIDE_CONTAINER} -> {PORT_INSIDE_CONTAINER} / {PROTOCOL}
4+
global:
5+
scrape_interval: 5s
6+
external_labels:
7+
monitor: "my-monitor"
8+
scrape_configs:
9+
- job_name: "prometheus"
10+
static_configs:
11+
- targets: ["opea_prometheus:9090"]
12+
- job_name: "vllm"
13+
metrics_path: /metrics
14+
static_configs:
15+
- targets: ["vllm-server:80"]
16+
- job_name: "tgi"
17+
metrics_path: /metrics
18+
static_configs:
19+
- targets: [ "tgi-service:80" ]
20+
- job_name: "codegen-backend-server"
21+
metrics_path: /metrics
22+
static_configs:
23+
- targets: ["codegen-xeon-backend-server:7778"]
24+
- job_name: "prometheus-node-exporter"
25+
metrics_path: /metrics
26+
static_configs:
27+
- targets: ["node-exporter:9100"]

CodeGen/docker_compose/intel/hpu/gaudi/README.md

Lines changed: 60 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,10 @@ This uses the default vLLM-based deployment using `compose.yaml`.
4949
# export https_proxy="your_https_proxy"
5050
# export no_proxy="localhost,127.0.0.1,${HOST_IP}" # Add other hosts if necessary
5151
source intel/set_env.sh
52-
cd /intel/hpu/gaudi
52+
cd intel/hpu/gaudi
53+
cd grafana/dashboards
54+
bash download_opea_dashboard.sh
55+
cd ../..
5356
```
5457

5558
_Note: The compose file might read additional variables from set_env.sh. Ensure all required variables like ports (`LLM_SERVICE_PORT`, `MEGA_SERVICE_PORT`, etc.) are set if not using defaults from the compose file._
@@ -228,7 +231,62 @@ Use the `Neural Copilot` extension configured with the CodeGen backend URL: `htt
228231
- **Model Download Issues:** Check `HF_TOKEN`, internet access, proxy settings. Check LLM service logs.
229232
- **Connection Errors:** Verify `HOST_IP`, ports, and proxy settings. Use `docker ps` and check service logs.
230233

231-
## Stopping the Application
234+
## Monitoring Deployment
235+
236+
To enable monitoring for the CodeGen application on Gaudi, you can use the monitoring Docker Compose file along with the main deployment.
237+
238+
### Option #1: Default Deployment (without monitoring)
239+
240+
To deploy the CodeGen services without monitoring, execute:
241+
242+
```bash
243+
docker compose up -d
244+
```
245+
246+
### Option #2: Deployment with Monitoring
247+
248+
> NOTE: To enable monitoring, `compose.monitoring.yaml` file need to be merged along with default `compose.yaml` file.
249+
250+
To deploy with monitoring:
251+
252+
```bash
253+
docker compose -f compose.yaml -f compose.monitoring.yaml up -d
254+
```
255+
256+
### Accessing Monitoring Services
257+
258+
Once deployed with monitoring, you can access:
259+
260+
- **Prometheus**: `http://${HOST_IP}:9090`
261+
- **Grafana**: `http://${HOST_IP}:3000` (username: `admin`, password: `admin`)
262+
- **Node Exporter**: `http://${HOST_IP}:9100`
263+
264+
### Monitoring Components
265+
266+
The monitoring stack includes:
267+
268+
- **Prometheus**: For metrics collection and querying
269+
- **Grafana**: For visualization and dashboards
270+
- **Node Exporter**: For system metrics collection
271+
272+
### Monitoring Dashboards
273+
274+
The following dashboards are automatically downloaded and configured:
275+
276+
- vLLM Dashboard
277+
- TGI Dashboard
278+
- CodeGen MegaService Dashboard
279+
- Node Exporter Dashboard
280+
281+
### Stopping the Application
282+
283+
If monitoring is enabled, execute the following command:
284+
285+
```bash
286+
docker compose -f compose.yaml -f compose.monitoring.yaml down
287+
```
288+
289+
If monitoring is not enabled, execute:
232290

233291
```bash
234292
docker compose down # for vLLM (compose.yaml)

0 commit comments

Comments
 (0)