-
Notifications
You must be signed in to change notification settings - Fork 550
support different types of computing hardware #5138
Description
Motivation
Currently, OpenPAI has supported the most widely used computing devices: Nvidia GPU, AMD GPU and CPU. In addition, it has the potential to support other types of device, e.g. AI computing chips (NPU).
Goal
Decouple OpenPAI services and specific hardware types. One OpenPAI service container can support a list of hardware types.
Requirements
For every type of computing device, the vendor should guarantee:
- one machine should only have one type of computing device
- driver and k8s device plugin are successfully deployed in each machine
- devices work correctly with docker and k8s
- compatible framework and docker images
MVP with default scheduler
By assuming that there is only one type of computing device in a cluster, we could build a minimal viable solution with the default scheduler by
- configure
ComputeDevice(default isnvidia.com/gpu) in deployment and record it in configmap - add option to turn off HivdD scheduler in quick start
- bypass (or do other) pre-checks according to
ComputeDevicein quick start - chage
nvidia.com/gputoComputeDevicein rest server - change vc resource information when use default scheduler
pai/src/rest-server/src/models/v2/job/k8s.js
Lines 483 to 487 in 2fb370a
| memory: `${config.taskRoles[taskRole].resourcePerInstance.memoryMB}Mi`, | |
| 'github.com/fuse': 1, | |
| 'nvidia.com/gpu': | |
| config.taskRoles[taskRole].resourcePerInstance.gpu, | |
| ...(infinibandDevice && { 'rdma/hca': 1 }), |
Beside the necessary works, we (pai-dev team and device vendor) could make better support by
- refactor and organize device-related codes in
devicessubfolders. The basic idea is to quick locate device related codes and isolate codes for different devices (e.g. different device vendors should avoid editing the same file).
If a component must support diverse types of computing device, there will be adevicesfolder in it. For PAI services, they should take these files into consideration in build time. And one container will support a list of different machine models. For other components like the deploy script, they should check these files in runtime. - provide monitoring tool like
nvidia-smiand prometheus exporter - update webportal terms
Perfect support with HiveD
By enabling HiveD, we could get better support
- allow multiple device types in a cluster
- support virtual clusters
- topology aware scheduling to guarantee sharing safety of DL scenario
Some extra efforts must be done to achieve this
- offer a container runtime for every device type. Container runtime is a modified version of runc adding a custom pre-start hook to all containers. Here are two examples nvidia-container-runtime and runtime for AMD Radeon Open Compute
- describe machines and devices in
layout.yamlreplace master.csv / worker.csv by layout.yaml #5151 - make sure HiveD config generation is independent of computing devices
- add appropriate environment variables in rest-server when generate pod spec in addition to
NVIDIA_VISIBLE_DEVICESandPAI_AMD_VISIBLE_DEVICES.
pai/src/rest-server/src/models/v2/job/k8s.js
Lines 656 to 676 in 2fb370a
| if (config.taskRoles[taskRole].resourcePerInstance.gpu > 0) { | |
| frameworkTaskRole.task.pod.spec.containers[0].env.push( | |
| { | |
| name: 'NVIDIA_VISIBLE_DEVICES', | |
| valueFrom: { | |
| fieldRef: { | |
| fieldPath: `metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']`, | |
| }, | |
| }, | |
| }, | |
| { | |
| name: 'PAI_AMD_VISIBLE_DEVICES', | |
| valueFrom: { | |
| fieldRef: { | |
| fieldPath: `metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']`, | |
| }, | |
| }, | |
| }, | |
| ); | |
| } | |
| } |
Some optional work items include
- clarify and unify the machine sku description in
layout.yamland HiveD skus - make
sku-(cpu,gpu,mem)converting simply, predictably and decoupled with devices CPU/GPU/Memory information to SKU definition API #5148. - health report for computing device. This is not mandatory since node-level health check is provided by k8s already.