Note: All the configurations listed below are managed within the hami-scheduler-device ConfigMap. You can update these configurations using one of the following methods:
-
Directly edit the ConfigMap: If HAMi has already been successfully installed, you can manually update the hami-scheduler-device ConfigMap using the
kubectl editcommand to manually update the hami-scheduler-device ConfigMap.kubectl edit configmap hami-scheduler-device -n <namespace>
After making changes, restart the related HAMi components to apply the updated configurations.
-
Modify Helm Chart: Update the corresponding values in the ConfigMap, then reapply the Helm Chart to regenerate the ConfigMap.
nvidia.deviceMemoryScaling: Float type, by default: 1. The ratio for NVIDIA device memory scaling, can be greater than 1 (enable virtual device memory, experimental feature). For NVIDIA GPU with M memory, if we setnvidia.deviceMemoryScalingargument to S, vGPUs split by this GPU will totally getS * Mmemory in Kubernetes with our device plugin.nvidia.deviceSplitCount: Integer type, by default: equals 10. Maximum tasks assigned to a simple GPU device.nvidia.migstrategy: String type, "none" for ignoring MIG features or "mixed" for allocating MIG device by separate resources. Default "none"nvidia.disablecorelimit: String type, "true" for disable core limit, "false" for enable core limit, default: falsenvidia.defaultMem: Integer type, by default: 0. The default device memory of the current task, in MB.'0' means use 100% device memorynvidia.defaultCores: Integer type, by default: equals 0. Percentage of GPU cores reserved for the current task. If assigned to 0, it may fit in any GPU with enough device memory. If assigned to 100, it will use an entire GPU card exclusively. Note: When a container requestsnvidia.com/gpuand its GPU memory reservation is exclusive (for examplenvidia.com/gpumem-percentageis 100, or memory fields are omitted sonvidia.defaultMemremains 0 and defaults to 100%), and the pod spec does not setnvidia.com/gpucores, HAMi defaultsnvidia.com/gpucoresto 100 during admission. Non-exclusive memory requests or pods that already setnvidia.com/gpucoresremain unchanged.nvidia.defaultGPUNum: Integer type, by default: equals 1, if configuration value is 0, then the configuration value will not take effect and will be filtered. when a user does not set nvidia.com/gpu this key in pod resource, webhook should check nvidia.com/gpumem、resource-mem-percentage、nvidia.com/gpucores this three key, anyone a key having value, webhook should add nvidia.com/gpu key and this default value to resources limits map.nvidia.memoryFactor: Integer type, by default: equals 1. During resource requests, the actual value ofnvidia.com/gpumemwill be multiplied by this factor. Ifmock-device-pluginis deployed, the actual valuenvidia.com/gpumeminnode.status.capacitywill also be amplified by the corresponding multiple.nvidia.resourceCountName: String type, vgpu number resource name, default: "nvidia.com/gpu"nvidia.resourceMemoryName: String type, vgpu memory size resource name, default: "nvidia.com/gpumem"nvidia.resourceMemoryPercentageName: String type, vgpu memory fraction resource name, default: "nvidia.com/gpumem-percentage"nvidia.resourceCoreName: String type, vgpu cores resource name, default: "nvidia.com/gpucores"nvidia.resourcePriorityName: String type, vgpu task priority name, default: "nvidia.com/priority"
HAMi allows configuring per-node behavior for device plugin. Edit
kubectl -n kube-system edit cm hami-device-pluginname: Name of the node.operatingmode: Operating mode of the node, can be "hami-core" or "mig", default: "hami-core".devicememoryscaling: Overcommit ratio of device memory.devicecorescaling: Overcommit ratio of device core.devicesplitcount: Allowed number of tasks sharing a device.filterdevices: Devices that are not registered to HAMi.uuid: UUIDs of devices to ignoreindex: Indexes of devices to ignore.- A device is ignored by HAMi if it's in
uuidorindexlist.
you can customize your vGPU support by setting the following parameters using -set, for example
helm install hami hami-charts/hami --set devicePlugin.deviceMemoryScaling=5 ...devicePlugin.service.schedulerPort: Integer type, by default: 31998, scheduler webhook service nodePort.devicePlugin.deviceListStrategy: String type, default value is "envvar". This sets the strategy for exposing devices to containers. "envvar" uses theNVIDIA_VISIBLE_DEVICESenvironment variable, while "cdi-annotations" uses the Container Device Interface (CDI).devicePlugin.nvidiaDriverRoot: String type. Specifies the root of the NVIDIA driver installation on the host. This is used whendeviceListStrategyiscdi-annotations. If not set via Helm, it defaults to/.devicePlugin.nvidiaHookPath: String type. Specifies the path to thenvidia-ctkbinary on the GPU node. This is used whendeviceListStrategyiscdi-annotations. If not set via Helm, it defaults to/usr/bin/nvidia-ctk.scheduler.defaultSchedulerPolicy.nodeSchedulerPolicy: String type, default value is "binpack", representing the GPU node scheduling policy. "binpack" means trying to allocate tasks to the same GPU node as much as possible, while "spread" means trying to allocate tasks to different GPU nodes as much as possible.scheduler.defaultSchedulerPolicy.gpuSchedulerPolicy: String type, default value is "spread", representing the GPU scheduling policy. "binpack" means trying to allocate tasks to the same GPU as much as possible, while "spread" means trying to allocate tasks to different GPUs as much as possible.
Webhook Selector Configs
The admission webhook supports flexible label-based filtering through namespaceSelector and objectSelector. By default, the webhook excludes namespaces/pods with the label hami.io/webhook: ignore. You can add additional filtering criteria to control which namespaces and pods the webhook applies to.
-
scheduler.admissionWebhook.namespaceSelector.matchLabels: Map type, default is empty. Add labels that namespaces must have for the webhook to apply. For example, to only apply the webhook to namespaces with a specific label:scheduler: admissionWebhook: namespaceSelector: matchLabels: app.kubernetes.io/part-of: kubeflow-profile
-
scheduler.admissionWebhook.namespaceSelector.matchExpressions: Array type, default is empty. Add label selector expressions to filter namespaces. Supports operators:In,NotIn,Exists,DoesNotExist. For example:scheduler: admissionWebhook: namespaceSelector: matchExpressions: - key: environment operator: In values: - production - staging
-
scheduler.admissionWebhook.objectSelector.matchLabels: Map type, default is empty. Add labels that pods must have for the webhook to apply. For example, to only apply the webhook to pods managed by HAMi:scheduler: admissionWebhook: objectSelector: matchLabels: app.kubernetes.io/managed-by: hami
-
scheduler.admissionWebhook.objectSelector.matchExpressions: Array type, default is empty. Add label selector expressions to filter pods. For example, to only apply to pods with a specific opt-in label:scheduler: admissionWebhook: objectSelector: matchExpressions: - key: hami.io/enable operator: In values: - "true"
Webhook TLS Certificate Configs
In Kubernetes, in order for the API server to communicate with the webhook component, the webhook requires a TLS certificate that the API server is configured to trust. HAMi scheduler provides two methods to generate/configure the required TLS certificate.
scheduler.patch.enabled: Boolean type, default value is true, if true, helm will use kube-webhook-certgen (job-patch) to generate a self-signed certificate and create a secret.scheduler.certManager.enabled: Boolean type, default value is false, if true, cert-manager will generate a self-signed certificate. Note: This option requires cert-manager to be installed in your cluster first. See cert-manager installation for more details.
-
nvidia.com/use-gpuuuid:String type, ie: "GPU-AAA,GPU-BBB"
If set, devices allocated by this pod must be one of UUIDs defined in this string.
-
nvidia.com/nouse-gpuuuid:String type, ie: "GPU-AAA,GPU-BBB"
If set, devices allocated by this pod will NOT in UUIDs defined in this string.
-
nvidia.com/nouse-gputype:String type, ie: "Tesla V100-PCIE-32GB, NVIDIA A10"
If set, devices allocated by this pod will NOT in types defined in this string.
-
nvidia.com/use-gputype:String type, ie: "Tesla V100-PCIE-32GB, NVIDIA A10"
If set, devices allocated by this pod MUST be one of types defined in this string.
-
hami.io/node-scheduler-policy:String type, "binpack" or "spread"
- binpack: the scheduler will try to allocate the pod to used GPU nodes for execution.
- spread: the scheduler will try to allocate the pod to different GPU nodes for execution.
-
hami.io/gpu-scheduler-policy:String type, "binpack" or "spread"
- binpack: the scheduler will try to allocate the pod to the same GPU card for execution.
- spread:the scheduler will try to allocate the pod to different GPU card for execution.
-
nvidia.com/vgpu-mode:String type, "hami-core" or "mig"
Which type of vgpu instance this pod wish to use
GPU_CORE_UTILIZATION_POLICY:
Currently this parameter can be specified during
helm installand then automatically injected into the container environment variables, through--set devices.nvidia.gpuCorePolicy=force
String type, "default", "force", "disable"
- default: "default"
- "default" means the default utilization policy
- "force" means the container will always limit the core utilization below "nvidia.com/gpucores"
- "disable" means the container will ignore the utilization limitation set by "nvidia.com/gpucores" during task execution
-
CUDA_DISABLE_CONTROL:Bool type, "true", "false"
- default: false
- "true" means the HAMi-core will not be used inside container, as a result, there will be no resource isolation and limitation in that container, only for debug.
-
CUDA_DISABLE_CONTROL:Bool type, "true", "false"
- default: false
- "true" means the HAMi-core will not be used inside container, as a result, there will be no resource isolation and limitation in that container, only for debug.