fix: correct device filter initialization order#857
Merged
archlitchi merged 1 commit intoProject-HAMi:masterfrom Feb 24, 2025
Merged
fix: correct device filter initialization order#857archlitchi merged 1 commit intoProject-HAMi:masterfrom
archlitchi merged 1 commit intoProject-HAMi:masterfrom
Conversation
Ensure DevicePluginFilterDevice is initialized before FilterDeviceToRegister. This fixes the initialization sequence to make device filtering work. Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com>
47f2dea to
365dbea
Compare
Member
|
/lgtm |
archlitchi
added a commit
that referenced
this pull request
May 6, 2025
* Fix: Update handling of version strings in Helm template and helpers.tpl (#845) * Update condition to include regexReplaceAll for outputting proper numbers from minor versions Signed-off-by: HJJ256 <harshjalan27@yahoo.com> * Update condition Signed-off-by: HJJ256 <harshjalan27@yahoo.com> * Update definition of strippedKubeVersion to handle variety of version numbering systems Signed-off-by: HJJ256 <harshjalan27@yahoo.com> * Update job-createSecret.yaml Signed-off-by: HJJ256 <harshjalan27@yahoo.com> * Update job-patchWebhook.yaml Signed-off-by: HJJ256 <harshjalan27@yahoo.com> --------- Signed-off-by: HJJ256 <harshjalan27@yahoo.com> * Update libvgpu.so (#876) * update libvgpu Signed-off-by: limengxuan <391013634@qq.com> * fix: disable passDeviceSpecsEnabled by default (#872) Due to potential pod startup issues in certain environments, set passDeviceSpecsEnabled to false by default. This configuration can still be enabled via helm values for environments that need it to handle runtime GPU access issues. Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com> * fix: Remove the pkg/k8sutil/client.go and replace it with HAMi/pkg/util/client in pkg/scheduler/scheduler.go (#681) Signed-off-by: Shouren Yang <yangshouren@gmail.com> * fix: correct device filter initialization order (#857) Ensure DevicePluginFilterDevice is initialized before FilterDeviceToRegister. This fixes the initialization sequence to make device filtering work. Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com> * fix parseNvidiaNumaInfo index out of range (#889) Signed-off-by: bin <bin.pan@daocloud.io> * fix conflict Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * fix conflict Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * fix ubuntu base image (#944) Signed-off-by: bin <bin.pan@daocloud.io> * fix: Add error handling for nvml.Init in NvidiaDevicePlugin (#982) Signed-off-by: yxxhero <aiopsclub@163.com> * Bump golang.org/x/net from 0.26.0 to 0.33.0 Bumps [golang.org/x/net](https://github.com/golang/net) from 0.26.0 to 0.33.0. - [Commits](golang/net@v0.26.0...v0.33.0) --- updated-dependencies: - dependency-name: golang.org/x/net dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Fix Dockerfile to make CI pass (#846) * update dockerfile Signed-off-by: limengxuan <391013634@qq.com> --------- Signed-off-by: HJJ256 <harshjalan27@yahoo.com> Signed-off-by: limengxuan <391013634@qq.com> Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com> Signed-off-by: Shouren Yang <yangshouren@gmail.com> Signed-off-by: bin <bin.pan@daocloud.io> Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Signed-off-by: yxxhero <aiopsclub@163.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Harsh Jaykumar Jalan <harshjalan27@yahoo.com> Co-authored-by: 霓漠Nimbus <136771156+Nimbus318@users.noreply.github.com> Co-authored-by: Shouren Yang <yangshouren@gmail.com> Co-authored-by: bin.pan <bin.pan@daocloud.io> Co-authored-by: yxxhero <11087727+yxxhero@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
archlitchi
added a commit
that referenced
this pull request
May 6, 2025
* Fix: Update handling of version strings in Helm template and helpers.tpl (#845) * Update condition to include regexReplaceAll for outputting proper numbers from minor versions Signed-off-by: HJJ256 <harshjalan27@yahoo.com> * Update condition Signed-off-by: HJJ256 <harshjalan27@yahoo.com> * Update definition of strippedKubeVersion to handle variety of version numbering systems Signed-off-by: HJJ256 <harshjalan27@yahoo.com> * Update job-createSecret.yaml Signed-off-by: HJJ256 <harshjalan27@yahoo.com> * Update job-patchWebhook.yaml Signed-off-by: HJJ256 <harshjalan27@yahoo.com> --------- Signed-off-by: HJJ256 <harshjalan27@yahoo.com> * Update libvgpu.so (#876) * update libvgpu Signed-off-by: limengxuan <391013634@qq.com> * fix: disable passDeviceSpecsEnabled by default (#872) Due to potential pod startup issues in certain environments, set passDeviceSpecsEnabled to false by default. This configuration can still be enabled via helm values for environments that need it to handle runtime GPU access issues. Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com> * fix: Remove the pkg/k8sutil/client.go and replace it with HAMi/pkg/util/client in pkg/scheduler/scheduler.go (#681) Signed-off-by: Shouren Yang <yangshouren@gmail.com> * fix: correct device filter initialization order (#857) Ensure DevicePluginFilterDevice is initialized before FilterDeviceToRegister. This fixes the initialization sequence to make device filtering work. Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com> * fix parseNvidiaNumaInfo index out of range (#889) Signed-off-by: bin <bin.pan@daocloud.io> * fix conflict Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * fix conflict Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * fix ubuntu base image (#944) Signed-off-by: bin <bin.pan@daocloud.io> * fix: Add error handling for nvml.Init in NvidiaDevicePlugin (#982) Signed-off-by: yxxhero <aiopsclub@163.com> * Bump golang.org/x/net from 0.26.0 to 0.33.0 Bumps [golang.org/x/net](https://github.com/golang/net) from 0.26.0 to 0.33.0. - [Commits](golang/net@v0.26.0...v0.33.0) --- updated-dependencies: - dependency-name: golang.org/x/net dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Fix Dockerfile to make CI pass (#846) * update dockerfile Signed-off-by: limengxuan <391013634@qq.com> * update version Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> --------- Signed-off-by: HJJ256 <harshjalan27@yahoo.com> Signed-off-by: limengxuan <391013634@qq.com> Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com> Signed-off-by: Shouren Yang <yangshouren@gmail.com> Signed-off-by: bin <bin.pan@daocloud.io> Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Signed-off-by: yxxhero <aiopsclub@163.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Harsh Jaykumar Jalan <harshjalan27@yahoo.com> Co-authored-by: 霓漠Nimbus <136771156+Nimbus318@users.noreply.github.com> Co-authored-by: Shouren Yang <yangshouren@gmail.com> Co-authored-by: bin.pan <bin.pan@daocloud.io> Co-authored-by: yxxhero <11087727+yxxhero@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
archlitchi
added a commit
that referenced
this pull request
May 6, 2025
* Fix: Update handling of version strings in Helm template and helpers.tpl (#845) * Update condition to include regexReplaceAll for outputting proper numbers from minor versions Signed-off-by: HJJ256 <harshjalan27@yahoo.com> * Update condition Signed-off-by: HJJ256 <harshjalan27@yahoo.com> * Update definition of strippedKubeVersion to handle variety of version numbering systems Signed-off-by: HJJ256 <harshjalan27@yahoo.com> * Update job-createSecret.yaml Signed-off-by: HJJ256 <harshjalan27@yahoo.com> * Update job-patchWebhook.yaml Signed-off-by: HJJ256 <harshjalan27@yahoo.com> --------- Signed-off-by: HJJ256 <harshjalan27@yahoo.com> * Update libvgpu.so (#876) * update libvgpu Signed-off-by: limengxuan <391013634@qq.com> * fix: disable passDeviceSpecsEnabled by default (#872) Due to potential pod startup issues in certain environments, set passDeviceSpecsEnabled to false by default. This configuration can still be enabled via helm values for environments that need it to handle runtime GPU access issues. Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com> * fix: Remove the pkg/k8sutil/client.go and replace it with HAMi/pkg/util/client in pkg/scheduler/scheduler.go (#681) Signed-off-by: Shouren Yang <yangshouren@gmail.com> * fix: correct device filter initialization order (#857) Ensure DevicePluginFilterDevice is initialized before FilterDeviceToRegister. This fixes the initialization sequence to make device filtering work. Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com> * fix parseNvidiaNumaInfo index out of range (#889) Signed-off-by: bin <bin.pan@daocloud.io> * fix conflict Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * fix conflict Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * fix ubuntu base image (#944) Signed-off-by: bin <bin.pan@daocloud.io> * fix: Add error handling for nvml.Init in NvidiaDevicePlugin (#982) Signed-off-by: yxxhero <aiopsclub@163.com> * Bump golang.org/x/net from 0.26.0 to 0.33.0 Bumps [golang.org/x/net](https://github.com/golang/net) from 0.26.0 to 0.33.0. - [Commits](golang/net@v0.26.0...v0.33.0) --- updated-dependencies: - dependency-name: golang.org/x/net dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Fix Dockerfile to make CI pass (#846) * update dockerfile Signed-off-by: limengxuan <391013634@qq.com> * update version Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * Optimize E2E with pod status check (#847) Signed-off-by: wen.rui <wen.rui@daocloud.io> --------- Signed-off-by: HJJ256 <harshjalan27@yahoo.com> Signed-off-by: limengxuan <391013634@qq.com> Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com> Signed-off-by: Shouren Yang <yangshouren@gmail.com> Signed-off-by: bin <bin.pan@daocloud.io> Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Signed-off-by: yxxhero <aiopsclub@163.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: wen.rui <wen.rui@daocloud.io> Co-authored-by: Harsh Jaykumar Jalan <harshjalan27@yahoo.com> Co-authored-by: 霓漠Nimbus <136771156+Nimbus318@users.noreply.github.com> Co-authored-by: Shouren Yang <yangshouren@gmail.com> Co-authored-by: bin.pan <bin.pan@daocloud.io> Co-authored-by: yxxhero <11087727+yxxhero@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Rei1010 <56469400+Rei1010@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Found an initialization order issue where FilterDeviceToRegister always returns false because DevicePluginFilterDevice is not initialized when the function is called. This happens because NewNVMLResourceManagers (which uses FilterDeviceToRegister) is called before the config file is read in NewNvidiaDevicePlugin.
What type of PR is this?
/kind bug
What this PR does / why we need it:
Refactor nvidia device plugin configuration loading:
This change improves code organization by separating config loading from plugin initialization.
Which issue(s) this PR fixes:
Fixes #856
Special notes for your reviewer:
Testing:
This fix ensures the device filtering feature works as intended without any side effects on existing functionality.
Does this PR introduce a user-facing change?:
No