Skip to content

fix: correct device filter initialization order#857

Merged
archlitchi merged 1 commit intoProject-HAMi:masterfrom
Nimbus318:fix/device-filter-init-order
Feb 24, 2025
Merged

fix: correct device filter initialization order#857
archlitchi merged 1 commit intoProject-HAMi:masterfrom
Nimbus318:fix/device-filter-init-order

Conversation

@Nimbus318
Copy link
Copy Markdown
Contributor

@Nimbus318 Nimbus318 commented Feb 10, 2025

Found an initialization order issue where FilterDeviceToRegister always returns false because DevicePluginFilterDevice is not initialized when the function is called. This happens because NewNVMLResourceManagers (which uses FilterDeviceToRegister) is called before the config file is read in NewNvidiaDevicePlugin.

What type of PR is this?
/kind bug

What this PR does / why we need it:
Refactor nvidia device plugin configuration loading:

  • Extract config loading from NewNvidiaDevicePlugin into LoadNvidiaDevicePluginConfig
  • Pass pre-loaded config and mode to NewNvidiaDevicePlugin

This change improves code organization by separating config loading from plugin initialization.

Previous attempt
The previous attempt using InitDeviceFilter was not ideal. This new approach provides a cleaner separation of concerns.

Which issue(s) this PR fixes:
Fixes #856

Special notes for your reviewer:
Testing:

  1. Verified device filtering works correctly with test config
  2. Confirmed existing functionality remains unchanged
  3. Added device filter config works as expected
    CleanShot 2025-02-10 at 16 43 07@2x
    CleanShot 2025-02-10 at 16 43 44@2x
    CleanShot 2025-02-10 at 17 24 16@2x

This fix ensures the device filtering feature works as intended without any side effects on existing functionality.

Does this PR introduce a user-facing change?:
No

@wawa0210 wawa0210 added the kind/bug Something isn't working label Feb 11, 2025
Ensure DevicePluginFilterDevice is initialized before FilterDeviceToRegister.
This fixes the initialization sequence to make device filtering work.

Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com>
@archlitchi
Copy link
Copy Markdown
Member

/lgtm

@archlitchi archlitchi merged commit 8cb8f03 into Project-HAMi:master Feb 24, 2025
@ouyangluwei163 ouyangluwei163 mentioned this pull request Apr 8, 2025
45 tasks
@archlitchi archlitchi mentioned this pull request May 6, 2025
11 tasks
archlitchi added a commit that referenced this pull request May 6, 2025
* Fix: Update handling of version strings in Helm template and helpers.tpl (#845)

* Update condition to include regexReplaceAll for outputting proper numbers from minor versions

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

* Update condition

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

* Update definition of strippedKubeVersion to handle variety of version numbering systems

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

* Update job-createSecret.yaml

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

* Update job-patchWebhook.yaml

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

---------

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

* Update libvgpu.so (#876)

* update libvgpu

Signed-off-by: limengxuan <391013634@qq.com>

* fix: disable passDeviceSpecsEnabled by default (#872)

Due to potential pod startup issues in certain environments,
set passDeviceSpecsEnabled to false by default.
This configuration can still be enabled via helm values
for environments that need it to handle runtime GPU access issues.

Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com>

* fix: Remove the pkg/k8sutil/client.go and replace it with HAMi/pkg/util/client in pkg/scheduler/scheduler.go (#681)

Signed-off-by: Shouren Yang <yangshouren@gmail.com>

* fix: correct device filter initialization order (#857)

Ensure DevicePluginFilterDevice is initialized before FilterDeviceToRegister.
This fixes the initialization sequence to make device filtering work.

Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com>

* fix parseNvidiaNumaInfo index out of range (#889)

Signed-off-by: bin <bin.pan@daocloud.io>

* fix conflict

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* fix conflict

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* fix ubuntu base image (#944)

Signed-off-by: bin <bin.pan@daocloud.io>

* fix: Add error handling for nvml.Init in NvidiaDevicePlugin (#982)

Signed-off-by: yxxhero <aiopsclub@163.com>

* Bump golang.org/x/net from 0.26.0 to 0.33.0

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.26.0 to 0.33.0.
- [Commits](golang/net@v0.26.0...v0.33.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* Fix Dockerfile to make CI pass (#846)

* update dockerfile

Signed-off-by: limengxuan <391013634@qq.com>

---------

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>
Signed-off-by: limengxuan <391013634@qq.com>
Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com>
Signed-off-by: Shouren Yang <yangshouren@gmail.com>
Signed-off-by: bin <bin.pan@daocloud.io>
Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Signed-off-by: yxxhero <aiopsclub@163.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Harsh Jaykumar Jalan <harshjalan27@yahoo.com>
Co-authored-by: 霓漠Nimbus <136771156+Nimbus318@users.noreply.github.com>
Co-authored-by: Shouren Yang <yangshouren@gmail.com>
Co-authored-by: bin.pan <bin.pan@daocloud.io>
Co-authored-by: yxxhero <11087727+yxxhero@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
archlitchi added a commit that referenced this pull request May 6, 2025
* Fix: Update handling of version strings in Helm template and helpers.tpl (#845)

* Update condition to include regexReplaceAll for outputting proper numbers from minor versions

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

* Update condition

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

* Update definition of strippedKubeVersion to handle variety of version numbering systems

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

* Update job-createSecret.yaml

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

* Update job-patchWebhook.yaml

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

---------

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

* Update libvgpu.so (#876)

* update libvgpu

Signed-off-by: limengxuan <391013634@qq.com>

* fix: disable passDeviceSpecsEnabled by default (#872)

Due to potential pod startup issues in certain environments,
set passDeviceSpecsEnabled to false by default.
This configuration can still be enabled via helm values
for environments that need it to handle runtime GPU access issues.

Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com>

* fix: Remove the pkg/k8sutil/client.go and replace it with HAMi/pkg/util/client in pkg/scheduler/scheduler.go (#681)

Signed-off-by: Shouren Yang <yangshouren@gmail.com>

* fix: correct device filter initialization order (#857)

Ensure DevicePluginFilterDevice is initialized before FilterDeviceToRegister.
This fixes the initialization sequence to make device filtering work.

Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com>

* fix parseNvidiaNumaInfo index out of range (#889)

Signed-off-by: bin <bin.pan@daocloud.io>

* fix conflict

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* fix conflict

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* fix ubuntu base image (#944)

Signed-off-by: bin <bin.pan@daocloud.io>

* fix: Add error handling for nvml.Init in NvidiaDevicePlugin (#982)

Signed-off-by: yxxhero <aiopsclub@163.com>

* Bump golang.org/x/net from 0.26.0 to 0.33.0

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.26.0 to 0.33.0.
- [Commits](golang/net@v0.26.0...v0.33.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* Fix Dockerfile to make CI pass (#846)

* update dockerfile

Signed-off-by: limengxuan <391013634@qq.com>

* update version

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

---------

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>
Signed-off-by: limengxuan <391013634@qq.com>
Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com>
Signed-off-by: Shouren Yang <yangshouren@gmail.com>
Signed-off-by: bin <bin.pan@daocloud.io>
Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Signed-off-by: yxxhero <aiopsclub@163.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Harsh Jaykumar Jalan <harshjalan27@yahoo.com>
Co-authored-by: 霓漠Nimbus <136771156+Nimbus318@users.noreply.github.com>
Co-authored-by: Shouren Yang <yangshouren@gmail.com>
Co-authored-by: bin.pan <bin.pan@daocloud.io>
Co-authored-by: yxxhero <11087727+yxxhero@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
archlitchi added a commit that referenced this pull request May 6, 2025
* Fix: Update handling of version strings in Helm template and helpers.tpl (#845)

* Update condition to include regexReplaceAll for outputting proper numbers from minor versions

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

* Update condition

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

* Update definition of strippedKubeVersion to handle variety of version numbering systems

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

* Update job-createSecret.yaml

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

* Update job-patchWebhook.yaml

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

---------

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>

* Update libvgpu.so (#876)

* update libvgpu

Signed-off-by: limengxuan <391013634@qq.com>

* fix: disable passDeviceSpecsEnabled by default (#872)

Due to potential pod startup issues in certain environments,
set passDeviceSpecsEnabled to false by default.
This configuration can still be enabled via helm values
for environments that need it to handle runtime GPU access issues.

Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com>

* fix: Remove the pkg/k8sutil/client.go and replace it with HAMi/pkg/util/client in pkg/scheduler/scheduler.go (#681)

Signed-off-by: Shouren Yang <yangshouren@gmail.com>

* fix: correct device filter initialization order (#857)

Ensure DevicePluginFilterDevice is initialized before FilterDeviceToRegister.
This fixes the initialization sequence to make device filtering work.

Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com>

* fix parseNvidiaNumaInfo index out of range (#889)

Signed-off-by: bin <bin.pan@daocloud.io>

* fix conflict

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* fix conflict

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* fix ubuntu base image (#944)

Signed-off-by: bin <bin.pan@daocloud.io>

* fix: Add error handling for nvml.Init in NvidiaDevicePlugin (#982)

Signed-off-by: yxxhero <aiopsclub@163.com>

* Bump golang.org/x/net from 0.26.0 to 0.33.0

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.26.0 to 0.33.0.
- [Commits](golang/net@v0.26.0...v0.33.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* Fix Dockerfile to make CI pass (#846)

* update dockerfile

Signed-off-by: limengxuan <391013634@qq.com>

* update version

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* Optimize E2E with pod status check (#847)

Signed-off-by: wen.rui <wen.rui@daocloud.io>

---------

Signed-off-by: HJJ256 <harshjalan27@yahoo.com>
Signed-off-by: limengxuan <391013634@qq.com>
Signed-off-by: Nimbus318 <136771156+Nimbus318@users.noreply.github.com>
Signed-off-by: Shouren Yang <yangshouren@gmail.com>
Signed-off-by: bin <bin.pan@daocloud.io>
Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Signed-off-by: yxxhero <aiopsclub@163.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: wen.rui <wen.rui@daocloud.io>
Co-authored-by: Harsh Jaykumar Jalan <harshjalan27@yahoo.com>
Co-authored-by: 霓漠Nimbus <136771156+Nimbus318@users.noreply.github.com>
Co-authored-by: Shouren Yang <yangshouren@gmail.com>
Co-authored-by: bin.pan <bin.pan@daocloud.io>
Co-authored-by: yxxhero <11087727+yxxhero@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Rei1010 <56469400+Rei1010@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Device filtering not working due to incorrect initialization order

3 participants