Skip to content

fix: surface reconcileUsersAcl errors in cluster status#140

Merged
jdheyburn merged 1 commit into
valkey-io:mainfrom
daanvinken:fix/acl-missing-secret-status
Apr 22, 2026
Merged

fix: surface reconcileUsersAcl errors in cluster status#140
jdheyburn merged 1 commit into
valkey-io:mainfrom
daanvinken:fix/acl-missing-secret-status

Conversation

@daanvinken
Copy link
Copy Markdown
Contributor

@daanvinken daanvinken commented Apr 15, 2026

This PR closes #139

Summary

When a user in spec.users references a password secret that doesn't exist, reconcileUsersAcl logs an error but skips the user and returns nil. The ACL file is written without that user, and the cluster reports Ready. I think it makes sense to fail hard.

Additionally, when reconcileUsersAcl does return an error (e.g. internal secret creation fails), the condition is set but updateStatus is never called here, so the error isn't persisted to the status either.

Implementation

  • users.go: replace continue with return fmt.Errorf(...) when fetchUserPasswords fails
  • valkeycluster_controller.go: add missing updateStatus call on the reconcileUsersAcl error path

Testing

Before the fix, a missing password secret is invisible:

NAME            STATE   REASON           AGE
acl-fail-test   Ready   ClusterHealthy   10m

After the fix:

 ༼ つ ▀_▀ ༽つ  ~  src  valkey-ope  k describe valkeycluster acl-fail-test | tail -n 15                        fix/acl-missing-secret-status  2✎  5+  ⎈ kind-valkey-repro  21:55:45
    Last Transition Time:  2026-04-15T19:51:06Z
    Message:               user testuser: no password or reference found
    Observed Generation:   1
    Reason:                UsersACLError
    Status:                False
    Type:                  Ready
  Message:                 user testuser: no password or reference found
  Ready Shards:            0
  Reason:                  UsersACLError
  Shards:                  0
  State:                   Failed
Events:
  Type    Reason          Age    From                      Message
  ----    ------          ----   ----                      -------
  Normal  ServiceCreated  4m46s  valkeycluster-controller  Created headless Service
 ༼ つ ▀_▀ ༽つ  ~  src  valkey-ope  k get valkeycluster                                                        fix/acl-missing-secret-status  2✎  5+  ⎈ kind-valkey-repro  21:55:53
NAME            STATE    REASON          AGE
acl-fail-test   Failed   UsersACLError   4m51s

Limitations

None.

Checklist

  • This Pull Request is related to one issue.
  • Commit message explains what changed and why
  • Tests are added or updated.
  • [NA] Documentation files are updated.
  • I have run pre-commit locally (pre-commit run --all-files or hooks on commit)

@daanvinken daanvinken force-pushed the fix/acl-missing-secret-status branch from b0df181 to 6a7b182 Compare April 15, 2026 20:01
@daanvinken daanvinken changed the title fix: surface reconcileUsersAcl errors in cluster status (#139) fix: surface reconcileUsersAcl errors in cluster status Apr 15, 2026
@daanvinken
Copy link
Copy Markdown
Contributor Author

daanvinken commented Apr 15, 2026

I considered failing soft (skip the user, set a Degraded condition and emit a warning event) so a missing secret doesn't block the whole cluster.

However, the Degraded condition gets cleared when the cluster reaches Ready at the end of the reconcile, so the warning disappears from status. Checking for any warnings before marking the cluster as healthy feels messy, as we'd be adding semantics where Degraded means different things depending on who set it..

And the warning event fires every reconcile cycle unless you add dedup logic, but the above problem persists. The hard error is simpler and keeps the problem visible.

Any input here is welcome!

@daanvinken daanvinken force-pushed the fix/acl-missing-secret-status branch from 6a7b182 to bd86d50 Compare April 20, 2026 08:02
@jdheyburn
Copy link
Copy Markdown
Collaborator

@daanvinken I ran CI checks, would you be able to take a look at those that have failed please?

reconcileUsersAcl silently skipped users with missing password
secrets and returned nil, causing the cluster to report Ready
with incomplete ACLs. Return an error instead so the cluster
status reflects the misconfiguration.

Also add the missing updateStatus call on the reconcileUsersAcl
error path so the condition is persisted to the API server.

Skip password lookup for disabled users since they don't need
credentials.

Signed-off-by: Daan Vinken <daanvinken@tythus.com>
@daanvinken daanvinken force-pushed the fix/acl-missing-secret-status branch from bd86d50 to ff0ab55 Compare April 20, 2026 19:58
@daanvinken
Copy link
Copy Markdown
Contributor Author

Yeah should be good now!

Copy link
Copy Markdown
Collaborator

@jdheyburn jdheyburn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@jdheyburn jdheyburn merged commit 7ef0469 into valkey-io:main Apr 22, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

reconcileUsersAcl silently skips users with missing password secrets

2 participants