KubernetesHook: add AWS exec-auth botocore guardrails for EKS token flow (#60943)#61936
KubernetesHook: add AWS exec-auth botocore guardrails for EKS token flow (#60943)#61936Vamsi-klu wants to merge 2 commits into
Conversation
|
Static tests are failing @Vamsi-klu |
|
@Srabasti i have updated the PR with the relevant checks. Can you please review this and let me know your feedback? Thanks! |
| _AWS_EXEC_AUTH_VERSION_CHECK_MODE_FIELD = "exec_auth_aws_cli_version_check_mode" | ||
| _AWS_EXEC_AUTH_VERSION_CHECK_MODES = {"warn", "fail", "ignore"} | ||
| _AWS_EXEC_AUTH_AWS_BINARY_NAMES = {"aws", "aws.exe", "aws2", "aws2.exe"} | ||
| _AWS_EXEC_AUTH_FIXED_BOTOCORE_VERSION = (1, 40, 2) | ||
| _BOTOCORE_VERSION_PATTERN = re.compile(r"botocore/(?P<version>\d+(?:\.\d+){1,2})") |
There was a problem hiding this comment.
I do not understand why AKS specific settings are introduced in K8s provider package (as well as RST docs above and tests below). Why is this not makde in the AWS specific package?
I my view K8s standard package should not be tainted by AWS, Google or Azure specific handling if no K8s standard.
There was a problem hiding this comment.
Thanks for the feedback @jscheffl, that's a fair point about keeping the K8s provider cloud agnostic.
The reason this was placed in KubernetesHook is that the vulnerability affects any kubeconfig using aws eks get-token exec auth, not just users going through EksHook or EksPodOperator. Many users configure a generic Kubernetes connection with a kubeconfig file that happens to use AWS exec auth and they never interact with the AWS provider at all.
That said, I agree AWS specific constants and detection logic shouldn't live here. How about this approach:
In the K8s provider, add a minimal, generic exec auth validation hook point in KubernetesHook.get_conn() (for example a discoverable entry point or registry pattern like _validate_exec_auth(kubeconfig, context)). No AWS/GCP/Azure specific code, just a generic extension mechanism.
In the AWS provider, register a validator that handles the AWS specific detection (aws eks get-token binary matching, botocore version parsing, subprocess call) and the connection extra field (exec_auth_aws_cli_version_check_mode). All AWS constants, helpers, tests, and docs move there.
This way the K8s hook stays cloud agnostic while still protecting users who use generic KubernetesHook with an EKS kubeconfig. It also makes the pattern extensible so any other provider could register their own exec auth validators in the future.
Would this approach work for you, or would you prefer everything moved entirely into the AWS provider package?
There was a problem hiding this comment.
Not an expert here... @o-nikolas / @eladkal how were such things handled in the past if cloud-provider specific stuff needed to be integrated into a base provider package?
There was a problem hiding this comment.
Off the top of my head I can't think of previous case quite like this. Can you @eladkal?
I see both sides, I don't like AWS stuff in the K8s hook, but I also see the point @Vamsi-klu mentions. And the workaround sounds quite a bit more complicated than the change already is, which I don't love 😬
There was a problem hiding this comment.
I really don't understand what we are trying to solve here.
but if k8s provider needs optional stuff from amazon provider it should have set it as optional dependency
There was a problem hiding this comment.
I'm pretty hesitant to add this much code to add a min version check like this, especially one that isn't really directly related to this provider.
The lower bound in the amazon provider is already higher than what we'd defend against here. And, our constraint files are already putting in a higher version. (2.11.1 has the new version, 2.11.0 did not, fwiw)
If we went with the proposed generic validator + validate botocore version in the aws provider, we'd not have a problem because of the min version over there (ignoring when that min was bumped over there).
I'm just wondering if this is now not really a likely issue with the latest versions of stuff already...
There was a problem hiding this comment.
I think we better just close this as won't fix. This is just too much to accommodate older Boto versions where it's very unlikely that anyone needs this by now.
There was a problem hiding this comment.
I'll close this PR and focus on other issues . Thanks everyone who commented your thoughts on this PR
|
Collaborators for this PR: @codingrealitylabs and @girlcoder-gaming. They helped me raise this PR. |
Why this change
Issue #60943 reports intermittent KubernetesPodOperator task failures on Celery workers when multiple tasks start together and kubeconfig uses
aws eks get-tokenexec auth.The failure mode is subtle:
aws eks get-token) can fail due to older botocore race behavior around~/.aws/cli/cache403 ForbiddenThis PR adds explicit runtime guardrails for that path so operators get a clear signal before task execution fails in a misleading way.
Impact of the change
This adds a policy-driven runtime check only when kubeconfig exec auth actually uses
aws eks get-token:warn(default): emits an actionable warning if botocore is vulnerable (< 1.40.2) or version cannot be detectedfail: hard-fails early with a clear error to enforce platform policyignore: bypasses the check when users intentionally manage this externallyOperational impact:
403fail) without forcing everyone into that modewarnScope and non-goals
aws eks get-token) because this is the concrete failing path in Race Condition in AWS CLI Cache Creation During Parallel KubernetesPodOperator Authentication #60943.403responses, and does not change auth flow for non-AWS exec plugins.Configuration
New Kubernetes connection extra:
exec_auth_aws_cli_version_check_mode:warn(default) |fail|ignoreValidation
aws eks get-token)aws --versionwarn,fail,ignore, invalid fallback)get_connand default kubeconfig client pathAIRFLOW_HOME=/tmp/airflow-60943 uv run --python 3.12 -m pytest providers/cncf/kubernetes/tests/unit/cncf/kubernetes/hooks/test_kubernetes.py -qcloses #60943