Skip to content

KubernetesHook: add AWS exec-auth botocore guardrails for EKS token flow (#60943)#61936

Closed
Vamsi-klu wants to merge 2 commits into
apache:mainfrom
Vamsi-klu:codex/60943-exec-auth-version-guardrails
Closed

KubernetesHook: add AWS exec-auth botocore guardrails for EKS token flow (#60943)#61936
Vamsi-klu wants to merge 2 commits into
apache:mainfrom
Vamsi-klu:codex/60943-exec-auth-version-guardrails

Conversation

@Vamsi-klu

Copy link
Copy Markdown
Contributor

Why this change

Issue #60943 reports intermittent KubernetesPodOperator task failures on Celery workers when multiple tasks start together and kubeconfig uses aws eks get-token exec auth.

The failure mode is subtle:

  • the auth subprocess (aws eks get-token) can fail due to older botocore race behavior around ~/.aws/cli/cache
  • Kubernetes client then proceeds with invalid/empty auth and surfaces a generic 403 Forbidden
  • this looks identical to real RBAC failures, so operators often lose time debugging the wrong problem

This PR adds explicit runtime guardrails for that path so operators get a clear signal before task execution fails in a misleading way.

Impact of the change

This adds a policy-driven runtime check only when kubeconfig exec auth actually uses aws eks get-token:

  • warn (default): emits an actionable warning if botocore is vulnerable (< 1.40.2) or version cannot be detected
  • fail: hard-fails early with a clear error to enforce platform policy
  • ignore: bypasses the check when users intentionally manage this externally

Operational impact:

  • Improves diagnosability of a production issue that often appears as ambiguous 403
  • Reduces MTTR by surfacing root-cause guidance at connection/auth setup time
  • Adds governance controls for teams that need strict enforcement (fail) without forcing everyone into that mode
  • Keeps backwards compatibility with default warn

Scope and non-goals

Configuration

New Kubernetes connection extra:

  • exec_auth_aws_cli_version_check_mode: warn (default) | fail | ignore

Validation

  • Added unit coverage for:
    • kubeconfig exec-auth detection (aws eks get-token)
    • botocore version parsing from aws --version
    • mode behavior (warn, fail, ignore, invalid fallback)
    • integration points in get_conn and default kubeconfig client path
  • Test command used:
    • AIRFLOW_HOME=/tmp/airflow-60943 uv run --python 3.12 -m pytest providers/cncf/kubernetes/tests/unit/cncf/kubernetes/hooks/test_kubernetes.py -q

closes #60943

@Srabasti

Copy link
Copy Markdown
Contributor

Static tests are failing @Vamsi-klu
Please run prek locally in your branch. Prek will fix any formatting errors, and then you can push the commit from your branch.

@Vamsi-klu

Copy link
Copy Markdown
Contributor Author

@Srabasti i have updated the PR with the relevant checks. Can you please review this and let me know your feedback? Thanks!

Comment on lines +63 to +67
_AWS_EXEC_AUTH_VERSION_CHECK_MODE_FIELD = "exec_auth_aws_cli_version_check_mode"
_AWS_EXEC_AUTH_VERSION_CHECK_MODES = {"warn", "fail", "ignore"}
_AWS_EXEC_AUTH_AWS_BINARY_NAMES = {"aws", "aws.exe", "aws2", "aws2.exe"}
_AWS_EXEC_AUTH_FIXED_BOTOCORE_VERSION = (1, 40, 2)
_BOTOCORE_VERSION_PATTERN = re.compile(r"botocore/(?P<version>\d+(?:\.\d+){1,2})")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why AKS specific settings are introduced in K8s provider package (as well as RST docs above and tests below). Why is this not makde in the AWS specific package?

I my view K8s standard package should not be tainted by AWS, Google or Azure specific handling if no K8s standard.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @jscheffl, that's a fair point about keeping the K8s provider cloud agnostic.

The reason this was placed in KubernetesHook is that the vulnerability affects any kubeconfig using aws eks get-token exec auth, not just users going through EksHook or EksPodOperator. Many users configure a generic Kubernetes connection with a kubeconfig file that happens to use AWS exec auth and they never interact with the AWS provider at all.

That said, I agree AWS specific constants and detection logic shouldn't live here. How about this approach:

In the K8s provider, add a minimal, generic exec auth validation hook point in KubernetesHook.get_conn() (for example a discoverable entry point or registry pattern like _validate_exec_auth(kubeconfig, context)). No AWS/GCP/Azure specific code, just a generic extension mechanism.
In the AWS provider, register a validator that handles the AWS specific detection (aws eks get-token binary matching, botocore version parsing, subprocess call) and the connection extra field (exec_auth_aws_cli_version_check_mode). All AWS constants, helpers, tests, and docs move there.
This way the K8s hook stays cloud agnostic while still protecting users who use generic KubernetesHook with an EKS kubeconfig. It also makes the pattern extensible so any other provider could register their own exec auth validators in the future.

Would this approach work for you, or would you prefer everything moved entirely into the AWS provider package?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert here... @o-nikolas / @eladkal how were such things handled in the past if cloud-provider specific stuff needed to be integrated into a base provider package?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off the top of my head I can't think of previous case quite like this. Can you @eladkal?

I see both sides, I don't like AWS stuff in the K8s hook, but I also see the point @Vamsi-klu mentions. And the workaround sounds quite a bit more complicated than the change already is, which I don't love 😬

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't understand what we are trying to solve here.
but if k8s provider needs optional stuff from amazon provider it should have set it as optional dependency

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty hesitant to add this much code to add a min version check like this, especially one that isn't really directly related to this provider.

The lower bound in the amazon provider is already higher than what we'd defend against here. And, our constraint files are already putting in a higher version. (2.11.1 has the new version, 2.11.0 did not, fwiw)

If we went with the proposed generic validator + validate botocore version in the aws provider, we'd not have a problem because of the min version over there (ignoring when that min was bumped over there).

I'm just wondering if this is now not really a likely issue with the latest versions of stuff already...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we better just close this as won't fix. This is just too much to accommodate older Boto versions where it's very unlikely that anyone needs this by now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll close this PR and focus on other issues . Thanks everyone who commented your thoughts on this PR

@Vamsi-klu

Copy link
Copy Markdown
Contributor Author

Collaborators for this PR: @codingrealitylabs and @girlcoder-gaming. They helped me raise this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race Condition in AWS CLI Cache Creation During Parallel KubernetesPodOperator Authentication

6 participants