feat: vendor-pluggable S3 credentials for native scans by mbutrovich · Pull Request #4309 · apache/datafusion-comet

mbutrovich · 2026-05-13T01:21:45Z

Which issue does this PR close?

Closes #4332.

Rationale for this change

Comet's native scan paths (object_store for raw Parquet, opendal via iceberg-rust for Iceberg) bypass Spark's Hadoop S3A credential infrastructure. Vendors with per-path STS, REST-vended creds, or other custom mechanisms cannot reach Comet through any existing SPI. AWSCredentialsProvider.getCredentials() is parameterless, Hadoop S3A custom signers never return credentials outside the signing pipeline, and Spark's CloudCredentialsProvider yields one JWT per service name with no path argument.

This PR adds a narrow, S3-specific SPI plus JNI plumbing to call it from native code. Activation is config-driven and modeled on parquet.crypto.factory.class (PME KMS, #2447). The user names one vendor class in a Spark or Hadoop config and the vendor dispatches across backends inside it.

Design rationale (keying, lifecycle, returns-or-throws, no Comet-side cache, property-bag handling, error-fidelity caveats) lives in the contributor guide page s3-credential-provider-design.md. Operator setup and vendor contract live in the user guide page s3-credential-providers.md.

What is in this PR

Java SPI under org.apache.comet.cloud.s3 (in the spark module, since refactor: Move most of comet-common module into comet-spark #4325 collapsed common to a minimal bootstrap): CometS3CredentialProvider (AutoCloseable, default initialize(Map)), CometS3Credentials, CometS3AccessMode, CometS3CredentialContext, and CometS3CredentialDispatcher keyed by (FQCN, dispatchKey, catalogProperties) with ensureInitialized(...) returning a long handle, hot-path getCredentialsForPath(handle, ...), and a JVM shutdown hook that closes every cached provider.
Shared org.apache.comet.util.ClassLoaders.loadClass prefers the thread context ClassLoader. Both the dispatcher and IcebergReflection.loadClass delegate to it.
Rust CometS3CredentialBridge (under native/core/src/cloud/s3/) implementing object_store::CredentialProvider and reqsign_core::ProvideCredential, plus a JNI handle in native/jni-bridge.
Activation keys: fs.s3a.comet.credential.provider.class (with per-bucket override) for Parquet, and s3.comet.credential.provider.class on the Spark catalog property for Iceberg. dispatchKey is the bucket on the Parquet path and the V2 catalog name on the Iceberg path.
The unfiltered FileIO property bag crosses JNI as catalog_properties. The storage-prefix filter (s3., gcs., adls., client.) moves native-side to iceberg_scan.rs::load_file_io.
IcebergScanExec gets a manual redacting Debug so plan dumps do not leak the property bag.
iceberg-rust pin bumped to 83b4595 (for reqsign-core 3.0 and CustomAwsCredentialLoader). testcontainers bumped to 1.21.4 and docker-java to 3.7.1 for modern Docker daemons.
Reference IcebergRESTVendedS3Provider (test scope, Spark 4.x build only) wrapping Iceberg's VendedCredentialsProvider. Test scope keeps iceberg-aws and AWS SDK v2 off Comet's runtime classpath.
New user guide and contributor guide pages (linked above).

How are these changes tested?

JUnit CometS3CredentialDispatcherTest: handle round-trip, ensureInitialized idempotence, distinct dispatchKey and catalogProperties isolation, closeAll swallows provider exceptions, missing-class / wrong-interface / no-arg-ctor / empty-FQCN failure modes, get-without-init guard.
JUnit IcebergRESTVendedS3ProviderTest (Spark 4.x).
End-to-end CometS3CredentialBridgeSuite (Minio): Parquet on S3, Iceberg on S3, REST plus SPI integration with a sentinel non-storage-prefix key reaching initialize(Map), multi-catalog isolation across two catalogs sharing one FQCN. Added to dev/ci/check-suites.py ignore list (manual, like other Docker-dependent S3 suites).
Confirmed end-to-end with a downstream custom credential provider.

# Conflicts: # docs/source/contributor-guide/index.md

…entials for read vs. write

… activation (fs.s3a.comet.credential.provider.class for Parquet, s3.comet.credential.provider.class for Iceberg), so the bridge is opt-in per Spark config rather than implicit on classpath presence.

mbutrovich · 2026-05-21T18:15:07Z

CC @snmvaughan

karuppayya

Left some comments. Will do another pass later today

karuppayya · 2026-05-21T19:24:50Z

+    InstanceKey key = new InstanceKey(providerClassName, dispatchKey == null ? "" : dispatchKey);
+    Map<String, String> props =
+        catalogProperties == null ? Collections.emptyMap() : catalogProperties;
+    INSTANCES.computeIfAbsent(


A vendor whose initialize throws gets re-attempted on every get_credential call from object_store. Should we cache error per key and backoff. May be a followup

mbutrovich · 2026-05-21T21:46:40Z

Thanks for the feedback @karuppayya! I think I addressed everything but:

A vendor whose initialize throws gets re-attempted on every get_credential call from object_store. Should we cache error per key and backoff. May be a followup

I will update my internal credential provider to align with these SPI changes and test again.

mbutrovich · 2026-05-21T23:07:03Z

Updated my internal implementation to match the latest SPI changes, and things are working well!

karuppayya · 2026-05-22T19:03:06Z

I guess the JNI call to getCredentialsForPath runs through the Comet tokio runtime, which is sized at spark.executor.cores worker threads. The call seems to be synchronous.
Since the call duration is non-deterministic and entirely controlled by the vendor's implementation, this can potentially block unrelated work on the runtime today and as the system grows. Do you think this is an issue?

That said, the current shape is a reasonable starting point . We can keeping refining in subsqeuent PRs.

mbutrovich · 2026-05-22T19:13:17Z

I guess the JNI call to getCredentialsForPath runs through the Comet tokio runtime, which is sized at spark.executor.cores worker threads. The call seems to be synchronous. Since the call duration is non-deterministic and entirely controlled by the vendor's implementation, this can potentially block unrelated work on the runtime today and as the system grows. Do you think this is an issue?

I think there are still opportunities to figure out how to get better parallelism and hide I/O latency in Comet's execution model, but yeah right now it's fairly restricted. I think at least for the OpenDAL/Iceberg case we have a knob you can tune to fire off more tasks for data loading, which I think would introduce parallelism on this path.

parthchandra · 2026-05-22T18:20:38Z

+        ) {
+            Ok(b) => Some(b),
+            Err(e) => {
+                log::warn!(


It's probably a better idea to fail here and let the user fix the error. Falling thru to the default provider chain will either fail or worse, succeed and lead to curious results.

Good call. The user explicitly named a provider, so silently falling back hides a real misconfiguration and can resolve to the wrong identity. Changed to propagate the error out of create_store.

parthchandra · 2026-05-22T18:26:32Z

+  private static final class InstanceKey {
+    final String providerClassName;
+    final String dispatchKey;
+    final Map<String, String> catalogProperties;


This could lead to KEY_TO_HANDLE getting quite large if there are many (JVM) sessions or if some catalog implementation refreshes some catalog property per table. We could limit the KEY_TO_HANDLE size and evict older keys to keep this limited.

Fair point on the catalog-refresh case, the design doesn't forbid it.

A few options I considered and where they break down:

Driver-side SparkListener / session-close hook: KEY_TO_HANDLE lives on executors, so a driver-side hook doesn't reach it. onApplicationEnd only fires at app shutdown, which the existing JVM shutdown hook already covers.

Per-session clearing on the executor: Spark has no "session ended on executor" event because executors are session-agnostic. In Spark Connect / Thrift Server one JVM serves many sessions concurrently. InstanceKey is (providerClassName, dispatchKey, catalogProperties) with no session identity, and two sessions configured with the same triple collapse to one entry via computeIfAbsent, so clearing on session X close would invalidate session Y's live bridge.

Plain LRU: the handle is held by native CometS3CredentialBridge instances by value and reused across scans, with no Drop callback into the JVM, so eviction can invalidate a live bridge mid-job.

The path that's both bounded and safe under parallel sessions is reference counting: a JNI callback from CometS3CredentialBridge::Drop decrements, entry evicts at zero. That's a real change rather than a one-line cap.

Do you have a catalog in mind that churns catalogProperties per table? Otherwise I would prefer to land this as-is and open a followup for the refcounted lifecycle once we have a concrete trigger.

Let's have a followup for this. My knowledge of this is theoretical but https://gravitino.apache.org/docs/0.9.0-incubating/security/credential-vending can return per table credentials by creating 'Assume Role' credentials scoped to the table.
This is similar to having per file encryption is PME; access to specific files/tables can be restricted for users.

parthchandra · 2026-05-22T21:04:26Z

-            // Extract vended credentials from FileIO (REST catalog credential vending).
-            // FileIO properties take precedence over Hadoop-derived properties because
-            // they contain per-table credentials vended by the REST catalog.
+            // Forward the full FileIO property bag (including credentials.uri, OAuth tokens,


Properties like OAuth tokens, bearer tokens, etc. should not really be here as this will get baked into a protobuf that is sent unencrypted over the wire to executors. Also, if tokens have an expiry then they need to be refreshed or the credentials provider will fail.

On the wire: this rides the same channel that already carries Hadoop delegation tokens, S3A vended credentials, and Iceberg REST credentials from driver to executors via SparkSession / Hadoop conf, so the property bag here isn't a new exposure relative to that baseline. Deployments that need wire encryption already have spark.network.crypto.enabled.

On expiry: the properties forwarded in the proto are the catalog bootstrap identity (REST URI, OAuth client config), not the live credential. getCredentialsForPath is called per request and is the refresh contract, which is why the SPI is shaped this way rather than serializing a one-shot credential into the plan.

Were you flagging a specific provider where the bootstrap bag itself carries a short-lived bearer token?

Oh right. Probably worth adding a documentation note that spark.network.crypto.enabled should be set to true

Thanks again, @parthchandra. This sent me back to check Spark's own posture before adding the note.

Spark RPC is plaintext by default. TransportContext.java:257 only installs an SslHandler when spark.ssl.rpc.enabled is true, and that defaults to false (TransportConf.java:273-275). The AES alternative spark.network.crypto.enabled also defaults to false (Network.scala:30-34), and the two are mutually exclusive (SecurityManager.scala:283). With both off, the Netty channel is raw TCP.

The same channel already carries Hadoop delegation tokens, CloudCredentialsProvider JWTs, shuffle blocks, and serialized closures. Spark's docs/security.md covers the custom delegation token SPI (lines 927-944) without recommending RPC encryption for it, and only flags crypto in two specific contexts: YARN secret distribution (line 64) and spark.io.encryption.enabled (line 285).

The bootstrap property bag here is the same category of data on the same channel, so a "set spark.network.crypto.enabled=true" callout would be more prescriptive than Spark is for equivalent mechanisms. RPC encryption is a deployment-wide call, not a per-SPI one.

Added a short "Wire encryption" subsection to the user guide that notes catalog config rides the Netty RPC channel, names the two opt-in mechanisms (spark.network.crypto.enabled and spark.ssl.rpc.enabled), and links Spark's security guide rather than prescribing a specific knob.

# Conflicts: # docs/source/contributor-guide/index.md

…tial_provider

mbutrovich added 8 commits May 12, 2026 17:49

cloud credential provider JVM side

e877e7e

cloud credential provider native side

892697b

hook up iceberg-rust, had to bump iceberg-rust

67b6f9b

docs

22e30b0

tests

58b1364

cleanup

858f901

cleanup

92b0416

Merge branch 'main' into credential_provider

2da7184

# Conflicts: # docs/source/contributor-guide/index.md

mbutrovich changed the title ~~Credential provider~~ feat: Spark custom credential providers for native scans May 13, 2026

mbutrovich added 5 commits May 12, 2026 21:48

fix native test failure in CI

4d03053

update contributor guide about multiple providers

a8cbe8e

add access mode to the SPI since a provider might have different cred…

9d07ff0

…entials for read vs. write

clean up iceberg path discrepancy

9b1e622

run prettier on the docs

08afb7d

andygrove added this to the 0.17.0 milestone May 13, 2026

andygrove added this to Comet Development May 13, 2026

github-project-automation Bot moved this to Todo in Comet Development May 13, 2026

andygrove assigned mbutrovich May 13, 2026

Merge branch 'main' into credential_provider

0cd8a36

mbutrovich force-pushed the credential_provider branch from b549155 to 0cd8a36 Compare May 14, 2026 14:33

mbutrovich moved this from Todo to In progress in Comet Development May 14, 2026

andygrove reviewed May 14, 2026

View reviewed changes

Comment thread common/src/main/java/org/apache/comet/cloud/CometCloudCredentialDispatcher.java Outdated

cleanup to get ready for review

d9596de

mbutrovich changed the title ~~feat: Spark custom credential providers for native scans~~ feat: vendor-pluggable S3 credentials for Comet native scans May 14, 2026

mbutrovich changed the title ~~feat: vendor-pluggable S3 credentials for Comet native scans~~ feat: vendor-pluggable S3 credentials for native scans May 14, 2026

mbutrovich marked this pull request as ready for review May 14, 2026 16:18

mbutrovich mentioned this pull request May 15, 2026

feat: Credential provider support #4335

Closed

mbutrovich added 3 commits May 15, 2026 14:50

Merge branch 'main' into credential_provider

7050069

Replaced the ServiceLoader-based S3 credential SPI with config-driven…

de5d5c2

… activation (fs.s3a.comet.credential.provider.class for Parquet, s3.comet.credential.provider.class for Iceberg), so the bridge is opt-in per Spark config rather than implicit on classpath presence.

Cleanup.

17825c0

mbutrovich and others added 5 commits May 18, 2026 19:34

Update file structure after apache#4325.

9f58cb0

Merge branch 'main' into credential_provider

176ff52

Merge branch 'main' into credential_provider

0c13ae6

Update docs, add contributor guide page about credential provider.

5853dff

fix format

c776613

mbutrovich requested review from comphead and parthchandra May 21, 2026 18:14

rename docs

6738ca0

karuppayya reviewed May 21, 2026

View reviewed changes

mbutrovich and others added 2 commits May 21, 2026 16:17

Merge branch 'main' into credential_provider

b85d645

Address PR feedback.

1895a17

mbutrovich requested a review from karuppayya May 21, 2026 23:16

mbutrovich added 2 commits May 21, 2026 19:30

Clean up docs trying to get line count on the diff down.

e37d61e

fix format.

f08805e

karuppayya approved these changes May 22, 2026

View reviewed changes

parthchandra reviewed May 22, 2026

View reviewed changes

mbutrovich and others added 8 commits May 22, 2026 17:47

Address PR feedback.

6b0c853

Merge branch 'main' into credential_provider

43d5d45

Merge branch 'main' into credential_provider

acd6cdb

Merge branch 'main' into credential_provider

2a81fdb

Merge branch 'main' into credential_provider

f308022

Merge branch 'main' into credential_provider

5c40070

# Conflicts: # docs/source/contributor-guide/index.md

Merge remote-tracking branch 'origin/credential_provider' into creden…

0a7fccc

…tial_provider

Update user guide about encryption.

59e0331

mbutrovich requested a review from parthchandra May 26, 2026 22:48

Conversation

mbutrovich commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What is in this PR

How are these changes tested?

Uh oh!

Uh oh!

mbutrovich commented May 21, 2026

Uh oh!

karuppayya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbutrovich commented May 21, 2026

Uh oh!

karuppayya commented May 22, 2026

Uh oh!

mbutrovich commented May 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mbutrovich commented May 13, 2026 •

edited

Loading

mbutrovich commented May 21, 2026 •

edited

Loading