S3FileIO: Always normalize listPrefix key to end with trailing slash#16656
S3FileIO: Always normalize listPrefix key to end with trailing slash#16656bolma-lila wants to merge 2 commits into
Conversation
S3FileIO.listPrefix() only normalizes the prefix with a trailing slash for S3 Directory Buckets, but not for standard S3 buckets. This causes two problems: 1. Security: ListObjectsV2 with prefix "warehouse/ns/table" also matches sibling keys like "warehouse/ns/table_secret/..." since S3 does pure string prefix matching. 2. STS compatibility: REST catalogs that vend credentials (e.g., Lakekeeper) scope the session policy to "s3:prefix": "<key>/*". A prefix without trailing slash doesn't match, causing 403. Apply toDirectoryPath() unconditionally, consistent with how Trino's S3FileSystem.directoryKey() normalizes keys before listing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- testListPrefixNormalizesTrailingSlashForGeneralPurposeBucket: verifies
via mock that ListObjectsV2Request always receives a trailing-slash
prefix even when the caller omits it (general-purpose bucket)
- testListPrefixDoesNotMatchSiblingPaths: verifies via MinIO that
listPrefix("s3://bucket/warehouse/ns/table") only returns objects
under "table/" and not under "table_sibling/"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| // Always normalize the prefix to end with "/" to prevent matching sibling prefixes. | ||
| // Without this, prefix "warehouse/ns/table" would also match "warehouse/ns/table_other/...". | ||
| // This is also required for STS session policies that scope ListBucket to "<key>/*". | ||
| uri = uri.toDirectoryPath(); |
There was a problem hiding this comment.
After this change s3.directory-bucket.list-prefix-as-directory (default true) has no production reader: isS3DirectoryBucketListPrefixAsDirectory() is now referenced only by a unit test, and S3URI.useS3DirectoryBucket() becomes production-dead. It also removes the only opt-out that existed for directory buckets. Either repurpose the flag to gate this now-unconditional normalization, or remove the dead property and its getter from S3FileIOProperties.
| && client.s3FileIOProperties().isS3DirectoryBucketListPrefixAsDirectory()) { | ||
| uri = uri.toDirectoryPath(); | ||
| } | ||
| // Always normalize the prefix to end with "/" to prevent matching sibling prefixes. |
There was a problem hiding this comment.
SupportsPrefixOperations.listPrefix documents that object stores "may allow for arbitrary prefixes"; this forces directory-only semantics for general-purpose buckets with no opt-out, while GCSFileIO.listPrefix (GCSFileIO.java:298) and ADLSFileIO.listPrefix (ADLSFileIO.java:219) still pass the raw prefix - so the sibling-match behavior persists there and S3 now diverges. Iceberg's own orphan-file listing already appends "/" before calling listPrefix (FileSystemWalker.java:68), so the remaining benefit is external/STS callers. Consider normalizing consistently across FileIOs or documenting why only S3 changes.
|
|
||
| // Call listPrefix WITHOUT trailing slash | ||
| List<FileInfo> fileInfoList = | ||
| StreamSupport.stream( |
There was a problem hiding this comment.
Streams.stream(...) (already imported and used in the sibling test below) is simpler than the StreamSupport/Spliterators construction: Streams.stream(localS3FileIo.listPrefix("s3://bucket/warehouse/ns/table")).collect(Collectors.toList()).
|
Thanks @wombatu-kun for the careful review — all three points are fair, and digging into them led me to conclude this change isn't the right fix, so I'm going to close it. Context on what I was actually chasing: the concrete failure was a Why this PR isn't needed: that's an Airbyte-caller problem, and it's fixed at the caller in airbytehq/airbyte#78624 by normalizing the location before
So as you noted, the only remaining beneficiary would be external/STS callers — and for those, the right place is the caller (as Iceberg already does in
Closing this in favor of the caller-side fix. If there's appetite for sibling-prefix hardening as defense-in-depth, I'd be happy to revisit it as a properly flag-gated, cross-FileIO change in a separate PR. Thanks again. |
Description
S3FileIO.listPrefix()currently only normalizes the prefix with a trailing slash for S3 Directory Buckets (viatoDirectoryPath()), but not for standard S3 buckets. This causes the prefix passed toListObjectsV2to potentially match sibling keys.Problem
Given a table location
s3://bucket/warehouse/ns/table, callinglistPrefix("s3://bucket/warehouse/ns/table")setsprefix=warehouse/ns/tablein theListObjectsV2Request. S3 returns all keys starting with that string, including:warehouse/ns/table/metadata/...(correct — belongs to this table)warehouse/ns/table_archive/data/...(wrong — different table with similar name)This also breaks STS credential vending from REST catalogs like Lakekeeper, which scope the session policy to:
{"Condition": {"StringLike": {"s3:prefix": "warehouse/ns/table/*"}}}A prefix without trailing slash does not match
StringLikepattern.../*, resulting in403 AccessDenied.Fix
Apply
toDirectoryPath()unconditionally inlistPrefix(), not just for Directory Buckets. This ensures the key always ends with/before being used as a ListObjects prefix, matching how Trino normalizes keys viadirectoryKey().Testing
TestS3FileIO.testListPrefix) creates objects underprefix/<scale>/subdirectories, so the normalized prefixpath/to/list/still matches all expected objectsListObjectsV2(prefix="key/")succeeds whereListObjectsV2(prefix="key")returns 403Related