CASSANALYTICS-42 Add S3-backed Cassandra batch reader by liucao-dd · Pull Request #206 · apache/cassandra-analytics

liucao-dd · 2026-05-12T22:10:22Z

Patch information

Jira: https://issues.apache.org/jira/browse/CASSANALYTICS-42
CEP: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-56%3A+Spark+Bulk+Reading+from+Cassandra+Backup+Uploaded+to+Object+Storage

Summary

Adds an S3-backed Cassandra batch reader so Spark jobs can read SSTables directly from object-storage backups without going through a live cluster. Built on top of the existing CassandraDataLayer / bulk-reader abstractions and exposed through Spark SQL via a new S3CassandraDataSource.

This PR ships only the generic SPI and the S3 reference implementation. Concrete backup-provider implementations are out of scope and intended to live in downstream repos that plug into the BackupReader SPI introduced here.

What's included

Backup reader SPI (cassandra-analytics-core/.../spark/data/backup/): BackupReader, BackupReaderConfig, BackupReaderFactory, BackupReaderRegistry — pluggable abstraction so different backup providers can be wired in without changes to core.
S3 data layer (cassandra-analytics-core):
- S3CassandraDataLayer with token-aware partitioning and SSTable selection over S3-resident backups, including SSTableTokenBounds pruning that correctly handles the Murmur3 wrap-around.
- S3ClientCache, S3ClientConfig, S3DataSourceClientConfig, S3SizingFactory, S3TableSizeProvider, and S3SnapshotTimeProvider (in cassandra-analytics-common).
Spark SQL integration: S3CassandraDataSource, S3CassandraPrebuiltReadContext(Registry), S3CassandraTokenIndexPrebuilder, plus refinements to CassandraScanBuilder, CassandraPartitioning, CassandraTable, and a new CassandraSourceStatistics.
Per-task metrics: SparkCustomMetricsStats plus Spark CustomTaskMetric classes (TaskTotal* / Total*) for S3 GET/HEAD latency, summary read latency, skipped/corrupt SSTable counts, opened SSTable duration, mutable metadata drift, and head fallback counts. Threaded through BackupReader read paths via a per-task Stats argument.
Bridge plumbing: minor changes to CassandraBridgeImplementation and SummaryDbUtils (both bridges) to thread the per-task Stats parameter; existing call sites get backwards-compatible overloads. VersionRunner is published via java-test-fixtures so the bridges can share it in tests.

Tests

All new tests live in the modules where the code lives (cassandra-analytics-core, bridges):

S3ClientCacheTest, S3ClientConfigTest, S3DataSourceClientConfigTest, S3DataSourceClientConfigBufferTest
S3SSTableLeakTests, SSTableTokenIndexTest
BackupReaderFactorySerializationTest plus a FakeBackupReader test fixture exercising the SPI without a real S3 endpoint
S3CassandraPrebuiltReadContextTest
Updated ReaderUtilsTests, SSTableReaderTests, SummaryDbTests for the new Stats-threaded signatures

The code is internally powering ~200 spark pipelines in production for ingesting data from s3 backup into data lake, ranging from 10MB to 300TB table size.

Reviewer notes

The diff is large (~9.4k LOC added across ~80 files) because it introduces both the SPI and a complete S3 implementation plus the metrics surface. This is mostly to show the overall idea and actual shipping will likely come with smaller PRs if preferred.

Please also ignore my shadow jar build changes, those are artifacts of our internal build system that we can discuss cleanup later.

Known gap currently deferred

s3 credential is not wired through sidecar
we use in memory buffer when reading from s3, and this works fine in prod when we have S3 integrated encryption (e.g. KMS), but may or may not work if other encryption mechanism exists that needs handling outside of S3 API.

Introduce the public backup-reader registration contract and S3 client configuration primitives so S3-backed readers can be wired without a bundled backup implementation.

Implement the S3-backed data layer and SSTable token-index builder behind the backup-reader contract, including Summary.db reader support needed for remote SSTables.

Wire the S3 data layer into the Spark DataSource path with scan statistics, custom task metrics, and prebuilt token-index contexts for efficient planning.

Add focused coverage for S3 config validation, client caching, token-index construction, prebuilt contexts, and Summary.db reader edge cases.

VersionRunner is the parameterized-test base used to iterate Cassandra version bridges. Until now it lived under cassandra-analytics-core's test sources, so external test consumers loading it via the existing testImplementation(testFixtures(project(...))) wiring couldn't see it. This commit relocates it to src/testFixtures and exposes the test fixtures artifact so any downstream test module can depend on it through the standard Gradle testFixtures mechanism. Existing in-module tests continue to resolve VersionRunner because Gradle automatically adds the testFixtures output to the local test compile classpath. - Enable the java-test-fixtures plugin in cassandra-analytics-core. - Declare testFixturesApi dependencies for cassandra-bridge and JUnit Jupiter API so consumers compile against fixtures without redeclaring transitive dependencies. - Move VersionRunner from src/test/java to src/testFixtures/java with no source changes.

Concurrent Spark tasks share a canonical BackupReader instance via ReaderInternCache so that S3 client and manifest state aren't re-allocated per task. With the prior shape, each task installed its own Stats sink through BackupReader.setStats(), which silently raced with sibling tasks on the same executor and attributed S3 GET/HEAD durations, mutable-metadata drift counts, and similar metrics to whichever task wrote setStats() last. Move from per-reader mutable state to per-call argument passing so the correct task-scoped Stats receives every S3 operation measurement. Interface changes (BackupReader): - Drop setStats(Stats); the interface no longer carries mutable per-task state. - Add a trailing Stats parameter to readAsync, readMutableMetadataAsync, getAsync, getMutableMetadataAsync, and exists. Implementations route S3 metrics through the supplied sink rather than a captured field. Config changes (BackupReaderConfig): - Drop the transient stats field and the withStats(Stats) helper. Stats is now exclusively supplied on individual read calls, which removes a sharp edge around closure capture and (de)serialization. - Bump serialVersionUID to reflect the shape change. Wiring changes: - S3CassandraDataLayer passes the task's SparkCustomMetricsStats (context.stats) into every BackupReader call site, including the ranged GET, streaming GET, mutable-metadata variants, and exists(). - The readObject path no longer reapplies setStats on the interned reader; it just recreates the executor-local Spark metrics sink. - SSTableTokenIndexBuilder, whose prebuild path does not flow metrics back to Spark, supplies Stats.DoNothingStats.INSTANCE explicitly. Test changes: - FakeBackupReader is updated to match the new signatures and no longer stores a Stats field.

SSTableTokenBounds.overlaps() previously normalized firstToken > lastToken by swapping the endpoints. For an SSTable whose covered range wraps the ring (i.e. crosses the Murmur3 zero boundary between Long.MAX_VALUE and Long.MIN_VALUE), the swap silently converted the actual covered band [first, MAX] U [MIN, last] into its complement (last, first). The data layer uses overlaps() to filter SSTables in listInstance(), so a query range falling inside the true covered band could be reported as non-overlapping and the SSTable would be skipped, silently dropping rows. This is the same wrap-around convention used elsewhere in the project (e.g. RangeUtils.calculateTokenRanges), where firstToken > lastToken means the range crosses the boundary rather than being inverted. Changes: - SSTableTokenBounds.overlaps(): when firstToken > lastToken, model the bounds as the two segments [first, MAX] and [MIN, last] and report overlap if the query range hits either segment. Non-wrap bounds keep the existing single-segment isConnected() check. - Constrain the new helper constants to the Murmur3 token domain (Long.MIN_VALUE / Long.MAX_VALUE) and document the assumption; the reader path is Murmur3-only today. - SSTableTokenIndexBuilder.toLong(): replace longValue() with longValueExact() so a non-Murmur3 token that ever reaches this path fails loudly instead of silently truncating to the low 64 bits. Tests: - SSTableTokenIndexTest: rename the inverted-bounds test to invertedBoundsModelWrapAround and assert the actual wrap semantics (queries inside either segment overlap, queries in the gap do not, endpoints are inclusive). - Add boundaryAndSingletonBoundsOverlap covering point ranges, shared endpoints, and adjacent-but-disjoint ranges on the well-formed path. - Add extremeTokenBoundsCoverRing exercising Long.MIN_VALUE / Long.MAX_VALUE for both the full-ring and the degenerate MAX..MIN inverted form.

jberragan

Thanks for the patch, this is a great addition! I will take a closer look, but a few high level questions:

Did you consider putting this in a separate module (e.g. cassandra-analytics-s3). The Sidecar DataLayer should never had been put in cassandra-analytics-core and should eventually be moved to its own module.
I suppose every user might have a slightly different backup path format, is this made pluggable by the BackupReader interface?

liucao-dd · 2026-05-12T23:55:10Z

Thanks for the patch, this is a great addition! I will take a closer look, but a few high level questions:

Did you consider putting this in a separate module (e.g. cassandra-analytics-s3). The Sidecar DataLayer should never had been put in cassandra-analytics-core and should eventually be moved to its own module.

wasn't aware that we wanted the sidecar data layer to live outside of core. happy to discuss the appropriate modularization. A s3 specific one could work, or we can have a batch read specific module? the s3 implementation inherits a good amount of work from the sidecar one.

I suppose every user might have a slightly different backup path format, is this made pluggable by the BackupReader interface?

yes, i didn't include the internal concrete implementation of this interface on this PR because it is tied to a vendor we use. But the goal is to make it pluggable per needs of different organizations, using the BackupReaderRegistry.register()

liucao-dd added 7 commits May 12, 2026 04:15

Add S3 reader configuration foundations

ba73508

Introduce the public backup-reader registration contract and S3 client configuration primitives so S3-backed readers can be wired without a bundled backup implementation.

Add S3 Cassandra data layer internals

d8807ca

Implement the S3-backed data layer and SSTable token-index builder behind the backup-reader contract, including Summary.db reader support needed for remote SSTables.

Expose S3 Cassandra through Spark SQL

4007648

Wire the S3 data layer into the Spark DataSource path with scan statistics, custom task metrics, and prebuilt token-index contexts for efficient planning.

Cover S3 Cassandra reader behavior

f206cea

Add focused coverage for S3 config validation, client caching, token-index construction, prebuilt contexts, and Summary.db reader edge cases.

jberragan reviewed May 12, 2026

View reviewed changes

Fix javadoc issues

33a659f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CASSANALYTICS-42 Add S3-backed Cassandra batch reader#206

CASSANALYTICS-42 Add S3-backed Cassandra batch reader#206
liucao-dd wants to merge 8 commits into
apache:trunkfrom
liucao-dd:s3-batch-reader-on-trunk

liucao-dd commented May 12, 2026 •

edited

Loading

Uh oh!

jberragan left a comment

Uh oh!

liucao-dd commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liucao-dd commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Patch information

Summary

What's included

Tests

Reviewer notes

Known gap currently deferred

Uh oh!

jberragan left a comment

Choose a reason for hiding this comment

Uh oh!

liucao-dd commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liucao-dd commented May 12, 2026 •

edited

Loading

liucao-dd commented May 12, 2026 •

edited

Loading