CASSANALYTICS-42 Add S3-backed Cassandra batch reader#206
Conversation
Introduce the public backup-reader registration contract and S3 client configuration primitives so S3-backed readers can be wired without a bundled backup implementation.
Implement the S3-backed data layer and SSTable token-index builder behind the backup-reader contract, including Summary.db reader support needed for remote SSTables.
Wire the S3 data layer into the Spark DataSource path with scan statistics, custom task metrics, and prebuilt token-index contexts for efficient planning.
Add focused coverage for S3 config validation, client caching, token-index construction, prebuilt contexts, and Summary.db reader edge cases.
VersionRunner is the parameterized-test base used to iterate Cassandra version bridges. Until now it lived under cassandra-analytics-core's test sources, so external test consumers loading it via the existing testImplementation(testFixtures(project(...))) wiring couldn't see it. This commit relocates it to src/testFixtures and exposes the test fixtures artifact so any downstream test module can depend on it through the standard Gradle testFixtures mechanism. Existing in-module tests continue to resolve VersionRunner because Gradle automatically adds the testFixtures output to the local test compile classpath. - Enable the java-test-fixtures plugin in cassandra-analytics-core. - Declare testFixturesApi dependencies for cassandra-bridge and JUnit Jupiter API so consumers compile against fixtures without redeclaring transitive dependencies. - Move VersionRunner from src/test/java to src/testFixtures/java with no source changes.
Concurrent Spark tasks share a canonical BackupReader instance via ReaderInternCache so that S3 client and manifest state aren't re-allocated per task. With the prior shape, each task installed its own Stats sink through BackupReader.setStats(), which silently raced with sibling tasks on the same executor and attributed S3 GET/HEAD durations, mutable-metadata drift counts, and similar metrics to whichever task wrote setStats() last. Move from per-reader mutable state to per-call argument passing so the correct task-scoped Stats receives every S3 operation measurement. Interface changes (BackupReader): - Drop setStats(Stats); the interface no longer carries mutable per-task state. - Add a trailing Stats parameter to readAsync, readMutableMetadataAsync, getAsync, getMutableMetadataAsync, and exists. Implementations route S3 metrics through the supplied sink rather than a captured field. Config changes (BackupReaderConfig): - Drop the transient stats field and the withStats(Stats) helper. Stats is now exclusively supplied on individual read calls, which removes a sharp edge around closure capture and (de)serialization. - Bump serialVersionUID to reflect the shape change. Wiring changes: - S3CassandraDataLayer passes the task's SparkCustomMetricsStats (context.stats) into every BackupReader call site, including the ranged GET, streaming GET, mutable-metadata variants, and exists(). - The readObject path no longer reapplies setStats on the interned reader; it just recreates the executor-local Spark metrics sink. - SSTableTokenIndexBuilder, whose prebuild path does not flow metrics back to Spark, supplies Stats.DoNothingStats.INSTANCE explicitly. Test changes: - FakeBackupReader is updated to match the new signatures and no longer stores a Stats field.
SSTableTokenBounds.overlaps() previously normalized firstToken > lastToken by swapping the endpoints. For an SSTable whose covered range wraps the ring (i.e. crosses the Murmur3 zero boundary between Long.MAX_VALUE and Long.MIN_VALUE), the swap silently converted the actual covered band [first, MAX] U [MIN, last] into its complement (last, first). The data layer uses overlaps() to filter SSTables in listInstance(), so a query range falling inside the true covered band could be reported as non-overlapping and the SSTable would be skipped, silently dropping rows. This is the same wrap-around convention used elsewhere in the project (e.g. RangeUtils.calculateTokenRanges), where firstToken > lastToken means the range crosses the boundary rather than being inverted. Changes: - SSTableTokenBounds.overlaps(): when firstToken > lastToken, model the bounds as the two segments [first, MAX] and [MIN, last] and report overlap if the query range hits either segment. Non-wrap bounds keep the existing single-segment isConnected() check. - Constrain the new helper constants to the Murmur3 token domain (Long.MIN_VALUE / Long.MAX_VALUE) and document the assumption; the reader path is Murmur3-only today. - SSTableTokenIndexBuilder.toLong(): replace longValue() with longValueExact() so a non-Murmur3 token that ever reaches this path fails loudly instead of silently truncating to the low 64 bits. Tests: - SSTableTokenIndexTest: rename the inverted-bounds test to invertedBoundsModelWrapAround and assert the actual wrap semantics (queries inside either segment overlap, queries in the gap do not, endpoints are inclusive). - Add boundaryAndSingletonBoundsOverlap covering point ranges, shared endpoints, and adjacent-but-disjoint ranges on the well-formed path. - Add extremeTokenBoundsCoverRing exercising Long.MIN_VALUE / Long.MAX_VALUE for both the full-ring and the degenerate MAX..MIN inverted form.
jberragan
left a comment
There was a problem hiding this comment.
Thanks for the patch, this is a great addition! I will take a closer look, but a few high level questions:
- Did you consider putting this in a separate module (e.g.
cassandra-analytics-s3). The Sidecar DataLayer should never had been put incassandra-analytics-coreand should eventually be moved to its own module. - I suppose every user might have a slightly different backup path format, is this made pluggable by the
BackupReaderinterface?
wasn't aware that we wanted the sidecar data layer to live outside of core. happy to discuss the appropriate modularization. A s3 specific one could work, or we can have a batch read specific module? the s3 implementation inherits a good amount of work from the sidecar one.
yes, i didn't include the internal concrete implementation of this interface on this PR because it is tied to a vendor we use. But the goal is to make it pluggable per needs of different organizations, using the |
Patch information
Jira: https://issues.apache.org/jira/browse/CASSANALYTICS-42
CEP: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-56%3A+Spark+Bulk+Reading+from+Cassandra+Backup+Uploaded+to+Object+Storage
Summary
Adds an S3-backed Cassandra batch reader so Spark jobs can read SSTables directly from object-storage backups without going through a live cluster. Built on top of the existing
CassandraDataLayer/ bulk-reader abstractions and exposed through Spark SQL via a newS3CassandraDataSource.This PR ships only the generic SPI and the S3 reference implementation. Concrete backup-provider implementations are out of scope and intended to live in downstream repos that plug into the
BackupReaderSPI introduced here.What's included
cassandra-analytics-core/.../spark/data/backup/):BackupReader,BackupReaderConfig,BackupReaderFactory,BackupReaderRegistry— pluggable abstraction so different backup providers can be wired in without changes to core.cassandra-analytics-core):S3CassandraDataLayerwith token-aware partitioning and SSTable selection over S3-resident backups, includingSSTableTokenBoundspruning that correctly handles the Murmur3 wrap-around.S3ClientCache,S3ClientConfig,S3DataSourceClientConfig,S3SizingFactory,S3TableSizeProvider, andS3SnapshotTimeProvider(incassandra-analytics-common).S3CassandraDataSource,S3CassandraPrebuiltReadContext(Registry),S3CassandraTokenIndexPrebuilder, plus refinements toCassandraScanBuilder,CassandraPartitioning,CassandraTable, and a newCassandraSourceStatistics.SparkCustomMetricsStatsplus SparkCustomTaskMetricclasses (TaskTotal*/Total*) for S3 GET/HEAD latency, summary read latency, skipped/corrupt SSTable counts, opened SSTable duration, mutable metadata drift, and head fallback counts. Threaded throughBackupReaderread paths via a per-taskStatsargument.CassandraBridgeImplementationandSummaryDbUtils(both bridges) to thread the per-taskStatsparameter; existing call sites get backwards-compatible overloads.VersionRunneris published viajava-test-fixturesso the bridges can share it in tests.Tests
All new tests live in the modules where the code lives (
cassandra-analytics-core, bridges):S3ClientCacheTest,S3ClientConfigTest,S3DataSourceClientConfigTest,S3DataSourceClientConfigBufferTestS3SSTableLeakTests,SSTableTokenIndexTestBackupReaderFactorySerializationTestplus aFakeBackupReadertest fixture exercising the SPI without a real S3 endpointS3CassandraPrebuiltReadContextTestReaderUtilsTests,SSTableReaderTests,SummaryDbTestsfor the newStats-threaded signaturesThe code is internally powering ~200 spark pipelines in production for ingesting data from s3 backup into data lake, ranging from 10MB to 300TB table size.
Reviewer notes
The diff is large (~9.4k LOC added across ~80 files) because it introduces both the SPI and a complete S3 implementation plus the metrics surface. This is mostly to show the overall idea and actual shipping will likely come with smaller PRs if preferred.
Please also ignore my shadow jar build changes, those are artifacts of our internal build system that we can discuss cleanup later.
Known gap currently deferred