Skip to content

CASSANALYTICS-128: Add flag to allow bulk write to indexed tables#181

Merged
jmckenzie-dev merged 1 commit into
apache:trunkfrom
jmckenzie-dev:allow_write_2i
Mar 13, 2026
Merged

CASSANALYTICS-128: Add flag to allow bulk write to indexed tables#181
jmckenzie-dev merged 1 commit into
apache:trunkfrom
jmckenzie-dev:allow_write_2i

Conversation

@jmckenzie-dev
Copy link
Copy Markdown
Contributor

Patch by Josh McKenzie; reviewed by TBD for CASSANALYTICS-128

Copy link
Copy Markdown
Contributor

@sarankk sarankk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Josh, Lgtm. 1 minor nit. Would it be useful to test writer with this flag in integration test too?

* with the understanding that the secondary indexes will NOT be updated by the bulk write and must be
* rebuilt separately after the job completes.
*/
SKIP_SECONDARY_INDEX_CHECK,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about defining it in the spark conf, instead of write options?

The rationale is that the toggle is to for advanced use case and not directly related to the write behavior. There is another existing spark conf, org.apache.cassandra.spark.bulkwriter.BulkSparkConf#SKIP_CLEAN that skips cleaning up SSTable when job fails.

Admittedly, there are some existing inconsistency of where to have conf and where to have write options in the project.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would putting this in BulkSparkConf force it session-wide? i.e. a user would lose the ability to reason about and easily configure this setting on a per-table / per operation basis vs. the public exposure of it via WriterOptions?

My intuition right now is that this is something that probably should have been enabled by default all this time w/a configurable guardrail to turn it off, so while I'm sympathetic to the idea of taking a small step from "don't allow it" to "allow it but make it hard to use", the risk of this being easily accessible for users is that they'll bulk write to a table that will then have a long running index building operation happen in the background. Which, other than "load on node" and "application might read a partial index if you don't have automation that clearly delineates when a bulk insert and reindex finish from application accessing it", doesn't represent a structural or correctness risk to the data.

Copy link
Copy Markdown
Contributor

@yifan-c yifan-c Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Josh explained to me offline too. I think it makes sense to have it as part of writer option. Especially in the scenario when the spark job has multiple write sessions, writer option allows finer control to toggle the flag on specific tables / write sessions.

@jmckenzie-dev
Copy link
Copy Markdown
Contributor Author

jmckenzie-dev commented Mar 11, 2026

Would it be useful to test writer with this flag in integration test too?

Eh, I didn't because ultimately it's a validation error inside the TableSchema construction so basically, if we can get past that point then we should be good to go.

The rest of the integration path would effectively be testing whether bulk import on cassandra with a table that has legacy SecondaryIndexes on it works correctly. Which, to be completely honest, I'm not super confident of (dug around a bit and ImportTest.java doesn't seem to have coverage). BUT - that's a core C* test coverage problem and not an analytics / sidecar problem.

So maybe a follow up JIRA to add test coverage in C* proper for legacy 2i on imports would be reasonable?

@jmckenzie-dev
Copy link
Copy Markdown
Contributor Author

Copy link
Copy Markdown
Contributor

@jyothsnakonisa jyothsnakonisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

this.ttl = MapUtils.getOrDefault(options, WriterOptions.TTL.name(), null);
this.timestamp = MapUtils.getOrDefault(options, WriterOptions.TIMESTAMP.name(), null);
this.quoteIdentifiers = MapUtils.getBoolean(options, WriterOptions.QUOTE_IDENTIFIERS.name(), false, "quote identifiers");
this.skipSecondaryIndexCheck = MapUtils.getBoolean(options, WriterOptions.SKIP_SECONDARY_INDEX_CHECK.name(), false, "skip secondary index check");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default value be true?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe default value shouldn't be true, because user need to understand that rebuild need to happen after the job.
@jmckenzie-dev can we print a note/warning to the user mentioning rebuild needs to happen after the job, so they are aware of penalty of using this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call; added:

        if (!skipSecondaryIndexCheck)
        {
            validateNoSecondaryIndexes(tableInfo);
        }
        else if (tableInfo.hasSecondaryIndex())
        {
            LOGGER.warn("Bulk writing to tables with SecondaryIndexes will have an asynchronous index rebuild "
                      + "take place automatically after writing. Reads against the index during this time "
                      + "window will produce inconsistent or stale results until index rebuild is complete.");
        }

// process today. 2i and SAI have different ergonomics here regarding if stale data is served during index build;
// ultimately we want the bulk writer to also write native SAI index files alongside sstables but until
// then, this is allowable and fine for users who Know What They're Doing.
if (!skipSecondaryIndexCheck)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a test if possible that check this behavior, basically the if block

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not explicitly tested with the following?

    @ParameterizedTest
    @MethodSource("org.apache.cassandra.bridge.VersionRunner#supportedVersions")
    public void testSecondaryIndexAllowedWithSkipCheck(String cassandraVersion)
    {
        TableSchema schema = getValidSchemaBuilder(cassandraVersion)
                .withHasSecondaryIndex()
                .withSkipSecondaryIndexCheck()
                .build();
        assertThat(schema).isNotNull();
    }

As TableSchemaTestCommon.MockTableSchemaBuilder#build calls the TableSchema constructor with 2i enabled + the flag to allow, and testSecondaryIndexIsUnsupprted tests the other case.

Or am I misunderstanding the request?

* By default, bulk writes to tables with secondary indexes are rejected.
* Setting this option to {@code true} allows bulk writes to proceed on tables that have secondary indexes,
* with the understanding that the secondary indexes will NOT be updated by the bulk write and must be
* rebuilt separately after the job completes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does C* detect and rebuild them, or does the user need to run a command to trigger the rebuild?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C* automatically rebuilds them on import. Part of the SSTableImporter logic paths.

@skoppu22
Copy link
Copy Markdown
Contributor

LGTM

Patch by Josh McKenzie; reviewed by Jyothsna Konisa and Shailaja Koppu for CASSANALYTICS-128
@jmckenzie-dev jmckenzie-dev merged commit 978d969 into apache:trunk Mar 13, 2026
tejaslodayadd added a commit to tejaslodayadd/cassandra-analytics that referenced this pull request May 18, 2026
…oc and comments

Several source files contain non-ASCII characters in JavaDoc and inline
comments (micro sign, rightwards arrow, em-dash). javac compiled with
the US-ASCII default encoding rejects these with "unmappable character"
errors, which the JIRA reproduces.

Replace the offenders in src/main/ across the affected modules with
ASCII-equivalent JavaDoc:

- micro sign 'mu' in time-unit lists in CreateSnapshotRequest,
  RequestContext, and SidecarClient becomes the HTML entity
  &apache#181;s. The 'mu-s' symbol still renders in generated Javadoc HTML
  but the source is ASCII.
- rightwards arrow in AbstractBulkWriterContext and RecordWriter
  becomes ->.
- em-dash in KafkaPublisher becomes --.

Test files are intentionally left untouched: their use of the two-dot
leader character matches the actual output of Guava's
com.google.common.collect.Range#toString and changing it would break
the assertions.

Patch by Tejas Lodaya for CASSANALYTICS-140
tejaslodayadd added a commit to tejaslodayadd/cassandra-analytics that referenced this pull request May 19, 2026
…oc and comments

Several source files contain non-ASCII characters in JavaDoc and inline
comments (micro sign, rightwards arrow, em-dash). javac compiled with
the US-ASCII default encoding rejects these with "unmappable character"
errors, which the JIRA reproduces.

Replace the offenders in src/main/ across the affected modules with
ASCII-equivalent JavaDoc:

- micro sign 'mu' in time-unit lists in CreateSnapshotRequest,
  RequestContext, and SidecarClient becomes the HTML entity
  &apache#181;s. The 'mu-s' symbol still renders in generated Javadoc HTML
  but the source is ASCII.
- rightwards arrow in AbstractBulkWriterContext and RecordWriter
  becomes ->.
- em-dash in KafkaPublisher becomes --.

Test files are intentionally left untouched: their use of the two-dot
leader character matches the actual output of Guava's
com.google.common.collect.Range#toString and changing it would break
the assertions.

Patch by Tejas Lodaya for CASSANALYTICS-140
frankgh pushed a commit that referenced this pull request May 19, 2026
…oc and comments (#210)

Several source files contain non-ASCII characters in JavaDoc and inline
comments (micro sign, rightwards arrow, em-dash). javac compiled with
the US-ASCII default encoding rejects these with "unmappable character"
errors, which the JIRA reproduces.

Replace the offenders in src/main/ across the affected modules with
ASCII-equivalent JavaDoc:

- micro sign 'mu' in time-unit lists in CreateSnapshotRequest,
  RequestContext, and SidecarClient becomes the HTML entity
  µs. The 'mu-s' symbol still renders in generated Javadoc HTML
  but the source is ASCII.
- rightwards arrow in AbstractBulkWriterContext and RecordWriter
  becomes ->.
- em-dash in KafkaPublisher becomes --.

Test files are intentionally left untouched: their use of the two-dot
leader character matches the actual output of Guava's
com.google.common.collect.Range#toString and changing it would break
the assertions.

Patch by Tejas Lodaya; reviewed by Saranya Krishnakumar, Francisco Guerrero for CASSANALYTICS-140
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants