feat: [Phase 1 - Draft] Queue Semantics support in Kafka Ingestion #19311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

Shekharrajak wants to merge 49 commits into apache:master from Shekharrajak:feature/kafka-share-group-ingestion

docs/ingestion/kafka-share-group-ingestion.md

-Original file line number
+Diff line change
@@ -0,0 +1,350 @@
+    ---
+    id: kafka-share-group-ingestion
+    title: "Kafka share group ingestion"
+    sidebar_label: "Kafka share group ingestion"
+    description: "Queue-semantics ingestion from Apache Kafka using share groups (KIP-932). Scale consumers beyond partition count with at-least-once delivery."
+    ---
+    <!--
+      ~ Licensed to the Apache Software Foundation (ASF) under one
+      ~ or more contributor license agreements.  See the NOTICE file
+      ~ distributed with this work for additional information
+      ~ regarding copyright ownership.  The ASF licenses this file
+      ~ to you under the Apache License, Version 2.0 (the
+      ~ "License"); you may not use this file except in compliance
+      ~ with the License.  You may obtain a copy of the License at
+      ~
+      ~   http://www.apache.org/licenses/LICENSE-2.0
+      ~
+      ~ Unless required by applicable law or agreed to in writing,
+      ~ software distributed under the License is distributed on an
+      ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+      ~ KIND, either express or implied.  See the License for the
+      ~ specific language governing permissions and limitations
+      ~ under the License.
+      -->
+    :::info
+    Requires Apache Kafka 4.0 or higher with share groups (KIP-932) enabled on the broker.
+    :::
+    ## Overview
+    Kafka share groups (KIP-932) let multiple consumers read from the same partition concurrently. The broker manages per-record acquisition locks and explicit acknowledgement, so consumer count is not capped by partition count, joining or leaving consumers does not pause the group, and a slow record does not block its partition.
+    Druid's `ShareGroupIndexTask` consumes from a share group and publishes segments with at-least-once delivery: records are acknowledged only after their segments are atomically registered in the metadata store.
+    ## When to use share group ingestion
+    | Scenario | Consumer group | Share group |
+    |----------|---------------|-------------|
+    | Workers needed exceed partition count | Idle workers | All workers active |
+    | Elastic scaling (auto-scale events) | Rebalancing pause (30-60s) | Zero pause |
+    | Per-message processing time varies | Head-of-line blocking | Independent processing |
+    | Ordered processing required per partition | Yes | No (delivery order not guaranteed) |
+    Choose share groups when throughput and elastic scaling matter more than strict per-partition ordering.
+    ## Task spec
+    Submit a `ShareGroupIndexTask` to the Overlord. There are no start/end offsets -- the broker tracks them.
+    ```json
+    {
+      "type": "index_kafka_share_group",
+      "dataSchema": {
+        "dataSource": "my_datasource",
+        "timestampSpec": {
+          "column": "__time",
+          "format": "auto"
+        },
+        "dimensionsSpec": {
+          "useSchemaDiscovery": true
+        },
+        "granularitySpec": {
+          "segmentGranularity": "DAY",
+          "queryGranularity": "NONE"
+        }
+      },
+      "ioConfig": {
+        "topic": "my_topic",
+        "groupId": "druid-share-group",
+        "consumerProperties": {
+          "bootstrap.servers": "kafka-broker:9092"
+        },
+        "inputFormat": {
+          "type": "json"
+        },
+        "pollTimeout": 2000
+      },
+      "tuningConfig": {
+        "type": "KafkaTuningConfig",
+        "maxRowsPerSegment": 5000000
+      }
+    }
+    ```
+    ## IO configuration
+    | Property | Type | Required | Default | Description |
+    |----------|------|----------|---------|-------------|
+    | `topic` | String | Yes | -- | Kafka topic to consume from. |
+    | `groupId` | String | Yes | -- | Share group identifier. Multiple tasks with the same `groupId` share the workload. |
+    | `consumerProperties` | Map | Yes | -- | Kafka consumer properties. Must include `bootstrap.servers`. See [Consumer property restrictions](#consumer-property-restrictions). |
+    | `inputFormat` | Object | Yes | -- | Input format for parsing records (json, csv, avro, etc.). |
+    | `pollTimeout` | Long | No | 2000 | Poll timeout in milliseconds. |
+    ### Consumer property restrictions
+    Share consumers (KIP-932) reject some keys that are valid for regular consumer groups. Druid silently strips the keys below from `consumerProperties` (with a `WARN` log per stripped key) before constructing the `KafkaShareConsumer`:
+    | Stripped key | Why |
+    |--------------|-----|
+    | `auto.offset.reset` | Initial position is broker-controlled for share groups. |
+    | `enable.auto.commit` | Share consumers always require explicit `acknowledge()` + `commitSync()`. |
+    | `group.instance.id` | Share groups do not support static membership. |
+    | `isolation.level` | Always read-committed for share groups. |
+    | `partition.assignment.strategy` | Broker controls per-record delivery for share groups. |
+    | `interceptor.classes` | Not supported for share consumers. |
+    | `session.timeout.ms` | Share groups have no consumer-group session model. |
+    | `heartbeat.interval.ms` | Share groups have no heartbeat. |
+    | `group.protocol` | Always `SHARE` for share consumers. |
+    | `group.remote.assignor` | Not applicable to share groups. |
+    `share.acknowledgement.mode=explicit` is set automatically and must not be overridden.
+    ### Tuning configuration
+    `tuningConfig` accepts the standard `KafkaTuningConfig` fields. The runner currently honors:
+    - `maxRowsInMemory` / `maxBytesInMemory`: triggers a mid-batch persist when the appenderator signals `isPersistRequired`.
+    - `maxRowsPerSegment`: when reached during a batch the runner logs the event; over-threshold segments are pushed at the end-of-batch publish boundary.
+    Mid-batch checkpoint and sequence rollover are not supported.
+    ## How it works
+. The task subscribes to the topic with a `KafkaShareConsumer` using the configured `groupId`.
+. The broker delivers batches of records with per-record acquisition locks.
+. Each polled record is parsed by `StreamChunkReader` (the same multi-row parser as `KafkaIndexTask`); a record may produce zero, one, or many `InputRow`s. All resulting rows are added to the appenderator before the record is acknowledged.
+. Parse failures go through `ParseExceptionHandler` (so `maxParseExceptions` is honored). Bytes/processed/unparseable counters are incremented exactly once per row.
+. Segments persist mid-batch on memory pressure and unconditionally at end-of-batch, then publish atomically via `SegmentTransactionalAppendAction`.
+. After a successful publish, every offset in the batch is acknowledged with `ACCEPT` and a `commitSync()` flushes acknowledgements to the broker.
+. On task failure or graceful stop before publish, unacknowledged records are redelivered by the broker after the acquisition lock expires.
+    ## Safety invariants
+. **ACK after publish:** `ACCEPT` is sent only after the segment is registered in the metadata store. No data loss on task failure.
+. **Multi-row safe:** every row produced from a record is added to the appenderator before that record is acknowledged.
+. **Resource safe:** `Appenderator` and `KafkaShareConsumer` are released on every exit path.
+. **Terminal state:** every polled record reaches exactly one terminal state -- `ACCEPT`, `RELEASE`, or broker redelivery after lock expiry.
+    ## Graceful stop
+    When the Overlord asks a task to stop, the runner calls `KafkaShareConsumer.wakeup()`. The in-flight `poll()` throws `WakeupException`; the runner exits the loop after committing any in-flight batch. Records polled but not yet published remain unacknowledged and are redelivered by the broker after the acquisition lock expires.
+    ## Acquisition lock duration
+    The broker controls the lock via `group.share.record.lock.duration.ms`. The runner logs the effective value once after the first poll:
+    ```
+    Effective broker acquisition lock timeout for share-group[my-group]: 30000 ms
+    ```
+    A single thread does both poll and publish. If a batch exceeds the lock duration, in-flight records may be redelivered (duplicates). Tune `pollTimeout`, `maxRowsInMemory`, and `maxRowsPerSegment` so each cycle stays well under the lock window.
+    ## Scaling
+    Tasks with the same `groupId` share the workload automatically; you can run more tasks than partitions:
+    ```
+    Topic: 4 partitions
+    Tasks with same groupId: 20
+    Result: All 20 tasks actively consuming (broker distributes records)
+    ```
+    Adding or removing tasks does not trigger a rebalancing pause.
+    ## Delivery semantics
+    At-least-once. On task failure, records between the last committed acknowledgement and the failure point are redelivered, which may produce duplicates across restarts. A deduplication cache is planned.
+    ## Metrics
+    In addition to the standard ingestion metrics (`ingest/events/processed`, `ingest/events/unparseable`, `ingest/persists/count`, etc.), share-group ingestion emits:
+    | Metric | Description |
+    |--------|-------------|
+    | `ingest/shareGroup/commitFailures` | Per-batch count of partitions whose `commitSync()` failed. A non-zero value means the affected records will be redelivered; alert on sustained non-zero values. |
+    ## Limitations (current release)
+    - Single-threaded ingestion per task; a future enhancement may add a background `RENEW` thread to extend the broker lock for long-running batches.
+    - No supervisor integration; tasks are submitted manually via the Overlord API. A `KafkaShareGroupSupervisor` is planned as a future enhancement.
+    - No deduplication cache (at-least-once).
+    - Delivery order within a partition is not guaranteed.
+    - Mid-batch checkpoint / sequence rollover is not supported. If a batch grossly exceeds `maxRowsPerSegment` the runner still publishes correctly (multiple segments per batch), but the threshold is only checked at end-of-batch boundaries.
+    ## Demo: end-to-end validation with Druid UI
+    ### Prerequisites
+    - Java 17
+    - Kafka 4.2.0 (with share groups enabled)
+    - Druid checked out from this repository (built from source)
+    ### Step 1: Start Kafka with share groups
+    ```bash
+    cd kafka_2.13-4.2.0
+    KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"
+    bin/kafka-storage.sh format --standalone -t $KAFKA_CLUSTER_ID -c config/server.properties
+    echo "group.share.enable=true" >> config/server.properties
+    echo "group.share.record.lock.duration.ms=30000" >> config/server.properties
+    bin/kafka-server-start.sh config/server.properties
+    ```
+    ### Step 2: Create topic and configure the share group
+    ```bash
+    cd kafka_2.13-4.2.0
+    bin/kafka-topics.sh --create --topic druid-share-test --partitions 4 --bootstrap-server localhost:9092
+    # Set share-group reset to earliest so the task picks up records that already exist
+    # in the topic. The default broker setting is 'latest', which would skip pre-existing
+    # records and ingest zero rows even though the producer ran successfully.
+    bin/kafka-configs.sh --bootstrap-server localhost:9092 --alter \
+      --entity-type groups --entity-name druid-demo-share-group \
+      --add-config share.auto.offset.reset=earliest
+    ```
+    ### Step 3: Produce sample messages
+    ```bash
+    cd kafka_2.13-4.2.0
+    bin/kafka-console-producer.sh --topic druid-share-test --bootstrap-server localhost:9092
+    ```
+    Paste these JSON records:
+    ```json
+    {"__time":"2025-06-01T00:00:00.000Z","item":"widget_a","value":100,"category":"electronics"}
+    {"__time":"2025-06-01T01:00:00.000Z","item":"widget_b","value":250,"category":"clothing"}
+    {"__time":"2025-06-01T02:00:00.000Z","item":"widget_c","value":50,"category":"electronics"}
+    {"__time":"2025-06-01T03:00:00.000Z","item":"widget_d","value":175,"category":"food"}
+    {"__time":"2025-06-01T04:00:00.000Z","item":"widget_e","value":320,"category":"electronics"}
+    ```
+    ### Step 4: Build Druid and run it
+    You can run the demo against either a freshly built Druid distribution or an existing stable Druid binary with the share-group JARs overlaid. Pick the one that matches your environment.
+    #### Option A: Build the full Druid distribution from source (recommended)
+    Builds the full distribution from this repository so the share-group code is packaged natively, with no JAR overlay required:
+    ```bash
+    cd /path/to/druid
+    JAVA_HOME=$(/usr/libexec/java_home -v 17) \
+      mvn clean install -Pdist -T1C -DskipTests \
+        -Dforbiddenapis.skip=true -Dcheckstyle.skip=true \
+        -Dpmd.skip=true -Dmaven.javadoc.skip=true -Denforcer.skip=true
+    tar -xzf distribution/target/apache-druid-*-bin.tar.gz -C /tmp
+    cd /tmp/apache-druid-*
+    bin/start-druid
+    ```
+    > Tip: For faster iteration, build only the `kafka-indexing-service` module with `mvn package -pl extensions-core/kafka-indexing-service -am -DskipTests -T1C` and overlay the resulting JAR onto the distribution from a previous full build (Option B steps below).
+    #### Option B: Overlay the share-group JAR onto a downloaded Druid binary (faster, best-effort)
+    If you already have a Druid release binary and want to avoid a full source build, you can replace the kafka-indexing-service JAR in that distribution with the one built from this branch.
+    > Caveat: The branch builds against `38.0.0-SNAPSHOT`. Druid does **not** guarantee extension `ABI` compatibility across major versions, so the overlay may fail at runtime against an older binary (`37.x` or earlier). Use the most recent stable Druid release available, and prefer Option A for a reliable demo.
+    ```bash
+    cd /path/to/druid
+    JAVA_HOME=$(/usr/libexec/java_home -v 17) mvn package \
+      -pl extensions-core/kafka-indexing-service -am \
+      -Pskip-static-checks -DskipTests -T1C -q
+    # Use the latest stable Druid release available; 37.0.0 is the example below.
+    DRUID_VERSION=37.0.0
+    cd /tmp
+    curl -O "https://dlcdn.apache.org/druid/${DRUID_VERSION}/apache-druid-${DRUID_VERSION}-bin.tar.gz"
+    tar -xzf "apache-druid-${DRUID_VERSION}-bin.tar.gz"
+    cd "apache-druid-${DRUID_VERSION}"
+    rm extensions/druid-kafka-indexing-service/*.jar
+    cp /path/to/druid/extensions-core/kafka-indexing-service/target/druid-kafka-indexing-service-*.jar \
+       extensions/druid-kafka-indexing-service/
+    cp ~/.m2/repository/org/apache/kafka/kafka-clients/4.2.0/kafka-clients-4.2.0.jar \
+       extensions/druid-kafka-indexing-service/
+    bin/start-druid
+    ```
+    ### Step 5: Submit task via Druid console
+    Open `http://localhost:8888`, go to the **Ingestion** tab, click **Submit JSON task**, and paste:
+    ```json
+    {
+      "type": "index_kafka_share_group",
+      "dataSchema": {
+        "dataSource": "share_group_demo",
+        "timestampSpec": {"column": "__time", "format": "auto"},
+        "dimensionsSpec": {"useSchemaDiscovery": true},
+        "granularitySpec": {"segmentGranularity": "DAY", "queryGranularity": "NONE"}
+      },
+      "ioConfig": {
+        "type": "kafka_share_group",
+        "topic": "druid-share-test",
+        "groupId": "druid-demo-share-group",
+        "consumerProperties": {"bootstrap.servers": "localhost:9092"},
+        "inputFormat": {"type": "json"},
+        "pollTimeout": 2000
+      },
+      "tuningConfig": {"type": "KafkaTuningConfig"}
+    }
+    ```
+    ### Step 6: Query data
+    Go to the **Query** tab and run:
+    ```sql
+    SELECT COUNT(*) AS total_rows FROM share_group_demo;
+    SELECT category, COUNT(*) AS cnt, SUM(value) AS total FROM share_group_demo GROUP BY category;
+    ```
+    ## Running tests
+    Unit tests:
+    ```bash
+    mvn test -pl extensions-core/kafka-indexing-service \
+      -Dtest="org.apache.druid.indexing.kafka.ShareGroupIndexTaskIOConfigTest,\
+    org.apache.druid.indexing.kafka.KafkaShareGroupRecordSupplierTest,\
+    org.apache.druid.indexing.kafka.ShareGroupIndexTaskTest,\
+    org.apache.druid.indexing.kafka.ShareGroupIndexTaskRunnerTest,\
+    org.apache.druid.indexing.kafka.ShareGroupConsumerPropertiesTest" \
+      -Dsurefire.failIfNoSpecifiedTests=false \
+      -Pskip-static-checks -Dweb.console.skip=true -T1C
+    ```
+    E2E test (requires Docker; Testcontainers starts an `apache/kafka:4.1.1` broker with `group.share.enable=true`):
+    ```bash
+    mvn test -pl embedded-tests -am \
+      -Dtest="org.apache.druid.testing.embedded.indexing.EmbeddedShareGroupIngestionTest" \
+      -Dsurefire.failIfNoSpecifiedTests=false \
+      -Pskip-static-checks -Dweb.console.skip=true -T1C
+    ```

embedded-tests/pom.xml

-Original file line number
+Diff line change
@@ Expand Up / @@ -40,6 +40,17 @@ @@
       </parent>
       <dependencies>
+        <!-- Declared first so its classes win on the test classpath over older
+             kafka classes shaded into druid-protobuf-extensions (Confluent
+             transitive). Required for share-group APIs (KIP-932) such as
+             ConfigResource.Type.GROUP. -->
+        <dependency>
+          <groupId>org.apache.kafka</groupId>
+          <artifactId>kafka-clients</artifactId>
+          <version>${apache.kafka.version}</version>
+          <scope>test</scope>
+        </dependency>
         <!-- Test dependencies -->
         <dependency>
           <groupId>org.apache.druid</groupId>
@@ Expand Down Expand Up / @@ -532,12 +543,6 @@ @@
           <artifactId>commons-codec</artifactId>
           <scope>test</scope>
         </dependency>
-        <dependency>
-          <groupId>org.apache.kafka</groupId>
-          <artifactId>kafka-clients</artifactId>
-          <version>${apache.kafka.version}</version>
-          <scope>test</scope>
-        </dependency>
         <dependency>
           <groupId>org.testcontainers</groupId>
           <artifactId>testcontainers</artifactId>
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: [Phase 1 - Draft] Queue Semantics support in Kafka Ingestion #19311

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

Uh oh!

feat: [Phase 1 - Draft] Queue Semantics support in Kafka Ingestion #19311

Are you sure you want to change the base?

Uh oh!

feat: [Phase 1 - Draft] Queue Semantics support in Kafka Ingestion #19311

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

Uh oh!