feat(connectors): Delta Lake Sink Connector#2889
feat(connectors): Delta Lake Sink Connector#2889kriti-sc wants to merge 11 commits intoapache:masterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2889 +/- ##
============================================
- Coverage 72.17% 72.02% -0.15%
Complexity 930 930
============================================
Files 1122 1124 +2
Lines 93502 93430 -72
Branches 70851 70789 -62
============================================
- Hits 67488 67297 -191
- Misses 23447 23563 +116
- Partials 2567 2570 +3
🚀 New features to boost your workflow:
|
| })?; | ||
|
|
||
| // Flush buffers to object store and commit to Delta log | ||
| let version = match state.writer.flush_and_commit(&mut state.table).await { |
There was a problem hiding this comment.
every consume() call does flush_and_commit, which creates a new parquet file and a new JSON transaction log entry in _delta_log/. at high throughput with, say, 1000-message batches at 1M ops/sec, that's 1000 parquet files and 1000 log entries per second. delta lake degrades catastrophically under this - metadata parsing slows, checkpoint overhead grows, and cloud object store LIST calls become a bottleneck. the sink needs a buffering strategy: accumulate across multiple consume() calls and flush on a configurable row count, byte threshold, or time interval - not per batch.
There was a problem hiding this comment.
there are two knobs poll_interval and batch_size to control delta write frequency and file size. Does this address your concern or are you thinking of a different issue @hubcio ?
| let version = match state.writer.flush_and_commit(&mut state.table).await { | ||
| Ok(v) => v, | ||
| Err(e) => { | ||
| state.writer.reset(); |
There was a problem hiding this comment.
when write() succeeds (line 133) but flush_and_commit() fails here, reset() clears the internal parquet buffer. those messages are permanently lost with no retry path. since the connector runtime uses AutoCommitWhen::PollingMessages (offset committed before consume), the consumer offset may already have been advanced past these messages. there's no retry, no DLQ, no metric - the messages just vanish.
There was a problem hiding this comment.
Agreed. Was planning to propose a DLQ strategy across sinks and sources. Is it ok to defer this concern for later? @hubcio
There was a problem hiding this comment.
Additionally, added a TODO for implementing a retry strategy and metrics.
| StructField::new("count", DataType::Primitive(PrimitiveType::Integer), true), | ||
| StructField::new("amount", DataType::Primitive(PrimitiveType::Double), true), | ||
| StructField::new("active", DataType::Primitive(PrimitiveType::Boolean), true), | ||
| StructField::new("timestamp", DataType::Primitive(PrimitiveType::Long), true), |
There was a problem hiding this comment.
the test fixture declares timestamp as PrimitiveType::Long, not PrimitiveType::Timestamp. this means all integration tests bypass the coercion logic entirely. an e2e test with a Timestamp-typed column would catch issues like the TimestampNtz gap and null-to-"null" coercion bug at the integration level.
There was a problem hiding this comment.
Agree on fixing the schema type. Timestamp better reflects real usage.
Disagree on the second part. Catching specific bugs like the TimestampNtz gap or null coercion is the unit test's responsibility. My rubric with the integration test is to verify the happy path end-to-end.
Which issue does this PR close?
Closes #1852
Rationale
Delta Lake is a data analytics engine, and very popular in modern streaming analytics architectures.
What changed?
Introduces a Delta Lake Sink Connector that enables writing data from Iggy to Delta Lake.
The Delta Lake writing logic is heavily inspired by the kafka-delta-ingest project, to have a proven starting ground for writing to Delta Lake.
Local Execution
user_id: String, user_type: u8, email: String, source: String, state: String, created_at: DateTime<Utc>, message: Stringusing sample data producer.AI Usage
If AI tools were used, please answer: