Feat: Faster per-record processing#301
Conversation
WalkthroughWalkthroughThe recent changes focus on refactoring the code to utilize a new Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant RecordProcessor
participant StreamRecordHandler
participant StreamRecord
Client->>RecordProcessor: Call process_record_message()
RecordProcessor->>StreamRecordHandler: Initialize handler with schema
RecordProcessor->>StreamRecord: Create StreamRecord with handler
StreamRecord->>StreamRecordHandler: Validate and process record
StreamRecordHandler-->>StreamRecord: Processed data
StreamRecord-->>RecordProcessor: Processed StreamRecord
RecordProcessor-->>Client: Return processed record
sequenceDiagram
participant TestSuite
participant TestCase
participant StreamRecordHandler
participant StreamRecord
TestSuite->>TestCase: Run test cases
TestCase->>StreamRecordHandler: Initialize handler with test schema
TestCase->>StreamRecord: Create StreamRecord with handler
StreamRecord->>StreamRecordHandler: Validate and process test record
StreamRecordHandler-->>StreamRecord: Processed test data
StreamRecord-->>TestCase: Return processed test record
TestCase-->>TestSuite: Test pass/fail result
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (5)
- airbyte/_future_cdk/record_processor.py (3 hunks)
- airbyte/_future_cdk/sql_processor.py (3 hunks)
- airbyte/_processors/file/base.py (3 hunks)
- airbyte/records.py (3 hunks)
- airbyte/sources/base.py (3 hunks)
Additional comments not posted (20)
airbyte/_processors/file/base.py (3)
16-16: Import statement change is appropriate.The addition of
StreamRecordHandleris necessary for the new functionality.
144-144: Function signature change is appropriate.The function now accepts
stream_record_handlerinstead ofstream_schema, aligning with the new object-oriented approach.
169-169: Ensure correct usage ofstream_record_handler.The
stream_record_handleris correctly used in theStreamRecord.from_record_messagecall.airbyte/_future_cdk/record_processor.py (5)
26-26: Import statement change is appropriate.The addition of
StreamRecordHandleris necessary for the new functionality.
160-160: Function signature change is appropriate.The function now accepts
stream_record_handlerinstead ofstream_schema, aligning with the new object-oriented approach.
184-184: Variable replacement is appropriate.The variable
stream_schemashas been replaced withstream_record_handlers, aligning with the new object-oriented approach.
192-197: Ensure correct instantiation ofStreamRecordHandler.The
StreamRecordHandleris correctly instantiated with the required parameters.
202-202: Ensure correct usage ofstream_record_handler.The
stream_record_handleris correctly used in theprocess_record_messagecall.airbyte/records.py (6)
77-77: Import statement change is appropriate.The import statement now includes only
NameNormalizerBase, which is necessary for the new functionality.
92-97: New classStreamRecordHandleris appropriate.The class introduction is necessary for the new object-oriented approach.
99-145: Ensure correct implementation ofStreamRecordHandlermethods.The methods in
StreamRecordHandlerare correctly implemented to handle key normalization and processing.
194-221: Ensure correct implementation ofStreamRecordmethods.The
__init__method and other methods inStreamRecordare correctly updated to handle field processing.
228-243: Ensure correct usage ofstream_record_handlerinfrom_record_message.The
stream_record_handleris correctly used in thefrom_record_messagemethod.
247-279: Ensure correct implementation of dictionary operations inStreamRecord.The methods
__getitem__,__setitem__,__delitem__, and__contains__are correctly updated to handle key processing and dictionary operations.airbyte/sources/base.py (3)
7-7: Import statement change is appropriate.The addition of
StreamRecordHandleris necessary for the new functionality.
456-461: Ensure correct instantiation ofStreamRecordHandler.The
StreamRecordHandleris correctly instantiated with the required parameters.
466-466: Ensure correct usage ofstream_record_handler.The
stream_record_handleris correctly used in theStreamRecord.from_record_messagecall.airbyte/_future_cdk/sql_processor.py (3)
65-65: Import statement forStreamRecordHandleradded.The import statement for
StreamRecordHandleris correctly added.
231-231: Function signature updated to usestream_record_handler.The function signature is updated to replace
stream_schemawithstream_record_handler.
242-242: Function call updated to usestream_record_handler.The function call to
process_record_messageis updated to usestream_record_handler.
There was a problem hiding this comment.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (4)
- airbyte/_future_cdk/record_processor.py (3 hunks)
- airbyte/records.py (2 hunks)
- airbyte/sources/base.py (3 hunks)
- tests/unit_tests/test_text_normalization.py (4 hunks)
Files skipped from review as they are similar to previous changes (3)
- airbyte/_future_cdk/record_processor.py
- airbyte/records.py
- airbyte/sources/base.py
Additional comments not posted (7)
tests/unit_tests/test_text_normalization.py (7)
5-5: Import statement addition looks good.The addition of
StreamRecordHandleris consistent with the refactoring described.
8-17: Fixture addition looks good.The
stream_json_schemafixture is well-defined and provides a reusable JSON schema for the tests.
20-36: Function modification looks good.The changes to
test_record_columns_listare consistent with the refactoring, and the function correctly initializes and usesStreamRecordHandler.
53-60: Function modification looks good.The changes to
test_case_insensitive_dictare consistent with the refactoring, and the function correctly initializes and usesStreamRecordHandlerwith the specified parameters.
119-128: Function modification looks good.The changes to
test_case_insensitive_dict_ware consistent with the refactoring, and the function correctly initializes and usesStreamRecordHandlerwith the updated schema.
154-163: Function modification looks good.The changes to
test_case_insensitive_w_pretty_keysare consistent with the refactoring, and the function correctly initializes and usesStreamRecordHandlerwithnormalize_keys=False.
216-220: Function adjustment looks good.The changes to
test_lower_case_normalizerimprove readability without altering the logic.
This brings per-record parsing time down from approx.
20usto roughly10us. We accomplish this by decoupling stream-level logic from record-level processing, meaning we build the logic handler for each stream at the start of the stream, and then reuse the handler's logic on those individual records.Note:
10-13usfor each write to gzip file.writeis entirely one step performed by thegzipmodule, the only options to speed up are: (a) write more records at once instead of one at a time, (b) use another library if there is a faster option, (c) do both a+b but also just migrate to something else entirely like parquet.Summary by CodeRabbit
New Features
StreamRecordHandlerto enhance stream record processing capabilities.Refactor
StreamRecordHandlerinstead of dictionaries for stream schemas, improving schema handling and processing efficiency.Tests
StreamRecordHandlerwith updated stream schema handling parameters.