Feat: Multiple perf improvements and viztracer performance profiling#298
Conversation
WalkthroughWalkthroughThis update introduces performance optimizations, enhanced schema management, and new configuration settings across multiple files. Key changes include adding caching for schema normalization and trace visualization configurations with VizTracer. Additionally, new scripts for performance testing and profiling with Snowflake and BigQuery caches have been added. Parallel file upload capabilities in Snowflake processing and increased batch sizes in file processing further enhance the performance. Changes
Sequence DiagramsSchema Management with CachingsequenceDiagram
participant User
participant SQLProcessor
User->>SQLProcessor: _ensure_schema_exists()
SQLProcessor->>SQLProcessor: Check _known_schemas_list
alt Schema exists
SQLProcessor->>User: Return
else Schema does not exist
SQLProcessor->>SQLProcessor: Create schema
SQLProcessor->>SQLProcessor: Add schema to _known_schemas_list
SQLProcessor->>User: Return
end
Parallel File Upload in Snowflake ProcessingsequenceDiagram
participant SQLProcessor
participant ThreadPoolExecutor
participant Snowflake
SQLProcessor->>ThreadPoolExecutor: Initialize with MAX_UPLOAD_THREADS (8)
loop For each file to upload
SQLProcessor->>ThreadPoolExecutor: Submit file upload task
end
ThreadPoolExecutor->>Snowflake: Upload files in parallel
Snowflake->>SQLProcessor: Acknowledge uploads
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
|
CodeRabbit (@coderabbitai) review |
Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 4
Outside diff range, codebase verification and nitpick comments (2)
examples/run_perf_test_reads.py (2)
16-18: Consider adding a script header or usage documentation.The VizTracer commands are included directly in the script comments. It would be beneficial to encapsulate these in a script header or provide a usage section in the documentation to guide users on how to execute the script properly.
99-119: Enhance command-line argument validation.The script uses
argparseto handle command-line arguments, which is good. However, consider adding more detailed help messages or validations to ensure that the user inputs are within expected ranges or formats. This will improve the user experience and prevent runtime errors due to invalid inputs.+ parser.add_argument("-e", type=int, required=True, choices=range(2, 7), help="The scale, as a power of 10. Choose between 2 and 6.")
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
Files selected for processing (10)
- .viztracerrc (1 hunks)
- airbyte/_future_cdk/sql_processor.py (3 hunks)
- airbyte/_processors/file/base.py (1 hunks)
- airbyte/_processors/sql/snowflake.py (3 hunks)
- airbyte/_util/name_normalizers.py (2 hunks)
- airbyte/progress.py (4 hunks)
- examples/run_bigquery_faker.py (1 hunks)
- examples/run_perf_test_cache_snowflake.py (1 hunks)
- examples/run_perf_test_reads.py (1 hunks)
- pyproject.toml (1 hunks)
Files skipped from review due to trivial changes (1)
- pyproject.toml
Additional comments not posted (13)
.viztracerrc (1)
4-8: Review of VizTracer configuration settings.The configuration settings for VizTracer are as follows:
max_stack_depth = 20: Limits the stack depth to 20, which is reasonable for most applications.unique_output_file = True: Ensures that each output file is unique.output_file = viztracer_report.json: Specifies the output file name.tracer_entries = 5_000_000: Allocates memory for 5 million entries.These settings seem appropriate for performance profiling and are consistent with best practices.
examples/run_perf_test_cache_snowflake.py (3)
7-7: Performance profiling command.The command to run the performance profiling is specified in the docstring. This is a good practice as it provides clear instructions on how to execute the script for its intended purpose.
18-38: Sensitive information handling and Snowflake cache setup.The script retrieves sensitive information from Google Cloud's Secret Manager and uses it to configure a Snowflake cache. It's crucial to ensure that
credentials_jsonand other sensitive data are handled securely and not exposed in logs or error messages.Verification successful
Sensitive information handling and Snowflake cache setup verified.
The script does not expose sensitive information in logs or error messages. The
print(f"Stream {name}: {len(list(result[name]))} records")is safe asnameis a static string andresultis derived from a cache read operation.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that sensitive information is not logged or exposed in error messages. # Test: Search for logging statements that might expose sensitive information. rg --type python -A 1 $'print|log' ./examples/run_perf_test_cache_snowflake.pyLength of output: 110
Script:
#!/bin/bash # Description: Verify that sensitive information is not logged or exposed in error messages. # Test: Search for logging statements that might expose sensitive information. rg 'print|log' ./examples/run_perf_test_cache_snowflake.pyLength of output: 123
Script:
#!/bin/bash # Description: Verify that sensitive information is not logged or exposed in error messages. # Search for the definitions and usages of `name` and `result` in the specified file. rg 'name|result' ./examples/run_perf_test_cache_snowflake.pyLength of output: 308
40-49: Data retrieval and processing.The script configures an Airbyte source, selects streams, and reads data using the configured Snowflake cache. It's important to verify that the
readmethod handles exceptions properly, especially since it interacts with external systems.examples/run_bigquery_faker.py (1)
52-52: Improved handling of missingdataset_id.The change to use
.getwith a default value fordataset_idis a good practice. It prevents the script from failing ifdataset_idis not provided, making the script more robust.airbyte/_util/name_normalizers.py (1)
52-52: Caching implementation innormalizemethod.The addition of
@functools.cacheto thenormalizemethod is an excellent use of Python's built-in caching mechanism. This should significantly improve performance by avoiding redundant calculations, especially in scenarios where normalization is frequently called with the same inputs.airbyte/_processors/sql/snowflake.py (2)
6-6: Validate the import of ThreadPoolExecutor.Ensure that the
ThreadPoolExecutoris used appropriately within the context of the file and that its import does not introduce unnecessary dependencies or complexity.
30-30: Confirm the usage ofMAX_UPLOAD_THREADS.The addition of the
MAX_UPLOAD_THREADSconstant is appropriate for controlling the number of concurrent threads. Verify that this constant is used consistently across the file and does not lead to resource exhaustion.Verification successful
Confirmed the usage of
MAX_UPLOAD_THREADS.The constant
MAX_UPLOAD_THREADSis used in theThreadPoolExecutorto control the number of concurrent threads, which is appropriate and should help prevent resource exhaustion.
MAX_UPLOAD_THREADSis defined and used inThreadPoolExecutorwithmax_workers=MAX_UPLOAD_THREADS.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of MAX_UPLOAD_THREADS across the file. # Test: Search for the constant usage. Expect: Usage in ThreadPoolExecutor and possibly elsewhere. rg --type python 'MAX_UPLOAD_THREADS' airbyte/_processors/sql/snowflake.pyLength of output: 105
Script:
#!/bin/bash # Description: Verify the usage of MAX_UPLOAD_THREADS across the file. # Test: Search for the constant usage. Expect: Usage in ThreadPoolExecutor and possibly elsewhere. rg 'MAX_UPLOAD_THREADS' airbyte/_processors/sql/snowflake.pyLength of output: 160
airbyte/_processors/file/base.py (1)
26-26: Confirm the impact of increasedDEFAULT_BATCH_SIZE.Increasing the
DEFAULT_BATCH_SIZEcould significantly affect memory usage and performance. It's important to ensure that this change has been thoroughly tested, especially in scenarios with large datasets.airbyte/progress.py (1)
Line range hint
88-402:
Review changes to time formatting and type handling.The changes to support float values for time calculations are crucial for precision. Ensure that these changes are consistent throughout the file and that they do not introduce any regressions in time tracking or display functions.
airbyte/_future_cdk/sql_processor.py (3)
165-165: Attribute_known_schemas_listadded: Good for performance, consider adding maintenance comments.The addition of
_known_schemas_listis a positive change for performance as it reduces the number of database schema checks. However, consider adding comments on how and when this list should be updated or invalidated to ensure it remains accurate.
311-313: Optimized schema existence check: Good for performance, consider handling cache inconsistencies.The method now first checks the cached list of known schemas which is a good optimization. However, consider adding error handling or a mechanism to handle situations where the cache might be stale or inaccurate.
381-397: Addedforce_refreshparameter to_get_schemas_list: Enhances flexibility in cache management.The addition of
force_refreshparameter allows for better control over schema list caching, which can be crucial in environments where schema changes are frequent. Consider adding unit tests to ensure that this parameter works as expected, especially in scenarios where the cache needs to be bypassed.
|
/fix-pr
|
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- .viztracerrc (1 hunks)
- airbyte/_processors/sql/snowflake.py (3 hunks)
Files skipped from review due to trivial changes (1)
- .viztracerrc
Additional context used
Ruff
airbyte/_processors/sql/snowflake.py
136-136: Use
logging.exceptioninstead oflogging.errorReplace with
exception(TRY400)
136-136: Undefined name
logging(F821)
Additional comments not posted (2)
airbyte/_processors/sql/snowflake.py (2)
6-6: Import of ThreadPoolExecutor is appropriate for the context.The import of
ThreadPoolExecutoraligns with the PR's objective to improve performance through parallel processing. This is a good use of Python's concurrent programming capabilities to enhance data upload speeds.
30-31: ConstantMAX_UPLOAD_THREADSis well-defined.Setting a constant for the maximum number of upload threads is a good practice. It helps maintain the code's configurability and makes it easier to adjust the performance tuning.
viztracerviztracer performance profiling
There was a problem hiding this comment.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (4)
- .gitignore (1 hunks)
- airbyte/_processors/sql/snowflake.py (4 hunks)
- airbyte/caches/bigquery.py (2 hunks)
- examples/run_perf_test_reads.py (1 hunks)
Files skipped from review due to trivial changes (1)
- .gitignore
Files skipped from review as they are similar to previous changes (2)
- airbyte/_processors/sql/snowflake.py
- examples/run_perf_test_reads.py
Additional context used
Learnings (1)
airbyte/caches/bigquery.py (1)
Learnt from: aaronsteers PR: airbytehq/PyAirbyte#281 File: airbyte/caches/bigquery.py:40-43 Timestamp: 2024-07-09T04:12:04.672Z Learning: The `BigQueryCache.get_arrow_dataset` method should have a docstring that correctly states the reason for the `NotImplementedError` as BigQuery not supporting `to_arrow`, instead of incorrectly mentioning `pd.read_sql_table`.
Additional comments not posted (2)
airbyte/caches/bigquery.py (2)
20-20: Approved: Import statement forNoReturn.The import of
NoReturnfrom thetypingmodule is correctly placed and necessary for the changes in theget_arrow_datasetmethod.
41-41: Approved: Change in return type and updated exception handling.Changing the return type from
NonetoNoReturnclarifies thatget_arrow_datasetis not intended to return any value and will always raise an exception. This is a good practice for functions that are designed to not return under normal operation.Additionally, the docstring and the exception message have been updated to reflect the correct reason why
get_arrow_datasetis not implemented, which aligns with the previous learning. The provided URL in the docstring is a good practice as it offers more context on the issue.
Summary by CodeRabbit
New Features
Enhancements
Bug Fixes
Chores
.gitignoreto exclude VizTracer log files and packaged documentation zips.General Perf Improvements (Read-to-DuckDB)
Before (15K records per second):
After (23K records per second):
Snowflake Perf Improvements (50K records)
Before (load takes 32 seconds):
After (load takes 25 seconds):
Snowflake Perf Improvements (500K records)
Before (load takes 2 min, 43 seconds):
After (load takes 25 seconds):
Snowflake 2M Records test:
Total time is 3min, 31s - 1min 53sec to extract the data and 1min 48sec to load it to Snowflake.
Comparable Airbyte Cloud job here (internal link):
https://cloud.airbyte.com/workspaces/19d7a891-8e0e-40ac-8a8c-5faf8d11e47c/connections/c7986ffb-124c-4c65-be06-a74970f2e6fb/status
Snowflake 50M Records test:
Comparable Airbyte Cloud job here (internal link):
https://cloud.airbyte.com/workspaces/19d7a891-8e0e-40ac-8a8c-5faf8d11e47c/connections/b356cca2-f81d-4f40-bf8b-42d50de0027e/job-history