Feat: Multiple perf improvements and `viztracer` performance profiling by aaronsteers · Pull Request #298 · airbytehq/PyAirbyte

Aaron ("AJ") Steers (aaronsteers) · 2024-07-17T15:37:51Z

Summary by CodeRabbit

New Features
- Introduced default configuration settings for VizTracer to enhance program execution tracing and visualization.
- Added scripts for profiling performance of Snowflake cache and read throughput.
Enhancements
- Improved schema handling with force refresh option for enhanced data integrity.
- Updated batch size limit for file writing to improve performance.
- Enabled parallel file uploads in Snowflake using thread pools.
- Applied caching to the normalize method for better performance.
- Enhanced time calculation precision in progress tracking.
Bug Fixes
- Fixed a bug in dataset name handling for BigQuery integration tests.
Chores
- Updated .gitignore to exclude VizTracer log files and packaged documentation zips.

General Perf Improvements (Read-to-DuckDB)

Before (15K records per second):

After (23K records per second):

Snowflake Perf Improvements (50K records)

Before (load takes 32 seconds):

After (load takes 25 seconds):

Snowflake Perf Improvements (500K records)

Before (load takes 2 min, 43 seconds):

After (load takes 25 seconds):

Snowflake 2M Records test:

Total time is 3min, 31s - 1min 53sec to extract the data and 1min 48sec to load it to Snowflake.

Comparable Airbyte Cloud job here (internal link):

https://cloud.airbyte.com/workspaces/19d7a891-8e0e-40ac-8a8c-5faf8d11e47c/connections/c7986ffb-124c-4c65-be06-a74970f2e6fb/status

Snowflake 50M Records test:

Comparable Airbyte Cloud job here (internal link):

https://cloud.airbyte.com/workspaces/19d7a891-8e0e-40ac-8a8c-5faf8d11e47c/connections/b356cca2-f81d-4f40-bf8b-42d50de0027e/job-history

coderabbitai · 2024-07-17T15:37:56Z

Walkthrough

This update introduces performance optimizations, enhanced schema management, and new configuration settings across multiple files. Key changes include adding caching for schema normalization and trace visualization configurations with VizTracer. Additionally, new scripts for performance testing and profiling with Snowflake and BigQuery caches have been added. Parallel file upload capabilities in Snowflake processing and increased batch sizes in file processing further enhance the performance.

Changes

Files	Summary of Changes
.viztracerrc, .gitignore	Added VizTracer configuration settings and updated .gitignore to exclude VizTracer log files and packaged docs.
airbyte/_future_cdk/sql_processor.py	Introduced schema caching and a force refresh parameter for schema list retrieval.
airbyte/_processors/file/base.py	Updated the default batch size from 10,000 to 100,000.
airbyte/_processors/sql/snowflake.py	Added parallel file upload using ThreadPoolExecutor with a maximum of 8 threads.
airbyte/_util/name_normalizers.py	Added caching to the normalize method in LowerCaseNormalizer using functools.cache.
airbyte/progress.py	Changed time formatting and type handling for improved precision in time calculations and display updates.
examples/run_bigquery_faker.py	Modified to use a default dataset name if not specified in the configuration.
examples/run_perf_test_cache_snowflake.py	New script for profiling performance with Snowflake cache.
examples/run_perf_test_reads.py	New script for profiling read throughput performance with command-line arguments and VizTracer for performance profiling.
airbyte/caches/bigquery.py	Updated return type of get_arrow_dataset to NoReturn, raising NotImplementedError explicitly.

Sequence Diagrams

Schema Management with Caching

sequenceDiagram
    participant User
    participant SQLProcessor

    User->>SQLProcessor: _ensure_schema_exists()
    SQLProcessor->>SQLProcessor: Check _known_schemas_list
    alt Schema exists
        SQLProcessor->>User: Return
    else Schema does not exist
        SQLProcessor->>SQLProcessor: Create schema
        SQLProcessor->>SQLProcessor: Add schema to _known_schemas_list
        SQLProcessor->>User: Return
    end

Parallel File Upload in Snowflake Processing

sequenceDiagram
    participant SQLProcessor
    participant ThreadPoolExecutor
    participant Snowflake

    SQLProcessor->>ThreadPoolExecutor: Initialize with MAX_UPLOAD_THREADS (8)
    loop For each file to upload
        SQLProcessor->>ThreadPoolExecutor: Submit file upload task
    end
    ThreadPoolExecutor->>Snowflake: Upload files in parallel
    Snowflake->>SQLProcessor: Acknowledge uploads

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

Aaron ("AJ") Steers (aaronsteers) · 2024-07-17T15:38:18Z

CodeRabbit (@coderabbitai) review

coderabbitai · 2024-07-17T15:38:23Z

Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 4

Outside diff range, codebase verification and nitpick comments (2)

examples/run_perf_test_reads.py (2)
16-18: Consider adding a script header or usage documentation.

The VizTracer commands are included directly in the script comments. It would be beneficial to encapsulate these in a script header or provide a usage section in the documentation to guide users on how to execute the script properly.

99-119: Enhance command-line argument validation.

The script uses argparse to handle command-line arguments, which is good. However, consider adding more detailed help messages or validations to ensure that the user inputs are within expected ranges or formats. This will improve the user experience and prevent runtime errors due to invalid inputs.
+ parser.add_argument("-e", type=int, required=True, choices=range(2, 7), help="The scale, as a power of 10. Choose between 2 and 6.")

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 326683c and c282122.

Files ignored due to path filters (1)

poetry.lock is excluded by !**/*.lock

Files selected for processing (10)

.viztracerrc (1 hunks)
airbyte/_future_cdk/sql_processor.py (3 hunks)
airbyte/_processors/file/base.py (1 hunks)
airbyte/_processors/sql/snowflake.py (3 hunks)
airbyte/_util/name_normalizers.py (2 hunks)
airbyte/progress.py (4 hunks)
examples/run_bigquery_faker.py (1 hunks)
examples/run_perf_test_cache_snowflake.py (1 hunks)
examples/run_perf_test_reads.py (1 hunks)
pyproject.toml (1 hunks)

Files skipped from review due to trivial changes (1)

pyproject.toml

Additional comments not posted (13)

.viztracerrc (1)

4-8: Review of VizTracer configuration settings.

The configuration settings for VizTracer are as follows:

max_stack_depth = 20: Limits the stack depth to 20, which is reasonable for most applications.

unique_output_file = True: Ensures that each output file is unique.

output_file = viztracer_report.json: Specifies the output file name.

tracer_entries = 5_000_000: Allocates memory for 5 million entries.

These settings seem appropriate for performance profiling and are consistent with best practices.
examples/run_perf_test_cache_snowflake.py (3)
7-7: Performance profiling command.

The command to run the performance profiling is specified in the docstring. This is a good practice as it provides clear instructions on how to execute the script for its intended purpose.

18-38: Sensitive information handling and Snowflake cache setup.

The script retrieves sensitive information from Google Cloud's Secret Manager and uses it to configure a Snowflake cache. It's crucial to ensure that credentials_json and other sensitive data are handled securely and not exposed in logs or error messages.

Verification successful

Sensitive information handling and Snowflake cache setup verified.

The script does not expose sensitive information in logs or error messages. The print statement found only outputs the stream name and the number of records, which are not sensitive.

print(f"Stream {name}: {len(list(result[name]))} records") is safe as name is a static string and result is derived from a cache read operation.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that sensitive information is not logged or exposed in error messages.

# Test: Search for logging statements that might expose sensitive information.
rg --type python -A 1 $'print|log' ./examples/run_perf_test_cache_snowflake.py
Length of output: 110

Script:
#!/bin/bash
# Description: Verify that sensitive information is not logged or exposed in error messages.

# Test: Search for logging statements that might expose sensitive information.
rg 'print|log' ./examples/run_perf_test_cache_snowflake.py
Length of output: 123

Script:
#!/bin/bash
# Description: Verify that sensitive information is not logged or exposed in error messages.

# Search for the definitions and usages of `name` and `result` in the specified file.
rg 'name|result' ./examples/run_perf_test_cache_snowflake.py
Length of output: 308
40-49: Data retrieval and processing.

The script configures an Airbyte source, selects streams, and reads data using the configured Snowflake cache. It's important to verify that the read method handles exceptions properly, especially since it interacts with external systems.
examples/run_bigquery_faker.py (1)

52-52: Improved handling of missing dataset_id.

The change to use .get with a default value for dataset_id is a good practice. It prevents the script from failing if dataset_id is not provided, making the script more robust.

airbyte/_util/name_normalizers.py (1)

52-52: Caching implementation in normalize method.

The addition of @functools.cache to the normalize method is an excellent use of Python's built-in caching mechanism. This should significantly improve performance by avoiding redundant calculations, especially in scenarios where normalization is frequently called with the same inputs.
airbyte/_processors/sql/snowflake.py (2)
6-6: Validate the import of ThreadPoolExecutor.

Ensure that the ThreadPoolExecutor is used appropriately within the context of the file and that its import does not introduce unnecessary dependencies or complexity.

30-30: Confirm the usage of MAX_UPLOAD_THREADS.

The addition of the MAX_UPLOAD_THREADS constant is appropriate for controlling the number of concurrent threads. Verify that this constant is used consistently across the file and does not lead to resource exhaustion.

Verification successful

Confirmed the usage of MAX_UPLOAD_THREADS.

The constant MAX_UPLOAD_THREADS is used in the ThreadPoolExecutor to control the number of concurrent threads, which is appropriate and should help prevent resource exhaustion.

MAX_UPLOAD_THREADS is defined and used in ThreadPoolExecutor with max_workers=MAX_UPLOAD_THREADS.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of MAX_UPLOAD_THREADS across the file.

# Test: Search for the constant usage. Expect: Usage in ThreadPoolExecutor and possibly elsewhere.
rg --type python 'MAX_UPLOAD_THREADS' airbyte/_processors/sql/snowflake.py
Length of output: 105

Script:
#!/bin/bash
# Description: Verify the usage of MAX_UPLOAD_THREADS across the file.

# Test: Search for the constant usage. Expect: Usage in ThreadPoolExecutor and possibly elsewhere.
rg 'MAX_UPLOAD_THREADS' airbyte/_processors/sql/snowflake.py
Length of output: 160
airbyte/_processors/file/base.py (1)

26-26: Confirm the impact of increased DEFAULT_BATCH_SIZE.

Increasing the DEFAULT_BATCH_SIZE could significantly affect memory usage and performance. It's important to ensure that this change has been thoroughly tested, especially in scenarios with large datasets.

airbyte/progress.py (1)

Line range hint 88-402:
Review changes to time formatting and type handling.

The changes to support float values for time calculations are crucial for precision. Ensure that these changes are consistent throughout the file and that they do not introduce any regressions in time tracking or display functions.

airbyte/_future_cdk/sql_processor.py (3)

165-165: Attribute _known_schemas_list added: Good for performance, consider adding maintenance comments.

The addition of _known_schemas_list is a positive change for performance as it reduces the number of database schema checks. However, consider adding comments on how and when this list should be updated or invalidated to ensure it remains accurate.

311-313: Optimized schema existence check: Good for performance, consider handling cache inconsistencies.

The method now first checks the cached list of known schemas which is a good optimization. However, consider adding error handling or a mechanism to handle situations where the cache might be stale or inaccurate.

381-397: Added force_refresh parameter to _get_schemas_list: Enhances flexibility in cache management.

The addition of force_refresh parameter allows for better control over schema list caching, which can be crucial in environments where schema changes are frequent. Consider adding unit tests to ensure that this parameter works as expected, especially in scenarios where the cache needs to be bypassed.

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c282122 and c85f8de.

Files selected for processing (1)

pyproject.toml (2 hunks)

Files skipped from review as they are similar to previous changes (1)

pyproject.toml

Aaron ("AJ") Steers (aaronsteers) · 2024-07-17T15:59:36Z

/fix-pr

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.
(This job requires that the PR author has "Allow edits from maintainers" enabled.)

PR auto-fix job started... Check job output.

🟦 Job completed successfully (no changes).

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c85f8de and b62b212.

Files selected for processing (2)

.viztracerrc (1 hunks)
airbyte/_processors/sql/snowflake.py (3 hunks)

Files skipped from review due to trivial changes (1)

.viztracerrc

Additional context used

Ruff

airbyte/_processors/sql/snowflake.py

136-136: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

136-136: Undefined name logging

(F821)

Additional comments not posted (2)

airbyte/_processors/sql/snowflake.py (2)

6-6: Import of ThreadPoolExecutor is appropriate for the context.

The import of ThreadPoolExecutor aligns with the PR's objective to improve performance through parallel processing. This is a good use of Python's concurrent programming capabilities to enhance data upload speeds.

30-31: Constant MAX_UPLOAD_THREADS is well-defined.

Setting a constant for the maximum number of upload threads is a good practice. It helps maintain the code's configurability and makes it easier to adjust the performance tuning.

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b62b212 and b9b8e82.

Files selected for processing (4)

.gitignore (1 hunks)
airbyte/_processors/sql/snowflake.py (4 hunks)
airbyte/caches/bigquery.py (2 hunks)
examples/run_perf_test_reads.py (1 hunks)

Files skipped from review due to trivial changes (1)

.gitignore

Files skipped from review as they are similar to previous changes (2)

airbyte/_processors/sql/snowflake.py
examples/run_perf_test_reads.py

Additional context used

Learnings (1)

airbyte/caches/bigquery.py (1)

Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#281
File: airbyte/caches/bigquery.py:40-43
Timestamp: 2024-07-09T04:12:04.672Z
Learning: The `BigQueryCache.get_arrow_dataset` method should have a docstring that correctly states the reason for the `NotImplementedError` as BigQuery not supporting `to_arrow`, instead of incorrectly mentioning `pd.read_sql_table`.

Additional comments not posted (2)

airbyte/caches/bigquery.py (2)

20-20: Approved: Import statement for NoReturn.

The import of NoReturn from the typing module is correctly placed and necessary for the changes in the get_arrow_dataset method.

41-41: Approved: Change in return type and updated exception handling.

Changing the return type from None to NoReturn clarifies that get_arrow_dataset is not intended to return any value and will always raise an exception. This is a good practice for functions that are designed to not return under normal operation.

Additionally, the docstring and the exception message have been updated to reflect the correct reason why get_arrow_dataset is not implemented, which aligns with the previous learning. The provided URL in the docstring is a good practice as it offers more context on the issue.

Aaron ("AJ") Steers (aaronsteers) added 10 commits July 17, 2024 08:29

perf: optimize _ensure_schema_exists()

e329ba8

cherry-picked

5508bca

chore: add perf-test example scripts

ab1cc9b

feat(perf): improve precision of progress prints

9003e91

perf: cache string normalization (2.5x speed boost! 🔥)

4b563a3

perf: increase batch size from 10K to 100K

d9b1285

perf(snowflake): upload using up to 8 concurrent threads

93ce1e9

feat(perf): robust test script for perf

2a5c042

fix(drive-by): 'dataset_id' no longer part of secret config for bigquery

9d65d68

chore(perf): reasonable max entries for viztracer

c282122

Aaron ("AJ") Steers (aaronsteers) added the enable-ai-review label Jul 17, 2024

coderabbitai Bot requested changes Jul 17, 2024

View reviewed changes

Comment thread .viztracerrc Outdated

Comment thread examples/run_perf_test_reads.py

Comment thread examples/run_perf_test_reads.py

Comment thread airbyte/_processors/sql/snowflake.py Outdated

Aaron ("AJ") Steers (aaronsteers) added 2 commits July 17, 2024 08:48

chore: ignore new SQLAlchemy warning

c85f8de

chore: fix ini comments

765b38c

coderabbitai Bot reviewed Jul 17, 2024

View reviewed changes

apply suggestion

b62b212

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai Bot approved these changes Jul 17, 2024

View reviewed changes

coderabbitai Bot requested changes Jul 17, 2024

View reviewed changes

Comment thread airbyte/_processors/sql/snowflake.py Outdated

Aaron ("AJ") Steers (aaronsteers) added 2 commits July 17, 2024 09:26

chore: lint fixes

4c2b088

chore: use custom exception type instead of log message

b5040b8

coderabbitai Bot approved these changes Jul 17, 2024

View reviewed changes

Aaron ("AJ") Steers (aaronsteers) changed the title ~~Feat: Multiple perf improvements, add performance profiling with viztracer~~ Feat: Multiple perf improvements and viztracer performance profiling Jul 17, 2024

Aaron ("AJ") Steers (aaronsteers) added 3 commits July 17, 2024 09:37

chore: gitignore viztracer reports

6122b1e

delete now-redundant profiling script

bdfd246

chore: add more usage examples

b9b8e82

coderabbitai Bot reviewed Jul 17, 2024

View reviewed changes

Aaron ("AJ") Steers (aaronsteers) enabled auto-merge (squash) July 17, 2024 16:51

Aaron ("AJ") Steers (aaronsteers) merged commit 1acd6df into main Jul 17, 2024

Aaron ("AJ") Steers (aaronsteers) deleted the feat/perf-improvements-and-profiling branch July 17, 2024 16:55

coderabbitai Bot mentioned this pull request Sep 23, 2024

Chore: Bump to Sqlalchemy 2.0 #396

Merged

coderabbitai Bot mentioned this pull request Mar 28, 2025

Feat: Support pandas in BigQuery cache #597

Open

Uh oh!

Conversation

Aaron ("AJ") Steers (aaronsteers) commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

General Perf Improvements (Read-to-DuckDB)

Snowflake Perf Improvements (50K records)

Snowflake Perf Improvements (500K records)

Snowflake 2M Records test:

Snowflake 50M Records test:

Uh oh!

coderabbitai Bot commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagrams

Schema Management with Caching

Parallel File Upload in Snowflake Processing

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

Aaron ("AJ") Steers (aaronsteers) commented Jul 17, 2024

Uh oh!

coderabbitai Bot commented Jul 17, 2024

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Aaron ("AJ") Steers (aaronsteers) commented Jul 17, 2024 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Aaron ("AJ") Steers (aaronsteers) commented Jul 17, 2024 •

edited

Loading

coderabbitai Bot commented Jul 17, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

Aaron ("AJ") Steers (aaronsteers) commented Jul 17, 2024 •

edited by github-actions Bot

Loading