Skip to content

fix(concurrent): Fix memory issue following concurrency in source-jira#150

Merged
Maxime Carbonneau-Leclerc (maxi297) merged 1 commit into
mainfrom
maxi297/memory-issue-associated-with-jinja
Dec 9, 2024
Merged

fix(concurrent): Fix memory issue following concurrency in source-jira#150
Maxime Carbonneau-Leclerc (maxi297) merged 1 commit into
mainfrom
maxi297/memory-issue-associated-with-jinja

Conversation

@maxi297
Copy link
Copy Markdown
Contributor

@maxi297 Maxime Carbonneau-Leclerc (maxi297) commented Dec 6, 2024

DISCLAIMER: I don't exactly understand this one but it seems to have helped. One of the reason this isn't clear is that the main memory issue wasn't known and this was done because we were observing a high number of objects being allocated by StreamPartitionAccessEnvironment in the heap.

Before this change , the memory usage was increasing faster.
We added this change as part of airbyte/source-jira:3.4.0-dev.657e80a526 and the process was still being killed because of memory usage but it was increasing slower. See the difference between the two executions here.

image

Ultimately, I don't have a strong opinion as if this would need to be merged but it seems to help hence why I'm opening this PR. Based on this comment from a pallets developer, they seem to intend the lib to be thread safe so I feel somewhat confident that this change should be fine (this + the fact that the current sync without the memory issue is currently running without problems for the user).

Summary by CodeRabbit

  • New Features

    • Enhanced security measures for Jinja interpolation, restricting certain extensions and built-in functions.
    • Introduced new constants for managing deprecated keywords.
  • Bug Fixes

    • Updated type handling for methods to improve compatibility with type checking tools.
  • Refactor

    • Simplified initialization process for the Jinja environment, improving code structure.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 6, 2024

📝 Walkthrough

Walkthrough

The changes in the jinja.py file involve updates to the JinjaInterpolation and StreamPartitionAccessEnvironment classes. Key modifications include the addition of class-level constants for managing deprecated keywords and restricting certain Jinja extensions and built-in functions. The initialization logic has been simplified by utilizing a shared _ENVIRONMENT instance, and several method signatures have been updated to enhance type safety. These adjustments aim to improve the structure and security of the Jinja interpolation logic.

Changes

File Path Change Summary
airbyte_cdk/sources/declarative/interpolation/jinja.py - Added variables: _ALIASES, _RESTRICTED_EXTENSIONS, _RESTRICTED_BUILTIN_FUNCTIONS.
- Updated method signature for _find_undeclared_variables from set[str] to Set[str].
- Updated method signature for _compile from Optional[str] to str.
- Removed constructor from JinjaInterpolation and simplified initialization logic using _ENVIRONMENT.
- Updated eval method to reference _ALIASES instead of ALIASES.
- Updated is_safe_attribute method to include a type ignore comment for MyPy.

Sequence Diagram(s)

sequenceDiagram
    participant J as JinjaInterpolation
    participant E as StreamPartitionAccessEnvironment
    participant F as Filters and Macros

    J->>E: Create instance of StreamPartitionAccessEnvironment
    E->>F: Update filters and globals
    E-->>J: Return updated environment
    J->>J: Use _ENVIRONMENT for rendering
Loading

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)
airbyte_cdk/sources/declarative/interpolation/jinja.py (5)

7-7: Consider documenting the type ignore rationale

The type ignore comment suggests there might be an underlying type system issue. Would you consider adding a more detailed comment explaining why this is necessary and if there's a plan to address it in the future? wdyt? 🤔

Also, since this method affects attribute access security, should we add some documentation about which attributes are considered safe and why? This could help future maintainers understand the security implications.

Also applies to: 30-30


33-50: Great security improvements! A few thoughts on documentation

The security restrictions and aliases are well thought out! 🔒 To make this even better for future maintainers:

  1. Would you consider adding a link to any security guidelines or discussions that led to these restrictions?
  2. For _RESTRICTED_BUILTIN_FUNCTIONS, maybe we could add a comment about the specific risks of each function (like we have for range)?

51-58: Nice optimization using singleton pattern! Consider thread-safety documentation

The singleton pattern with module-level initialization is a great way to address the memory issues! Since this is a concurrent context (as mentioned in PR objectives), would you consider adding a comment explaining that this initialization is thread-safe due to Python's module import lock? This could help prevent future modifications that might break thread safety.


Line range hint 89-108: Consider optimizing alias lookup performance

The alias handling looks good, but for performance in tight loops, would you consider creating a set of reserved keywords at module level? Something like:

_RESERVED_KEYWORDS = set(_ALIASES.keys())

Then we could do a faster lookup:

if alias in _RESERVED_KEYWORDS and alias in context:

What do you think? 🤔


136-148: Document caching behavior for memory management

The @cache decorators are a great addition for performance! Since memory usage is a concern in this PR, would you consider adding a comment about:

  1. Expected cache size growth
  2. When/if the cache gets cleared
  3. Any potential memory implications in long-running scenarios

This could help users understand the memory trade-offs. wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 59505ff and 7f1680c.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/interpolation/jinja.py (5 hunks)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/interpolation/jinja.py (1)

136-148: Verify concurrent access patterns

Let's verify that the cached methods are being accessed safely in concurrent scenarios:

✅ Verification successful

The caching implementation looks safe for concurrent use 🔒

From the codebase analysis, I can see that JinjaInterpolation is used in a straightforward, stateless manner across the codebase. The cached methods _find_undeclared_variables and _compile are only used internally within the class, and there's no evidence of concurrent access patterns or parallel processing in any of the usage sites. The Python @cache decorator is thread-safe by design, and the way these methods are used (pure functions with immutable string inputs) makes them inherently safe for concurrent access.

The main usages are in:

  • Interpolated string/mapping/boolean classes that use JinjaInterpolation in a sequential manner
  • DateTime based cursor for template evaluation
  • Test files that verify basic functionality

Would you agree that we can keep the current implementation as is? The caching provides performance benefits without introducing thread-safety concerns. wdyt?

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for concurrent usage patterns of these methods
rg -l "ThreadPoolExecutor|ProcessPoolExecutor|Thread|Process" | xargs rg "_find_undeclared_variables|_compile"

# Search for potential concurrent iterator usage
ast-grep --pattern 'for $_ in $_._find_undeclared_variables'

Length of output: 175


Script:

#!/bin/bash
# Let's check for any async/concurrent patterns in files that use JinjaInterpolation
rg -l "JinjaInterpolation" | xargs rg "async|await|concurrent|parallel|Thread|Process"

# Check how JinjaInterpolation is being used
rg -A 3 "JinjaInterpolation" 

# Look for test files that might reveal concurrent usage
fd "test.*py" | xargs rg "JinjaInterpolation"

Length of output: 10361

@maxi297 Maxime Carbonneau-Leclerc (maxi297) deleted the maxi297/memory-issue-associated-with-jinja branch December 9, 2024 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants