fix(concurrent): Fix memory issue following concurrency in source-jira#150
Conversation
📝 WalkthroughWalkthroughThe changes in the Changes
Sequence Diagram(s)sequenceDiagram
participant J as JinjaInterpolation
participant E as StreamPartitionAccessEnvironment
participant F as Filters and Macros
J->>E: Create instance of StreamPartitionAccessEnvironment
E->>F: Update filters and globals
E-->>J: Return updated environment
J->>J: Use _ENVIRONMENT for rendering
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (5)
airbyte_cdk/sources/declarative/interpolation/jinja.py (5)
7-7: Consider documenting the type ignore rationaleThe type ignore comment suggests there might be an underlying type system issue. Would you consider adding a more detailed comment explaining why this is necessary and if there's a plan to address it in the future? wdyt? 🤔
Also, since this method affects attribute access security, should we add some documentation about which attributes are considered safe and why? This could help future maintainers understand the security implications.
Also applies to: 30-30
33-50: Great security improvements! A few thoughts on documentationThe security restrictions and aliases are well thought out! 🔒 To make this even better for future maintainers:
- Would you consider adding a link to any security guidelines or discussions that led to these restrictions?
- For
_RESTRICTED_BUILTIN_FUNCTIONS, maybe we could add a comment about the specific risks of each function (like we have forrange)?
51-58: Nice optimization using singleton pattern! Consider thread-safety documentationThe singleton pattern with module-level initialization is a great way to address the memory issues! Since this is a concurrent context (as mentioned in PR objectives), would you consider adding a comment explaining that this initialization is thread-safe due to Python's module import lock? This could help prevent future modifications that might break thread safety.
Line range hint
89-108: Consider optimizing alias lookup performanceThe alias handling looks good, but for performance in tight loops, would you consider creating a set of reserved keywords at module level? Something like:
_RESERVED_KEYWORDS = set(_ALIASES.keys())Then we could do a faster lookup:
if alias in _RESERVED_KEYWORDS and alias in context:What do you think? 🤔
136-148: Document caching behavior for memory managementThe
@cachedecorators are a great addition for performance! Since memory usage is a concern in this PR, would you consider adding a comment about:
- Expected cache size growth
- When/if the cache gets cleared
- Any potential memory implications in long-running scenarios
This could help users understand the memory trade-offs. wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/interpolation/jinja.py(5 hunks)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/interpolation/jinja.py (1)
136-148: Verify concurrent access patterns
Let's verify that the cached methods are being accessed safely in concurrent scenarios:
✅ Verification successful
The caching implementation looks safe for concurrent use 🔒
From the codebase analysis, I can see that JinjaInterpolation is used in a straightforward, stateless manner across the codebase. The cached methods _find_undeclared_variables and _compile are only used internally within the class, and there's no evidence of concurrent access patterns or parallel processing in any of the usage sites. The Python @cache decorator is thread-safe by design, and the way these methods are used (pure functions with immutable string inputs) makes them inherently safe for concurrent access.
The main usages are in:
- Interpolated string/mapping/boolean classes that use JinjaInterpolation in a sequential manner
- DateTime based cursor for template evaluation
- Test files that verify basic functionality
Would you agree that we can keep the current implementation as is? The caching provides performance benefits without introducing thread-safety concerns. wdyt?
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for concurrent usage patterns of these methods
rg -l "ThreadPoolExecutor|ProcessPoolExecutor|Thread|Process" | xargs rg "_find_undeclared_variables|_compile"
# Search for potential concurrent iterator usage
ast-grep --pattern 'for $_ in $_._find_undeclared_variables'
Length of output: 175
Script:
#!/bin/bash
# Let's check for any async/concurrent patterns in files that use JinjaInterpolation
rg -l "JinjaInterpolation" | xargs rg "async|await|concurrent|parallel|Thread|Process"
# Check how JinjaInterpolation is being used
rg -A 3 "JinjaInterpolation"
# Look for test files that might reveal concurrent usage
fd "test.*py" | xargs rg "JinjaInterpolation"
Length of output: 10361
DISCLAIMER: I don't exactly understand this one but it seems to have helped. One of the reason this isn't clear is that the main memory issue wasn't known and this was done because we were observing a high number of objects being allocated by
StreamPartitionAccessEnvironmentin the heap.Before this change , the memory usage was increasing faster.
We added this change as part of
airbyte/source-jira:3.4.0-dev.657e80a526and the process was still being killed because of memory usage but it was increasing slower. See the difference between the two executions here.Ultimately, I don't have a strong opinion as if this would need to be merged but it seems to help hence why I'm opening this PR. Based on this comment from a pallets developer, they seem to intend the lib to be thread safe so I feel somewhat confident that this change should be fine (this + the fact that the current sync without the memory issue is currently running without problems for the user).
Summary by CodeRabbit
New Features
Bug Fixes
Refactor