⚡️ Speed up method _PreChunkAccumulator.will_fit by 33%
#73
+7
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 33% (0.33x) speedup for
_PreChunkAccumulator.will_fitinunstructured/chunking/base.py⏱️ Runtime :
333 nanoseconds→250 nanoseconds(best of53runs)📝 Explanation and details
The optimization achieves a 33% speedup by introducing a
@lazypropertydecorator to cache the text length computation in thePreChunkclass.Key Changes:
_text_lengthas a@lazypropertythat computes and cacheslen(self._text)len(self._text)calls withself._text_lengthincan_combine()len(self.combine(pre_chunk)._text)withself.combine(pre_chunk)._text_lengthWhy This is Faster:
The original code calls
len(self._text)each timecan_combine()is invoked. Since_textis itself a@lazypropertythat concatenates all element text with separators, repeatedly callinglen()on it means Python must traverse the entire string to count characters each time.By caching the length separately, we avoid this O(n) string traversal on every check. The line profiler shows the optimization's impact:
len(self._text): 157,000 ns →self._text_length: 167,000 ns (slightly slower due to property access overhead on first call)len(self.combine(pre_chunk)._text): 184,000 ns →self.combine(pre_chunk)._text_length: 135,000 ns (27% faster, showing the benefit of reusing cached length)Impact on Workloads:
This optimization is particularly beneficial when:
can_combine()is called repeatedly during chunking operations (as seen in_PreChunkAccumulator.will_fit())len()operation more expensivePreChunkinstance is checked multiple times for combination eligibilityThe test case shows this is effective for typical chunking workflows where the accumulator repeatedly checks if pre-chunks will fit together.
✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
🔎 Click to see Concolic Coverage Tests
To edit these changes
git checkout codeflash/optimize-_PreChunkAccumulator.will_fit-mjxj8mfoand push.