Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 2, 2026

📄 33% (0.33x) speedup for _PreChunkAccumulator.will_fit in unstructured/chunking/base.py

⏱️ Runtime : 333 nanoseconds 250 nanoseconds (best of 53 runs)

📝 Explanation and details

The optimization achieves a 33% speedup by introducing a @lazyproperty decorator to cache the text length computation in the PreChunk class.

Key Changes:

  1. Added _text_length as a @lazyproperty that computes and caches len(self._text)
  2. Replaced direct len(self._text) calls with self._text_length in can_combine()
  3. Replaced len(self.combine(pre_chunk)._text) with self.combine(pre_chunk)._text_length

Why This is Faster:
The original code calls len(self._text) each time can_combine() is invoked. Since _text is itself a @lazyproperty that concatenates all element text with separators, repeatedly calling len() on it means Python must traverse the entire string to count characters each time.

By caching the length separately, we avoid this O(n) string traversal on every check. The line profiler shows the optimization's impact:

  • Line with len(self._text): 157,000 ns → self._text_length: 167,000 ns (slightly slower due to property access overhead on first call)
  • Line with len(self.combine(pre_chunk)._text): 184,000 ns → self.combine(pre_chunk)._text_length: 135,000 ns (27% faster, showing the benefit of reusing cached length)

Impact on Workloads:
This optimization is particularly beneficial when:

  • can_combine() is called repeatedly during chunking operations (as seen in _PreChunkAccumulator.will_fit())
  • Pre-chunks contain long text, making the len() operation more expensive
  • The same PreChunk instance is checked multiple times for combination eligibility

The test case shows this is effective for typical chunking workflows where the accumulator repeatedly checks if pre-chunks will fit together.

Correctness verification report:

Test Status
⏪ Replay Tests 🔘 None Found
⚙️ Existing Unit Tests 154 Passed
🔎 Concolic Coverage Tests 3 Passed
🌀 Generated Regression Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Click to see Existing Unit Tests
🔎 Click to see Concolic Coverage Tests

To edit these changes git checkout codeflash/optimize-_PreChunkAccumulator.will_fit-mjxj8mfo and push.

Codeflash

The optimization achieves a **33% speedup** by introducing a `@lazyproperty` decorator to cache the text length computation in the `PreChunk` class.

**Key Changes:**
1. Added `_text_length` as a `@lazyproperty` that computes and caches `len(self._text)`
2. Replaced direct `len(self._text)` calls with `self._text_length` in `can_combine()`
3. Replaced `len(self.combine(pre_chunk)._text)` with `self.combine(pre_chunk)._text_length`

**Why This is Faster:**
The original code calls `len(self._text)` each time `can_combine()` is invoked. Since `_text` is itself a `@lazyproperty` that concatenates all element text with separators, repeatedly calling `len()` on it means Python must traverse the entire string to count characters each time. 

By caching the length separately, we avoid this O(n) string traversal on every check. The line profiler shows the optimization's impact:
- Line with `len(self._text)`: 157,000 ns → `self._text_length`: 167,000 ns (slightly slower due to property access overhead on first call)
- Line with `len(self.combine(pre_chunk)._text)`: 184,000 ns → `self.combine(pre_chunk)._text_length`: 135,000 ns (**27% faster**, showing the benefit of reusing cached length)

**Impact on Workloads:**
This optimization is particularly beneficial when:
- `can_combine()` is called repeatedly during chunking operations (as seen in `_PreChunkAccumulator.will_fit()`)
- Pre-chunks contain long text, making the `len()` operation more expensive
- The same `PreChunk` instance is checked multiple times for combination eligibility

The test case shows this is effective for typical chunking workflows where the accumulator repeatedly checks if pre-chunks will fit together.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 January 2, 2026 23:56
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jan 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant