Fix performance regression for prefix caching#4270
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a performance regression in prefix caching introduced by #4223 that caused increased cache misses when sending long and repeated requests in bursts. The fix enables caching and matching of prefix blocks from partially prefilled sequences.
Changes:
- Modified
CachePromptandPrefixMatchfunctions to allow caching and matching of blocks from partially prefilled sequences (wherecache_len < prompt.size()) - Changed
Materializereturn value attribute from[[nodiscard]]to[[maybe_unused]]since the return value is not always used - Replaced
FT_CHECKmacros withTM_CHECKvariants for consistency - Refactored code formatting and variable initialization in several files
- Added
async_parameter to benchmark script
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/turbomind/models/llama/SequenceManager.h | Changed Materialize attribute from [[nodiscard]] to [[maybe_unused]] |
| src/turbomind/models/llama/SequenceManager.cc | Enabled caching/matching for partially prefilled sequences; replaced FT_CHECK with TM_CHECK; cleaned up formatting |
| src/turbomind/models/llama/BlockTrie.cc | Refactored Match and Cache functions with clearer variable names and improved control flow |
| src/turbomind/models/llama/BlockManager.h | Removed unused function declaration |
| src/turbomind/models/llama/BlockManager.cc | Replaced FT_CHECK with TM_CHECK; removed unused function; added timestamp initialization |
| src/turbomind/engine/engine.cc | Removed unused variables; removed unnecessary lambda captures |
| benchmark/profile_pipeline_api.py | Added async_ parameter support |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| BlockIds block_ids; | ||
| UniqueIds block_unique_ids; | ||
| std::tie(block_ids, block_unique_ids) = block_trie_->Cache(seq, seq.prompt); | ||
| if (auto& seq = *sequences[i]; !seq.prompt.empty()) { |
There was a problem hiding this comment.
The condition !seq.prompt.empty() should also check if there are blocks available to cache. When seq.cache_len < seq.prompt.size() but seq.blocks is empty or insufficient, attempting to cache could lead to unexpected behavior. Consider adding a check for seq.blocks.size() to ensure blocks exist before caching.
| if (auto& seq = *sequences[i]; !seq.prompt.empty()) { | |
| if (auto& seq = *sequences[i]; !seq.prompt.empty() && !seq.blocks.empty()) { |
| } | ||
|
|
||
| /// TODO: `Unlock` and `Lock` can't be batched because there may be repeated blocks between sequences | ||
| if (const int offset = seq.cache_len / block_seq_len_; offset < block_ids.size()) { |
There was a problem hiding this comment.
The offset calculation using integer division could lead to data loss if seq.cache_len is not perfectly aligned with block_seq_len_. Consider adding a check or comment to clarify the expected behavior when seq.cache_len % block_seq_len_ != 0.
| if (const int offset = seq.cache_len / block_seq_len_; offset < block_ids.size()) { | |
| // NOTE: seq.cache_len is expected to be aligned to block_seq_len_. The integer division below | |
| // intentionally converts from cached tokens to a whole-block offset; any partial block (non-zero | |
| // remainder) is ignored here and will be effectively truncated. | |
| const int offset = seq.cache_len / block_seq_len_; | |
| if (seq.cache_len % block_seq_len_ != 0 && rank_ == 0) { | |
| TM_LOG_DEBUG( | |
| "[SeqMgr][match] ID %llu, cache_len %d not aligned to block_seq_len_ %d; truncating to %d blocks", | |
| seq.id, seq.cache_len, block_seq_len_, offset); | |
| } | |
| if (offset < static_cast<int>(block_ids.size())) { |
In #4223, the change of scheduling policy causes more cache miss when sending long and repeated requests in a burst. This PR fixes it by