Fix performance regression for prefix caching by lzhangzz · Pull Request #4270 · InternLM/lmdeploy

lzhangzz · 2026-01-13T11:13:49Z

In #4223, the change of scheduling policy causes more cache miss when sending long and repeated requests in a burst. This PR fixes it by

Allowing to cache blocks from partially prefilled sequences.
Allowing partially prefilled sequences to match against the cache trie.

Copilot

Pull request overview

This PR fixes a performance regression in prefix caching introduced by #4223 that caused increased cache misses when sending long and repeated requests in bursts. The fix enables caching and matching of prefix blocks from partially prefilled sequences.

Changes:

Modified CachePrompt and PrefixMatch functions to allow caching and matching of blocks from partially prefilled sequences (where cache_len < prompt.size())
Changed Materialize return value attribute from [[nodiscard]] to [[maybe_unused]] since the return value is not always used
Replaced FT_CHECK macros with TM_CHECK variants for consistency
Refactored code formatting and variable initialization in several files
Added async_ parameter to benchmark script

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/turbomind/models/llama/SequenceManager.h	Changed Materialize attribute from [[nodiscard]] to [[maybe_unused]]
src/turbomind/models/llama/SequenceManager.cc	Enabled caching/matching for partially prefilled sequences; replaced FT_CHECK with TM_CHECK; cleaned up formatting
src/turbomind/models/llama/BlockTrie.cc	Refactored Match and Cache functions with clearer variable names and improved control flow
src/turbomind/models/llama/BlockManager.h	Removed unused function declaration
src/turbomind/models/llama/BlockManager.cc	Replaced FT_CHECK with TM_CHECK; removed unused function; added timestamp initialization
src/turbomind/engine/engine.cc	Removed unused variables; removed unnecessary lambda captures
benchmark/profile_pipeline_api.py	Added async_ parameter support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-15T05:43:11Z

src/turbomind/models/llama/SequenceManager.cc

-            BlockIds  block_ids;
-            UniqueIds block_unique_ids;
-            std::tie(block_ids, block_unique_ids) = block_trie_->Cache(seq, seq.prompt);
+        if (auto& seq = *sequences[i]; !seq.prompt.empty()) {


The condition !seq.prompt.empty() should also check if there are blocks available to cache. When seq.cache_len < seq.prompt.size() but seq.blocks is empty or insufficient, attempting to cache could lead to unexpected behavior. Consider adding a check for seq.blocks.size() to ensure blocks exist before caching.

Suggested change

if (auto& seq = *sequences[i]; !seq.prompt.empty()) {

if (auto& seq = *sequences[i]; !seq.prompt.empty() && !seq.blocks.empty()) {

Copilot · 2026-01-15T05:43:11Z

src/turbomind/models/llama/SequenceManager.cc

+        }
+
+        /// TODO: `Unlock` and `Lock` can't be batched because there may be repeated blocks between sequences
+        if (const int offset = seq.cache_len / block_seq_len_; offset < block_ids.size()) {


The offset calculation using integer division could lead to data loss if seq.cache_len is not perfectly aligned with block_seq_len_. Consider adding a check or comment to clarify the expected behavior when seq.cache_len % block_seq_len_ != 0.

Suggested change

if (const int offset = seq.cache_len / block_seq_len_; offset < block_ids.size()) {

// NOTE: seq.cache_len is expected to be aligned to block_seq_len_. The integer division below

// intentionally converts from cached tokens to a whole-block offset; any partial block (non-zero

// remainder) is ignored here and will be effectively truncated.

const int offset = seq.cache_len / block_seq_len_;

if (seq.cache_len % block_seq_len_ != 0 && rank_ == 0) {

TM_LOG_DEBUG(

"[SeqMgr][match] ID %llu, cache_len %d not aligned to block_seq_len_ %d; truncating to %d blocks",

seq.id, seq.cache_len, block_seq_len_, offset);

}

if (offset < static_cast<int>(block_ids.size())) {

src/turbomind/models/llama/BlockManager.cc

lzhangzz added 3 commits January 13, 2026 10:46

fix performance regression for prefix caching

eccb112

fix lint

8831712

fix lint

ca68a74

lvhan028 self-requested a review January 14, 2026 02:58

lvhan028 added the Bug:P1 label Jan 14, 2026

lvhan028 requested a review from Copilot January 15, 2026 05:39

Copilot started reviewing on behalf of lvhan028 January 15, 2026 05:39 View session

Copilot AI reviewed Jan 15, 2026

View reviewed changes

lvhan028 approved these changes Jan 15, 2026

View reviewed changes

lvhan028 merged commit ae2d004 into InternLM:main Jan 15, 2026
15 checks passed

lzhangzz mentioned this pull request Jan 26, 2026

Fix prefix caching #4292

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix performance regression for prefix caching#4270

Fix performance regression for prefix caching#4270
lvhan028 merged 3 commits intoInternLM:mainfrom
lzhangzz:fix-prefix-cache

lzhangzz commented Jan 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 15, 2026

Uh oh!

Copilot AI Jan 15, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if (auto& seq = *sequences[i]; !seq.prompt.empty()) {
	if (auto& seq = *sequences[i]; !seq.prompt.empty() && !seq.blocks.empty()) {

-        if (const int offset = seq.cache_len / block_seq_len_; offset < block_ids.size()) {
+        // NOTE: seq.cache_len is expected to be aligned to block_seq_len_. The integer division below
+        // intentionally converts from cached tokens to a whole-block offset; any partial block (non-zero
+        // remainder) is ignored here and will be effectively truncated.
+        const int offset = seq.cache_len / block_seq_len_;
+        if (seq.cache_len % block_seq_len_ != 0 && rank_ == 0) {
+            TM_LOG_DEBUG(
+                "[SeqMgr][match] ID %llu, cache_len %d not aligned to block_seq_len_ %d; truncating to %d blocks",
+                seq.id, seq.cache_len, block_seq_len_, offset);
+        }
+        if (offset < static_cast<int>(block_ids.size())) {

Conversation

lzhangzz commented Jan 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants