[Refactor] improve context compression budgeting#874
Conversation
…rt token counting - Rewrite infinite_context.ts: class -> function, structured output (summary + recent messages) - Rewrite infinite_context_chain.ts: class -> simple compressChunk function - Add scratchpad compression in agent loop (legacy-executor.ts) - Extract shared countMessageTokens/countMessagesTokens to utils/count_tokens.ts with usage_metadata baseline optimization - Update chat_history.ts and model.ts cropMessages to use baseline optimization - Fix multimodal warning: 'chatluna-multimodal-service' -> 'multimodal-service'
|
Warning Review limit reached
Your plan currently allows 2 reviews/hour. Refill in 13 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
Walkthrough本PR重构ChatLuna的上下文压缩管道,将基于类的 Changes无限上下文和Scratchpad压缩系统重构
多模态插件名称更新
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the infinite context management system, moving from a class-based manager to functional utilities like compressIfNeeded and compressChunk. It introduces scratchpad compression for the agent executor to handle long tool-call loops and optimizes token counting by leveraging usage_metadata from previous AI responses as a baseline. Feedback focuses on ensuring that AbortSignal is correctly propagated through the new asynchronous compression paths to prevent unnecessary background processing and addressing a logic error in the token counting optimization that skips valid baseline messages. Additionally, it was noted that compression thresholds should be unified across the codebase.
…n trigger Instead of estimating tokens by formatting scratchpad text, use the real input_tokens from the AI message's usage_metadata returned by the LLM call. This is accurate since it's what the model actually consumed.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9256c50b33
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/core/src/llm-core/agent/legacy-executor.ts`:
- Around line 381-395: The compression condition uses only scratchpadTokens
(from formatScratchpadForCount and tokenCounter) against maxTokenLimit * 0.84,
but the actual prompt also includes input['chat_history']; update the check in
legacy-executor.ts to include chat history tokens: format and count
input['chat_history'] (using the same tokenCounter), then compute either
totalTokens = scratchpadTokens + chatHistoryTokens and compare totalTokens to
maxTokenLimit * 0.84, or compute remainingBudget = maxTokenLimit -
chatHistoryTokens and compare scratchpadTokens to remainingBudget * 0.84;
trigger compression when the combined/remaining-based threshold is exceeded
(adjust the existing if that currently tests scratchpadTokens).
In `@packages/core/src/llm-core/chat/infinite_context.ts`:
- Around line 111-123: 当前实现使用 splitMessages() 固定按轮次(1~3)保留最近消息并在
compressIfNeeded() 仅记录 outputTokens 而不再校验
threshold/maxTokenLimit,导致若保留的最近轮次很长仍会超预算并在下次调用失败。请改为按 token 预算从后往前回填最近轮次:在
splitMessages 或 compressIfNeeded 中引入基于 threshold/maxTokenLimit 的预算计算(使用
threshold 和 maxTokenLimit、inputTokens、outputTokens),逐轮累加最近完整轮次直到累加的 tokens
达到预算上限为止;在生成 resultMessages 后重新计算并设置 outputTokens、compressed
标志、remainingMessageCount 和 messages 字段以反映真实压缩结果(引用符号:splitMessages,
compressIfNeeded, resultMessages, outputTokens, threshold, maxTokenLimit,
remainingMessageCount)。
In `@packages/core/src/llm-core/platform/model.ts`:
- Around line 835-891: 当前把 baselineTokens 直接一次性加到 totalTokens(在使用
baselineIdx/baselineRoundIdx 时)会低估同一轮中 baseline 之后的 AI 回复和 tool 消息的代价。修复方法:不要使用
baselineTokens 作为整个 0..baselineRoundIdx 的成本;在处理到 i <= baselineRoundIdx 且
selectedRounds 为空的分支里,逐轮调用 countRoundTokens(conversationRounds[j]) 累加
0..baselineRoundIdx 每一轮的真实 token 数并据此判断 exceedsLimit/truncated,然后将这些轮逐个 unshift
到 selectedRounds(而不是直接加 baselineTokens 并一次性 unshift 重复
baselineRoundIdx)。参考符号:baselineIdx, baselineRoundIdx, baselineTokens,
conversationRounds, selectedRounds, totalTokens, countRoundTokens,
maxTokenLimit。
In `@packages/core/src/llm-core/prompt/chat_history.ts`:
- Around line 72-137: The baseline calculation underestimates historical tokens
because findBaseline/baseline.tokens is treated as the full cost up to baseline
while runtime.usedTokens has already subtracted the current request
(input/scratchpad) and the baseline AI reply token count is not added back; this
causes selectedRounds to include too much history when chatHistory ends with an
AI message. Fix by computing the true baseline cost as baseline.tokens plus the
token count of the baseline AI message if that message is not already included
in runtime.usedTokens (i.e., when current request tokens were removed), or
alternatively recompute the baseline segment by calling countMessagesTokens on
rounds[0..baselineRoundIdx] instead of trusting baseline.tokens; update the
logic in the loop that unwraps the bulkRounds (the block using baselineRoundIdx,
baseline.tokens, runtime.usedTokens, selectedRounds, availableLimit and
countMessagesTokens) and likewise apply the same correction in the analogous
code at lines 198-217 so usedTokens correctly reflects all messages up to and
including the baseline AI message before comparing to availableLimit.
In `@packages/core/src/middlewares/chat/read_chat_message.ts`:
- Line 252: The warning strings reference the old plugin name
"chatluna-multimodal-service" while the code checks for
ctx.chatluna.getPlugin('multimodal-service'); update all warning/error messages
in this file that mention "chatluna-multimodal-service" (the messages near the
checks around ctx.chatluna.getPlugin('multimodal-service')) to use
"multimodal-service" so the logged/printed plugin name matches the actual plugin
id the code looks up (apply to the other similar messages in the same file).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: b2d3d4ce-1642-4765-a201-18f725a050f0
📒 Files selected for processing (9)
packages/core/src/llm-core/agent/legacy-executor.tspackages/core/src/llm-core/chain/infinite_context_chain.tspackages/core/src/llm-core/chat/app.tspackages/core/src/llm-core/chat/infinite_context.tspackages/core/src/llm-core/platform/model.tspackages/core/src/llm-core/prompt/chat_history.tspackages/core/src/llm-core/prompt/system_prompts.tspackages/core/src/llm-core/utils/count_tokens.tspackages/core/src/middlewares/chat/read_chat_message.ts
- count_tokens.ts: allow baseline when it's the last message (baselineIdx >= 0) - Pass AbortSignal through compression chain (app.ts -> infinite_context -> compressChunk, legacy-executor -> compressScratchpad -> compressChunk) - Unify compression threshold to 0.85 - Fix compacted messages detection: use reference equality (compacted !== messages) instead of length comparison - Revert chat_history.ts baseline optimization (unreliable in prompt pipeline context where system tokens differ between calls)
…text - cropMessages baseline now counts the AI message itself and subsequent tool messages in the same round (usage_metadata.input_tokens only covers messages before the AI response) - Update warning messages to show both plugin names for clarity
This pr refactors ChatLuna context compression to better account for chat history, tool messages, and agent scratchpad tokens.
New Features
usage_metadata.input_tokensas the scratchpad compression trigger baseline.Bug fixes
Other Changes
yarn lint-fixcompleted with no errors. Existing max-len warnings remain inread_chat_message.ts.