fix: ensure vision-capable models for image attachments#388
Conversation
- Add requireVision parameter to getModel() to select vision-capable models - Update researcher agent to detect images and use vision models - Update resolution-search agent to use vision models for image analysis - Fix AWS Bedrock configuration with valid Claude 3.5 Sonnet model ID - Improve type safety for multimodal content in actions.tsx Fixes the issue where image attachments were not returning tokens because non-vision models were being used for multimodal content processing.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
WalkthroughIntroduces image-detection and vision-aware model selection across agents and utilities, adds explicit Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Action as Action Handler
participant Agent as Agent (Researcher / ResolutionSearch)
participant ModelSelection as getModel(requireVision)
participant ModelAPI as Model Provider
Client->>Action: Submit message (may include images)
Action->>Action: Build message object\ncontent typed as CoreMessage['content']
Action->>Agent: Invoke agent with messages
Agent->>Agent: Inspect message.content parts\nset hasImage flag
Agent->>ModelSelection: Call getModel(hasImage)
alt hasImage = true
Note right of ModelSelection `#BFD7EA`: Vision required\nxAI may be considered/skipped based on flag
ModelSelection->>ModelAPI: Return vision-capable model
else hasImage = false
Note right of ModelSelection `#E8F6E8`: Vision not required\nstandard provider selection
ModelSelection->>ModelAPI: Return standard model
end
Agent->>ModelAPI: streamText / generateObject using selected model
ModelAPI->>Agent: Response / stream
Agent->>Client: Return streamed/generated output
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
||||||||||||||||||||||||
PR Code Suggestions ✨Explore these optional code suggestions:
|
|||||||||||
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
app/actions.tsxlib/agents/researcher.tsxlib/agents/resolution-search.tsxlib/utils/index.ts
🧰 Additional context used
🧬 Code graph analysis (3)
lib/agents/resolution-search.tsx (2)
lib/db/schema.ts (1)
messages(26-37)lib/utils/index.ts (1)
getModel(19-62)
app/actions.tsx (3)
components/chat-panel.tsx (1)
e(72-107)lib/actions/chat.ts (1)
msg(119-127)lib/actions/chat-db.ts (1)
msg(117-121)
lib/agents/researcher.tsx (2)
lib/db/schema.ts (1)
messages(26-37)lib/utils/index.ts (1)
getModel(19-62)
🔇 Additional comments (5)
lib/agents/researcher.tsx (2)
99-103: Image detection logic looks correct.The implementation properly checks for multimodal content by inspecting message.content arrays for image parts. This aligns with the PR objective to enable vision-capable model selection when images are present.
106-106: Type assertion is necessary due to missing return type annotation ongetModel().The
getModel()function inlib/utils/index.tslacks an explicit return type annotation and returns instances from different provider factories (createXai,createAmazonBedrock,createOpenAI). Without theas LanguageModelcast, TypeScript cannot guarantee that these provider instances are recognized asLanguageModel. To ensure type safety across all callers, the assertion should be retained. Alternatively, add an explicit return type annotation togetModel()to eliminate the need for repeated casts at every call site.Likely an incorrect or invalid review comment.
app/actions.tsx (1)
253-282: Type safety improvements look good.The explicit typing of
contentasCoreMessage['content'](line 254) and the message cast (line 282) properly align the multimodal content handling with the expectedCoreMessagestructure. The conditional logic correctly handles both string content (text-only) and array content (with images).lib/utils/index.ts (2)
19-38: requireVision parameter correctly gates model selection.The implementation properly skips xAI's non-vision model when
requireVisionis true, ensuring that only vision-capable models (Bedrock Claude or OpenAI gpt-4o) are selected for image processing tasks. This directly addresses the root cause described in the PR objectives.
24-24: The default Bedrock model ID is valid and supported.The model
anthropic.claude-3-5-sonnet-20241022-v2:0is a current Claude 3.5 Sonnet version supported by AWS Bedrock. No changes needed. Regional availability and IAM permissions depend on your AWS configuration—verify these at runtime using your AWS credentials if the model fails to load.
| // Check if any message contains an image (resolution search is specifically for image analysis) | ||
| const hasImage = messages.some(message => | ||
| Array.isArray(message.content) && | ||
| message.content.some(part => part.type === 'image') | ||
| ) | ||
|
|
||
| // Use generateObject to get the full object at once. | ||
| const { object } = await generateObject({ | ||
| model: getModel(), | ||
| model: getModel(hasImage), |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider validating that images are actually present.
Since resolutionSearch is specifically designed for satellite image analysis (per the system prompt), consider adding a validation check to ensure images are present before proceeding. This would provide clearer error messages if called incorrectly.
🔎 Suggested validation
const hasImage = messages.some(message =>
Array.isArray(message.content) &&
message.content.some(part => part.type === 'image')
)
+
+ if (!hasImage) {
+ throw new Error('resolutionSearch requires at least one image in the messages')
+ }
const { object } = await generateObject({📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Check if any message contains an image (resolution search is specifically for image analysis) | |
| const hasImage = messages.some(message => | |
| Array.isArray(message.content) && | |
| message.content.some(part => part.type === 'image') | |
| ) | |
| // Use generateObject to get the full object at once. | |
| const { object } = await generateObject({ | |
| model: getModel(), | |
| model: getModel(hasImage), | |
| // Check if any message contains an image (resolution search is specifically for image analysis) | |
| const hasImage = messages.some(message => | |
| Array.isArray(message.content) && | |
| message.content.some(part => part.type === 'image') | |
| ) | |
| if (!hasImage) { | |
| throw new Error('resolutionSearch requires at least one image in the messages') | |
| } | |
| // Use generateObject to get the full object at once. | |
| const { object } = await generateObject({ | |
| model: getModel(hasImage), |
🤖 Prompt for AI Agents
In lib/agents/resolution-search.tsx around lines 42 to 50, the code only detects
image parts but does not halt when none exist; add an explicit validation after
computing hasImage that checks if hasImage is false and then return or throw a
clear, contextual error (e.g., "resolutionSearch requires at least one image in
messages") so the function short-circuits before calling
generateObject/getModel; ensure the validation uses the same
Array.isArray(message.content) && message.content.some(...) check and that the
error path provides a helpful message for callers and logs where appropriate.
| // Check if any message contains an image (resolution search is specifically for image analysis) | ||
| const hasImage = messages.some(message => | ||
| Array.isArray(message.content) && | ||
| message.content.some(part => part.type === 'image') | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider extracting duplicate image detection logic.
This exact image detection pattern appears in both lib/agents/researcher.tsx (lines 99-103) and here. Consider extracting it to a shared utility function to maintain consistency and reduce duplication.
🔎 Example helper function
Add to lib/utils/index.ts:
+export function hasImageContent(messages: CoreMessage[]): boolean {
+ return messages.some(message =>
+ Array.isArray(message.content) &&
+ message.content.some(part => part.type === 'image')
+ )
+}Then use it in both files:
- const hasImage = messages.some(message =>
- Array.isArray(message.content) &&
- message.content.some(part => part.type === 'image')
- )
+ const hasImage = hasImageContent(messages)Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In lib/agents/resolution-search.tsx around lines 42 to 46 the image-detection
logic (checking message.content is an array and any part.type === 'image') is
duplicated elsewhere (lib/agents/researcher.tsx lines 99-103); extract this into
a shared utility (e.g., export function hasImage(messages: Message[]): boolean)
placed in lib/utils/index.ts, implement the same array-and-part-type checks once
in that function, export it, then replace the inline detection in both files
with an import of the new hasImage utility and call it (adjust imports/typing
accordingly).
There was a problem hiding this comment.
The core direction is correct, but getModel(requireVision) does not actually guarantee a vision-capable model when Bedrock is configured with an arbitrary BEDROCK_MODEL_ID, which can reintroduce the original failure mode. app/actions.tsx still has a lossy conversion path (joining parts to string) that may break if non-text parts exist, and it uses a broad as CoreMessage cast that can mask structural issues. Image detection logic is duplicated across agents and is inconsistent in resolution-search (computed from a different message set than the one sent to the model). Overall reliability would improve with a centralized messagesHaveImage() helper and stricter runtime selection/validation for vision models.
Additional notes (4)
-
Maintainability |
app/actions.tsx:279-282
Casting the entire object toCoreMessage(} as CoreMessage) hides structural issues and makes it easy to accidentally omit required fields (or include incompatible ones) without noticing. Since you already have aCoreMessage['content']variable, it’s better to construct an object that is explicitly aCoreMessage(or ensure the arraymessagesis typed so the push is checked) rather than forcing a cast at the end. -
Maintainability |
lib/agents/researcher.tsx:99-103
ThehasImagedetection is duplicated across agents and is subtly coupled to the exact content representation (Array.isArray(message.content)+part.type === 'image'). If the message format evolves (different part types, or nested content), you’ll need to update multiple locations and risk inconsistent behavior.
Given this PR’s theme (vision gating), centralizing image detection will reduce drift and make it easier to test.
- Compatibility |
lib/utils/index.ts:19-19
getModel(requireVision)currently assumes Bedrock is vision-capable whenever credentials exist and a model id is set. However,BEDROCK_MODEL_IDis configurable and could be set to a non-vision model, which would reintroduce the original failure mode whenrequireVision=true.
Since this function is now responsible for guaranteeing vision support, it should enforce that guarantee (or have an explicit allowlist of known vision-capable IDs) when requireVision is true.
- Readability |
lib/utils/index.ts:40-40
WhenrequireVisionis true, the function skips xAI (good), but the selection order is still Bedrock-first, then OpenAI. That’s fine, but there’s no explicit check that the Bedrock region is set (awsRegion). IfAWS_REGIONis missing, this will likely fail at runtime and prevent fallback.
Given the goal is reliability for image inputs, consider requiring region (or catching Bedrock creation errors and falling back).
Summary of changes
What changed
- Added vision-aware model selection by extending
getModel()to accept arequireVision: booleanparameter and using it to avoid xAI non-vision models when images are present. - Fixed AWS Bedrock default model id by reading
process.env.BEDROCK_MODEL_IDwith a fallback toanthropic.claude-3-5-sonnet-20241022-v2:0. - Added image detection in both agents:
lib/agents/researcher.tsxdetects multimodal image parts and callsgetModel(hasImage).lib/agents/resolution-search.tsxdoes the same for image analysis and callsgetModel(hasImage).
- Improved message content typing in
app/actions.tsxby usingCoreMessage['content']instead ofany, and casting the pushed message toCoreMessage.
Files touched
app/actions.tsxlib/agents/researcher.tsxlib/agents/resolution-search.tsxlib/utils/index.ts
| const hasImage = messageParts.some(part => part.type === 'image') | ||
| const content = hasImage | ||
| ? (messageParts as any) | ||
| // Properly type the content based on whether it contains images | ||
| const content: CoreMessage['content'] = hasImage | ||
| ? messageParts as CoreMessage['content'] | ||
| : messageParts.map(part => part.text).join('\n') |
There was a problem hiding this comment.
content is typed as CoreMessage['content'], but the code only checks hasImage to decide whether to send an array vs. a joined string. This can silently produce an invalid payload when non-image non-text parts exist (e.g., future file, audio, tool parts) or if a text part is missing .text. In that case, the non-image branch will do part.text and potentially drop/flatten data.
Given the stated goal (multimodal correctness), it’s safer to build content by validating/normalizing parts and only joining when you’re sure all parts are text.
Suggestion
Tighten the branching to ensure the string path only runs when every part is text, otherwise keep the array form:
const isAllText = messageParts.every(p => p.type === 'text')
const content: CoreMessage['content'] = isAllText
? messageParts.map(p => p.text).join('\n')
: (messageParts as CoreMessage['content'])This avoids lossy conversion if additional part types are introduced later. Reply with "@CharlieHelps yes please" if you'd like me to add a commit with this change.
| const filteredMessages = messages.filter(msg => msg.role !== 'system'); | ||
|
|
||
| // Check if any message contains an image (resolution search is specifically for image analysis) | ||
| const hasImage = messages.some(message => | ||
| Array.isArray(message.content) && | ||
| message.content.some(part => part.type === 'image') | ||
| ) | ||
|
|
||
| // Use generateObject to get the full object at once. | ||
| const { object } = await generateObject({ | ||
| model: getModel(), | ||
| model: getModel(hasImage), |
There was a problem hiding this comment.
filteredMessages is computed, but hasImage is checked against messages (unfiltered). If a system message ever contains image parts (or the caller passes different arrays), you could incorrectly select a vision model when it’s not needed, or vice versa.
For correctness and consistency, compute hasImage from the actual messages array you pass to the model (filteredMessages).
Suggestion
Base the check on filteredMessages:
const hasImage = filteredMessages.some(message =>
Array.isArray(message.content) &&
message.content.some(part => part.type === 'image')
)Reply with "@CharlieHelps yes please" if you'd like me to add a commit with this fix.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
lib/utils/index.ts (2)
37-37: Fix misleading fallback message.The error message claims fallback to OpenAI, but the actual fallback order is Gemini → Bedrock → OpenAI.
🔎 Suggested fix
- console.warn('xAI API unavailable, falling back to OpenAI:') + console.warn('xAI API unavailable, falling back to next provider')
55-69: Consider adding error handling for consistency.Unlike xAI and Gemini providers, the Bedrock block lacks a try-catch. If Bedrock initialization fails, the function won't fall back to OpenAI. Adding error handling would improve reliability and consistency, especially for vision-required scenarios.
🔎 Suggested improvement
if (awsAccessKeyId && awsSecretAccessKey) { - const bedrock = createAmazonBedrock({ - bedrockOptions: { - region: awsRegion, - credentials: { - accessKeyId: awsAccessKeyId, - secretAccessKey: awsSecretAccessKey, - }, - }, - }) - const model = bedrock(bedrockModelId, { - additionalModelRequestFields: { top_k: 350 }, - }) - return model + try { + const bedrock = createAmazonBedrock({ + bedrockOptions: { + region: awsRegion, + credentials: { + accessKeyId: awsAccessKeyId, + secretAccessKey: awsSecretAccessKey, + }, + }, + }) + const model = bedrock(bedrockModelId, { + additionalModelRequestFields: { top_k: 350 }, + }) + return model + } catch (error) { + console.warn('AWS Bedrock unavailable, falling back to next provider:', error) + } }
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
lib/utils/index.ts
🔇 Additional comments (4)
lib/utils/index.ts (4)
19-19: LGTM! Clean function signature with clear intent.The
requireVisionparameter with a sensible default provides an explicit, backward-compatible API for vision-aware model selection.
71-76: LGTM! Clear documentation of vision support.The updated comment accurately reflects that gpt-4o supports vision, making it a suitable fallback for multimodal inputs.
41-51: Gemini 3 Pro vision support is confirmed.Gemini 3 Pro is natively multimodal and supports image inputs (up to 900 images per prompt, with support for PNG, JPEG, WebP, HEIC, and HEIF formats). The code correctly does not skip Gemini when
requireVision=true.
25-25: No issues found. The model ID 'anthropic.claude-3-5-sonnet-20241022-v2:0' is valid and documented in AWS Bedrock, and Claude 3.5 Sonnet supports vision/multimodal capabilities as required by therequireVisionparameter in thegetModel()function.
| // If vision is required, skip models that don't support it | ||
| if (!requireVision && xaiApiKey) { |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Good vision-gating logic.
Correctly skips xAI when vision is required, as xAI models don't support multimodal inputs.
Optional: Consider clarifying the comment
- // If vision is required, skip models that don't support it
+ // If vision is required, skip xAI (no vision support) and try vision-capable providers📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // If vision is required, skip models that don't support it | |
| if (!requireVision && xaiApiKey) { | |
| // If vision is required, skip xAI (no vision support) and try vision-capable providers | |
| if (!requireVision && xaiApiKey) { |
🤖 Prompt for AI Agents
In lib/utils/index.ts around lines 27 to 28, the existing comment "If vision is
required, skip models that don't support it" is misleading given the
conditional; update the comment to accurately describe the logic (e.g., explain
that when vision is not required and an xAI API key is present, xAI-only models
are considered, otherwise xAI is skipped) so future readers understand the
condition and intent.
This commit documents that the branch has been synchronized with the latest changes from main branch. All recent updates have been merged including: - Gemini 3 Pro model support (PR #389) - Image attachment token fixes (PR #388) - Comprehensive E2E test suite (PR #350) - Playwright GitHub Actions CI/CD - All dependency updates and bug fixes The Supabase backend implementation and collaboration features from this PR have been preserved and are compatible with the latest main branch changes.
User description
Problem
When users attach images to their messages, the AI does not return tokens (streaming responses). The system appears to process the image but fails to generate a proper response.
Root Cause
The
getModel()function was returning models that don't support vision/multimodal inputs:Solution
getModel(): AddedrequireVisionparameter to ensure vision-capable models are used when images are presentChanges
lib/utils/index.ts: AddedrequireVisionparameter togetModel()lib/agents/researcher.tsx: Added image detection logiclib/agents/resolution-search.tsx: Added image detection logicapp/actions.tsx: Improved content type safetyTesting
The fix ensures that when images are attached:
Fixes the issue described in the attached documentation.
PR Type
Bug fix, Enhancement
Description
Add
requireVisionparameter togetModel()for vision-capable model selectionFix AWS Bedrock configuration with valid Claude 3.5 Sonnet model ID
Detect images in researcher and resolution-search agents
Improve type safety for multimodal content handling
Diagram Walkthrough
File Walkthrough
index.ts
Add vision parameter and fix Bedrock model configlib/utils/index.ts
requireVisionparameter togetModel()function to conditionallyselect vision-capable models
model ID
multimodal inputs
actions.tsx
Improve type safety for multimodal contentapp/actions.tsx
contentas
CoreMessage['content']CoreMessageresearcher.tsx
Add image detection for vision model selectionlib/agents/researcher.tsx
content
hasImageflag togetModel()to request vision-capable models whenneeded
resolution-search.tsx
Add image detection for resolution search agentlib/agents/resolution-search.tsx
hasImageflag togetModel()for vision-capable model selectionanalysis
Summary by CodeRabbit
New Features
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.