[V1.2.3] Add dedicated OCR and video analysis actions (#155)#196
Conversation
Added name character limit to 20
There were two bugs fixed: Bug 1: `prompt_cache_key` is an OpenAI-specific routing hint in `extra_body`. xAI’s API ignores it, so it doesn’t help with cache routing for Grok. So I skipped it when `self.provider == "grok"`. Bug 2: Wrong field was used for reading cached tokens: * OpenAI → `usage.prompt_tokens_details.cached_tokens` * Grok (xAI) → `usage.prompt_cache_hit_tokens` ---------------- The code was always reading the OpenAI field, so Grok always returned 0 cached tokens, making it look like every call was a full cache miss. I fixed this by branching on `self.provider == "grok"` to read the correct field. Additionally, I updated the cache metrics log to show the actual provider name (grok, openai, etc.). ---------------- I updated the fixed in the same branch " feature/CLI"
…into feature/CLI
|
The current approach for video understanding can be improved. Currently, only Google's model/API supports video understanding (Qwen and Moonshot do too, but we haven't set them up as providers yet). Here is what you can do for now: If the user has Google API key set up, then use the Gemini API to perform video understanding; if a Google API key is not set up, the agent can fall back to using your current approach (which might be expensive and not as effective). You can refer to the I haven't tested the PR yet. Once this is improved, I will test the PR. |
…d_video, OpenCV as fallback
- Install Issues on Mac fixed - Python compatibility & Syntax Issues fixed - CLI skills updated - Local LLM compatibility Issues fixed - Image action error fixed
…exceptions in describe_image_bytes
…r and understand_video
|
…model - Merge generate_multimodal_multi_image into generate_multimodal (image_bytes_list param) - Add json_mode param to describe_image_bytes; describe_image_ocr now a thin wrapper - understand_video pulls model from get_vlm_model() with gemini-1.5-pro fallback - Add test suites: gemini_client_multimodal, vlm_json_mode, ocr_wrapper, understand_video_model
Feature/cli
Update version from 1.2.2 to 1.2.3 in settings config
ahmad-ajmal
left a comment
There was a problem hiding this comment.
Just one small fix and the merge conflicts
fb0be05 to
866601e
Compare
|
I don't you need to change 89 files in this PR |
|
@ahmad-ajmal, hey its so many files because I just deleted the future annotation call I used to test throughout the entire project |
| except Exception as e: | ||
| return {'status': 'error', 'summary': '', 'file_path': '', 'file_saved': False, 'message': str(e)} | ||
|
|
||
| execute = perform_ocr |
There was a problem hiding this comment.
what is this line for? (same in understand_video)
| except Exception as e: | ||
| logger.warning(f"[VLM] Failed to report usage: {e}") | ||
|
|
||
| def _openai_describe_bytes_plain(self, image_bytes: bytes, sys: str | None, usr: str) -> Dict[str, Any]: |
There was a problem hiding this comment.
almost identical to _openai_describe_bytes
| @@ -1,4 +1,5 @@ | |||
| # -*- coding: utf-8 -*- | |||
| from __future__ import annotations | |||
There was a problem hiding this comment.
very confusing change - is there a reason for this?
| @@ -1,4 +1,5 @@ | |||
| # -*- coding: utf-8 -*- | |||
| from __future__ import annotations | |||
There was a problem hiding this comment.
Same for here, is there a reason for this?
There was a problem hiding this comment.
dont push this - dont want to enable MCPs by default
There was a problem hiding this comment.
dont push this - dont want skills enabled by default
revert the |
…config, skills_config
…be_bytes via json_mode param
…nnotations in log_events and profiler
…everted bulk deletion)
…tead of delegating entirely to InternalActionInterface
What changed
perform_ocrunderstand_videodescribe_image_ocranddescribe_video_frames_openai_describe_bytes_plainfor raw text output (no JSON enforcement)GeminiClient.generate_multimodalto accept multiple image frames viaimage_bytes_listperform_ocrandunderstand_videoinInternalActionInterfacedescribe_imageactionopencv-python-headlessto requirements for keyframe extractionCloses #155