Skip to content

feat(floware): image url base search and tag based document fetch#221

Merged
vizsatiz merged 1 commit intodevelopfrom
CU-86d20fcdn-Feature-to-support-imageurl-based-search-in-kb
Feb 18, 2026
Merged

feat(floware): image url base search and tag based document fetch#221
vizsatiz merged 1 commit intodevelopfrom
CU-86d20fcdn-Feature-to-support-imageurl-based-search-in-kb

Conversation

@vizsatiz
Copy link
Member

@vizsatiz vizsatiz commented Feb 18, 2026

Summary by CodeRabbit

  • New Features
    • Knowledge base documents can now be filtered using a new query filter parameter for refined search results
    • Retrieval system now supports cloud-hosted images from Google Cloud Storage (gs://) and AWS S3 (s3://) with automatic cloud resource handling

@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

The pull request refactors knowledge base document retrieval using a QueryGenerator-driven approach, adds support for cloud-hosted images (gs:// and s3://) in RAG retrieval with CloudStorageManager integration, and introduces dynamic query generation for document listing with optional filtering.

Changes

Cohort / File(s) Summary
Query Generation
wavefront/server/modules/knowledge_base_module/.../queries/generate_query.py
Adds get_documents_list_query method to QueryGenerator for constructing SQL queries with dynamic WHERE clauses, file_type filtering, and OData-style metadata filtering. Note: Method is defined twice (duplicate) in the file.
Document Retrieval Refactoring
wavefront/server/modules/knowledge_base_module/.../controllers/knowledge_base_document_controller.py
Refactors get_documents endpoint to use QueryGenerator instead of direct ORM queries. Adds query_filter parameter, replaces ORM data fetching with execute_query(), introduces _document_row_to_dict helper for row normalization, and adds ValueError handling for query construction/execution.
Cloud Image Support
wavefront/server/modules/knowledge_base_module/.../controllers/rag_retreival_controller.py
Extends ImagePayload with image_url field for cloud-hosted images (gs:// or s3://). Integrates CloudStorageManager dependency in retrieve_query endpoint, implements URL validation and parsing, fetches and base64-encodes images from cloud storage, and adds comprehensive error handling for URL and storage failures.

Sequence Diagrams

sequenceDiagram
    participant Client
    participant RAG_Controller as RAG Retrieval<br/>Controller
    participant Storage_Manager as CloudStorageManager
    participant Cloud_Storage as Cloud Storage<br/>(GCS/S3)
    participant Encoder as Base64<br/>Encoder

    Client->>RAG_Controller: POST retrieve_query<br/>(payload with image_url)
    RAG_Controller->>Storage_Manager: resolve image_data<br/>(from image_url)
    Storage_Manager->>Storage_Manager: Parse & validate URL<br/>(gs:// or s3://)
    alt Valid Cloud URL
        Storage_Manager->>Cloud_Storage: Fetch image
        Cloud_Storage-->>Storage_Manager: Image bytes
        Storage_Manager->>Encoder: Encode to base64
        Encoder-->>Storage_Manager: base64 string
        Storage_Manager-->>RAG_Controller: image_data (base64)
    else Invalid/Missing
        Storage_Manager-->>RAG_Controller: Error (400/422)
    end
    RAG_Controller->>RAG_Controller: Proceed with inference
    RAG_Controller-->>Client: Query results + response
Loading
sequenceDiagram
    participant Client
    participant Doc_Controller as Document<br/>Controller
    participant QueryGen as QueryGenerator
    participant Database as Database

    Client->>Doc_Controller: GET /documents<br/>(kb_id, file_type,<br/>query_filter, offset, limit)
    Doc_Controller->>QueryGen: get_documents_list_query<br/>(params)
    QueryGen->>QueryGen: Build dynamic WHERE<br/>clause + filters
    QueryGen-->>Doc_Controller: (sql_query, params)
    Doc_Controller->>Database: execute_query<br/>(sql_query, params)
    Database-->>Doc_Controller: Raw rows
    Doc_Controller->>Doc_Controller: _document_row_to_dict<br/>(normalize UUIDs/dates)
    Doc_Controller-->>Client: Formatted<br/>documents JSON
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hopping through clouds to fetch images bright,
QueryGenerator queries crafted just right,
Documents flow through pipelines so clean,
Building the finest retrieval machine! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat(floware): image url base search and tag based document fetch' partially aligns with the changeset. While image URL support is added in rag_retrieval_controller.py, the title overlooks the significant QueryGenerator refactoring and knowledge_base_document_controller changes that enable tag-based filtering.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch CU-86d20fcdn-Feature-to-support-imageurl-based-search-in-kb

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/knowledge_base_document_controller.py (2)

214-214: QueryGenerator is instantiated on every request.

The QueryGenerator constructor creates an ODataQueryParser each time. Since it holds no request-specific state, consider making it a module-level singleton or injecting it via the DI container to avoid repeated allocation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/knowledge_base_document_controller.py`
at line 214, QueryGenerator is being instantiated per request in
knowledge_base_document_controller (creating a new ODataQueryParser each time);
replace the per-request new QueryGenerator() with a single shared instance or
DI-provided instance to avoid repeated allocation. Create a module-level
singleton (e.g., QUERY_GENERATOR = QueryGenerator()) or add
QueryGenerator/ODataQueryParser as a dependency injected into the controller
class or factory, then replace the local instantiation in the handler with the
shared/injected symbol (QueryGenerator or QUERY_GENERATOR) so the
ODataQueryParser is created once.

230-234: Only ValueError is caught — unexpected exceptions will return a 500 without a structured error body.

If execute_query raises a database-level exception (e.g., SQLAlchemyError), it won't be caught here and will propagate as an unhandled 500. Consider whether you want a broader catch for operational errors, similar to how upload_document (line 174) handles Exception.

Proposed fix
     except ValueError as e:
         return JSONResponse(
             status_code=status.HTTP_400_BAD_REQUEST,
             content=response_formatter.buildErrorResponse(str(e)),
         )
+    except Exception as e:
+        logger.error(f'Error fetching documents: {e}')
+        return JSONResponse(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            content=response_formatter.buildErrorResponse(
+                'An error occurred while fetching documents'
+            ),
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/knowledge_base_document_controller.py`
around lines 230 - 234, The current except block in the handler that calls
execute_query only catches ValueError, so database/operational errors (e.g.,
SQLAlchemyError) will propagate as unstructured 500s; update the try/except in
the same controller function (the one that currently has "except ValueError as
e") to keep the existing ValueError branch but add a broader except (either
except Exception as e or except SQLAlchemyError as e plus Exception as fallback)
that returns a JSONResponse with
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR and
content=response_formatter.buildErrorResponse(str(e)), and ensure you also log
the exception (using the controller's logger) similar to how upload_document
handles Exception to preserve consistent error formatting and diagnostics.
wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py (2)

134-147: The if payload.image_url: pass control flow is confusing.

Using pass to fall through to the code after the if/elif/else block works but is non-obvious. Consider inverting the logic for clarity.

Suggested restructure
-    if payload.image_url:
-        pass
-    elif payload.image_data:
-        return (payload.image_data, None)
-    else:
+    if not payload.image_url and not payload.image_data:
         return (
             None,
             JSONResponse(
                 status_code=status.HTTP_400_BAD_REQUEST,
                 content=response_formatter.buildErrorResponse(
                     'Query or Image data should not be empty'
                 ),
             ),
         )
+    if not payload.image_url:
+        return (payload.image_data, None)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py`
around lines 134 - 147, The control flow using "if payload.image_url: pass" is
confusing—restructure the logic in rag_retreival_controller.py to explicitly
handle all three cases: if neither payload.image_url nor payload.image_data
return the JSONResponse error using response_formatter.buildErrorResponse, if
payload.image_data return (payload.image_data, None), and otherwise (i.e.,
payload.image_url present) return (None, payload.image_url); update the branch
that currently uses payload.image_url/pass to the explicit return so the
behavior is clear.

183-192: Broad except Exception may leak internal details to the client.

The caught exception's string representation (line 189) could contain bucket paths, credentials errors, or SDK internals. Consider logging the full exception server-side and returning a generic message to the client.

Proposed fix
     except Exception as e:
+        import logging
+        logging.exception('Failed to fetch image from cloud storage')
         return (
             None,
             JSONResponse(
                 status_code=status.HTTP_400_BAD_REQUEST,
                 content=response_formatter.buildErrorResponse(
-                    f'Failed to fetch image from storage: {e!s}'
+                    'Failed to fetch image from cloud storage'
                 ),
             ),
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py`
around lines 183 - 192, Replace the broad except block that returns the
exception string to the client (the except Exception as e branch that builds the
JSONResponse using response_formatter.buildErrorResponse(f'Failed to fetch image
from storage: {e!s}')): log the full exception server-side using the module's
logger (e.g., logger.exception or process_logger.error with exception info) and
change the response content to a generic error message (e.g., "Failed to fetch
image from storage") so sensitive details in e are not sent to the client; keep
the same status_code and JSONResponse/response_formatter usage but remove e from
the client-facing string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py`:
- Around line 102-120: _infer that _parse_cloud_image_url currently accepts URLs
like "gs://bucket/" because rest.partition('/') yields an empty key; update
_parse_cloud_image_url to validate that the "key" component is non-empty after
partitioning for both gs:// and s3:// branches and raise ValueError (e.g.,
"Invalid gs:// URL: missing path after bucket" / "Invalid s3:// URL: missing
path after bucket") when key == "" so callers don't receive an empty object key.
- Around line 181-193: The call to cloud_storage.read_file(bucket, key) is
synchronous and will block the async event loop; change it to run in a thread
using asyncio.to_thread (e.g., content = await
asyncio.to_thread(cloud_storage.read_file, bucket, key)) and preserve the
existing try/except so any exception raised during the threaded call is caught
and returned the same way; also add import asyncio at the top of the module so
asyncio.to_thread is available.

---

Nitpick comments:
In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/knowledge_base_document_controller.py`:
- Line 214: QueryGenerator is being instantiated per request in
knowledge_base_document_controller (creating a new ODataQueryParser each time);
replace the per-request new QueryGenerator() with a single shared instance or
DI-provided instance to avoid repeated allocation. Create a module-level
singleton (e.g., QUERY_GENERATOR = QueryGenerator()) or add
QueryGenerator/ODataQueryParser as a dependency injected into the controller
class or factory, then replace the local instantiation in the handler with the
shared/injected symbol (QueryGenerator or QUERY_GENERATOR) so the
ODataQueryParser is created once.
- Around line 230-234: The current except block in the handler that calls
execute_query only catches ValueError, so database/operational errors (e.g.,
SQLAlchemyError) will propagate as unstructured 500s; update the try/except in
the same controller function (the one that currently has "except ValueError as
e") to keep the existing ValueError branch but add a broader except (either
except Exception as e or except SQLAlchemyError as e plus Exception as fallback)
that returns a JSONResponse with
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR and
content=response_formatter.buildErrorResponse(str(e)), and ensure you also log
the exception (using the controller's logger) similar to how upload_document
handles Exception to preserve consistent error formatting and diagnostics.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py`:
- Around line 134-147: The control flow using "if payload.image_url: pass" is
confusing—restructure the logic in rag_retreival_controller.py to explicitly
handle all three cases: if neither payload.image_url nor payload.image_data
return the JSONResponse error using response_formatter.buildErrorResponse, if
payload.image_data return (payload.image_data, None), and otherwise (i.e.,
payload.image_url present) return (None, payload.image_url); update the branch
that currently uses payload.image_url/pass to the explicit return so the
behavior is clear.
- Around line 183-192: Replace the broad except block that returns the exception
string to the client (the except Exception as e branch that builds the
JSONResponse using response_formatter.buildErrorResponse(f'Failed to fetch image
from storage: {e!s}')): log the full exception server-side using the module's
logger (e.g., logger.exception or process_logger.error with exception info) and
change the response content to a generic error message (e.g., "Failed to fetch
image from storage") so sensitive details in e are not sent to the client; keep
the same status_code and JSONResponse/response_formatter usage but remove e from
the client-facing string.

Comment on lines +102 to +120
def _parse_cloud_image_url(url: str) -> Tuple[str, str, str]:
"""
Parse gs:// or s3:// URL into (scheme, bucket, key).
Returns (scheme, bucket, key) or raises ValueError.
"""
url = (url or '').strip()
if url.startswith('gs://'):
rest = url[5:]
if '/' not in rest:
raise ValueError('Invalid gs:// URL: missing path after bucket')
bucket, _, key = rest.partition('/')
return ('gs', bucket, key)
if url.startswith('s3://'):
rest = url[5:]
if '/' not in rest:
raise ValueError('Invalid s3:// URL: missing path after bucket')
bucket, _, key = rest.partition('/')
return ('s3', bucket, key)
raise ValueError('image_url must be in gs:// or s3:// format')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Empty object key passes validation for URLs like gs://bucket/.

When the URL is gs://bucket/, rest.partition('/') produces ('bucket', '/', '') — the function returns an empty key, which will cause a confusing error downstream when the storage read fails. Add a check for a non-empty key.

Proposed fix
     if url.startswith('gs://'):
         rest = url[5:]
         if '/' not in rest:
             raise ValueError('Invalid gs:// URL: missing path after bucket')
         bucket, _, key = rest.partition('/')
+        if not key:
+            raise ValueError('Invalid gs:// URL: missing object key after bucket')
         return ('gs', bucket, key)
     if url.startswith('s3://'):
         rest = url[5:]
         if '/' not in rest:
             raise ValueError('Invalid s3:// URL: missing path after bucket')
         bucket, _, key = rest.partition('/')
+        if not key:
+            raise ValueError('Invalid s3:// URL: missing object key after bucket')
         return ('s3', bucket, key)
🧰 Tools
🪛 Ruff (0.15.1)

[warning] 111-111: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 117-117: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 120-120: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py`
around lines 102 - 120, _infer that _parse_cloud_image_url currently accepts
URLs like "gs://bucket/" because rest.partition('/') yields an empty key; update
_parse_cloud_image_url to validate that the "key" component is non-empty after
partitioning for both gs:// and s3:// branches and raise ValueError (e.g.,
"Invalid gs:// URL: missing path after bucket" / "Invalid s3:// URL: missing
path after bucket") when key == "" so callers don't receive an empty object key.

Comment on lines +181 to +193
try:
content = cloud_storage.read_file(bucket, key)
except Exception as e:
return (
None,
JSONResponse(
status_code=status.HTTP_400_BAD_REQUEST,
content=response_formatter.buildErrorResponse(
f'Failed to fetch image from storage: {e!s}'
),
),
)
image_bytes = content.read() if hasattr(content, 'read') else content
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Blocking synchronous I/O call inside an async function will stall the event loop.

cloud_storage.read_file(bucket, key) is a synchronous SDK call. In the same codebase, upload_document (in knowledge_base_document_controller.py, line 130) wraps the analogous cloud_storage.save_small_file with asyncio.to_thread(...). The same treatment is needed here to avoid blocking the FastAPI event loop under concurrent requests.

Proposed fix
     try:
-        content = cloud_storage.read_file(bucket, key)
+        content = await asyncio.to_thread(cloud_storage.read_file, bucket, key)
     except Exception as e:

You'll also need to add import asyncio at the top of the file.

🧰 Tools
🪛 Ruff (0.15.1)

[warning] 183-183: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py`
around lines 181 - 193, The call to cloud_storage.read_file(bucket, key) is
synchronous and will block the async event loop; change it to run in a thread
using asyncio.to_thread (e.g., content = await
asyncio.to_thread(cloud_storage.read_file, bucket, key)) and preserve the
existing try/except so any exception raised during the threaded call is caught
and returned the same way; also add import asyncio at the top of the module so
asyncio.to_thread is available.

@vizsatiz vizsatiz merged commit 35d68f8 into develop Feb 18, 2026
10 checks passed
@vizsatiz vizsatiz deleted the CU-86d20fcdn-Feature-to-support-imageurl-based-search-in-kb branch February 18, 2026 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant