feat(floware): image url base search and tag based document fetch by vizsatiz · Pull Request #221 · rootflo/wavefront

vizsatiz · 2026-02-18T07:30:54Z

Summary by CodeRabbit

New Features
- Knowledge base documents can now be filtered using a new query filter parameter for refined search results
- Retrieval system now supports cloud-hosted images from Google Cloud Storage (gs://) and AWS S3 (s3://) with automatic cloud resource handling

coderabbitai · 2026-02-18T07:31:16Z

📝 Walkthrough

Walkthrough

The pull request refactors knowledge base document retrieval using a QueryGenerator-driven approach, adds support for cloud-hosted images (gs:// and s3://) in RAG retrieval with CloudStorageManager integration, and introduces dynamic query generation for document listing with optional filtering.

Changes

Cohort / File(s)	Summary
Query Generation `wavefront/server/modules/knowledge_base_module/.../queries/generate_query.py`	Adds `get_documents_list_query` method to QueryGenerator for constructing SQL queries with dynamic WHERE clauses, file_type filtering, and OData-style metadata filtering. Note: Method is defined twice (duplicate) in the file.
Document Retrieval Refactoring `wavefront/server/modules/knowledge_base_module/.../controllers/knowledge_base_document_controller.py`	Refactors `get_documents` endpoint to use QueryGenerator instead of direct ORM queries. Adds `query_filter` parameter, replaces ORM data fetching with `execute_query()`, introduces `_document_row_to_dict` helper for row normalization, and adds ValueError handling for query construction/execution.
Cloud Image Support `wavefront/server/modules/knowledge_base_module/.../controllers/rag_retreival_controller.py`	Extends ImagePayload with `image_url` field for cloud-hosted images (gs:// or s3://). Integrates CloudStorageManager dependency in `retrieve_query` endpoint, implements URL validation and parsing, fetches and base64-encodes images from cloud storage, and adds comprehensive error handling for URL and storage failures.

Sequence Diagrams

sequenceDiagram
    participant Client
    participant RAG_Controller as RAG Retrieval<br/>Controller
    participant Storage_Manager as CloudStorageManager
    participant Cloud_Storage as Cloud Storage<br/>(GCS/S3)
    participant Encoder as Base64<br/>Encoder

    Client->>RAG_Controller: POST retrieve_query<br/>(payload with image_url)
    RAG_Controller->>Storage_Manager: resolve image_data<br/>(from image_url)
    Storage_Manager->>Storage_Manager: Parse & validate URL<br/>(gs:// or s3://)
    alt Valid Cloud URL
        Storage_Manager->>Cloud_Storage: Fetch image
        Cloud_Storage-->>Storage_Manager: Image bytes
        Storage_Manager->>Encoder: Encode to base64
        Encoder-->>Storage_Manager: base64 string
        Storage_Manager-->>RAG_Controller: image_data (base64)
    else Invalid/Missing
        Storage_Manager-->>RAG_Controller: Error (400/422)
    end
    RAG_Controller->>RAG_Controller: Proceed with inference
    RAG_Controller-->>Client: Query results + response

sequenceDiagram
    participant Client
    participant Doc_Controller as Document<br/>Controller
    participant QueryGen as QueryGenerator
    participant Database as Database

    Client->>Doc_Controller: GET /documents<br/>(kb_id, file_type,<br/>query_filter, offset, limit)
    Doc_Controller->>QueryGen: get_documents_list_query<br/>(params)
    QueryGen->>QueryGen: Build dynamic WHERE<br/>clause + filters
    QueryGen-->>Doc_Controller: (sql_query, params)
    Doc_Controller->>Database: execute_query<br/>(sql_query, params)
    Database-->>Doc_Controller: Raw rows
    Doc_Controller->>Doc_Controller: _document_row_to_dict<br/>(normalize UUIDs/dates)
    Doc_Controller-->>Client: Formatted<br/>documents JSON

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hopping through clouds to fetch images bright,
QueryGenerator queries crafted just right,
Documents flow through pipelines so clean,
Building the finest retrieval machine! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat(floware): image url base search and tag based document fetch' partially aligns with the changeset. While image URL support is added in rag_retrieval_controller.py, the title overlooks the significant QueryGenerator refactoring and knowledge_base_document_controller changes that enable tag-based filtering.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch CU-86d20fcdn-Feature-to-support-imageurl-based-search-in-kb

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/knowledge_base_document_controller.py (2)

214-214: QueryGenerator is instantiated on every request.

The QueryGenerator constructor creates an ODataQueryParser each time. Since it holds no request-specific state, consider making it a module-level singleton or injecting it via the DI container to avoid repeated allocation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/knowledge_base_document_controller.py`
at line 214, QueryGenerator is being instantiated per request in
knowledge_base_document_controller (creating a new ODataQueryParser each time);
replace the per-request new QueryGenerator() with a single shared instance or
DI-provided instance to avoid repeated allocation. Create a module-level
singleton (e.g., QUERY_GENERATOR = QueryGenerator()) or add
QueryGenerator/ODataQueryParser as a dependency injected into the controller
class or factory, then replace the local instantiation in the handler with the
shared/injected symbol (QueryGenerator or QUERY_GENERATOR) so the
ODataQueryParser is created once.

230-234: Only ValueError is caught — unexpected exceptions will return a 500 without a structured error body.

If execute_query raises a database-level exception (e.g., SQLAlchemyError), it won't be caught here and will propagate as an unhandled 500. Consider whether you want a broader catch for operational errors, similar to how upload_document (line 174) handles Exception.

Proposed fix

     except ValueError as e:
         return JSONResponse(
             status_code=status.HTTP_400_BAD_REQUEST,
             content=response_formatter.buildErrorResponse(str(e)),
         )
+    except Exception as e:
+        logger.error(f'Error fetching documents: {e}')
+        return JSONResponse(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            content=response_formatter.buildErrorResponse(
+                'An error occurred while fetching documents'
+            ),
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/knowledge_base_document_controller.py`
around lines 230 - 234, The current except block in the handler that calls
execute_query only catches ValueError, so database/operational errors (e.g.,
SQLAlchemyError) will propagate as unstructured 500s; update the try/except in
the same controller function (the one that currently has "except ValueError as
e") to keep the existing ValueError branch but add a broader except (either
except Exception as e or except SQLAlchemyError as e plus Exception as fallback)
that returns a JSONResponse with
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR and
content=response_formatter.buildErrorResponse(str(e)), and ensure you also log
the exception (using the controller's logger) similar to how upload_document
handles Exception to preserve consistent error formatting and diagnostics.

wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py (2)

134-147: The if payload.image_url: pass control flow is confusing.

Using pass to fall through to the code after the if/elif/else block works but is non-obvious. Consider inverting the logic for clarity.

Suggested restructure

-    if payload.image_url:
-        pass
-    elif payload.image_data:
-        return (payload.image_data, None)
-    else:
+    if not payload.image_url and not payload.image_data:
         return (
             None,
             JSONResponse(
                 status_code=status.HTTP_400_BAD_REQUEST,
                 content=response_formatter.buildErrorResponse(
                     'Query or Image data should not be empty'
                 ),
             ),
         )
+    if not payload.image_url:
+        return (payload.image_data, None)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py`
around lines 134 - 147, The control flow using "if payload.image_url: pass" is
confusing—restructure the logic in rag_retreival_controller.py to explicitly
handle all three cases: if neither payload.image_url nor payload.image_data
return the JSONResponse error using response_formatter.buildErrorResponse, if
payload.image_data return (payload.image_data, None), and otherwise (i.e.,
payload.image_url present) return (None, payload.image_url); update the branch
that currently uses payload.image_url/pass to the explicit return so the
behavior is clear.

183-192: Broad except Exception may leak internal details to the client.

The caught exception's string representation (line 189) could contain bucket paths, credentials errors, or SDK internals. Consider logging the full exception server-side and returning a generic message to the client.

Proposed fix

     except Exception as e:
+        import logging
+        logging.exception('Failed to fetch image from cloud storage')
         return (
             None,
             JSONResponse(
                 status_code=status.HTTP_400_BAD_REQUEST,
                 content=response_formatter.buildErrorResponse(
-                    f'Failed to fetch image from storage: {e!s}'
+                    'Failed to fetch image from cloud storage'
                 ),
             ),
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py`
around lines 183 - 192, Replace the broad except block that returns the
exception string to the client (the except Exception as e branch that builds the
JSONResponse using response_formatter.buildErrorResponse(f'Failed to fetch image
from storage: {e!s}')): log the full exception server-side using the module's
logger (e.g., logger.exception or process_logger.error with exception info) and
change the response content to a generic error message (e.g., "Failed to fetch
image from storage") so sensitive details in e are not sent to the client; keep
the same status_code and JSONResponse/response_formatter usage but remove e from
the client-facing string.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py`:
- Around line 102-120: _infer that _parse_cloud_image_url currently accepts URLs
like "gs://bucket/" because rest.partition('/') yields an empty key; update
_parse_cloud_image_url to validate that the "key" component is non-empty after
partitioning for both gs:// and s3:// branches and raise ValueError (e.g.,
"Invalid gs:// URL: missing path after bucket" / "Invalid s3:// URL: missing
path after bucket") when key == "" so callers don't receive an empty object key.
- Around line 181-193: The call to cloud_storage.read_file(bucket, key) is
synchronous and will block the async event loop; change it to run in a thread
using asyncio.to_thread (e.g., content = await
asyncio.to_thread(cloud_storage.read_file, bucket, key)) and preserve the
existing try/except so any exception raised during the threaded call is caught
and returned the same way; also add import asyncio at the top of the module so
asyncio.to_thread is available.

---

Nitpick comments:
In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/knowledge_base_document_controller.py`:
- Line 214: QueryGenerator is being instantiated per request in
knowledge_base_document_controller (creating a new ODataQueryParser each time);
replace the per-request new QueryGenerator() with a single shared instance or
DI-provided instance to avoid repeated allocation. Create a module-level
singleton (e.g., QUERY_GENERATOR = QueryGenerator()) or add
QueryGenerator/ODataQueryParser as a dependency injected into the controller
class or factory, then replace the local instantiation in the handler with the
shared/injected symbol (QueryGenerator or QUERY_GENERATOR) so the
ODataQueryParser is created once.
- Around line 230-234: The current except block in the handler that calls
execute_query only catches ValueError, so database/operational errors (e.g.,
SQLAlchemyError) will propagate as unstructured 500s; update the try/except in
the same controller function (the one that currently has "except ValueError as
e") to keep the existing ValueError branch but add a broader except (either
except Exception as e or except SQLAlchemyError as e plus Exception as fallback)
that returns a JSONResponse with
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR and
content=response_formatter.buildErrorResponse(str(e)), and ensure you also log
the exception (using the controller's logger) similar to how upload_document
handles Exception to preserve consistent error formatting and diagnostics.

In
`@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py`:
- Around line 134-147: The control flow using "if payload.image_url: pass" is
confusing—restructure the logic in rag_retreival_controller.py to explicitly
handle all three cases: if neither payload.image_url nor payload.image_data
return the JSONResponse error using response_formatter.buildErrorResponse, if
payload.image_data return (payload.image_data, None), and otherwise (i.e.,
payload.image_url present) return (None, payload.image_url); update the branch
that currently uses payload.image_url/pass to the explicit return so the
behavior is clear.
- Around line 183-192: Replace the broad except block that returns the exception
string to the client (the except Exception as e branch that builds the
JSONResponse using response_formatter.buildErrorResponse(f'Failed to fetch image
from storage: {e!s}')): log the full exception server-side using the module's
logger (e.g., logger.exception or process_logger.error with exception info) and
change the response content to a generic error message (e.g., "Failed to fetch
image from storage") so sensitive details in e are not sent to the client; keep
the same status_code and JSONResponse/response_formatter usage but remove e from
the client-facing string.

coderabbitai · 2026-02-18T07:35:33Z

.../modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py

+def _parse_cloud_image_url(url: str) -> Tuple[str, str, str]:
+    """
+    Parse gs:// or s3:// URL into (scheme, bucket, key).
+    Returns (scheme, bucket, key) or raises ValueError.
+    """
+    url = (url or '').strip()
+    if url.startswith('gs://'):
+        rest = url[5:]
+        if '/' not in rest:
+            raise ValueError('Invalid gs:// URL: missing path after bucket')
+        bucket, _, key = rest.partition('/')
+        return ('gs', bucket, key)
+    if url.startswith('s3://'):
+        rest = url[5:]
+        if '/' not in rest:
+            raise ValueError('Invalid s3:// URL: missing path after bucket')
+        bucket, _, key = rest.partition('/')
+        return ('s3', bucket, key)
+    raise ValueError('image_url must be in gs:// or s3:// format')


⚠️ Potential issue | 🟡 Minor

Empty object key passes validation for URLs like gs://bucket/.

When the URL is gs://bucket/, rest.partition('/') produces ('bucket', '/', '') — the function returns an empty key, which will cause a confusing error downstream when the storage read fails. Add a check for a non-empty key.

Proposed fix

if url.startswith('gs://'): rest = url[5:] if '/' not in rest: raise ValueError('Invalid gs:// URL: missing path after bucket') bucket, _, key = rest.partition('/') + if not key: + raise ValueError('Invalid gs:// URL: missing object key after bucket') return ('gs', bucket, key) if url.startswith('s3://'): rest = url[5:] if '/' not in rest: raise ValueError('Invalid s3:// URL: missing path after bucket') bucket, _, key = rest.partition('/') + if not key: + raise ValueError('Invalid s3:// URL: missing object key after bucket') return ('s3', bucket, key)

🧰 Tools

🪛 Ruff (0.15.1)

[warning] 111-111: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 117-117: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 120-120: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py` around lines 102 - 120, _infer that _parse_cloud_image_url currently accepts URLs like "gs://bucket/" because rest.partition('/') yields an empty key; update _parse_cloud_image_url to validate that the "key" component is non-empty after partitioning for both gs:// and s3:// branches and raise ValueError (e.g., "Invalid gs:// URL: missing path after bucket" / "Invalid s3:// URL: missing path after bucket") when key == "" so callers don't receive an empty object key.

coderabbitai · 2026-02-18T07:35:33Z

.../modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py

+    try:
+        content = cloud_storage.read_file(bucket, key)
+    except Exception as e:
+        return (
+            None,
+            JSONResponse(
+                status_code=status.HTTP_400_BAD_REQUEST,
+                content=response_formatter.buildErrorResponse(
+                    f'Failed to fetch image from storage: {e!s}'
+                ),
+            ),
+        )
+    image_bytes = content.read() if hasattr(content, 'read') else content


⚠️ Potential issue | 🟠 Major

Blocking synchronous I/O call inside an async function will stall the event loop.

cloud_storage.read_file(bucket, key) is a synchronous SDK call. In the same codebase, upload_document (in knowledge_base_document_controller.py, line 130) wraps the analogous cloud_storage.save_small_file with asyncio.to_thread(...). The same treatment is needed here to avoid blocking the FastAPI event loop under concurrent requests.

Proposed fix

try: - content = cloud_storage.read_file(bucket, key) + content = await asyncio.to_thread(cloud_storage.read_file, bucket, key) except Exception as e:

You'll also need to add import asyncio at the top of the file.

🧰 Tools

🪛 Ruff (0.15.1)

[warning] 183-183: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@wavefront/server/modules/knowledge_base_module/knowledge_base_module/controllers/rag_retreival_controller.py` around lines 181 - 193, The call to cloud_storage.read_file(bucket, key) is synchronous and will block the async event loop; change it to run in a thread using asyncio.to_thread (e.g., content = await asyncio.to_thread(cloud_storage.read_file, bucket, key)) and preserve the existing try/except so any exception raised during the threaded call is caught and returned the same way; also add import asyncio at the top of the module so asyncio.to_thread is available.

feat(floware): image url base search and tag based document fetch

7f1f1fe

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

vizsatiz merged commit 35d68f8 into develop Feb 18, 2026
10 checks passed

vizsatiz deleted the CU-86d20fcdn-Feature-to-support-imageurl-based-search-in-kb branch February 18, 2026 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(floware): image url base search and tag based document fetch#221

feat(floware): image url base search and tag based document fetch#221
vizsatiz merged 1 commit intodevelopfrom
CU-86d20fcdn-Feature-to-support-imageurl-based-search-in-kb

vizsatiz commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 18, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagrams

Estimated Code Review Effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 18, 2026

Uh oh!

coderabbitai bot Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vizsatiz commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagrams

Estimated Code Review Effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vizsatiz commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading