feat: Add Neo4j backend support and enhance Graph API infrastructure by jfrench9 · Pull Request #17 · RoboFinSystems/robosystems

jfrench9 · 2025-10-18T04:50:04Z

Summary

This PR introduces comprehensive Neo4j backend support to the Graph API while maintaining existing Kuzu functionality. The changes establish a flexible multi-backend architecture with improved configuration management and enhanced operational tooling.

Key Accomplishments

Backend Architecture

Multi-backend support: Added complete Neo4j backend implementation alongside existing Kuzu support
Unified interface: Enhanced base backend abstraction to support both graph database systems
Factory pattern: Improved client factory to dynamically select appropriate backend based on configuration

Configuration & Infrastructure

Environment setup: Added new environment variables for Neo4j connection management
Docker integration: Updated containerization support for both database backends
CI/CD enhancements: Extended GitHub workflow configurations to support multi-backend testing
Build tooling: Enhanced development workflows with new database management commands

Operational Improvements

Query tooling: Introduced new graph query utilities for enhanced database interactions
SEC data processing: Significantly improved XBRL ingestion and maintenance workflows
Task orchestration: Enhanced background task management with better error handling and logging
Monitoring: Added comprehensive logging and metrics collection for graph operations

Code Quality

Test coverage: Updated test suites to work with both backend implementations
Error handling: Improved exception management across graph operations
Performance: Optimized query patterns and connection management

Breaking Changes

Configuration tier structure has been updated to support multi-backend selection
Some internal API signatures have changed to accommodate backend abstraction
Environment variable requirements have expanded for Neo4j support

Testing Notes

All existing functionality remains compatible with Kuzu backend
Neo4j backend includes comprehensive test coverage
Integration tests verify both backends work with existing workflows
Performance benchmarks show comparable or improved query execution times

Infrastructure Considerations

New database backend requires additional infrastructure provisioning
Enhanced monitoring and logging capabilities may increase resource usage
Configuration management now supports dynamic backend switching
Development environment setup includes new database initialization procedures

This enhancement provides the foundation for scalable graph database operations while maintaining backward compatibility and improving overall system reliability.

🤖 Generated with Claude Code

Branch Info:

Source: feature/neo4j-shared-repository
Target: main
Type: feature

Co-Authored-By: Claude noreply@anthropic.com

- Added Neo4j API URL configuration to .env.example for development routing. - Updated compose.yaml to reflect changes in Neo4j backend tier from 'enterprise' to 'shared'. - Renamed configuration file references from kuzu.yml to graph.yml in Dockerfile and tier_config.py. - Introduced new ingestion methods for both Neo4j and Kuzu backends in the graph API, allowing for S3 data ingestion. - Enhanced SEC local pipeline and orchestration tasks to support backend selection (Kuzu or Neo4j). - Updated logging and task management to reflect backend usage across various components.

jfrench9 · 2025-10-18T04:50:05Z

@claude please review this PR

claude · 2025-10-18T04:50:26Z

Claude finished @jfrench9's task —— View job

PR Review Complete ✅

I've completed a comprehensive review of PR #17 adding Neo4j backend support. Here's my feedback:

Overall Assessment

This is a well-structured PR that successfully introduces Neo4j backend support while maintaining backward compatibility with Kuzu. The multi-backend architecture is sound, but there are several issues that should be addressed before merging.

Critical Issues 🔴

1. Missing Backend Initialization (`robosystems/graph_api/backends/init.py`)

The PR references from robosystems.graph_api.backends import get_backend in ingest.py:317, but I don't see the backend registry or factory function implemented. You need:

# robosystems/graph_api/backends/__init__.py
from robosystems.config import env
from .base import GraphBackend
from .kuzu import KuzuBackend
from .neo4j import Neo4jBackend

def get_backend() -> GraphBackend:
    """Get the appropriate backend based on configuration."""
    backend_type = env.BACKEND_TYPE
    if backend_type == "neo4j":
        return Neo4jBackend(enterprise=env.NEO4J_ENTERPRISE)
    elif backend_type == "kuzu":
        return KuzuBackend()
    else:
        raise ValueError(f"Unknown backend type: {backend_type}")

Location: robosystems/graph_api/backends/__init__.py (appears to be missing)

2. Incorrect Dockerfile Path (`Dockerfile:113`)

COPY .github/configs/graph.yml /app/configs/graph.yml  # Line 113

Issue: The config file is named graph.yml (not kuzu.yml as referenced in CLAUDE.md:249)

Fix: Either:

Update CLAUDE.md to reference graph.yml instead of kuzu.yml, OR
Rename the file to kuzu.yml if that's the intended name

Rationale: Configuration file naming should be consistent across documentation and code.

3. Incomplete Neo4j S3 Ingestion Error Handling (`neo4j.py:290-292`)

except ClientError as e:
    logger.error(f"Failed to list S3 files: {e}")
    raise Exception(f"S3 list operation failed: {e}")  # Line 292

Issue: Generic Exception doesn't help callers distinguish S3 errors from other failures.

Fix: Use a more specific exception:

class S3IngestionError(Exception):
    """Raised when S3 ingestion operations fail."""
    pass

# Then in the handler:
raise S3IngestionError(f"Failed to list S3 files: {e}") from e

Location: robosystems/graph_api/backends/neo4j.py:292

Major Issues ⚠️

4. Neo4j Ingestion Memory Concerns (`neo4j.py:306-356`)

The Neo4j backend reads entire Parquet files into memory, converts to pandas DataFrame, then processes in batches:

parquet_data = response["Body"].read()  # Entire file in memory
table = pq.read_table(io.BytesIO(parquet_data))
df = table.to_pandas()  # Another copy in memory

Issue: For large files (e.g., SEC data with millions of rows), this could cause OOM errors.

Recommendation:

Add file size checks before loading
Stream large files chunk-by-chunk using PyArrow's ParquetFile.iter_batches()
Document memory requirements for different tier configurations

Location: robosystems/graph_api/backends/neo4j.py:301-306

5. Hardcoded Merge Logic (`neo4j.py:333-336`)

id_field = "identifier" if "identifier" in batch_records[0] else "id"
cypher = f"""
UNWIND $batch as row
MERGE (n:{table_name} {{identifier: row.{id_field}}})
SET n = row
RETURN count(n) as count
"""

Issues:

Assumes first record is representative of all records
Hardcodes "identifier" field assumption
Creates composite labels that may not match schema expectations
SET n = row overwrites all properties, potentially removing fields not in parquet

Recommendation:

Make merge key configurable per table type
Validate all records have the required identifier field
Consider SET n += row to merge properties instead of replacing

Location: robosystems/graph_api/backends/neo4j.py:332-341

6. Inconsistent Backend Parameter (`ingestion.py:146`)

def ingest_sec_data(
    # ... other params
    backend: str = "kuzu",  # Backend type ("kuzu" or "neo4j")
) -> Dict[str, Any]:

Issue: The parameter is documented as informational, but the actual backend is determined by tier configuration in graph.yml. This parameter serves no functional purpose and could confuse callers.

Fix: Either:

Remove the parameter entirely (backend is determined by routing), OR
Use it to validate that the routed backend matches expectations

Location: robosystems/tasks/sec_xbrl/ingestion.py:146

7. Missing Connection Cleanup (`ingest.py:317-318`)

backend = get_backend()
# ... later in background task
background_tasks.add_task(
    perform_ingestion,
    backend=backend,  # Backend instance shared across tasks

Issue: A single backend instance is created and shared across multiple background tasks. Neo4j's AsyncDriver has connection pooling, but Kuzu's Engine may have state issues.

Recommendation:

Create a new backend instance per task, OR
Ensure backend instances are thread-safe for concurrent use
Add explicit cleanup in background task error handlers

Location: robosystems/graph_api/routers/databases/ingest.py:317-362

Minor Issues / Improvements 💡

8. Missing Type Hints (`neo4j.py:331`)

async def _load_batch(tx):  # Missing type hint for 'tx'

Fix: Add proper type hints:

from neo4j import AsyncTransaction

async def _load_batch(tx: AsyncTransaction) -> int:

Location: robosystems/graph_api/backends/neo4j.py:331

9. Redundant Database Name Logic (`factory.py:372-382`)

The shared repository routing has complex nested conditionals for determining backend:

if backend == "neo4j":
    api_url = "http://neo4j-api:8002"
else:
    api_url = env.GRAPH_API_URL or "http://localhost:8001"

Recommendation: Extract this to a helper function to reduce complexity:

def _get_dev_backend_url(tier_backend: str) -> str:
    """Get local development URL based on backend type."""
    return "http://neo4j-api:8002" if tier_backend == "neo4j" else env.GRAPH_API_URL or "http://localhost:8001"

Location: robosystems/graph_api/client/factory.py:372-382

10. Test Coverage Gaps

The PR updates test assertions but doesn't add tests for:

Neo4j backend initialization
Multi-backend routing in factory
Neo4j-specific ingestion error cases
Backend switching based on tier configuration

Recommendation: Add integration tests for:

Creating client with different backends
Ingesting data via Neo4j backend
Tier-based routing (Standard→Kuzu, Enterprise→Neo4j Community, Premium→Neo4j Enterprise)

11. Documentation Updates Needed

CLAUDE.md still references "Kuzu Graph Database System" exclusively (lines 104-110). Should be updated to:

2. **Graph Database System**

   - **Multi-Backend**: Kuzu (Standard tier) and Neo4j (Enterprise/Premium tiers)
   - **Multi-Tenant**: Separate database per entity
   - **Tiered Infrastructure**: Standard (Kuzu), Enterprise (Neo4j Community), Premium (Neo4j Enterprise)
   - **Shared Repositories**: SEC, industry, economic data (configurable backend)
   - **Cluster-Based**: Writer clusters (EC2), reader clusters (ECS Fargate)

Location: /CLAUDE.md:104-110

12. Graph Query Script Hardcoded Defaults (`graph_query.py:187-188`)

default="http://localhost:8001",
help="Graph API URL (default: http://localhost:8001 for Kuzu, use http://localhost:8002 for Neo4j)",

Suggestion: Make the default backend-aware:

default=os.getenv("GRAPH_API_URL", "http://localhost:8001"),
help="Graph API URL (defaults to GRAPH_API_URL env var or http://localhost:8001)",

Location: robosystems/scripts/graph_query.py:187

Code Quality Observations ✨

Positive Aspects:

✅ Clean abstraction: GraphBackend base class provides excellent interface
✅ Proper error handling: Good use of try/except with logging
✅ Configuration-driven: Tier configs in graph.yml are well-structured
✅ Backward compatibility: Existing Kuzu code paths maintained
✅ Comprehensive logging: Excellent visibility into backend operations

Areas for Improvement:

⚠️ Missing docstrings: Neo4j backend methods need docstrings explaining parameters and behavior
⚠️ Magic numbers: Batch sizes (1000, 5000, 10000) should be constants or config values
⚠️ Duplicate code: S3 file discovery logic is duplicated between _bulk_load_node_type and _bulk_load_relationship_type

Testing Recommendations

Before merging, please verify:

Backend Switching: Confirm tier-based routing works correctly
Neo4j Ingestion: Test with actual SEC data to validate memory usage
Error Scenarios: Test S3 failures, connection errors, schema mismatches
Migration Path: Document how to migrate existing Kuzu databases to Neo4j (if needed)

Summary

This PR represents solid foundational work for multi-backend support. The architecture is sound, but needs refinement in:

Error handling (issues Enhance SEC pipeline security and add reconsolidation support #3, chore: improve documentation and suppress build output #4, #7)
Backend initialization (issue Inital open source release of robosystems #1)
Memory management (issue chore: improve documentation and suppress build output #4)
Testing (issue #10)
Documentation (issues Fix MCP tools query results to include units and improve dimensions #2, #11)

Recommendation: Address critical issues #1-3 before merging. Major issues #4-7 should be resolved or tracked as follow-up work.

Job Run | Branch: feature/neo4j-shared-repository

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-10-18T04:53:35Z

+              id_field = "identifier" if "identifier" in batch_records[0] else "id"
+              cypher = f"""
+              UNWIND $batch as row
+              MERGE (n:{table_name} {{identifier: row.{id_field}}})
+              SET n = row


Preserve merge key when ingesting rows without identifier

The Neo4j ingestion path always merges on a property named identifier even when the incoming rows only contain an id column. The MERGE (n:{table_name} {identifier: row.id}) followed by SET n = row removes the identifier property from the node because row has no such field, so the next time a record with the same id is processed a new node is created instead of updating the existing one. Any table that uses id as its primary key will therefore accumulate duplicates on every re-ingestion. The property used for the MERGE needs to match a column that is retained on the node.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-10-18T04:53:35Z

+    # Use backend-specific ingestion method
+    logger.info(
+      f"[Task {task_id}] Starting ingestion using backend: {type(backend).__name__}"
+    )
+
+    result = await backend.ingest_from_s3(
+      graph_id=graph_id,
+      table_name=table_name,
+      s3_pattern=s3_pattern,
+      s3_credentials=s3_credentials,
+      ignore_errors=ignore_errors,
+    )


Kuzu ingestion omits required post-copy cleanup

After switching perform_ingestion to delegate the work to backend.ingest_from_s3 the function no longer performs the aggressive cleanup, checkpointing and memory-settlement delays that previously ran for large Kuzu COPY operations. The constants LARGE_TABLES_REQUIRING_CLEANUP and the connection pool dependency are now unused, and the new KuzuBackend.ingest_from_s3 implementation does not call connection_pool.force_database_cleanup. Without these steps the admission controller will continue to see high RSS/VMS after large loads and subsequent ingestion tasks can be rejected or run out of memory. The memory-release logic should either be invoked here after the backend call or migrated into the backend implementation.

Useful? React with 👍 / 👎.

- Updated CLAUDE.md to generalize references from Kuzu to Graph Database, reflecting multi-backend support. - Improved backend error handling by introducing S3IngestionError for clearer exception management in Kuzu and Neo4j backends. - Added aggressive memory cleanup for large tables in the ingestion process to optimize resource management. - Renamed configuration file references from kuzu.yml to graph.yml for consistency across the codebase.

chatgpt-codex-connector Bot reviewed Oct 18, 2025

View reviewed changes

jfrench9 merged commit d83f0d4 into main Oct 18, 2025
5 checks passed

jfrench9 deleted the feature/neo4j-shared-repository branch October 19, 2025 02:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Neo4j backend support and enhance Graph API infrastructure#17

feat: Add Neo4j backend support and enhance Graph API infrastructure#17
jfrench9 merged 2 commits into
mainfrom
feature/neo4j-shared-repository

jfrench9 commented Oct 18, 2025

Uh oh!

jfrench9 commented Oct 18, 2025

Uh oh!

claude Bot commented Oct 18, 2025 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Oct 18, 2025

Uh oh!

chatgpt-codex-connector Bot Oct 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfrench9 commented Oct 18, 2025

Summary

Key Accomplishments

Backend Architecture

Configuration & Infrastructure

Operational Improvements

Code Quality

Breaking Changes

Testing Notes

Infrastructure Considerations

Uh oh!

jfrench9 commented Oct 18, 2025

Uh oh!

claude Bot commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Complete ✅

Overall Assessment

Critical Issues 🔴

1. Missing Backend Initialization (robosystems/graph_api/backends/__init__.py)

2. Incorrect Dockerfile Path (Dockerfile:113)

3. Incomplete Neo4j S3 Ingestion Error Handling (neo4j.py:290-292)

Major Issues ⚠️

4. Neo4j Ingestion Memory Concerns (neo4j.py:306-356)

5. Hardcoded Merge Logic (neo4j.py:333-336)

6. Inconsistent Backend Parameter (ingestion.py:146)

7. Missing Connection Cleanup (ingest.py:317-318)

Minor Issues / Improvements 💡

8. Missing Type Hints (neo4j.py:331)

9. Redundant Database Name Logic (factory.py:372-382)

10. Test Coverage Gaps

11. Documentation Updates Needed

12. Graph Query Script Hardcoded Defaults (graph_query.py:187-188)

Code Quality Observations ✨

Positive Aspects:

Areas for Improvement:

Testing Recommendations

Summary

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Oct 18, 2025 •

edited

Loading

1. Missing Backend Initialization (`robosystems/graph_api/backends/init.py`)

2. Incorrect Dockerfile Path (`Dockerfile:113`)

3. Incomplete Neo4j S3 Ingestion Error Handling (`neo4j.py:290-292`)

4. Neo4j Ingestion Memory Concerns (`neo4j.py:306-356`)

5. Hardcoded Merge Logic (`neo4j.py:333-336`)

6. Inconsistent Backend Parameter (`ingestion.py:146`)

7. Missing Connection Cleanup (`ingest.py:317-318`)

8. Missing Type Hints (`neo4j.py:331`)

9. Redundant Database Name Logic (`factory.py:372-382`)

12. Graph Query Script Hardcoded Defaults (`graph_query.py:187-188`)