Skip to content

feat: Add Neo4j backend support and enhance Graph API infrastructure#17

Merged
jfrench9 merged 2 commits into
mainfrom
feature/neo4j-shared-repository
Oct 18, 2025
Merged

feat: Add Neo4j backend support and enhance Graph API infrastructure#17
jfrench9 merged 2 commits into
mainfrom
feature/neo4j-shared-repository

Conversation

@jfrench9
Copy link
Copy Markdown
Member

Summary

This PR introduces comprehensive Neo4j backend support to the Graph API while maintaining existing Kuzu functionality. The changes establish a flexible multi-backend architecture with improved configuration management and enhanced operational tooling.

Key Accomplishments

Backend Architecture

  • Multi-backend support: Added complete Neo4j backend implementation alongside existing Kuzu support
  • Unified interface: Enhanced base backend abstraction to support both graph database systems
  • Factory pattern: Improved client factory to dynamically select appropriate backend based on configuration

Configuration & Infrastructure

  • Environment setup: Added new environment variables for Neo4j connection management
  • Docker integration: Updated containerization support for both database backends
  • CI/CD enhancements: Extended GitHub workflow configurations to support multi-backend testing
  • Build tooling: Enhanced development workflows with new database management commands

Operational Improvements

  • Query tooling: Introduced new graph query utilities for enhanced database interactions
  • SEC data processing: Significantly improved XBRL ingestion and maintenance workflows
  • Task orchestration: Enhanced background task management with better error handling and logging
  • Monitoring: Added comprehensive logging and metrics collection for graph operations

Code Quality

  • Test coverage: Updated test suites to work with both backend implementations
  • Error handling: Improved exception management across graph operations
  • Performance: Optimized query patterns and connection management

Breaking Changes

  • Configuration tier structure has been updated to support multi-backend selection
  • Some internal API signatures have changed to accommodate backend abstraction
  • Environment variable requirements have expanded for Neo4j support

Testing Notes

  • All existing functionality remains compatible with Kuzu backend
  • Neo4j backend includes comprehensive test coverage
  • Integration tests verify both backends work with existing workflows
  • Performance benchmarks show comparable or improved query execution times

Infrastructure Considerations

  • New database backend requires additional infrastructure provisioning
  • Enhanced monitoring and logging capabilities may increase resource usage
  • Configuration management now supports dynamic backend switching
  • Development environment setup includes new database initialization procedures

This enhancement provides the foundation for scalable graph database operations while maintaining backward compatibility and improving overall system reliability.


🤖 Generated with Claude Code

Branch Info:

  • Source: feature/neo4j-shared-repository
  • Target: main
  • Type: feature

Co-Authored-By: Claude noreply@anthropic.com

- Added Neo4j API URL configuration to .env.example for development routing.
- Updated compose.yaml to reflect changes in Neo4j backend tier from 'enterprise' to 'shared'.
- Renamed configuration file references from kuzu.yml to graph.yml in Dockerfile and tier_config.py.
- Introduced new ingestion methods for both Neo4j and Kuzu backends in the graph API, allowing for S3 data ingestion.
- Enhanced SEC local pipeline and orchestration tasks to support backend selection (Kuzu or Neo4j).
- Updated logging and task management to reflect backend usage across various components.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Oct 18, 2025

Claude finished @jfrench9's task —— View job


PR Review Complete ✅

I've completed a comprehensive review of PR #17 adding Neo4j backend support. Here's my feedback:


Overall Assessment

This is a well-structured PR that successfully introduces Neo4j backend support while maintaining backward compatibility with Kuzu. The multi-backend architecture is sound, but there are several issues that should be addressed before merging.


Critical Issues 🔴

1. Missing Backend Initialization (robosystems/graph_api/backends/__init__.py)

The PR references from robosystems.graph_api.backends import get_backend in ingest.py:317, but I don't see the backend registry or factory function implemented. You need:

# robosystems/graph_api/backends/__init__.py
from robosystems.config import env
from .base import GraphBackend
from .kuzu import KuzuBackend
from .neo4j import Neo4jBackend

def get_backend() -> GraphBackend:
    """Get the appropriate backend based on configuration."""
    backend_type = env.BACKEND_TYPE
    if backend_type == "neo4j":
        return Neo4jBackend(enterprise=env.NEO4J_ENTERPRISE)
    elif backend_type == "kuzu":
        return KuzuBackend()
    else:
        raise ValueError(f"Unknown backend type: {backend_type}")

Location: robosystems/graph_api/backends/__init__.py (appears to be missing)


2. Incorrect Dockerfile Path (Dockerfile:113)

COPY .github/configs/graph.yml /app/configs/graph.yml  # Line 113

Issue: The config file is named graph.yml (not kuzu.yml as referenced in CLAUDE.md:249)

Fix: Either:

  • Update CLAUDE.md to reference graph.yml instead of kuzu.yml, OR
  • Rename the file to kuzu.yml if that's the intended name

Rationale: Configuration file naming should be consistent across documentation and code.


3. Incomplete Neo4j S3 Ingestion Error Handling (neo4j.py:290-292)

except ClientError as e:
    logger.error(f"Failed to list S3 files: {e}")
    raise Exception(f"S3 list operation failed: {e}")  # Line 292

Issue: Generic Exception doesn't help callers distinguish S3 errors from other failures.

Fix: Use a more specific exception:

class S3IngestionError(Exception):
    """Raised when S3 ingestion operations fail."""
    pass

# Then in the handler:
raise S3IngestionError(f"Failed to list S3 files: {e}") from e

Location: robosystems/graph_api/backends/neo4j.py:292


Major Issues ⚠️

4. Neo4j Ingestion Memory Concerns (neo4j.py:306-356)

The Neo4j backend reads entire Parquet files into memory, converts to pandas DataFrame, then processes in batches:

parquet_data = response["Body"].read()  # Entire file in memory
table = pq.read_table(io.BytesIO(parquet_data))
df = table.to_pandas()  # Another copy in memory

Issue: For large files (e.g., SEC data with millions of rows), this could cause OOM errors.

Recommendation:

  1. Add file size checks before loading
  2. Stream large files chunk-by-chunk using PyArrow's ParquetFile.iter_batches()
  3. Document memory requirements for different tier configurations

Location: robosystems/graph_api/backends/neo4j.py:301-306


5. Hardcoded Merge Logic (neo4j.py:333-336)

id_field = "identifier" if "identifier" in batch_records[0] else "id"
cypher = f"""
UNWIND $batch as row
MERGE (n:{table_name} {{identifier: row.{id_field}}})
SET n = row
RETURN count(n) as count
"""

Issues:

  1. Assumes first record is representative of all records
  2. Hardcodes "identifier" field assumption
  3. Creates composite labels that may not match schema expectations
  4. SET n = row overwrites all properties, potentially removing fields not in parquet

Recommendation:

  1. Make merge key configurable per table type
  2. Validate all records have the required identifier field
  3. Consider SET n += row to merge properties instead of replacing

Location: robosystems/graph_api/backends/neo4j.py:332-341


6. Inconsistent Backend Parameter (ingestion.py:146)

def ingest_sec_data(
    # ... other params
    backend: str = "kuzu",  # Backend type ("kuzu" or "neo4j")
) -> Dict[str, Any]:

Issue: The parameter is documented as informational, but the actual backend is determined by tier configuration in graph.yml. This parameter serves no functional purpose and could confuse callers.

Fix: Either:

  1. Remove the parameter entirely (backend is determined by routing), OR
  2. Use it to validate that the routed backend matches expectations

Location: robosystems/tasks/sec_xbrl/ingestion.py:146


7. Missing Connection Cleanup (ingest.py:317-318)

backend = get_backend()
# ... later in background task
background_tasks.add_task(
    perform_ingestion,
    backend=backend,  # Backend instance shared across tasks

Issue: A single backend instance is created and shared across multiple background tasks. Neo4j's AsyncDriver has connection pooling, but Kuzu's Engine may have state issues.

Recommendation:

  1. Create a new backend instance per task, OR
  2. Ensure backend instances are thread-safe for concurrent use
  3. Add explicit cleanup in background task error handlers

Location: robosystems/graph_api/routers/databases/ingest.py:317-362


Minor Issues / Improvements 💡

8. Missing Type Hints (neo4j.py:331)

async def _load_batch(tx):  # Missing type hint for 'tx'

Fix: Add proper type hints:

from neo4j import AsyncTransaction

async def _load_batch(tx: AsyncTransaction) -> int:

Location: robosystems/graph_api/backends/neo4j.py:331


9. Redundant Database Name Logic (factory.py:372-382)

The shared repository routing has complex nested conditionals for determining backend:

if backend == "neo4j":
    api_url = "http://neo4j-api:8002"
else:
    api_url = env.GRAPH_API_URL or "http://localhost:8001"

Recommendation: Extract this to a helper function to reduce complexity:

def _get_dev_backend_url(tier_backend: str) -> str:
    """Get local development URL based on backend type."""
    return "http://neo4j-api:8002" if tier_backend == "neo4j" else env.GRAPH_API_URL or "http://localhost:8001"

Location: robosystems/graph_api/client/factory.py:372-382


10. Test Coverage Gaps

The PR updates test assertions but doesn't add tests for:

  • Neo4j backend initialization
  • Multi-backend routing in factory
  • Neo4j-specific ingestion error cases
  • Backend switching based on tier configuration

Recommendation: Add integration tests for:

  1. Creating client with different backends
  2. Ingesting data via Neo4j backend
  3. Tier-based routing (Standard→Kuzu, Enterprise→Neo4j Community, Premium→Neo4j Enterprise)

11. Documentation Updates Needed

CLAUDE.md still references "Kuzu Graph Database System" exclusively (lines 104-110). Should be updated to:

2. **Graph Database System**

   - **Multi-Backend**: Kuzu (Standard tier) and Neo4j (Enterprise/Premium tiers)
   - **Multi-Tenant**: Separate database per entity
   - **Tiered Infrastructure**: Standard (Kuzu), Enterprise (Neo4j Community), Premium (Neo4j Enterprise)
   - **Shared Repositories**: SEC, industry, economic data (configurable backend)
   - **Cluster-Based**: Writer clusters (EC2), reader clusters (ECS Fargate)

Location: /CLAUDE.md:104-110


12. Graph Query Script Hardcoded Defaults (graph_query.py:187-188)

default="http://localhost:8001",
help="Graph API URL (default: http://localhost:8001 for Kuzu, use http://localhost:8002 for Neo4j)",

Suggestion: Make the default backend-aware:

default=os.getenv("GRAPH_API_URL", "http://localhost:8001"),
help="Graph API URL (defaults to GRAPH_API_URL env var or http://localhost:8001)",

Location: robosystems/scripts/graph_query.py:187


Code Quality Observations ✨

Positive Aspects:

  1. Clean abstraction: GraphBackend base class provides excellent interface
  2. Proper error handling: Good use of try/except with logging
  3. Configuration-driven: Tier configs in graph.yml are well-structured
  4. Backward compatibility: Existing Kuzu code paths maintained
  5. Comprehensive logging: Excellent visibility into backend operations

Areas for Improvement:

  1. ⚠️ Missing docstrings: Neo4j backend methods need docstrings explaining parameters and behavior
  2. ⚠️ Magic numbers: Batch sizes (1000, 5000, 10000) should be constants or config values
  3. ⚠️ Duplicate code: S3 file discovery logic is duplicated between _bulk_load_node_type and _bulk_load_relationship_type

Testing Recommendations

Before merging, please verify:

  1. Backend Switching: Confirm tier-based routing works correctly
  2. Neo4j Ingestion: Test with actual SEC data to validate memory usage
  3. Error Scenarios: Test S3 failures, connection errors, schema mismatches
  4. Migration Path: Document how to migrate existing Kuzu databases to Neo4j (if needed)

Summary

This PR represents solid foundational work for multi-backend support. The architecture is sound, but needs refinement in:

Recommendation: Address critical issues #1-3 before merging. Major issues #4-7 should be resolved or tracked as follow-up work.


Job Run | Branch: feature/neo4j-shared-repository

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread robosystems/graph_api/backends/neo4j.py Outdated
Comment on lines +332 to +336
id_field = "identifier" if "identifier" in batch_records[0] else "id"
cypher = f"""
UNWIND $batch as row
MERGE (n:{table_name} {{identifier: row.{id_field}}})
SET n = row
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve merge key when ingesting rows without identifier

The Neo4j ingestion path always merges on a property named identifier even when the incoming rows only contain an id column. The MERGE (n:{table_name} {identifier: row.id}) followed by SET n = row removes the identifier property from the node because row has no such field, so the next time a record with the same id is processed a new node is created instead of updating the existing one. Any table that uses id as its primary key will therefore accumulate duplicates on every re-ingestion. The property used for the MERGE needs to match a column that is retained on the node.

Useful? React with 👍 / 👎.

Comment on lines +209 to +220
# Use backend-specific ingestion method
logger.info(
f"[Task {task_id}] Starting ingestion using backend: {type(backend).__name__}"
)

result = await backend.ingest_from_s3(
graph_id=graph_id,
table_name=table_name,
s3_pattern=s3_pattern,
s3_credentials=s3_credentials,
ignore_errors=ignore_errors,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Kuzu ingestion omits required post-copy cleanup

After switching perform_ingestion to delegate the work to backend.ingest_from_s3 the function no longer performs the aggressive cleanup, checkpointing and memory-settlement delays that previously ran for large Kuzu COPY operations. The constants LARGE_TABLES_REQUIRING_CLEANUP and the connection pool dependency are now unused, and the new KuzuBackend.ingest_from_s3 implementation does not call connection_pool.force_database_cleanup. Without these steps the admission controller will continue to see high RSS/VMS after large loads and subsequent ingestion tasks can be rejected or run out of memory. The memory-release logic should either be invoked here after the backend call or migrated into the backend implementation.

Useful? React with 👍 / 👎.

- Updated CLAUDE.md to generalize references from Kuzu to Graph Database, reflecting multi-backend support.
- Improved backend error handling by introducing S3IngestionError for clearer exception management in Kuzu and Neo4j backends.
- Added aggressive memory cleanup for large tables in the ingestion process to optimize resource management.
- Renamed configuration file references from kuzu.yml to graph.yml for consistency across the codebase.
@jfrench9 jfrench9 merged commit d83f0d4 into main Oct 18, 2025
5 checks passed
@jfrench9 jfrench9 deleted the feature/neo4j-shared-repository branch October 19, 2025 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant