feat: Add Neo4j backend support and enhance Graph API infrastructure#17
Conversation
- Added Neo4j API URL configuration to .env.example for development routing. - Updated compose.yaml to reflect changes in Neo4j backend tier from 'enterprise' to 'shared'. - Renamed configuration file references from kuzu.yml to graph.yml in Dockerfile and tier_config.py. - Introduced new ingestion methods for both Neo4j and Kuzu backends in the graph API, allowing for S3 data ingestion. - Enhanced SEC local pipeline and orchestration tasks to support backend selection (Kuzu or Neo4j). - Updated logging and task management to reflect backend usage across various components.
|
@claude please review this PR |
|
Claude finished @jfrench9's task —— View job PR Review Complete ✅I've completed a comprehensive review of PR #17 adding Neo4j backend support. Here's my feedback: Overall AssessmentThis is a well-structured PR that successfully introduces Neo4j backend support while maintaining backward compatibility with Kuzu. The multi-backend architecture is sound, but there are several issues that should be addressed before merging. Critical Issues 🔴1. Missing Backend Initialization (
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| id_field = "identifier" if "identifier" in batch_records[0] else "id" | ||
| cypher = f""" | ||
| UNWIND $batch as row | ||
| MERGE (n:{table_name} {{identifier: row.{id_field}}}) | ||
| SET n = row |
There was a problem hiding this comment.
Preserve merge key when ingesting rows without
identifier
The Neo4j ingestion path always merges on a property named identifier even when the incoming rows only contain an id column. The MERGE (n:{table_name} {identifier: row.id}) followed by SET n = row removes the identifier property from the node because row has no such field, so the next time a record with the same id is processed a new node is created instead of updating the existing one. Any table that uses id as its primary key will therefore accumulate duplicates on every re-ingestion. The property used for the MERGE needs to match a column that is retained on the node.
Useful? React with 👍 / 👎.
| # Use backend-specific ingestion method | ||
| logger.info( | ||
| f"[Task {task_id}] Starting ingestion using backend: {type(backend).__name__}" | ||
| ) | ||
|
|
||
| result = await backend.ingest_from_s3( | ||
| graph_id=graph_id, | ||
| table_name=table_name, | ||
| s3_pattern=s3_pattern, | ||
| s3_credentials=s3_credentials, | ||
| ignore_errors=ignore_errors, | ||
| ) |
There was a problem hiding this comment.
Kuzu ingestion omits required post-copy cleanup
After switching perform_ingestion to delegate the work to backend.ingest_from_s3 the function no longer performs the aggressive cleanup, checkpointing and memory-settlement delays that previously ran for large Kuzu COPY operations. The constants LARGE_TABLES_REQUIRING_CLEANUP and the connection pool dependency are now unused, and the new KuzuBackend.ingest_from_s3 implementation does not call connection_pool.force_database_cleanup. Without these steps the admission controller will continue to see high RSS/VMS after large loads and subsequent ingestion tasks can be rejected or run out of memory. The memory-release logic should either be invoked here after the backend call or migrated into the backend implementation.
Useful? React with 👍 / 👎.
- Updated CLAUDE.md to generalize references from Kuzu to Graph Database, reflecting multi-backend support. - Improved backend error handling by introducing S3IngestionError for clearer exception management in Kuzu and Neo4j backends. - Added aggressive memory cleanup for large tables in the ingestion process to optimize resource management. - Renamed configuration file references from kuzu.yml to graph.yml for consistency across the codebase.
Summary
This PR introduces comprehensive Neo4j backend support to the Graph API while maintaining existing Kuzu functionality. The changes establish a flexible multi-backend architecture with improved configuration management and enhanced operational tooling.
Key Accomplishments
Backend Architecture
Configuration & Infrastructure
Operational Improvements
Code Quality
Breaking Changes
Testing Notes
Infrastructure Considerations
This enhancement provides the foundation for scalable graph database operations while maintaining backward compatibility and improving overall system reliability.
🤖 Generated with Claude Code
Branch Info:
feature/neo4j-shared-repositorymainCo-Authored-By: Claude noreply@anthropic.com