feat: implement cross-endpoint routing reliability infrastructure#149
feat: implement cross-endpoint routing reliability infrastructure#149
Conversation
- Replace ManifestClient (mothership) with StateManagerClient for peer-to-peer architecture - Make get_endpoint_for_function, get_resource_for_function, and is_local_function async - Remove manifest_client parameter from ServiceRegistry.__init__ - Query State Manager directly for full manifest with extracted resources_endpoints mapping - Update docstrings to reflect peer-to-peer model and State Manager dependency - Ensure manifest cache refresh happens before routing decisions
Remove ManifestClient HTTP client and /manifest endpoint as they're replaced by peer-to-peer StateManagerClient model where all endpoints query State Manager directly. This eliminates single point of failure and simplifies architecture. Changes: - Delete ManifestClient (src/tetra_rp/runtime/manifest_client.py) - Delete ManifestClient tests (tests/unit/runtime/test_manifest_client.py) - Remove /manifest endpoint from lb_handler.py - Remove /manifest endpoint tests from test_lb_handler.py and test_lb_remote_execution.py - Update integration tests to use StateManagerClient mocks - Remove FLASH_MOTHERSHIP_ID environment variable references - Update documentation to reflect peer-to-peer architecture - Update CLI test-mothership command output Peer-to-peer architecture benefits: - No single point of failure (no mothership dependency) - All endpoints are equal peers - Simpler deployment model - Consistent service discovery via State Manager GraphQL API
- Fixed test_service_registry.py: Added RUNPOD_ENDPOINT_ID env var to tests that call _ensure_manifest_loaded() (test_is_local_function_remote, test_get_resource_for_function_remote) - Fixed test_production_wrapper.py: Updated async method mocks to use AsyncMock properly for all get_resource_for_function calls - Fixed test_cross_endpoint_routing.py: Added RUNPOD_ENDPOINT_ID env var to test_manifest_loading_on_demand - Fixed conftest.py: Made worker_temp_dir fixture compatible with both parallel (xdist) and serial execution by using request fixture to detect worker_id instead of requiring it All 707 tests now pass with 65% coverage. Tests use peer-to-peer routing via StateManagerClient as per recent hub-and-spoke cleanup.
Cleanup: - Remove /manifest endpoint from LoadBalancer handler generator (lb_handler_generator.py) - Complete peer-to-peer architecture migration - all endpoints now query State Manager directly Phase 2 Foundation (Reliability Features): - Add reliability_config.py: Centralized configuration for circuit breaker, load balancing, retry logic, and metrics collection with environment variable support - Add circuit_breaker.py: Circuit breaker pattern implementation with state machine (CLOSED, OPEN, HALF_OPEN) using sliding window failure detection - Add load_balancer.py: Load balancing strategies (round-robin, least-connections, random) for distributing requests across multiple endpoints - Add retry_manager.py: Retry logic with exponential backoff, jitter, and circuit breaker integration for handling transient failures - Add metrics.py: Structured logging-based metrics collection with helpers for circuit breaker, retry, and load balancer telemetry Next: Phase 2 will implement upfront provisioning in deployment flow.
There was a problem hiding this comment.
Pull request overview
This PR implements cross-endpoint routing reliability infrastructure with circuit breaker patterns and transitions from a hub-and-spoke to peer-to-peer architecture for distributed state management.
Changes:
- Introduces circuit breaker, retry, and load balancer components for resilient cross-endpoint communication
- Migrates from ManifestClient (HTTP-based) to StateManagerClient (GraphQL-based) for peer-to-peer state synchronization
- Converts synchronous service registry methods to async to support State Manager queries
- Removes obsolete hub-and-spoke manifest endpoint and JSON normalization utilities
Reviewed changes
Copilot reviewed 26 out of 27 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/runtime/test_service_registry.py | Updates tests for async service registry methods and State Manager client integration |
| tests/unit/runtime/test_retry_manager.py | New tests for retry manager with exponential backoff functionality |
| tests/unit/runtime/test_reliability_config.py | New tests for reliability configuration system |
| tests/unit/runtime/test_production_wrapper.py | Updates production wrapper tests to use async service registry |
| tests/unit/runtime/test_manifest_client.py | Removes obsolete ManifestClient tests |
| tests/unit/runtime/test_load_balancer.py | New tests for load balancer strategies |
| tests/unit/runtime/test_lb_handler.py | Removes obsolete manifest endpoint tests |
| tests/unit/runtime/test_circuit_breaker.py | New tests for circuit breaker pattern implementation |
| tests/unit/core/utils/test_json.py | Removes obsolete JSON normalization tests |
| tests/integration/test_lb_remote_execution.py | Removes obsolete manifest endpoint integration tests |
| tests/integration/test_cross_endpoint_routing.py | Updates for StateManagerClient and removes FLASH_MOTHERSHIP_ID references |
| tests/conftest.py | Fixes worker_temp_dir fixture for pytest-xdist compatibility |
| src/tetra_rp/runtime/service_registry.py | Migrates to peer-to-peer StateManagerClient and makes methods async |
| src/tetra_rp/runtime/retry_manager.py | New retry manager with exponential backoff |
| src/tetra_rp/runtime/reliability_config.py | New centralized reliability configuration |
| src/tetra_rp/runtime/production_wrapper.py | Updates to use async service registry methods |
| src/tetra_rp/runtime/metrics.py | New metrics collection via structured logging |
| src/tetra_rp/runtime/manifest_client.py | Removes obsolete HTTP-based manifest client |
| src/tetra_rp/runtime/load_balancer.py | New load balancer with multiple strategies |
| src/tetra_rp/runtime/lb_handler.py | Removes obsolete manifest endpoint |
| src/tetra_rp/runtime/circuit_breaker.py | New circuit breaker implementation |
| src/tetra_rp/core/utils/json.py | Removes obsolete JSON normalization utility |
| src/tetra_rp/cli/commands/test_mothership.py | Updates documentation for peer-to-peer architecture |
| src/tetra_rp/cli/commands/build_utils/lb_handler_generator.py | Removes manifest endpoint generation |
| docs/Flash_Deploy_Guide.md | New comprehensive deployment guide |
| docs/Cross_Endpoint_Routing.md | Updates for peer-to-peer architecture |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 27 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix worker_id access pattern in conftest.py with safer dict.get() approach - Use default_factory for mutable tuple in RetryConfig dataclass - Replace deprecated datetime.utcnow() with datetime.now(timezone.utc)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 27 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Update docstring in conftest.py to clarify worker_id is extracted from request - Remove unused asyncio import from reliability_config.py - Use built-in TimeoutError instead of asyncio.TimeoutError (equivalent in Python 3.8+)
The FLASH_MOTHERSHIP_ID variable was originally intended for child endpoints to identify their parent mothership. However, the current peer-to-peer architecture uses RUNPOD_ENDPOINT_ID directly for State Manager queries, making this variable redundant. This variable was set but never consumed anywhere in the codebase, making it safe to remove without affecting functionality. Changes: - Remove FLASH_MOTHERSHIP_ID from environment dict in mothership_provisioner.py - Remove test assertion validating FLASH_MOTHERSHIP_ID presence - Update documentation references to remove mentions of FLASH_MOTHERSHIP_ID - All tests continue to pass
Summary
Implements cross-endpoint routing reliability infrastructure with circuit breaker patterns and peer-to-peer state management.
See Cross-Endpoint Routing and Flash Deploy Guide for technical details
Core Changes
Implementation Details
Tickets
AE-1747
Test plan
make test-unitmake test-integrationmake testmake quality-check