Skip to content

feat: implement cross-endpoint routing reliability infrastructure#149

Merged
deanq merged 15 commits intomainfrom
deanq/ae-1747-cross-endpoint-routing-clean
Jan 27, 2026
Merged

feat: implement cross-endpoint routing reliability infrastructure#149
deanq merged 15 commits intomainfrom
deanq/ae-1747-cross-endpoint-routing-clean

Conversation

@deanq
Copy link
Copy Markdown
Member

@deanq deanq commented Jan 22, 2026

Summary

Implements cross-endpoint routing reliability infrastructure with circuit breaker patterns and peer-to-peer state management.

See Cross-Endpoint Routing and Flash Deploy Guide for technical details

Core Changes

  • Cross-Endpoint Routing Reliability: Circuit breaker and retry mechanisms for resilient request handling
  • State Manager Refactor: Migrate from hub-and-spoke to peer-to-peer architecture for distributed state
  • Infrastructure Cleanup: Remove obsolete hub-and-spoke infrastructure
  • Test Fixes: Resolve async test failures in cross-endpoint routing tests

Implementation Details

  • Circuit breaker pattern for endpoint request resilience
  • Peer-to-peer State Manager for distributed coordination
  • Proper async/await handling in routing tests
  • Service registry integration with new architecture

Tickets

AE-1747

Test plan

  • Run unit tests: make test-unit
  • Run integration tests: make test-integration
  • Full test suite: make test
  • Verify coverage: make quality-check

deanq added 5 commits January 22, 2026 11:58
- Replace ManifestClient (mothership) with StateManagerClient for peer-to-peer architecture
- Make get_endpoint_for_function, get_resource_for_function, and is_local_function async
- Remove manifest_client parameter from ServiceRegistry.__init__
- Query State Manager directly for full manifest with extracted resources_endpoints mapping
- Update docstrings to reflect peer-to-peer model and State Manager dependency
- Ensure manifest cache refresh happens before routing decisions
Remove ManifestClient HTTP client and /manifest endpoint as they're replaced
by peer-to-peer StateManagerClient model where all endpoints query State Manager
directly. This eliminates single point of failure and simplifies architecture.

Changes:
- Delete ManifestClient (src/tetra_rp/runtime/manifest_client.py)
- Delete ManifestClient tests (tests/unit/runtime/test_manifest_client.py)
- Remove /manifest endpoint from lb_handler.py
- Remove /manifest endpoint tests from test_lb_handler.py and test_lb_remote_execution.py
- Update integration tests to use StateManagerClient mocks
- Remove FLASH_MOTHERSHIP_ID environment variable references
- Update documentation to reflect peer-to-peer architecture
- Update CLI test-mothership command output

Peer-to-peer architecture benefits:
- No single point of failure (no mothership dependency)
- All endpoints are equal peers
- Simpler deployment model
- Consistent service discovery via State Manager GraphQL API
- Fixed test_service_registry.py: Added RUNPOD_ENDPOINT_ID env var to tests
  that call _ensure_manifest_loaded() (test_is_local_function_remote,
  test_get_resource_for_function_remote)
- Fixed test_production_wrapper.py: Updated async method mocks to use AsyncMock
  properly for all get_resource_for_function calls
- Fixed test_cross_endpoint_routing.py: Added RUNPOD_ENDPOINT_ID env var to
  test_manifest_loading_on_demand
- Fixed conftest.py: Made worker_temp_dir fixture compatible with both
  parallel (xdist) and serial execution by using request fixture to detect
  worker_id instead of requiring it

All 707 tests now pass with 65% coverage. Tests use peer-to-peer routing
via StateManagerClient as per recent hub-and-spoke cleanup.
Cleanup:
- Remove /manifest endpoint from LoadBalancer handler generator (lb_handler_generator.py)
- Complete peer-to-peer architecture migration - all endpoints now query State Manager directly

Phase 2 Foundation (Reliability Features):
- Add reliability_config.py: Centralized configuration for circuit breaker, load balancing,
  retry logic, and metrics collection with environment variable support
- Add circuit_breaker.py: Circuit breaker pattern implementation with state machine (CLOSED,
  OPEN, HALF_OPEN) using sliding window failure detection
- Add load_balancer.py: Load balancing strategies (round-robin, least-connections, random)
  for distributing requests across multiple endpoints
- Add retry_manager.py: Retry logic with exponential backoff, jitter, and circuit breaker
  integration for handling transient failures
- Add metrics.py: Structured logging-based metrics collection with helpers for circuit
  breaker, retry, and load balancer telemetry

Next: Phase 2 will implement upfront provisioning in deployment flow.
@deanq deanq changed the title feat: implement cross-endpoint routing reliability infrastructure (AE-1747) feat: implement cross-endpoint routing reliability infrastructure Jan 22, 2026
@deanq deanq requested a review from Copilot January 22, 2026 20:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements cross-endpoint routing reliability infrastructure with circuit breaker patterns and transitions from a hub-and-spoke to peer-to-peer architecture for distributed state management.

Changes:

  • Introduces circuit breaker, retry, and load balancer components for resilient cross-endpoint communication
  • Migrates from ManifestClient (HTTP-based) to StateManagerClient (GraphQL-based) for peer-to-peer state synchronization
  • Converts synchronous service registry methods to async to support State Manager queries
  • Removes obsolete hub-and-spoke manifest endpoint and JSON normalization utilities

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/unit/runtime/test_service_registry.py Updates tests for async service registry methods and State Manager client integration
tests/unit/runtime/test_retry_manager.py New tests for retry manager with exponential backoff functionality
tests/unit/runtime/test_reliability_config.py New tests for reliability configuration system
tests/unit/runtime/test_production_wrapper.py Updates production wrapper tests to use async service registry
tests/unit/runtime/test_manifest_client.py Removes obsolete ManifestClient tests
tests/unit/runtime/test_load_balancer.py New tests for load balancer strategies
tests/unit/runtime/test_lb_handler.py Removes obsolete manifest endpoint tests
tests/unit/runtime/test_circuit_breaker.py New tests for circuit breaker pattern implementation
tests/unit/core/utils/test_json.py Removes obsolete JSON normalization tests
tests/integration/test_lb_remote_execution.py Removes obsolete manifest endpoint integration tests
tests/integration/test_cross_endpoint_routing.py Updates for StateManagerClient and removes FLASH_MOTHERSHIP_ID references
tests/conftest.py Fixes worker_temp_dir fixture for pytest-xdist compatibility
src/tetra_rp/runtime/service_registry.py Migrates to peer-to-peer StateManagerClient and makes methods async
src/tetra_rp/runtime/retry_manager.py New retry manager with exponential backoff
src/tetra_rp/runtime/reliability_config.py New centralized reliability configuration
src/tetra_rp/runtime/production_wrapper.py Updates to use async service registry methods
src/tetra_rp/runtime/metrics.py New metrics collection via structured logging
src/tetra_rp/runtime/manifest_client.py Removes obsolete HTTP-based manifest client
src/tetra_rp/runtime/load_balancer.py New load balancer with multiple strategies
src/tetra_rp/runtime/lb_handler.py Removes obsolete manifest endpoint
src/tetra_rp/runtime/circuit_breaker.py New circuit breaker implementation
src/tetra_rp/core/utils/json.py Removes obsolete JSON normalization utility
src/tetra_rp/cli/commands/test_mothership.py Updates documentation for peer-to-peer architecture
src/tetra_rp/cli/commands/build_utils/lb_handler_generator.py Removes manifest endpoint generation
docs/Flash_Deploy_Guide.md New comprehensive deployment guide
docs/Cross_Endpoint_Routing.md Updates for peer-to-peer architecture

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/conftest.py Outdated
Comment thread src/tetra_rp/runtime/reliability_config.py Outdated
Comment thread src/tetra_rp/runtime/circuit_breaker.py Outdated
Comment thread src/tetra_rp/runtime/circuit_breaker.py Outdated
Comment thread src/tetra_rp/runtime/circuit_breaker.py Outdated
Comment thread src/tetra_rp/runtime/circuit_breaker.py Outdated
Comment thread src/tetra_rp/runtime/circuit_breaker.py Outdated
Comment thread src/tetra_rp/runtime/circuit_breaker.py Outdated
Comment thread src/tetra_rp/runtime/circuit_breaker.py Outdated
Comment thread src/tetra_rp/runtime/circuit_breaker.py Outdated
@deanq deanq requested a review from Copilot January 22, 2026 21:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 27 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Fix worker_id access pattern in conftest.py with safer dict.get() approach
- Use default_factory for mutable tuple in RetryConfig dataclass
- Replace deprecated datetime.utcnow() with datetime.now(timezone.utc)
@deanq deanq requested a review from Copilot January 22, 2026 21:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 27 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/conftest.py
Comment thread src/tetra_rp/runtime/reliability_config.py Outdated
deanq and others added 2 commits January 22, 2026 14:12
- Update docstring in conftest.py to clarify worker_id is extracted from request
- Remove unused asyncio import from reliability_config.py
- Use built-in TimeoutError instead of asyncio.TimeoutError (equivalent in Python 3.8+)
Comment thread src/tetra_rp/runtime/reliability_config.py
Comment thread src/tetra_rp/runtime/service_registry.py
deanq and others added 5 commits January 26, 2026 10:01
The FLASH_MOTHERSHIP_ID variable was originally intended for child endpoints
to identify their parent mothership. However, the current peer-to-peer
architecture uses RUNPOD_ENDPOINT_ID directly for State Manager queries,
making this variable redundant.

This variable was set but never consumed anywhere in the codebase, making
it safe to remove without affecting functionality.

Changes:
- Remove FLASH_MOTHERSHIP_ID from environment dict in mothership_provisioner.py
- Remove test assertion validating FLASH_MOTHERSHIP_ID presence
- Update documentation references to remove mentions of FLASH_MOTHERSHIP_ID
- All tests continue to pass
@deanq deanq requested a review from jhcipar January 27, 2026 21:55
@deanq deanq merged commit cb6a226 into main Jan 27, 2026
7 checks passed
@deanq deanq deleted the deanq/ae-1747-cross-endpoint-routing-clean branch January 27, 2026 22:19
This was referenced Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants