Skip to content

Commit 5631c32

Browse files
authored
feat: enable multi-datacenter support (#266)
* feat: enable multi-datacenter support * docs: update SDK reference and deploy guide for multi-datacenter support * feat: add datacenter alias on NetworkVolume * fix: pass locations through manifest and resource provisioner * fix: remove dead _is_prod_environment function * fix: address copilot review on multi-volume support * fix: address PR review comments for multi-datacenter support * style: fix ruff formatting in manifest.py * fix: remove redundant type annotation on _deployed_volume_ids reset * fix: align datacenter enum to s3-enabled DCs * fix: address post-rebase review comments * fix: address PR review findings for multi-datacenter support - fix US_GA_1 references to US_GA_2 in docs, tests (enum member does not exist) - add field_serializer for datacenter on ServerlessResource for consistency - strip runtime-assigned volume IDs from config_hash to prevent false drift - add tests for multi-volume manifest serialization - add tests for multi-volume provisioner reconstruction - add test verifying volume id does not affect config hash
1 parent df9ea77 commit 5631c32

21 files changed

Lines changed: 2442 additions & 1646 deletions

docs/Flash_Deploy_Guide.md

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -289,13 +289,14 @@ result = await vllm.post("/v1/completions", {"prompt": "hello"})
289289
### Persistent Storage
290290

291291
```python
292-
from runpod_flash import Endpoint, GpuGroup, NetworkVolume
292+
from runpod_flash import Endpoint, GpuGroup, DataCenter, NetworkVolume, PodTemplate
293293

294-
vol = NetworkVolume(id="vol_abc123")
294+
vol = NetworkVolume(name="model-cache", size=100, datacenter=DataCenter.US_GA_1)
295295

296296
@Endpoint(
297297
name="model-server",
298298
gpu=GpuGroup.AMPERE_80,
299+
datacenter=DataCenter.US_GA_2,
299300
volume=vol,
300301
template=PodTemplate(containerDiskInGb=100),
301302
)
@@ -304,6 +305,24 @@ async def serve(data: dict) -> dict:
304305
...
305306
```
306307

308+
Multiple volumes across datacenters:
309+
310+
```python
311+
volumes = [
312+
NetworkVolume(name="models-us", size=100, datacenter=DataCenter.US_GA_1),
313+
NetworkVolume(name="models-eu", size=100, datacenter=DataCenter.EU_RO_1),
314+
]
315+
316+
@Endpoint(
317+
name="global-server",
318+
gpu=GpuGroup.AMPERE_80,
319+
datacenter=[DataCenter.US_GA_1, DataCenter.EU_RO_1],
320+
volume=volumes,
321+
)
322+
async def serve(data: dict) -> dict:
323+
...
324+
```
325+
307326
## Troubleshooting
308327

309328
### Build Issues

docs/Flash_SDK_Reference.md

Lines changed: 36 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ Endpoint(
2020
dependencies: Optional[List[str]] = None,
2121
system_dependencies: Optional[List[str]] = None,
2222
accelerate_downloads: bool = True,
23-
volume: Optional[NetworkVolume] = None,
24-
datacenter: DataCenter = DataCenter.EU_RO_1,
23+
volume: Optional[Union[NetworkVolume, List[NetworkVolume]]] = None,
24+
datacenter: Optional[Union[DataCenter, List[DataCenter], str, List[str]]] = None,
2525
env: Optional[Dict[str, str]] = None,
2626
gpu_count: int = 1,
2727
execution_timeout_ms: int = 0,
@@ -46,8 +46,8 @@ Endpoint(
4646
| `dependencies` | `list[str]` | `None` | Python packages to install (e.g., `["torch", "numpy==1.24"]`). |
4747
| `system_dependencies` | `list[str]` | `None` | System packages to install. |
4848
| `accelerate_downloads` | `bool` | `True` | Enable accelerated downloads. |
49-
| `volume` | `NetworkVolume` | `None` | Network volume for persistent storage. |
50-
| `datacenter` | `DataCenter` | `EU_RO_1` | Preferred datacenter. |
49+
| `volume` | `NetworkVolume` or list | `None` | Network volume(s) for persistent storage. One volume per datacenter. |
50+
| `datacenter` | `DataCenter`, list, `str`, or `None` | `None` | Datacenter(s) to deploy into. `None` means all available DCs. Accepts a single value, a list, or string DC IDs. CPU endpoints must use DCs in `CPU_DATACENTERS`. |
5151
| `env` | `dict[str, str]` | `None` | Environment variables for the endpoint. |
5252
| `gpu_count` | `int` | `1` | GPUs per worker. |
5353
| `execution_timeout_ms` | `int` | `0` | Max execution time in ms. 0 = no limit. |
@@ -335,8 +335,25 @@ CPU instance selection. Can also be passed as a string to `cpu=`.
335335

336336
| Value | Location |
337337
|-------|----------|
338-
| `DataCenter.EU_RO_1` | Europe - Romania (default) |
339-
| `DataCenter.US_TX_3` | US - Texas |
338+
| `DataCenter.US_CA_2` | US - California |
339+
| `DataCenter.US_GA_2` | US - Georgia |
340+
| `DataCenter.US_IL_1` | US - Illinois |
341+
| `DataCenter.US_KS_2` | US - Kansas |
342+
| `DataCenter.US_MD_1` | US - Maryland |
343+
| `DataCenter.US_MO_1` | US - Missouri |
344+
| `DataCenter.US_MO_2` | US - Missouri |
345+
| `DataCenter.US_NC_1` | US - North Carolina |
346+
| `DataCenter.US_NC_2` | US - North Carolina |
347+
| `DataCenter.US_NE_1` | US - Nebraska |
348+
| `DataCenter.US_WA_1` | US - Washington |
349+
| `DataCenter.EU_CZ_1` | Europe - Czech Republic |
350+
| `DataCenter.EU_RO_1` | Europe - Romania |
351+
| `DataCenter.EUR_IS_1` | Europe - Iceland |
352+
| `DataCenter.EUR_NO_1` | Europe - Norway |
353+
354+
When `datacenter=None` (the default), the endpoint is available in all data centers.
355+
356+
CPU endpoints are restricted to the `CPU_DATACENTERS` subset: `EU_RO_1`.
340357

341358
### CudaVersion
342359

@@ -350,12 +367,22 @@ CPU instance selection. Can also be passed as a string to `cpu=`.
350367

351368
### NetworkVolume
352369

353-
Persistent storage that survives worker restarts.
370+
Persistent storage that survives worker restarts. Each volume is tied to a specific datacenter.
354371

355372
```python
356-
from runpod_flash import NetworkVolume
373+
from runpod_flash import NetworkVolume, DataCenter
357374

375+
# existing volume by ID
358376
vol = NetworkVolume(id="vol_abc123")
377+
378+
# create a new volume in a specific datacenter
379+
vol = NetworkVolume(name="my-models", size=100, datacenter=DataCenter.US_GA_2)
380+
381+
# multiple volumes across datacenters (one per DC)
382+
volumes = [
383+
NetworkVolume(name="models-us", size=100, datacenter=DataCenter.US_GA_1),
384+
NetworkVolume(name="models-eu", size=100, datacenter=DataCenter.EU_RO_1),
385+
]
359386
```
360387

361388
### PodTemplate
@@ -392,6 +419,7 @@ from runpod_flash import (
392419
CpuInstanceType,
393420
CudaVersion,
394421
DataCenter,
422+
CPU_DATACENTERS,
395423
NetworkVolume,
396424
PodTemplate,
397425
ServerlessScalerType,

src/runpod_flash/__init__.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
from .client import remote
1818
from .endpoint import Endpoint, EndpointJob
1919
from .core.resources import (
20+
CPU_DATACENTERS,
2021
CpuInstanceType,
2122
CpuLiveLoadBalancer,
2223
CpuLiveServerless,
@@ -58,6 +59,7 @@
5859

5960
_RESOURCE_NAMES = frozenset(
6061
{
62+
"CPU_DATACENTERS",
6163
"CpuInstanceType",
6264
"CpuLiveLoadBalancer",
6365
"CpuLiveServerless",
@@ -104,6 +106,7 @@ def __getattr__(name):
104106
return remote
105107
elif name in _RESOURCE_NAMES:
106108
from .core.resources import (
109+
CPU_DATACENTERS,
107110
CpuInstanceType,
108111
CpuLiveLoadBalancer,
109112
CpuLiveServerless,
@@ -126,6 +129,7 @@ def __getattr__(name):
126129
)
127130

128131
attrs = {
132+
"CPU_DATACENTERS": CPU_DATACENTERS,
129133
"CpuInstanceType": CpuInstanceType,
130134
"CpuLiveLoadBalancer": CpuLiveLoadBalancer,
131135
"CpuLiveServerless": CpuLiveServerless,
@@ -173,6 +177,7 @@ def __getattr__(name):
173177
"Endpoint",
174178
"EndpointJob",
175179
"remote",
180+
"CPU_DATACENTERS",
176181
"CpuInstanceType",
177182
"CpuLiveLoadBalancer",
178183
"CpuLiveServerless",

src/runpod_flash/cli/commands/build_utils/manifest.py

Lines changed: 35 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,24 @@
2525
RESERVED_PATHS = ["/execute", "/ping"]
2626

2727

28+
def _serialize_network_volume(nv) -> dict:
29+
"""Serialize a NetworkVolume to a manifest-safe dict."""
30+
nv_config: dict = {}
31+
if nv.name is not None:
32+
nv_config["name"] = nv.name
33+
if getattr(nv, "id", None) is not None:
34+
nv_config["id"] = nv.id
35+
if nv.size is not None:
36+
nv_config["size"] = nv.size
37+
if hasattr(nv, "dataCenterId") and nv.dataCenterId is not None:
38+
nv_config["dataCenterId"] = (
39+
nv.dataCenterId.value
40+
if hasattr(nv.dataCenterId, "value")
41+
else nv.dataCenterId
42+
)
43+
return nv_config
44+
45+
2846
@dataclass
2947
class ManifestFunction:
3048
"""Function entry in manifest."""
@@ -218,24 +236,29 @@ def _extract_config_properties(config: Dict[str, Any], resource_config) -> None:
218236
):
219237
config["scalerValue"] = resource_config.scalerValue
220238

239+
if hasattr(resource_config, "locations") and resource_config.locations:
240+
config["locations"] = resource_config.locations
241+
221242
if hasattr(resource_config, "env") and resource_config.env:
222243
env_dict = dict(resource_config.env)
223244
env_dict.pop("RUNPOD_API_KEY", None)
224245
if env_dict:
225246
config["env"] = env_dict
226247

227-
if hasattr(resource_config, "networkVolume") and resource_config.networkVolume:
228-
nv = resource_config.networkVolume
229-
nv_config = {"name": nv.name}
230-
if nv.size is not None:
231-
nv_config["size"] = nv.size
232-
if hasattr(nv, "dataCenterId") and nv.dataCenterId is not None:
233-
nv_config["dataCenterId"] = (
234-
nv.dataCenterId.value
235-
if hasattr(nv.dataCenterId, "value")
236-
else nv.dataCenterId
237-
)
238-
config["networkVolume"] = nv_config
248+
if (
249+
hasattr(resource_config, "networkVolumes")
250+
and resource_config.networkVolumes
251+
):
252+
config["networkVolumes"] = [
253+
_serialize_network_volume(nv) for nv in resource_config.networkVolumes
254+
]
255+
256+
elif (
257+
hasattr(resource_config, "networkVolume") and resource_config.networkVolume
258+
):
259+
config["networkVolume"] = _serialize_network_volume(
260+
resource_config.networkVolume
261+
)
239262

240263
elif (
241264
hasattr(resource_config, "networkVolumeId")

src/runpod_flash/core/api/runpod.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -273,6 +273,10 @@ async def save_endpoint(self, input_data: Dict[str, Any]) -> Dict[str, Any]:
273273
locations
274274
name
275275
networkVolumeId
276+
networkVolumeIds {
277+
networkVolumeId
278+
dataCenterId
279+
}
276280
flashEnvironmentId
277281
scalerType
278282
scalerValue

src/runpod_flash/core/resources/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
)
1919
from .serverless_cpu import CpuServerlessEndpoint
2020
from .template import PodTemplate
21-
from .network_volume import NetworkVolume, DataCenter
21+
from .network_volume import NetworkVolume, DataCenter, CPU_DATACENTERS
2222
from .load_balancer_sls_resource import (
2323
CpuLoadBalancerSlsResource,
2424
LoadBalancerSlsResource,
@@ -33,6 +33,7 @@
3333
"CpuLoadBalancerSlsResource",
3434
"CpuServerlessEndpoint",
3535
"CudaVersion",
36+
"CPU_DATACENTERS",
3637
"DataCenter",
3738
"DeployableResource",
3839
"GpuGroup",

src/runpod_flash/core/resources/load_balancer_sls_resource.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -342,6 +342,7 @@ class CpuLoadBalancerSlsResource(CpuEndpointMixin, LoadBalancerSlsResource):
342342
"allowedCudaVersions",
343343
"imageName",
344344
"networkVolume",
345+
"networkVolumes",
345346
"python_version",
346347
}
347348

@@ -401,6 +402,7 @@ def config_hash(self) -> str:
401402
"flashEnvironmentId",
402403
"imageName",
403404
"networkVolume",
405+
"networkVolumes",
404406
"instanceIds", # CPU-specific
405407
"workersMin", # Scaling
406408
"workersMax",

src/runpod_flash/core/resources/network_volume.py

Lines changed: 69 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
from pydantic import (
77
Field,
88
field_serializer,
9+
model_validator,
910
)
1011

1112
from ..api.runpod import RunpodRestClient
@@ -17,12 +18,50 @@
1718

1819

1920
class DataCenter(str, Enum):
20-
"""
21-
Enum representing available data centers for network volumes.
22-
#TODO: Add more data centers as needed. Lock this to the available data center.
23-
"""
24-
21+
"""Enum representing available RunPod data centers."""
22+
23+
# north america
24+
US_CA_2 = "US-CA-2"
25+
US_GA_2 = "US-GA-2"
26+
US_IL_1 = "US-IL-1"
27+
US_KS_2 = "US-KS-2"
28+
US_MD_1 = "US-MD-1"
29+
US_MO_1 = "US-MO-1"
30+
US_MO_2 = "US-MO-2"
31+
US_NC_1 = "US-NC-1"
32+
US_NC_2 = "US-NC-2"
33+
US_NE_1 = "US-NE-1"
34+
US_WA_1 = "US-WA-1"
35+
36+
# europe
37+
EU_CZ_1 = "EU-CZ-1"
2538
EU_RO_1 = "EU-RO-1"
39+
EUR_IS_1 = "EUR-IS-1"
40+
EUR_NO_1 = "EUR-NO-1"
41+
42+
@classmethod
43+
def from_string(cls, value: str) -> "DataCenter":
44+
"""Parse a datacenter ID string into a DataCenter enum.
45+
46+
Accepts the canonical form (e.g. "EU-RO-1") as well as common
47+
variations like lowercase or underscore-separated.
48+
"""
49+
normalized = value.strip().upper().replace("_", "-")
50+
try:
51+
return cls(normalized)
52+
except ValueError:
53+
valid = ", ".join(dc.value for dc in cls)
54+
raise ValueError(
55+
f"Unknown datacenter '{value}'. Valid datacenters: {valid}"
56+
)
57+
58+
59+
# data centers that support CPU serverless endpoints
60+
CPU_DATACENTERS: frozenset[DataCenter] = frozenset(
61+
{
62+
DataCenter.EU_RO_1,
63+
}
64+
)
2665

2766

2867
class NetworkVolume(DeployableResource):
@@ -41,22 +80,42 @@ class NetworkVolume(DeployableResource):
4180
"name",
4281
}
4382

44-
# Internal fixed value
45-
dataCenterId: DataCenter = Field(default=DataCenter.EU_RO_1, frozen=True)
83+
# public alias -- users pass datacenter=, which syncs to dataCenterId for the API
84+
datacenter: Optional[DataCenter] = Field(default=None, exclude=True)
85+
dataCenterId: DataCenter = Field(default=DataCenter.EU_RO_1)
4686

4787
id: Optional[str] = Field(default=None)
48-
name: str
88+
name: Optional[str] = None
4989
size: Optional[int] = Field(default=100, gt=0) # Size in GB
5090

91+
@model_validator(mode="before")
92+
@classmethod
93+
def sync_datacenter_alias(cls, data):
94+
"""Allow datacenter= as a user-friendly alias for dataCenterId."""
95+
if isinstance(data, dict):
96+
dc = data.pop("datacenter", None)
97+
if dc is not None and "dataCenterId" not in data:
98+
data["dataCenterId"] = dc
99+
return data
100+
101+
@model_validator(mode="after")
102+
def require_name_or_id(self):
103+
"""Either name or id must be provided."""
104+
if not self.name and not self.id:
105+
raise ValueError("either 'name' or 'id' must be provided")
106+
return self
107+
51108
def __str__(self) -> str:
52109
return f"{self.__class__.__name__}:{self.id}"
53110

54111
@property
55112
def resource_id(self) -> str:
56113
"""Unique resource ID based on name and datacenter for idempotent behavior."""
57-
# Use name + datacenter to ensure idempotence
58114
resource_type = self.__class__.__name__
59-
config_key = f"{self.name}:{self.dataCenterId.value}"
115+
if self.name:
116+
config_key = f"{self.name}:{self.dataCenterId.value}"
117+
else:
118+
config_key = f"id:{self.id}"
60119
hash_obj = hashlib.md5(f"{resource_type}:{config_key}".encode())
61120
return f"{resource_type}_{hash_obj.hexdigest()}"
62121

0 commit comments

Comments
 (0)