Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 2 additions & 21 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Pre-commit hooks automatically run:
**CODObject** (`cod_object.py`)
- Primary interface for interacting with cloud-optimized DICOM series
- Manages series-level tar archives and metadata in GCS
- Handles access modes, serialization/deserialization, and state synchronization
- Handles access modes and state synchronization
- Key URI pattern: `<datastore_path>/studies/<study_uid>/series/<series_uid>.tar`
- Must be used as context manager for `mode="w"` to ensure proper lock release and sync

Expand Down Expand Up @@ -74,7 +74,6 @@ Pre-commit hooks automatically run:
- `mode="a"`: Append access; acquires exclusive lock automatically (raises `LockAcquisitionError` if exists); fetches remote tar if it exists; appends to existing tar/metadata on sync
- `sync_on_exit=True` (default): Auto-syncs and releases lock on context exit for `mode="w"` or `mode="a"`
- `sync_on_exit=False`: No lock acquired, no auto-sync; useful for local testing/debugging
- Locks persist through serialization/deserialization (for Apache Beam workflows)
- Locks deliberately "hang" on errors to indicate series corruption
- User must use context manager for proper lock release

Expand All @@ -98,12 +97,6 @@ Pre-commit hooks automatically run:
- v1.0: Uncompressed DICOM JSON dict, UIDs parsed from metadata
- v2.0: Zstandard-compressed metadata, explicit UID/pixeldata indexing, ~5-10x size reduction

**Serialization/Deserialization**
- CODObject designed for Apache Beam workflows
- `.serialize()` returns dict for passing between DoFns
- `.deserialize()` reconstructs object, can re-acquire persistent lock
- Pattern: Acquire lock in first DoFn, pass serialized object through pipeline, release in last DoFn

### Project Structure

```
Expand Down Expand Up @@ -137,7 +130,7 @@ cloud_optimized_dicom/
- `smart-open`: Unified remote file access

**Optional:**
- `apache-beam[gcp]`: Data processing (CODObject serialization compatible); install with `pip install cloud-optimized-dicom[beam]`. Without Beam, metric counters silently no-op.
- `apache-beam[gcp]`: Data processing; install with `pip install cloud-optimized-dicom[beam]`. Without Beam, metric counters silently no-op.

**Test:**
- `pydicom==2.3.0`: Original pydicom for validation
Expand Down Expand Up @@ -178,18 +171,6 @@ cod = CODObject(client=..., datastore_path=..., mode="w")
del cod # Lock still exists remotely!
```

**Apache Beam Pattern:**
```python
def first_dofn():
cod = CODObject(..., mode="w") # No context manager
yield cod.serialize() # Lock persists

def last_dofn(serialized_cod):
with CODObject.deserialize(**serialized_cod) as cod: # Reacquires lock
pass # Work happens here
# sync() called automatically, lock released
```

### Testing Notes

- Tests require GCS authentication (service account JSON key in `GCP_SA_KEY` secret for CI)
Expand Down
89 changes: 25 additions & 64 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Cloud Optimized DICOM

[![PyPI version](https://img.shields.io/pypi/v/cloud-optimized-dicom)](https://pypi.org/project/cloud-optimized-dicom/)
[![Python versions](https://img.shields.io/pypi/pyversions/cloud-optimized-dicom)](https://pypi.org/project/cloud-optimized-dicom/)
[![License](https://img.shields.io/pypi/l/cloud-optimized-dicom)](https://pypi.org/project/cloud-optimized-dicom/)
[![Tests](https://github.com/gradienthealth/cloud_optimized_dicom/actions/workflows/test.yml/badge.svg)](https://github.com/gradienthealth/cloud_optimized_dicom/actions/workflows/test.yml)

A library for efficiently storing and interacting with DICOM files in the cloud.

# Development Setup
Expand Down Expand Up @@ -98,93 +103,49 @@ In either case, what happens if they both attempt to modify the same `CODObject`
To avoid the "first process gets overwritten by second process" outcome, we introduce the concept of "locking".

### Terminology & Concepts
A **lock** is just a file with a specific name (by default, `.cod.lock`).
A **lock** is just a file with a specific name (`.gradient.lock`).

**Acquiring a lock** means that the `CODObject` will upload a lock blob to the datastore and store its generation number. If the lock already exists, the `CODObject` will raise a `LockAcquisitionError`.

**State change operations** are any operations that constitute a change to the datastore (namely, appending to it).

By default, state change operations are **clean**, but they can also be **dirty**, meaning they are confined to the user's local environment and will not alter the remote datastore

### The `CODObject(lock=?)` argument
`CODObject`s take a `lock` argument which defaults to `None`. Instantiation behavior depends on this flag:
### Access Modes
`CODObject`s take a `mode` argument that controls locking and sync behavior:

- `lock=None` -> error is raised (user is required to acknowledge their lock choice by setting this flag).
- `lock=True` -> `CODObject` will attempt to acquire a lock, and will raise an error if it cannot.
- `lock=False` -> `CODObject` will not attempt to acquire a lock. Any regular state change operations that are attempted will raise an error. dirty state change operations will be permitted, but the user will again be required to acknowledge the dirtiness of the operation by setting dirty=True in the operation call. See the state change operations section below for more info.
- `mode="r"` -> Read-only. No lock is acquired. Write operations will raise a `WriteOperationInReadModeError`.
- `mode="w"` -> Write (overwrite). A lock is acquired automatically. Starts fresh with empty metadata/tar locally. Overwrites remote tar/metadata on sync.
- `mode="a"` -> Append. A lock is acquired automatically. Fetches remote tar if it exists. Appends to existing tar/metadata on sync.

Because `CODObject(lock=True)` instantiation raises an error if the lock cannot be acquired (already exists), it is guaranteed that no other writing-enabled `CODObject(lock=True)` will be created on the same series while one already exists, thus avoiding the race condition where two workers attempt to create CODObjects with the same study/series UIDs.
Because `mode="w"` and `mode="a"` raise an error if the lock cannot be acquired (already exists), it is guaranteed that no other writing-enabled `CODObject` will be created on the same series while one already exists, thus avoiding the race condition where two workers attempt to create CODObjects with the same study/series UIDs.

### When is a lock necessary?
When the operation you are attempting involves actually modifying the COD datastore itself (example: ingesting new files), a lock is required
When the operation you are attempting involves actually modifying the COD datastore itself (example: ingesting new files), use `mode="w"` or `mode="a"`.

### Why would I ever set `lock=False`?
In some cases, like exporting or otherwise just reading data from COD but not altering it, you may not want your operation to be blocked if another process is interacting with the datastore.
For read-only operations like exporting or reading data from COD, use `mode="r"` so your operation is not blocked if another process is writing to the datastore.

### Lock Release & Management
`CODObject` is designed to be used as a context manager.
When you enter a `with` statement using a `CODObject(lock=True)`, the lock will persist for the duration of the statement, and will be released when the statement ends.
This way, all cleanup (including lock release) is handled for you.
When you enter a `with` statement, the lock will persist for the duration of the statement. On successful exit, changes are automatically synced and the lock is released.
```python
with CODObject(client=..., datastore_path=..., lock=True) as cod:
with CODObject(client=..., datastore_path=..., mode="w") as cod:
cod.append(instances)
cod.sync()
# lock exists within context
assert cod._locker.get_lock_blob().exists() is True
# lock is released when context is exited
assert cod._locker.get_lock_blob().exists() is False
# sync() called automatically, lock released
```
In the case of an error, locks are deliberately left **hanging** to indicate that the series is corrupt in some way and needs user attention.

If an exception occurs in user code (before sync), the lock is **released** — only local state was affected, so the remote datastore is not corrupt:
```python
with CODObject(client=..., datastore_path=..., lock=True) as cod:
with CODObject(client=..., datastore_path=..., mode="w") as cod:
raise ValueError("test")
# assertion will pass; lock file persists
assert cod._locker.get_lock_blob().exists() is True
# lock is released; sync was skipped since no work reached the remote datastore
```

Locks are NOT automatically released when a `CODObject` goes out of scope,
which is an explicit design choice to allow for lock persistence across serialization/deserialization (see below).
However, if the sync itself fails (meaning remote state may be partially written), the lock is deliberately left **hanging** to signal that the series may be corrupt and needs attention.

The tradeoff, however, is that it is possible to accidentally create hanging locks:
Locks are NOT automatically released when a `CODObject` goes out of scope. Always use a context manager (`with` statement) to ensure proper cleanup:
```python
cod_a = CODObject(client=..., datastore_path=..., lock=True)
# do some stuff
cod_a.append(instances)
cod_a.sync()
del cod_a
# cod_a is now out of scope, but lock still exists in the remote datastore
cod_b = CODObject(client=..., datastore_path=..., lock=True)
# the above will raise a LockAcquisitionError because the lock persists
# Incorrect: Lock persists indefinitely
cod = CODObject(client=..., datastore_path=..., mode="w")
del cod # Lock still exists remotely!
```
**It is YOUR responsibility as the user of this class to make sure your locks are released.**

## Serialization/Deserialization
COD was designed with apache beam workflows in mind. For this reason, `CODObject`s can be serialized into a dictionary, so that they can be conveniently shuffled or otherwise passed between `DoFn`s.

Furthermore, because CODObjects store lock generation numbers, they can actually re-acquire an existing lock if they had it previously and were serialized/deserialized. Consider the following recommended workflow:
```python
def dofn_first():
# note the LACK of "with" context manager here
cod_obj = CODObject(client=..., datastore_path=..., lock=True)
# do some stuff
yield cod_obj.serialize() # lock persists

# ... (other dofns here, also without context managers)

def dofn_last(serialized_cod):
# persistent lock reacquired during deserialization
with CODObject.deserialize(**serialized_cod, client=...) as cod_obj:
# do some stuff
cod_obj.append(instances)
cod_obj.sync()
# lock released when "with" block exited
```

It would of course work perfectly well to use a `with` statement in each `DoFn`,
but it would be unnecessarily inefficient as a unique lock would be acquired and released in each `DoFn`.


## Instance URI management: `dicom_uri` vs `_original_path` vs `dependencies`
Two main principles govern how the `Instance` class manages URIs:
1. It should be as simple and straightforward as possible to instantiate an `Instance`
Expand Down
47 changes: 4 additions & 43 deletions cloud_optimized_dicom/cod_object.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import tarfile
import warnings
from tempfile import mkdtemp
from typing import Callable, Literal, Optional, Union
from typing import Literal, Optional, Union

import numpy as np
from google.api_core.exceptions import NotFound
Expand Down Expand Up @@ -66,7 +66,6 @@ class CODObject:
temp_dir: str - If a temp_dir with data pertaining to this series already exists, provide it here.
override_errors: bool - If `True`, delete any existing error.log and upload a new one.
empty_lock_override_age: float - If `None`, do not override a stale lock if it exists. If `float`, override a stale lock if it exists and is older than the given age (in hours).
lock_generation: int - The generation of the lock file. Should only be set if instantiation from serialized cod object.
lock: bool - DEPRECATED. Use mode="r", mode="w", or mode="a" instead.
"""

Expand All @@ -86,12 +85,6 @@ def __init__(
temp_dir: str = None,
override_errors: bool = False,
empty_lock_override_age: float = None,
# fields user should not set
lock_generation: int = None,
metadata: SeriesMetadata = None,
_tar_synced: bool = False,
_metadata_synced: bool = True,
_thumbnail_synced: bool = False,
# deprecated
lock: bool = None,
):
Expand Down Expand Up @@ -120,9 +113,8 @@ def __init__(
self.series_uid = series_uid
self._validate_uids()
self.hashed_uids = hashed_uids
self._metadata = metadata
self.override_errors = override_errors
self.lock_generation = lock_generation
self.lock_generation = None
# check for error.log existence - if it exists, fail initialization
if (
error_log_blob := storage.Blob.from_string(
Expand Down Expand Up @@ -158,8 +150,8 @@ def __init__(
elif mode in ("r", "a"):
# Read and append modes fetch existing metadata
self._get_metadata(create_if_missing=create_if_missing)
self._tar_synced = _tar_synced
self._metadata_synced = _metadata_synced
self._tar_synced = False
self._metadata_synced = True
else:
raise ValueError(f"Unexpected mode: {mode!r}")
# if the thumbnail exists, it is not synced (we did not fetch it)
Expand Down Expand Up @@ -837,37 +829,6 @@ def assert_instance_belongs_to_cod_object(
)
return True

# Serialization methods
def serialize(self) -> dict:
"""Serialize the object into a dict"""
state = self.__dict__.copy()
# remove client (cannot pickle)
del state["client"]
# remove locker (will be recreated on deserialization)
del state["_locker"]
# use metadata's to_dict() method to serialize
state["_metadata"] = self._metadata.to_dict()
return state

@classmethod
def deserialize(
cls,
serialized_obj: dict,
client: storage.Client,
uid_hash_func: Optional[Callable] = None,
) -> "CODObject":
metadata_dict = serialized_obj.pop("_metadata")
# Extract mode and sync_on_exit from serialized state
mode = serialized_obj.pop("_mode")
sync_on_exit = serialized_obj.pop("_sync_on_exit", True)
cod_object = CODObject(
**serialized_obj, client=client, mode=mode, sync_on_exit=sync_on_exit
)
cod_object._metadata = SeriesMetadata.from_dict(
metadata_dict, uid_hash_func=uid_hash_func
)
return cod_object

def cleanup_temp_dir(self):
"""Clean temp dir (if not done already)"""
# clean up temp dir
Expand Down
8 changes: 0 additions & 8 deletions cloud_optimized_dicom/locker.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ class CODLocker:

Args:
cod_object (CODObject): The COD object to lock.
lock_generation (int): (optional) The generation of the lock file to re-acquire if the lock was already known.
"""

def __init__(self, cod_object: "CODObject"):
Expand All @@ -35,16 +34,9 @@ def acquire(
create_if_missing (bool): Passthrough for CODObject.get_metadata (see documentation there)
empty_lock_override_age (float): The age (in hours) of a lock file to consider "stale" and override (if it is empty)
"""
# if the lock already exists, assert generation matches (re-acquisition case)
if (lock_blob := self.get_lock_blob()).exists():
lock_blob.reload()
lock_uri = f"gs://{lock_blob.bucket.name}/{lock_blob.name}"
# Reacquire case
if lock_blob.generation == self.cod_object.lock_generation:
logger.info(
f"COD:LOCK:REACQUIRED:{lock_uri} (generation: {self.cod_object.lock_generation})"
)
return
# No override case
if not empty_lock_override_age:
raise LockAcquisitionError(
Expand Down
18 changes: 0 additions & 18 deletions cloud_optimized_dicom/tests/test_cod_object.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,24 +137,6 @@ def test_extract_locally(self):
self.assertEqual(ds.SeriesInstanceUID, self.test_series_uid)
self.assertEqual(ds.SOPInstanceUID, self.test_instance_uid)

def test_serialize_deserialize(self):
"""Test serialization and deserialization"""
with CODObject(
client=self.client,
datastore_path=self.datastore_path,
study_uid="1.2.3.4.5.6.7.8.9.0",
series_uid="1.2.3.4.5.6.7.8.9.0",
mode="w",
sync_on_exit=False,
) as cod_obj:
serialized = cod_obj.serialize()
with CODObject.deserialize(serialized, self.client) as deserialized:
reserialized = deserialized.serialize()
# Assert all public fields are equal
for field in serialized:
if not field.startswith("_"):
self.assertEqual(serialized[field], reserialized[field])

def test_instance_read_after_sync(self):
"""Test that an instance can be read after a sync"""
delete_uploaded_blobs(self.client, [self.datastore_path])
Expand Down
31 changes: 0 additions & 31 deletions cloud_optimized_dicom/tests/test_hashed_uids.py
Original file line number Diff line number Diff line change
Expand Up @@ -202,34 +202,3 @@ def test_append_diff_hash_dupe_with_hashed_uids(self):
)
append_result = cod_object.append([diff_hash_dupe])
self.assertEqual(append_result.conflict[0], diff_hash_dupe)

def test_serialize_deserialize_with_hashed_uids(self):
"""Test that serialize and deserialize work with hashed uids"""
cod_object = CODObject(
datastore_path=self.datastore_path,
client=self.client,
study_uid=example_hash_function(self.test_study_uid),
series_uid=example_hash_function(self.test_series_uid),
mode="w",
sync_on_exit=False,
hashed_uids=True,
)
instance = Instance(
dicom_uri=self.local_instance_path, uid_hash_func=example_hash_function
)
cod_object.append([instance])
# serialize the cod_object
serialized_cod_object = cod_object.serialize()
# deserialize the cod_object
deserialized_cod_object = CODObject.deserialize(
serialized_cod_object,
client=self.client,
uid_hash_func=example_hash_function,
)
# verify the deserialized cod_object is equal to the original cod_object
self.assertEqual(deserialized_cod_object.study_uid, cod_object.study_uid)
self.assertEqual(deserialized_cod_object.series_uid, cod_object.series_uid)
self.assertEqual(len(deserialized_cod_object._metadata.instances), 1)
self.assertEqual(
deserialized_cod_object._metadata.instances, cod_object._metadata.instances
)
9 changes: 7 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,19 @@ build-backend = "setuptools.build_meta"

[project]
name = "cloud-optimized-dicom"
version = "0.2.2"
version = "0.2.3"
description = "A library for efficiently storing and interacting with DICOM files in the cloud"
readme = "README.md"
authors = [
{name = "Cal Nightingale", email = "cal@gradienthealth.io"}
]
license = "MIT"
requires-python = ">=3.8"
requires-python = ">=3.11,<3.14"
classifiers = [
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
]
dependencies = [
"smart-open==7.0.4",
"ratarmountcore==0.7.1",
Expand Down