A reproducible, single-node benchmark harness for durable-stream servers: declarative workload suites, a Kubernetes client fleet, and exact cross-fleet HDR-percentile merging, runnable on a local kind cluster or on GKE. Workloads are server-agnostic and run against any supported implementation.
Currently supported implementations: durable-streams (Rust), the Node.js reference server (@durable-streams/server), ursula, and S2 (s2lite).
Results from a run across them: results/REPORT.md.
A suite is a JSON file (suites/*.json) that declares the workload, the systems and configs, and the sweep. scripts/bench brings up a cluster, deploys each server fresh, drives a Kubernetes client fleet, merges per-pod HDR histograms into fleet-wide percentiles, records per-cell results under results/<suite>/, and tears down. Reports regenerate from local results, no cluster required.
Each of the four workloads is its own declarative suite:
- Write throughput — append/s at saturation, plus tail latency and pod memory —
suites/run-{durable,ursula,s2,node}.json. Drives concurrent appends across many streams while the client fleet ramps a per-cardinality pod ladder; once server throughput stops climbing it pins the load and confirms the peak append rate, the latency at that peak, and the server's peak pod memory. This saturation walk finds the server's ceiling rather than assuming a pod count. The memory figure is the pod cgroup working set (anon plus active page cache), so a resident cache and an OS-paging design are compared on equal terms across every implementation. - Sustained load — latency and server-memory stability over time —
suites/sustained.json. Holds a fixed, modest append rate across a set of streams for a long window and watches whether latency and the server's resident memory stay flat, surfacing slow drift or leaks that a short burst would miss. - Catch-up / reconnect — per-client catch-up latency and body size —
suites/catchup-{durable,ursula,s2}.json. Pre-populates a stream, then has many clients reconnect and replay it from the beginning all at once, recording how long each client takes to catch up, how large its response is, and the aggregate replay throughput. Each implementation replays through whatever native read path it offers, whether a snapshot plus the tail since that snapshot or a full scan of the log. - SSE fan-out — per-event delivery latency and memory vs subscriber count —
scripts/run-sse.sh. Has a single writer publish to one stream while a growing number of subscribers stream it, measuring the per-event end-to-end delivery latency as the fan-out widens.
kubectl,python3(3.x, stdlib only), Docker.- Local: kind.
- Remote:
gcloudauthenticated; an Artifact Registry repo. OverridePROJECT(defaults togcloud config get-value project),AR_LOCATION(defaulteurope-west1),AR_REPO(defaultds-bench),ZONE, and the machine types (SERVER_MACHINE,CLIENT_MACHINE) for your environment. - The
durable-streamsserver source checked out alongside this repo, only if you build its image yourself. ursula and S2 use upstream-published images (ghcr.io/tonbo-io/ursula,ghcr.io/s2-streamstore/s2), so there is no source to vendor.
DS_TARGET=local scripts/cluster-up.sh # kind cluster + MinIO + metrics ConfigMap
DS_TARGET=local scripts/build-images.sh # build server + ds-bench images, load into kind
# `*-local` suites use small ladders/counts that fit a single kind node:
DS_TARGET=local scripts/bench suites/write-throughput-local.json run # run a workload
DS_TARGET=local scripts/bench suites/catchup-local.json run # another workload
scripts/bench suites/write-throughput-local.json report # (re)generate its report
DS_TARGET=local scripts/cluster-down.sh # tear downDS_TARGET=local runs everything against the kind cluster, with no cloud and no teardown of kind. The full suites/*.json are sized for a multi-node GKE cluster; for kind, use the *-local suites (or copy one and shrink stream_counts / pod_ladder / clients). run writes results/<suite>/, and re-running resumes, skipping finished cells.
PROJECT=my-project scripts/build-images.sh # Cloud Build → Artifact Registry
PROJECT=my-project scripts/bench suites/run-durable.json run # one system
PROJECT=my-project scripts/run-matrix.sh # all systems, ≤3 GKE clusters in parallelscripts/cluster-up.sh (invoked by bench) creates a server node pool (one node, server CPU-pinned) and a Spot client pool. See scripts/target-env.sh for all overridable env (registry, zone, machine types, pull policy). Remote clusters are billable: they tear down on clean completion, and scripts/teardown-watchdog.sh is a deadline safety net.
A workload is server-agnostic: the same suite runs against any supported implementation, chosen by the suite's modes and the server image that gets deployed. Every server points at the same single-node MinIO, and only the system under test is running while it is measured.
- durable-streams (Rust) — the server this harness was built alongside. Runs WAL-backed (
--durability wal, a sharded committer, with or without the resident tail cache) or without a WAL (--durability memory);suites/run-durable.jsonruns the wal / wal-tailcache / memory variants side by side. - Node.js reference server (
@durable-streams/server) — the protocol's reference implementation. It shares the wire protocol, so it reuses thedurableAPI style and runs as modenode(in-memory storage). Being TypeScript rather than a compiled binary, its image is built from the../durable-streamsmonorepo (pnpm workspace, started under Node) — seedockerfiles/durable-node.Dockerfile;build-images.shbuilds it by default (BUILD_NODE=0to skip). - ursula — a single-node Raft server. The storage backend is chosen at deploy time via
URSULA_WAL(memorykeeps the log in RAM;disk, the default, writes a WAL and fsyncs on every commit), sosuites/run-ursula.jsoncovers both. - S2 (
s2lite) — object-store-backed.
Adding another implementation comes down to a deployment manifest, a ds-bench API style for its wire protocol, and a few addressing lines in deploy_mode and reset_state in scripts/lib-bench.sh.
- Run the write-throughput suites individually (
scripts/bench suites/run-<system>.json run) or all at once withscripts/run-matrix.sh(≤3 GKE clusters in parallel). - Raw data: each run writes its per-cell data under
results/<suite>/— thecells.jsonresult-and-resume store, the merged HDR histograms, and the sidecarsamples.csv. - Published dataset: the report and curated data for the run in this repo live in
results/. - Regenerate reports (purely from local files, no cluster):
scripts/bench suites/<suite>.json reportfor most workloads,python3 scripts/catchup_report.py suites/catchup-*.jsonfor catch-up, andscripts/run-sse.shfor SSE.
Framework logic is unit-tested, no cluster required:
cd scripts && for t in *_test.py; do python3 "$t"; done
for t in scripts/*_test.sh; do bash "$t"; doneThese cover the suite loader, the per-cell result stores, the saturation classifier, the catch-up and sustained runners, and the report renderers.
These benchmarks target single-node deployments. Within that scope, every run is kept on equal footing — one node per server, identical workload parameters, a shared single-node MinIO, fresh data each run, and only the system under test running while it is measured — and all numbers are generated by ds-bench on equal hardware, not reused from any implementation's published results.
Extend the harness to replicated and other deployment topologies; the current workloads and suites assume a single node.
The benchmark methodology is based on ursula's published benchmark (ursula.tonbo.io/benchmark); ds-bench is derived from its ursula-bench (Apache-2.0).