Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
add_node.sh	add_node.sh
cluster_status.sh	cluster_status.sh
collect_frames.sh	collect_frames.sh
render_frame.sbatch	render_frame.sbatch
setup_node.sh	setup_node.sh
slurm.conf.template	slurm.conf.template
submit_render.sh	submit_render.sh

BHRT Distributed Rendering with Slurm + Tailscale

Distribute black hole ray tracing frame renders across multiple machines connected via a Tailscale mesh, scheduled by Slurm.

Architecture

┌────────────────────────────────────────────────────────┐
│  Your machine (macOS)                                  │
│    submit_render.sh  ──sbatch──►  Slurm controller     │
│    collect_frames.sh ◄──rsync──  compute nodes         │
│    make_video.sh     ──ffmpeg──►  blackhole.mp4        │
└────────────────────────────────────────────────────────┘
        │ Tailscale VPN (100.x.y.z)
        ├───── node01 (8 CPU)   ── renders frames 0-19
        ├───── node02 (16 CPU)  ── renders frames 20-39
        ├───── gpu01  (GPU)     ── renders frames 40-59
        └───── ...

Each frame is an independent Slurm array task. Slurm distributes tasks across available nodes automatically. Frames are collected back to the head node and assembled into a video.

Quick Start

1. Set up Tailscale on every machine

# Install: https://tailscale.com/download
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

# Verify connectivity between nodes:
tailscale ping <other-node>

2. Bootstrap the head node

# On the machine that will run slurmctld:
sudo ./slurm/setup_node.sh --head

This installs Slurm + MUNGE, builds the renderer at /opt/bhrt/, and generates a MUNGE key.

3. Configure the cluster

Edit /etc/slurm/slurm.conf on the head node:

# Set the head node hostname:
SlurmctldHost=your-head-node-tailscale-hostname

# Add your nodes (use Tailscale IPs from `tailscale ip -4`):
NodeName=node01  NodeAddr=100.64.0.1  CPUs=8   RealMemory=16000  State=UNKNOWN
NodeName=node02  NodeAddr=100.64.0.2  CPUs=16  RealMemory=32000  State=UNKNOWN
NodeName=gpu01   NodeAddr=100.64.0.10 CPUs=16  RealMemory=64000  Gres=gpu:1  State=UNKNOWN

# Define partitions:
PartitionName=cpu  Nodes=node01,node02  Default=YES  MaxTime=INFINITE  State=UP
PartitionName=gpu  Nodes=gpu01          Default=NO   MaxTime=INFINITE  State=UP

4. Set up compute nodes

# Copy MUNGE key from head to each compute node:
scp /etc/munge/munge.key node01:/etc/munge/munge.key
scp /etc/munge/munge.key node02:/etc/munge/munge.key

# Copy slurm.conf to each compute node:
scp /etc/slurm/slurm.conf node01:/etc/slurm/slurm.conf
scp /etc/slurm/slurm.conf node02:/etc/slurm/slurm.conf

# Run setup on each compute node:
ssh node01 'sudo /opt/bhrt/slurm/setup_node.sh'
ssh node02 'sudo /opt/bhrt/slurm/setup_node.sh'

5. Verify the cluster

sinfo                          # Should show all nodes in "idle" state
./slurm/cluster_status.sh      # Pretty-printed cluster overview

6. Submit a render

# Render 120 frames, distributed across all nodes:
./slurm/submit_render.sh -n 120 -dt 0.5 -t0 0.0

# With GPU acceleration:
./slurm/submit_render.sh -n 120 --gpu

# With multi-layer EXR output (for flaresim / grading):
./slurm/submit_render.sh -n 120 --exr

# With both EXR and HDR:
./slurm/submit_render.sh -n 120 --exr --hdr

# Dry run (preview the sbatch command):
./slurm/submit_render.sh -n 120 --dry-run

7. Monitor progress

./slurm/cluster_status.sh -w   # Live dashboard (refreshes every 5s)
squeue -u $USER                # Raw Slurm queue

8. Collect frames and make the video

# Pull frames from all nodes (or verify if using shared FS):
./slurm/collect_frames.sh

# If using a shared filesystem:
./slurm/collect_frames.sh --shared-fs

# Assemble TGA frames into video:
./make_video.sh -i build/frames -o blackhole.mp4

# Assemble from EXR (with automatic tonemapping):
./make_video.sh -i build/frames -f exr -o blackhole.mp4

# Post-process EXR frames through flaresim before encoding:
# for f in build/frames/frame_*.exr; do
#   ./flaresim/build/flaresim --input "$f" --output "graded_${f}" ...
# done

File Overview

File	Purpose
`slurm.conf.template`	Slurm configuration template — edit and copy to `/etc/slurm/slurm.conf`
`setup_node.sh`	Bootstrap script for each node (installs deps, builds renderer, starts daemons)
`render_frame.sbatch`	Slurm batch script — renders a single frame (one per array task)
`submit_render.sh`	Main entry point — submits a job array to Slurm
`collect_frames.sh`	Gathers frames from nodes, verifies completeness
`cluster_status.sh`	Live cluster dashboard (Tailscale + Slurm + frame progress)

Options Reference

`submit_render.sh`

Flag	Default	Description
`-n`	60	Number of frames
`-t0`	0.0	Starting time value
`-dt`	0.5	Time step per frame
`-p`	frame	Filename prefix
`-d`	/opt/bhrt	Project directory on nodes
`-P`	cpu	Slurm partition
`-c`	4	CPUs per task
`-m`	4G	Memory per task
`-T`	01:00:00	Wall time limit per frame
`--gpu`	off	Use GPU partition + request 1 GPU
`--exr`	off	Also output multi-layer OpenEXR per frame
`--hdr`	off	Also output Radiance HDR per frame
`--dry-run`	off	Preview without submitting

`collect_frames.sh`

Flag	Default	Description
`-n`	from job info	Expected frame count
`-d`	/opt/bhrt	Project directory
`-L`	./build/frames	Local collection directory
`--exr`	from job info	Expect / collect EXR frames
`--hdr`	from job info	Expect / collect HDR frames
`--shared-fs`	off	Skip rsync, verify in-place
`--nodes`	auto-detect	Comma-separated node list

Networking Notes

All Slurm traffic (slurmctld ↔ slurmd) flows through the Tailscale mesh
Node addresses in slurm.conf must be Tailscale IPs (100.x.y.z) or hostnames
SSH for rsync in collect_frames.sh also routes through Tailscale
No port forwarding or public IPs required
Tailscale ACLs can restrict which nodes can communicate

Troubleshooting

Nodes show as "down" in `sinfo`

# Check slurmd is running on the node:
ssh node01 'systemctl status slurmd'

# Check Tailscale connectivity:
tailscale ping node01

# Resume a drained node:
sudo scontrol update NodeName=node01 State=RESUME

MUNGE authentication errors

# Verify same key on all nodes:
md5sum /etc/munge/munge.key    # must match everywhere

# Restart MUNGE:
sudo systemctl restart munge
munge -n | unmunge             # test locally
ssh node01 'munge -n' | unmunge  # test cross-node

Frames missing after render

# Check which tasks failed:
sacct -j <JOB_ID> --format=JobID,State,ExitCode,NodeList

# View error logs:
cat /opt/bhrt/build/logs/frame_<JOB_ID>_<TASK_ID>.err

# Re-render only missing frames:
./slurm/collect_frames.sh -n 120   # reports missing indices
sbatch --array=5,12,47 slurm/render_frame.sbatch

Clock skew between nodes

MUNGE requires clocks to be within 5 minutes. Use NTP:

sudo timedatectl set-ntp true
timedatectl status

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

BHRT Distributed Rendering with Slurm + Tailscale

Architecture

Quick Start

1. Set up Tailscale on every machine

2. Bootstrap the head node

3. Configure the cluster

4. Set up compute nodes

5. Verify the cluster

6. Submit a render

7. Monitor progress

8. Collect frames and make the video

File Overview

Options Reference

`submit_render.sh`

`collect_frames.sh`

Networking Notes

Troubleshooting

Nodes show as "down" in `sinfo`

MUNGE authentication errors

Frames missing after render

Clock skew between nodes

FilesExpand file tree

slurm

Directory actions

More options

Directory actions

More options

Latest commit

History

slurm

Folders and files

parent directory

README.md

BHRT Distributed Rendering with Slurm + Tailscale

Architecture

Quick Start

1. Set up Tailscale on every machine

2. Bootstrap the head node

3. Configure the cluster

4. Set up compute nodes

5. Verify the cluster

6. Submit a render

7. Monitor progress

8. Collect frames and make the video

File Overview

Options Reference

submit_render.sh

collect_frames.sh

Networking Notes

Troubleshooting

Nodes show as "down" in sinfo

MUNGE authentication errors

Frames missing after render

Clock skew between nodes

`submit_render.sh`

`collect_frames.sh`

Nodes show as "down" in `sinfo`