Pooling per-node scratch storage with mergerfs
After my VSCode-on-Slurm post seemed to land well, here’s another small tool that has saved me an embarrassing amount of cd-ing around our cluster: mergerfs.
TL;DR: mergerfs is a FUSE filesystem that lets you mount N directories as if they were one. I use it to pool the per-node scratch storage on our SLURM cluster into a single tidy mount under my home directory, so I can
ls,tail, andrsyncacross nodes without thinking about which node a file lives on.
📥 In a hurry? Download cluster_mergerfs.sh — the interactive wrapper script discussed below.
Table of Contents
The problem
Like a lot of academic clusters, ours doesn’t have one big shared scratch pool. Each compute node has its own local-ish fast storage:
/cluster/node1/<user>/...
/cluster/node2/<user>/...
/cluster/node3/<user>/...
/cluster/node4/<user>/...
This is great for IO performance — jobs hit the storage attached to the node they ran on — but bad for the human running them. A single training run that scheduled across multiple nodes leaves logs scattered everywhere. Tailing one log means knowing which node produced it. Backing things up means looping over nodes. Pointing tools (TensorBoard, viewers, scripts) at a single root is impossible.
What I want is a view: one path that looks like the union of all those per-node directories, transparently.
That’s exactly what mergerfs gives you.
What mergerfs is (mental model)
mergerfs is a userspace, policy-driven union filesystem. You hand it a list of branches (existing directories) and a mount point, and it presents the union of those branches at the mount point. Reads transparently fall through to whichever branch has the file; writes get routed to a branch according to a create policy you configure.
Crucially, mergerfs doesn’t move data. Each file still lives on exactly one branch on disk — mergerfs just makes the namespace look unified. If you bypass the mount and look at a branch directly, you still see the real underlying files. That’s what makes it safe to bolt on top of an existing setup without migrations.
Prerequisites
Before any of this works, two things need to be true on every node you want to mount on:
-
mergerfsis installed. It’s packaged for most distributions (apt install mergerfs,dnf install mergerfs, etc.); otherwise grab a release from the project’s GitHub. How you push that out is up to you (Ansible, Salt, your distro’s image, or just SSH-ing it in) — the important thing is that the binary is there and onPATH. -
FUSE allows
user_allow_other. mergerfs is intended to run as a normal user (don’t run it as root), and you’ll almost certainly want the merged view to be readable by other tools/users on the box — daemons, jobs running as a different user, root doing backups. That requiresuser_allow_otherin/etc/fuse.conf:# /etc/fuse.conf user_allow_otherThis is a one-time, root-level change per node.
SSH from where you run the script
The script orchestrates every node by SSHing into it (ssh -o ConnectTimeout=5 <node> bash <<EOF ... EOF), so the host you launch it from needs to be able to SSH into every cluster node without a password prompt. In practice that means:
-
Key-based auth set up to all nodes (an
ssh-agentwith your key loaded, or a key without a passphrase). The script doesn’t handle interactive password entry — it’ll just hang. -
Hostnames resolvable by the names that come back from
sinfo/scontrol show hostnames. If your SLURM short hostnames aren’t directly DNS-resolvable, add an entry per node in~/.ssh/config:Host node1 HostName node1.cluster.example.org User <your-user> -
bashavailable on the remote side (the script runs heredoc’d bash on each node). -
No host-key prompts on first contact. Either pre-populate
~/.ssh/known_hosts, or addStrictHostKeyChecking=accept-newin~/.ssh/configfor your cluster hosts so the first SSH doesn’t block waiting for a y/n.
If you can run ssh <node> hostname against every node and it just prints the hostname with no prompts, you’re set.
My mergerfs options
Here are the options I mount with, and why each one:
cache.files=off,use_ino,func.getattr=newest,
category.create=mfs,moveonenospc=true,
minfreespace=300G,allow_other
cache.files=off— disable kernel page caching of file content. With networked-ish per-node storage this avoids stale-cache surprises; performance is rarely the bottleneck for what I use the merged view for (browsing logs, rsync).use_ino— preserve the underlying filesystem’s inode numbers in the merged view. Tools that compare files by(dev, ino)(rsync, tar) behave correctly.func.getattr=newest— if the same path exists on several branches,stat()returns the version with the most recent mtime. Handy when a job’s log file ends up on multiple nodes.category.create=mfs— most-free-space create policy. New files go to whichever branch currently has the most free space. Keeps things naturally balanced.moveonenospc=true— if a write fails withENOSPC, mergerfs will transparently move the file to another branch and retry. Saves a job from dying because the branch it landed on filled up.minfreespace=300G— branches with less than 300 GB free are excluded from new file creation. With ML workloads writing big checkpoints this prevents pathological “write 200 GB onto a branch that had 250 GB and lock everything up” cases.allow_other— combined withuser_allow_otherinfuse.conf(above), allows other users on the box to read the mount.
A script to manage it across the cluster
Mounting one merged view by hand is mergerfs <opts> /a:/b:/c /mnt. Doing that interactively across N nodes for M merge sets, and remembering to clean up, gets old fast. So I wrote a small interactive bash tool. It can:
- Mount any subset of merge sets on any subset of nodes (creates source dirs if missing, refuses to mount if a source is unreachable, replaces stale mounts).
- Check mounts — for each (set, node), show whether it’s active, the merged-view
df, the active mount options, per-source disk usage, and flag any branch that has dropped belowminfreespace(so you know mergerfs has stopped writing to it). Optional file-count anddu -sh(gated, because they’re slow on big trees). - Unmount safely (
fusermount -uz). - Kill mergerfs process — for the case where a mount has gone unresponsive: it locates the right
mergerfsPID by matching the mount point at the end of/proc/<pid>/cmdline(so a path that appears as a source in another mergerfs instance doesn’t false-match),kill -9s it, then attemptsfusermount -uz, falling back toumount -lif needed. - Show rsync commands — emit ready-to-paste
rsynclines for migrating each merge set into a single consolidated directory, for the day you outgrow the union view and want to physically merge.
Node discovery uses sinfo + scontrol show hostnames so it always picks up what SLURM currently knows about, plus the head node.
The whole thing is one self-contained file. Configure your merge sets at the top — each entry is mount_point:source1:source2:...:sourceN — and run it. The interactive menu does the rest.
📥 Download cluster_mergerfs.sh (~700 lines, single file, no dependencies beyond mergerfs, ssh, and standard SLURM tooling).
The configuration block at the top is the only thing you need to edit:
# Format: local_mount:path1:path2:path3
MERGE_SETS=(
"/home/<user>/projects/projectA/logs:/cluster/node1/<user>/projectA_logs:/cluster/node2/<user>/projectA_logs:/cluster/node3/<user>/projectA_logs:/cluster/node4/<user>/projectA_logs"
"/home/<user>/projects/projectB/scenes:/cluster/node1/<user>/projectB/scenes"
"/home/<user>/projects/projectB/outputs:/cluster/node1/<user>/projectB/outputs"
)
MERGERFS_OPTS="cache.files=off,use_ino,func.getattr=newest,category.create=mfs,moveonenospc=true,minfreespace=300G,allow_other"
A few practical notes
- mergerfs mounts are per-machine. Each compute node sees the merged view through its own mergerfs process; there’s no cluster-wide FS magic. The script just SSH-loops the same mount on every node so you get a consistent view wherever you land.
- Run as your user, not root. Combined with
allow_other, that gives you the right ownership and avoids weirdchownaftermath. - Don’t write into a branch directly while the merged view is mounted on top of it — mergerfs caches some metadata, so changes that bypass the union can confuse it briefly. If in doubt, write through the mount point.
minfreespaceis your friend. Without it, a single full branch can break creates pretty unintuitively. With it, mergerfs just rotates traffic onto branches that still have headroom.- If a mount goes hung, the “Kill mergerfs process” action is much safer than
kill -9 $(pgrep mergerfs)— it specifically matches the mount point at the end of the cmdline so it never kills a different instance.
Wrap-up
mergerfs isn’t going to replace a real distributed filesystem, and that’s fine — it’s not trying to. What it does give you is a five-minute Saturday-afternoon way to make per-node storage feel like one place, without moving any data and without admin-team meetings about provisioning. Combined with a small wrapper script, it’s been one of the higher quality-of-life upgrades to my day-to-day workflow on the cluster.
Disclaimer
The script linked above is released under the MIT License and is provided “as is”, without warranty of any kind, express or implied. It performs operations that touch filesystems, mount points, and remote machines via SSH, and includes destructive actions (unmount, kill -9). You are solely responsible for understanding what it does before running it, for testing it in a safe environment first, and for any data loss, downtime, or other consequences that may result from its use. Use at your own risk.