PostgreSQL Across Two Machines, Byte-Perfect: The Three Bugs in the Way

KEIBIDROP mounts another machine's folder over FUSE, peer-to-peer, end-to-end encrypted. You read a file and the bytes stream from the other peer on demand, with no cloud in the middle.

Two earlier posts set this up. In the file-descriptor recycling post we fixed the race that made initdb produce half-empty files, after which PostgreSQL ran on a single Linux FUSE mount. In the cross-machine post we tried the harder version, writing the database on one machine and reading it on another, and it almost worked: 977 files synced, a few went missing, the database would not start.

This is the resolution. Initialize a PostgreSQL data directory on a Linux VPS's mount, insert and update rows, stop, remove the lock file; then start PostgreSQL on a macOS machine against the same directory, synced peer-to-peer, with nothing prefetched. The receipt:

# VPS (Linux): initdb on the FUSE mount, 1000 inserts + 333 updates, CHECKPOINT, stop, rm postmaster.pid
SELECT count(*) || '|' || sum(n) || '|' || md5(string_agg(val, ',' ORDER BY id)) FROM t;
  -> 1000|500500|96f27461ce6316f124efc9b0ffba995a

# Mac (macOS): start postgres on the SAME data dir, synced peer-to-peer
  -> 1000|500500|96f27461ce6316f124efc9b0ffba995a     # identical

        Same PostgreSQL 16 data directory. Written on Linux, read on macOS, over a cold on-demand FUSE mount across a bridged WAN link. Byte-identical. The Mac then INSERTed a row and VACUUMed clean, with no errors in the server log.
      

Getting from "almost" to byte-identical took three bugs, each a concrete race, each now reproducible and fixed.

The foundation: 0.3.3

0.3.3 was about making on-demand reads correct, which is the prerequisite for everything else.

Random seeks return the right bytes. The chunk bitmap is 512 KiB-granular. An on-demand read of a few KB used to mark the whole chunk present, so a later read elsewhere in that chunk took the fast path and served sparse-hole zeros. Fixed by fetching chunk-aligned blocks and only marking chunks that are fully written.
Linux copies got about 10x faster. We reported ext4's 4 KiB st_blksize instead of our 256 KiB block size, so cp and dd used 4 KiB buffers and a 1 GB read took forever. One Getattr field.
Size-based prefetch, default off. Aggressive prefetch saturates a relay link and starves the seeks you actually want, so it is opt-in.

You cannot run PostgreSQL on a filesystem that returns the wrong bytes for a random read. This was step zero.

Bug one: data races on file metadata

A File object lives in three maps: RemoteFiles (guarded by RemoteFilesLock), AllFileMap (AfmLock), and OpenFileHandlers (OpenMapLock). The same pointer gets pulled from one into another when a notification moves a file from remote to local.

Every FUSE handler touched that File's fields under whichever map lock it had used to find the File:

Getattr  : mutates f.stat in place    under AfmLock
Read     : reads   f.stat.Size        under OpenMapLock
OpenEx   : reads   f.IsLocalPresent   under AfmLock
Release  : writes  f.IsLocalPresent   under OpenMapLock

Same field, two different locks, no mutual exclusion. A read of f.stat.Size could tear against Getattr rewriting the struct, producing a wrong clamp length and wrong bytes. The windows are tiny, which is why it presented as "corruption once in a while" and never reproduced on demand.

We built the reader with go build -race and ran concurrent git fsck and git status against a freshly-arrived repo. The race detector flags the unsynchronized access whether or not the timing happens to corrupt:

WARNING: DATA RACE
  Write by goroutine 79: (*Dir).Getattr  fuse_directory.go:907   (copyFusestatFromFusestat)
  Read  by goroutine 68: (*Dir).Read     fuse_directory.go:1907  (f.stat.Size)

Four distinct races, all the same shape. The fix is a per-File RWMutex taken innermost, after the map lock and never the reverse, guarding the metadata. Re-running the same stress went from 4 races to 0, full suite green, no deadlock.

This is the bug that mattered for PostgreSQL. It opens hundreds of descriptors concurrently at startup; concurrent metadata access is its normal mode, not an edge case.

Bug two: git's atomic renames did not propagate

Clone a repo on peer A, run git status on peer B, get fatal: bad object HEAD. The .git/objects directory is empty on B even though A finished cloning.

git writes its final artifacts to a temp name and atomically renames them in:

tmp_pack_XXXXXX  -> pack-<hash>.pack    (also .idx, .rev)
index.lock       -> index

The destination only ever appears via rename, never create. On the receiving peer, the rename handler materialized the destination only if the source was already tracked. The source is a transient temp the debounce layer never sent, so the destination, the actual 22 MB pack, never landed. The working tree synced; the object store did not.

It was intermittent, so we instrumented the FUSE Rename, Create, and Unlink handlers and re-cloned:

GITOP-create  tmp_pack_XXXXXX
GITRENAME     tmp_pack_XXXXXX -> pack-<hash>.pack   willNotify=true

The sender fired the rename correctly, with the destination's stat. The receiver dropped it. The fix: the rename event carries the destination's attributes, so always create the destination as an on-demand file. Five consecutive big-repo clones afterward had the pack present and git fsck --full clean, every time.

Bug three: silent notification drops

The sender debounces notifications about 200 ms per path, since a clone is a storm of create, write, and rename, then re-stats each file before sending so the peer gets the final size. If the file was gone at that instant, because a temp flickered or git created then deleted a .keep marker, it silently dropped the notification.

We only saw it after adding end-to-end logging to the notify pipeline, enqueue to send to receive, with size, mtime, and reason at every drop point. A file would enqueue and never send:

notify-drop: refreshAttrFromDisk false  path=.../pack-*.keep  lstatErr="...no such file or directory"

The fix: do not drop. Keep the notification with the size captured at enqueue. A file that genuinely vanished reads back as EOF on the peer, so the worst case is a harmless empty phantom, never a lost real file that merely flickered. The bounded notify channel also dropped on overflow; we raised it from 2048 to 16384.

We deliberately did not filter by suffix. Dropping anything ending in .lock would be convenient, but PostgreSQL, editors, and build tools all use .lock legitimately. Losing a real file is worse than a cosmetic phantom.

Empirical results

Everything below ran on the cold path, prefetch off, between a Linux VPS and a macOS machine over a bridged link, the same setup that used to corrupt.

git clone. KEIBIDROP (315 files) and go-fp (38 files). Peer git fsck --full --strict: clean. Missing files: 0. git checkout: works.

git-LFS. The siui-integration repo: 3.6 GB across 13 LFS objects, the largest 431 MB. LFS names each object by the sha256 of its content, so the check is exact. Read every object on-demand over the bridge, sha256 it, compare to its filename:

LFS RESULT: 0 corrupt / 13 checked

PostgreSQL. The handoff at the top. A 977-file data directory, 1000 rows plus 333 updates written on the VPS, read on macOS byte-identical, written to from macOS, VACUUM clean. Cold: the Mac's cache backing held 20 MB after the run, fetched as PostgreSQL read it; nothing was prefetched.

 id  |    val     |  n
-----+------------+-----
  99 | row99_upd  |  99      <- updated on the VPS
 100 | row100     | 100
 102 | row102_upd | 102      <- updated on the VPS
1001 | from_mac   | 12345    <- inserted on the Mac

Benchmarks held: encrypted transport around 1.9 GB/s, FUSE write around 1 GB/s, on-demand read end-to-end 190 to 330 MB/s. The FUSE kernel layer is 54% of the read path, so the per-File lock added for bug one is invisible against it.

Why this is the bar

A database is the adversarial case for a network filesystem. It wants fsync durability, it mmaps files, it holds hundreds of descriptors open at once, it does random single-page reads, and it guards itself with a lock file. Most network filesystems' documentation tells you not to put a database on them.

KEIBIDROP's backing store is another laptop behind a NAT, reached over an encrypted, relay-bridged connection, fetched a chunk at a time as you read. If PostgreSQL round-trips byte-perfect across that on the cold path, the integrity question for ordinary files is settled.

What is still rough

Runtime lock files (postmaster.pid, *.lock) sync as empty phantoms on the receiver if they are removed mid-flight, the keep-don't-drop tradeoff above. They read as EOF, and the PostgreSQL handoff already has you remove the stale lock, which you would do anyway. Still a wart.
Windows and WinFsp rename delivery is correct by construction, the receiver fix is platform-agnostic Go, but not yet exercised end-to-end on a Windows peer.
Benchmarks for the per-File lock on the hot read path.

The corruption that haunted git, LFS, and PostgreSQL on FUSE came down to three concrete bugs: unsynchronized metadata, a dropped rename, and a dropped notification. Each is now a fixed line of code with a test that reproduces it. That is the difference between flaky and fixed.