How File Descriptor Recycling Broke PostgreSQL on Our FUSE Filesystem

POSIX fd semantics, concurrent goroutines, and 50% data loss during initdb

8 min read | KEIBIDROP Series

PostgreSQL's initdb creates around 224 files in under a second. On our FUSE filesystem, about half of them ended up as 0 bytes. The database would not start. WAL segments had invalid magic numbers. System indexes were missing. Running under strace, everything worked. Without strace, 90% failure rate.

We spent two sessions chasing this. The first session narrowed it down: not a DirectIO issue (we tested with DirectIO always on, always off, same failure), not a page cache issue, not a permission issue. The bug disappeared under observation, which pointed at a timing-dependent race.

How our FUSE handler manages file handles

When a program opens a file on our FUSE mount, the kernel calls our Open handler. We call syscall.Open on the real file in the backing directory, get back a file descriptor number (say, 42), store it in a map, and return it to the kernel as the file handle. The kernel passes that same number back on every subsequent Read, Write, and Release call for that file.

Our map looked like this:

OpenFileHandlers map[uint64]*File  // key = fd number from syscall.Open

Write used the handle directly as the fd for the pwrite syscall:

n, err := platPwrite(int(fh), buff, offset)

Release closed the fd and deleted the map entry:

platClose(int(fh))
delete(d.OpenFileHandlers, fh)

The handle was the fd. The fd was the handle. This is where the problem lived.

The POSIX guarantee that matters here

The open(2) specification requires returning the lowest-numbered unused file descriptor. The Linux close(2) man page says it plainly: "The kernel always releases the file descriptor early in the close operation, freeing it for reuse." The fd is recyclable the instant you close it, before any pending I/O flushes complete.

In Go, with many goroutines calling CreateEx and Release concurrently on different files, the kernel recycles fd numbers constantly. CreateEx opens file A, gets fd 42. Release closes fd 42 for a different file. CreateEx opens file B, gets fd 42 again. If a Write for file A is still in flight when file B takes over fd 42, that Write lands in file B.

The FUSE kernel module makes this worse. On non-fuseblk mounts, FUSE_RELEASE is queued asynchronously. There is no ordering guarantee between a Write and a Release for the same file handle. The kernel can dispatch them to the userspace daemon in either order. libfuse issue #746 documents a related race in the high-level API.

POSIX mandates lowest-available fd. Close frees the number instantly. In a concurrent FUSE handler, this means fd numbers collide constantly.

Why strace hides it

strace interposes on every syscall, serializing them and adding microseconds of latency at each step. Those microseconds are enough that fd 42 does not get recycled before the in-flight Write completes. Remove strace, and the window between close and reopen collapses to nanoseconds. The race hits on almost every initdb run because 224 files cycling through a small pool of fd numbers at high speed produces frequent collisions.

The fix

We decoupled the FUSE handle namespace from the kernel fd namespace. A process-wide atomic counter generates handle IDs that never repeat:

type HandleEntry struct {
    FD   int   // the actual kernel fd for syscalls
    File *File // metadata
}

var nextHandleID atomic.Uint64

func allocHandleID() uint64 {
    return nextHandleID.Add(1)
}

The map became map[uint64]*HandleEntry. When opening a file, we allocate a unique handle ID, store the real fd alongside the File struct, and return the handle ID to the kernel. When writing, we look up the handle to get the actual fd. The kernel can recycle fd 42 as many times as it wants; each file gets a distinct handle ID in our map and there are no collisions.

We changed every platPwrite(int(fh), ...) to platPwrite(entry.FD, ...) and every d.OpenFileHandlers[uint64(fd)] = f to d.OpenFileHandlers[handleID] = &HandleEntry{FD: fd, File: f}. About 80 lines across four files. No lock patterns changed. The Opendir/Releasedir path was left alone since it has a simple open-use-close lifecycle without a map.

Verification

  1. Added the new types to types.go. Ran go vet. No new warnings beyond a pre-existing one about an unused streamCancel variable.
  2. Changed all map initializations. Ran go vet again.
  3. Updated CreateEx and OpenEx (four code paths each). Built successfully.
  4. Updated Read, Write, Release, Fsync. Built with race detector. Ran filesystem unit tests.
  5. Ran all 70 integration tests. All passed.
  6. Wrote a stress test: 250 goroutines each doing Create, Write, Release in a tight loop on different files, then verifying every file has the correct content. Ran 10 times with the race detector. Zero cross-contamination, zero races detected.
  7. Ran PostgreSQL initdb on a Linux FUSE mount. 968 files, 168 legitimately empty (matching a native non-FUSE initdb exactly). Started the server, created a table, inserted 15,000 rows with concurrent updates, ran VACUUM ANALYZE, queried aggregates, shut down cleanly. 980 files, 42 MB.

The uint64 counter will not wrap in practice. At one billion handle allocations per second it takes 585 years. The HandleEntry struct adds 8 bytes per open file, which is negligible.

What we confirmed about the OS behavior

POSIX open(2) mandates lowest-available fd on both Linux and macOS (which inherits this from its BSD layer). The Linux close(2) man page explicitly states early fd release. The Go sync/atomic.Uint64.Add operation is lock-free and sequentially consistent across goroutines, with automatic 64-bit alignment on 32-bit systems since Go 1.19. Windows uses opaque HANDLEs rather than small integers, but WinFSP/cgofuse creates a POSIX-style abstraction, so the fix applies there too.

We also confirmed that the FUSE protocol provides no ordering guarantee for Release relative to other operations on the same handle. Our previous EBADF fallback code (which reopened the file by path when a write failed with bad-fd) was a workaround for symptoms of this race. The opaque handle fix addresses the cause.

8 min read | KEIBIDROP Series | FUSE Deadlocks | PostgreSQL Benchmarks | Cross-Peer PostgreSQL