The Write/Release Race Condition

The Symptom

Copy a 1GB file into the KEIBIDROP FUSE mount, and about one in five times the copy fails partway through:

cp: error writing '/mnt/keibidrop/large_file.bin': Bad file descriptor

It happened more often under load and frequently on the Intel MacBook I developed on. The file would be there but truncated to around 548KB of a 1GB file, with the rest gone. "Bad file descriptor" in FUSE tells you something is wrong without telling you what, where, or why.

Finding the Bug

The logs told the story, eventually. A single file copy generates roughly 10,000 log lines of FUSE operations; finding the relevant pair required searching for EBADF and then looking at the timestamps immediately preceding it. When I found the failing sequence, I saw this:

22:03:54.976  Release() path=/large_file.bin fh=7 -- closing fd
22:03:54.977  Write()   path=/large_file.bin fh=7 -- pwrite failed: EBADF

One millisecond. Release closed the file descriptor at .976, and Write tried to use it at .977.

The kernel sent Release (the "this file handle is done" signal) before the last Write had completed. The Write was already in flight when Release arrived and closed the underlying file descriptor out from under it.

This should not happen in a well-behaved filesystem. But FUSE is not a normal filesystem. The kernel dispatches FUSE operations asynchronously across multiple threads. Write and Release can be in-flight simultaneously on different goroutines, and the kernel makes no guarantee about ordering at the edges.

The Code (Before)

Here is the Write handler as it existed before the fix. See if you can spot the problem:

func (fs *KeibiFS) Write(path string, buff []byte, ofst int64, fh uint64) int {
    fs.mu.RLock()
    entry, ok := fs.files[path]
    fileHandle := entry.fh
    fs.mu.RUnlock()  // <-- DANGER: lock released here

    if !ok {
        return -fuse.ENOENT
    }

    // By this point, Release() may have already closed fileHandle
    n, err := syscall.Pwrite(int(fileHandle), buff, ofst)
    if err != nil {
        log.Printf("Write pwrite failed: %v", err)
        return -fuse.EIO
    }

    return n
}

And the Release handler:

func (fs *KeibiFS) Release(path string, fh uint64) int {
    fs.mu.Lock()
    entry, ok := fs.files[path]
    if ok {
        syscall.Close(int(entry.fh))
        delete(fs.files, path)
    }
    fs.mu.Unlock()
    return 0
}

The Write handler acquires a read lock, copies the file handle, releases the read lock, and then calls Pwrite with the copied handle. The Release handler acquires a write lock and closes the handle.

The window is between RUnlock() and Pwrite(). In that gap, Release can acquire the write lock, close the file descriptor, and return before Write gets to use the handle it just copied.

Source: Write handler | Release handler | OpenMapLock declaration

The Race

Here is the timeline, step by step:

Thread A (Write)          Thread B (Release)
─────────────────         ──────────────────
RLock()
  read fh = 7
RUnlock()
                          Lock()       // succeeds, no readers
                            Close(7)   // fd 7 is gone
                            delete(path)
                          Unlock()
Pwrite(7, ...)
  → EBADF                // fd 7 no longer exists

Thread A reads the file handle and drops the lock. Thread B grabs the exclusive lock, closes the file descriptor, and releases. Thread A then tries to use a file descriptor that no longer exists.

The reason this is intermittent is timing. On fast hardware, Write usually finishes the Pwrite before Release arrives. Under load, or with large files that generate many Write calls, the window is wider and Release slips in.

The Fix

Hold the read lock during the actual Pwrite syscall:

func (fs *KeibiFS) Write(path string, buff []byte, ofst int64, fh uint64) int {
    fs.mu.RLock()
    defer fs.mu.RUnlock()  // hold through the entire operation

    entry, ok := fs.files[path]
    if !ok {
        return -fuse.ENOENT
    }

    n, err := syscall.Pwrite(int(entry.fh), buff, ofst)
    if err != nil {
        log.Printf("Write pwrite failed: %v", err)
        return -fuse.EIO
    }

    return n
}

That is it. One line changed: defer fs.mu.RUnlock() instead of an early RUnlock() before the syscall.

Why This Works

Go's sync.RWMutex has specific semantics that make this safe:

Multiple goroutines can hold RLock simultaneously, so multiple Write calls on different paths (or even the same path) do not block each other. File copies with many concurrent writes continue to work at full speed.
Lock() waits for all existing RLocks to release, so Release cannot acquire the exclusive lock until every in-flight Write has finished its Pwrite and released its read lock.
There is no deadlock risk. Write only ever acquires a read lock, Release only ever acquires a write lock, and neither calls the other. The lock ordering is trivially acyclic.

Source: OpenFileCounter (reference counting)

The cost is that Release blocks until all concurrent Writes finish, but Pwrite on a local file descriptor takes microseconds, so the delay is negligible.

Why It Was Intermittent

The time between RUnlock() and Pwrite() in the old code was on the order of nanoseconds. The goroutine scheduler rarely paused Thread A in that window during small file copies. Under load, copying a large file generates hundreds of Write calls, and the probability of a context switch in that exact window increases. The bug appeared roughly one in five times on large files, which is the worst kind of race condition: frequent enough to be a real problem, rare enough to make you doubt your own logs.

Lessons

FUSE operations interleave in ways you do not expect

The FUSE protocol allows the kernel to dispatch multiple operations concurrently. Write and Release for the same file can be in-flight on different threads at the same time. You cannot assume that "the last Write finishes before Release begins." The kernel makes no such guarantee.

Brief locks are not always correct

The original code followed a common pattern: lock, read shared state, unlock, use the copied value. This is fine when the copied value remains valid after unlock, but a file descriptor can be invalidated by another operation (Close). The copy is a handle that refers to external state that can change.

RWMutex is your friend

The fix works because RWMutex allows concurrent reads. If we used a plain Mutex, holding it during Pwrite would serialize all writes and destroy throughput for multi-file copies. RWMutex gives us the safety of "Release cannot run during Write" without the cost of "only one Write at a time."

Log timestamps are crucial

Without millisecond-resolution timestamps on every FUSE operation, this bug would have been nearly impossible to diagnose. The one-millisecond gap between Release and the failed Write was the entire clue. If the logs had only shown "Write failed: EBADF" without timing, I would still be looking for the cause.

Always log timestamps in FUSE handlers, both for debugging and for understanding the concurrency model of your own code.

The Meta-Lesson

There is a well-known rule in concurrent programming: "Release locks before I/O." The reasoning is sound, since holding a lock during a network call that might take seconds (or hang forever) is a recipe for deadlocks and starvation.

This rule has an implicit assumption: the I/O is slow and unpredictable. Network calls, disk I/O over NFS, and HTTP requests can take milliseconds to minutes, and you do not want to hold a lock for that duration.

Pwrite on a local file descriptor is a different kind of I/O entirely. It is a syscall that completes in microseconds. Holding a read lock for the duration of a local Pwrite will cause no starvation; nobody will notice. Dropping the lock, on the other hand, lets another goroutine close the file descriptor out from under you.

The rule is better stated as "release locks before slow I/O." For fast, local syscalls, holding the lock is sometimes the only correct option. I spent weeks chasing this race condition, and the fix was a one-line change; the original "optimization" of releasing the lock early saved microseconds and cost correctness. Catherine McGeoch's A Guide to Experimental Algorithmics shaped how I approach this kind of performance debugging: measure first, change one variable, measure again.