The st_blksize Trick: 10x FUSE Write Performance

The Problem

We were testing KEIBIDROP's FUSE mount by copying a 1 GB file into it. The operation completed, but it was painfully slow. Profiling revealed the issue: the kernel was issuing 25,600 individual write calls, each carrying exactly 4,096 bytes. The throughput topped out at around 300 MB/s.

For a system that encrypts every chunk with ChaCha20-Poly1305 before sending it over the wire, those 25,600 round trips through the FUSE layer added up fast. Each write call means a context switch from kernel space to userspace, a lock acquisition in our filesystem handler, an encryption pass, and a network send. At 4 KB per call, the overhead dominated the actual data transfer.

        The numbers: 1 GB / 4 KB = 262,144 potential write calls. The kernel was batching slightly, but we still saw ~25,600 calls. Each one carried the full cost of a FUSE round-trip.
      

The Discovery

The answer came from reading the kernel source, not the FUSE documentation. When userspace programs like cp or macOS's fcopyfile need to copy a file, they call stat() on the destination to determine the optimal I/O buffer size. The field they read is st_blksize, the "preferred block size for filesystem I/O."

On HFS+ and APFS, this value defaults to 4,096 bytes. Most FUSE filesystems inherit this default without thinking about it. The kernel dutifully reports 4,096, and cp dutifully writes in 4 KB chunks.

The important detail is that st_blksize is advisory; it is a hint from the filesystem to userspace about what buffer size will give the best performance, not a hardware constraint. For a FUSE filesystem that batches writes into encrypted network packets, 4 KB is absurdly small.

// What the kernel does (simplified):
buf_size = stat(destination).st_blksize;  // 4096 by default
while (bytes_remaining > 0) {
    n = read(src, buf, buf_size);
    write(dst, buf, n);                   // FUSE round-trip per call
}

The Fix

The fix is almost embarrassingly simple. Override st_blksize in the FUSE Getattr handler to report a larger block size:

// FilesystemBlockSize is the preferred I/O block size reported
// to userspace via stat(). Larger values mean fewer, bigger
// write calls from cp/fcopyfile/rsync.
const FilesystemBlockSize = 2 << 20 // 2 MiB

func (fs *KeibiFS) Getattr(path string, stat *fuse.Stat_t, fh uint64) int {
    // ... fill in size, mode, timestamps ...
    stat.Blksize = FilesystemBlockSize
    return 0
}

Source: Getattr handler | FilesystemBlockSize (macOS)

We also need to update Statfs() to report consistent block sizing for the filesystem as a whole:

func (fs *KeibiFS) Statfs(path string, stat *fuse.Statfs_t) int {
    stat.Bsize  = uint64(FilesystemBlockSize)
    stat.Frsize = uint64(FilesystemBlockSize)
    // ... rest of statfs fields ...
    return 0
}

Source: Statfs handler | Bsize/Frsize override

The Result

After the change, copying that same 1 GB file looked completely different:

Metric	Before (4 KB)	After (2 MiB)
Write calls	~25,600	~1,000
Bytes per call	4 KB	~1 MB
Throughput	300 MB/s	3,400 MB/s
Improvement	baseline	~10x

The write calls dropped by roughly 25x. The throughput increased by over 10x. The file copy that previously felt sluggish became nearly instantaneous on local transfers. For encrypted network transfers, the improvement was even more dramatic because we were amortizing the per-call encryption overhead across much larger chunks.

Platform-Specific Values

The optimal block size is not the same everywhere. We tested across platforms and settled on these values:

Platform	Optimal st_blksize	Notes
macOS Intel	2 MiB	Sweet spot for fcopyfile path
macOS M-series	up to 10 MiB	Larger caches tolerate bigger blocks
Linux	256 KiB	FUSE max_write caps effective size
Windows (WinFsp)	up to 10 MiB	Similar to M-series behavior

We use Go build tags to select the right value at compile time:

// blocksize_darwin.go
//go:build darwin

package filesystem

const FilesystemBlockSize = 2 << 20 // 2 MiB

// blocksize_linux.go
//go:build linux

package filesystem

const FilesystemBlockSize = 256 << 10 // 256 KiB

// blocksize_windows.go
//go:build windows

package filesystem

const FilesystemBlockSize = 2 << 20 // 2 MiB

Source: macOS | Linux | Windows

Why It Works

The 10x improvement comes from four compounding effects:

Fewer syscalls. Each FUSE write call requires a context switch from kernel to userspace and back. Going from 25,600 calls to 1,000 eliminates 24,600 round-trips. At roughly 2-5 microseconds per context switch, that alone saves 50-120 ms.
Better disk I/O alignment. Storage devices have internal page sizes (typically 4-16 KB for SSDs). Larger writes let the OS batch these into sequential operations, reducing write amplification.
Reduced lock contention. Our FUSE handler acquires a mutex per write to maintain file consistency (see Write/Release Race). Fewer writes means fewer lock acquisitions.
CPU cache efficiency. Processing 1 MB of data in a single call keeps the working set hot in L2/L3 cache. Processing 4 KB at a time means the encryption state, file metadata, and network buffers get evicted and reloaded 25x more often.

The Trap: Don't Go Too High

There is a sweet spot, and overshooting it hurts. Going beyond the optimal range introduces memory pressure, TLB thrashing, and diminishing returns. On Apple Silicon and Windows, the sweet spot goes up to about 10 MiB; on Intel Macs, 2 MiB; on Linux, 256 KiB due to FUSE max_write caps. Beyond those ranges, throughput deteriorates on all platforms. These numbers come from empirical testing across a previous FUSE project and KEIBIDROP itself.

        The sweet spot curve: at 4 KB, syscall overhead dominates; at 1-2 MiB, throughput is maximized with reasonable memory usage; at 16+ MiB, you start paying for memory pressure and TLB thrashing with diminishing throughput gains.
      

At very large block sizes, you start hitting memory pressure. Each in-flight write consumes st_blksize bytes of kernel buffer space. With multiple concurrent copies, that adds up. On a machine with 8 GB of RAM, having 16 concurrent writes at 64 MiB each would consume 1 GB just for write buffers.

There is also the issue of latency. If a write fails halfway through a 64 MiB chunk, the entire chunk must be retried. With 2 MiB chunks, you lose at most 2 MiB of progress on failure.

How to Find Your Optimal

If you are building a FUSE filesystem, here is a simple instrumentation approach to find your optimal block size:

type WriteStats struct {
    mu         sync.Mutex
    callCount  int64
    totalBytes int64
    minSize    int64
    maxSize    int64
}

func (ws *WriteStats) Record(n int64) {
    ws.mu.Lock()
    defer ws.mu.Unlock()
    ws.callCount++
    ws.totalBytes += n
    if n < ws.minSize || ws.minSize == 0 {
        ws.minSize = n
    }
    if n > ws.maxSize {
        ws.maxSize = n
    }
}

func (ws *WriteStats) Report() string {
    ws.mu.Lock()
    defer ws.mu.Unlock()
    avg := int64(0)
    if ws.callCount > 0 {
        avg = ws.totalBytes / ws.callCount
    }
    return fmt.Sprintf(
        "writes=%d total=%s avg=%s min=%s max=%s",
        ws.callCount,
        humanize(ws.totalBytes),
        humanize(avg),
        humanize(ws.minSize),
        humanize(ws.maxSize),
    )
}

Run your typical workload (file copies, IDE operations, git clones) with different st_blksize values and compare the stats. You are looking for the point where increasing the block size stops improving throughput.

The Meta-Lesson

Default values in systems software exist for compatibility, not for performance. The 4,096-byte block size is a safe default that works on every filesystem from FAT32 to ZFS. It is the value that will never cause a problem and will never be optimal.

When you control the implementation and you know your filesystem's characteristics, your network's MTU, and your encryption's chunk size, you should override every default that touches your hot path.

This is empirical systems engineering: measure, hypothesize, change one variable, measure again. Catherine McGeoch's A Guide to Experimental Algorithmics describes exactly this methodology, and it applies to systems tuning as well as it does to algorithm benchmarking. The st_blksize fix took one line of code to implement and weeks of profiling to discover.