The Problem
We were testing KEIBIDROP's FUSE mount by copying a 1 GB file into it. The operation completed, but it was painfully slow. Profiling revealed the issue: the kernel was issuing 25,600 individual write calls, each carrying exactly 4,096 bytes. The throughput topped out at around 300 MB/s.
For a system that encrypts every chunk with ChaCha20-Poly1305 before sending it over the wire, those 25,600 round trips through the FUSE layer added up fast. Each write call means a context switch from kernel space to userspace, a lock acquisition in our filesystem handler, an encryption pass, and a network send. At 4 KB per call, the overhead dominated the actual data transfer.
The Discovery
The answer came from reading the kernel source, not the FUSE documentation. When userspace programs like cp or macOS's fcopyfile need to copy a file, they call stat() on the destination to determine the optimal I/O buffer size. The field they read is st_blksize, the "preferred block size for filesystem I/O."
On HFS+ and APFS, this value defaults to 4,096 bytes. Most FUSE filesystems inherit this default without thinking about it. The kernel dutifully reports 4,096, and cp dutifully writes in 4 KB chunks.
The important detail is that st_blksize is advisory; it is a hint from the filesystem to userspace about what buffer size will give the best performance, not a hardware constraint. For a FUSE filesystem that batches writes into encrypted network packets, 4 KB is absurdly small.
// What the kernel does (simplified):
buf_size = stat(destination).st_blksize; // 4096 by default
while (bytes_remaining > 0) {
n = read(src, buf, buf_size);
write(dst, buf, n); // FUSE round-trip per call
}
The Fix
The fix is almost embarrassingly simple. Override st_blksize in the FUSE Getattr handler to report a larger block size:
// FilesystemBlockSize is the preferred I/O block size reported
// to userspace via stat(). Larger values mean fewer, bigger
// write calls from cp/fcopyfile/rsync.
const FilesystemBlockSize = 2 << 20 // 2 MiB
func (fs *KeibiFS) Getattr(path string, stat *fuse.Stat_t, fh uint64) int {
// ... fill in size, mode, timestamps ...
stat.Blksize = FilesystemBlockSize
return 0
}
Source: Getattr handler | FilesystemBlockSize (macOS)
We also need to update Statfs() to report consistent block sizing for the filesystem as a whole:
func (fs *KeibiFS) Statfs(path string, stat *fuse.Statfs_t) int {
stat.Bsize = uint64(FilesystemBlockSize)
stat.Frsize = uint64(FilesystemBlockSize)
// ... rest of statfs fields ...
return 0
}
Source: Statfs handler | Bsize/Frsize override
The Result
After the change, copying that same 1 GB file looked completely different:
| Metric | Before (4 KB) | After (2 MiB) |
|---|---|---|
| Write calls | ~25,600 | ~1,000 |
| Bytes per call | 4 KB | ~1 MB |
| Throughput | 300 MB/s | 3,400 MB/s |
| Improvement | baseline | ~10x |
The write calls dropped by roughly 25x. The throughput increased by over 10x. The file copy that previously felt sluggish became nearly instantaneous on local transfers. For encrypted network transfers, the improvement was even more dramatic because we were amortizing the per-call encryption overhead across much larger chunks.
Platform-Specific Values
The optimal block size is not the same everywhere. We tested across platforms and settled on these values:
| Platform | Optimal st_blksize | Notes |
|---|---|---|
| macOS Intel | 2 MiB | Sweet spot for fcopyfile path |
| macOS M-series | up to 10 MiB | Larger caches tolerate bigger blocks |
| Linux | 256 KiB | FUSE max_write caps effective size |
| Windows (WinFsp) | up to 10 MiB | Similar to M-series behavior |
We use Go build tags to select the right value at compile time:
// blocksize_darwin.go
//go:build darwin
package filesystem
const FilesystemBlockSize = 2 << 20 // 2 MiB
// blocksize_linux.go
//go:build linux
package filesystem
const FilesystemBlockSize = 256 << 10 // 256 KiB
// blocksize_windows.go
//go:build windows
package filesystem
const FilesystemBlockSize = 2 << 20 // 2 MiB
Source: macOS | Linux | Windows
Why It Works
The 10x improvement comes from four compounding effects:
- Fewer syscalls. Each FUSE write call requires a context switch from kernel to userspace and back. Going from 25,600 calls to 1,000 eliminates 24,600 round-trips. At roughly 2-5 microseconds per context switch, that alone saves 50-120 ms.
- Better disk I/O alignment. Storage devices have internal page sizes (typically 4-16 KB for SSDs). Larger writes let the OS batch these into sequential operations, reducing write amplification.
- Reduced lock contention. Our FUSE handler acquires a mutex per write to maintain file consistency (see Write/Release Race). Fewer writes means fewer lock acquisitions.
- CPU cache efficiency. Processing 1 MB of data in a single call keeps the working set hot in L2/L3 cache. Processing 4 KB at a time means the encryption state, file metadata, and network buffers get evicted and reloaded 25x more often.
The Trap: Don't Go Too High
There is a sweet spot, and overshooting it hurts. Going beyond the optimal range introduces memory pressure, TLB thrashing, and diminishing returns. On Apple Silicon and Windows, the sweet spot goes up to about 10 MiB; on Intel Macs, 2 MiB; on Linux, 256 KiB due to FUSE max_write caps. Beyond those ranges, throughput deteriorates on all platforms. These numbers come from empirical testing across a previous FUSE project and KEIBIDROP itself.
At very large block sizes, you start hitting memory pressure. Each in-flight write consumes st_blksize bytes of kernel buffer space. With multiple concurrent copies, that adds up. On a machine with 8 GB of RAM, having 16 concurrent writes at 64 MiB each would consume 1 GB just for write buffers.
There is also the issue of latency. If a write fails halfway through a 64 MiB chunk, the entire chunk must be retried. With 2 MiB chunks, you lose at most 2 MiB of progress on failure.
How to Find Your Optimal
If you are building a FUSE filesystem, here is a simple instrumentation approach to find your optimal block size:
type WriteStats struct {
mu sync.Mutex
callCount int64
totalBytes int64
minSize int64
maxSize int64
}
func (ws *WriteStats) Record(n int64) {
ws.mu.Lock()
defer ws.mu.Unlock()
ws.callCount++
ws.totalBytes += n
if n < ws.minSize || ws.minSize == 0 {
ws.minSize = n
}
if n > ws.maxSize {
ws.maxSize = n
}
}
func (ws *WriteStats) Report() string {
ws.mu.Lock()
defer ws.mu.Unlock()
avg := int64(0)
if ws.callCount > 0 {
avg = ws.totalBytes / ws.callCount
}
return fmt.Sprintf(
"writes=%d total=%s avg=%s min=%s max=%s",
ws.callCount,
humanize(ws.totalBytes),
humanize(avg),
humanize(ws.minSize),
humanize(ws.maxSize),
)
}
Run your typical workload (file copies, IDE operations, git clones) with different st_blksize values and compare the stats. You are looking for the point where increasing the block size stops improving throughput.
The Meta-Lesson
Default values in systems software exist for compatibility, not for performance. The 4,096-byte block size is a safe default that works on every filesystem from FAT32 to ZFS. It is the value that will never cause a problem and will never be optimal.
When you control the implementation and you know your filesystem's characteristics, your network's MTU, and your encryption's chunk size, you should override every default that touches your hot path.
This is empirical systems engineering: measure, hypothesize, change one variable, measure again. Catherine McGeoch's A Guide to Experimental Algorithmics describes exactly this methodology, and it applies to systems tuning as well as it does to algorithm benchmarking. The st_blksize fix took one line of code to implement and weeks of profiling to discover.