KEIBIDROP Benchmark Matrix

Loopback, WAN, WiFi LAN, iOS, and FUSE-over-WAN

8 min read | KEIBIDROP Series | May 2026

In March we measured 442 MB/s on loopback. After fixing a double-copy in the encryption layer and increasing the gRPC HTTP/2 window from 64 KiB to 16 MiB, we now measure 662 MB/s on the same laptop. This post reports the full matrix: loopback, laptop-to-VPS (24 ms, 500 Mbps wire), iPhone-to-laptop over WiFi, iPhone through bridge relay, and FUSE reads over WAN. All traffic encrypted with AES-256-GCM.

Lab Setup

LaptopIntel i7-9750H @ 2.60 GHz (AES-NI), 32 GB RAM, macOS, NVMe SSD
VPSUbuntu 24.04, x86_64, Timisoara, Romania
iPhoneiOS, ARM64, same WiFi as laptop (Iasi, Romania)
RTT laptop-VPS24 ms (Iasi to Timisoara)
ISP wire500 Mbps
WiFi router4 Gbps (WiFi 5, 802.11ac)
CipherAES-256-GCM (hardware-accelerated, auto-negotiated)
Toolkd bench-pull (agentic CLI) + test suite for FUSE loopback

Methodology: Pre-allocated random files (1-3 GB). Each benchmark runs alone, no concurrent traffic. bench-pull times the full PullFile call, reports MB/s, and deletes the local copy. FUSE benchmarks use dd reading through the mount. Files are pre-generated and pre-shared before timing starts.

Results

Loopback (Intel i7-9750H, macOS, AES-NI, NVMe SSD)

ModeSizeMB/sTime
no-FUSE3 GB6624.6 s
FUSE1 MB8911 ms
FUSE10 MB13673 ms
FUSE100 MB200500 ms
FUSE1 GB1526.7 s

Laptop to VPS (Iasi to Timisoara, 24 ms RTT, 500 Mbps wire, bridge relay)

DirectionModeSizeMB/sTime
Uploadno-FUSE3 GB56.954 s
Downloadno-FUSE3 GB54.356.5 s
DownloadFUSE (macOS)3 GB22.5137 s
UploadFUSE (Linux)1 GB7.9136 s

iPhone to Laptop (WiFi 5, 4 Gbps router, ARM64 phone, AES-256-GCM)

PathSizeMB/sTime
WiFi LAN612 MB45.312.9 s
WiFi LAN1 GB47.021.8 s
Bridge (via Timisoara)1 GB42.923.9 s

What Changed Since March

Three changes account for the 442 to 662 MB/s improvement:

First, we eliminated a double-copy in SecureConn.Read(). A CPU profile showed 29% of time in runtime.memmove, not in actual encryption. An intermediate bytes.Buffer was copying every decrypted message twice. Replacing it with a direct slice reference removed one full copy per message. About 30 lines changed. (See the SecureConn post for the full analysis.)

Second, we increased the gRPC HTTP/2 flow control window from 64 KiB to 16 MiB and the transport I/O buffers from 32 KiB to 4 MiB. Go gRPC ships with defaults sized for small RPC messages. With 4 MiB file chunks, the 64 KiB window caused the receiver to throttle the sender on every chunk. The change is four constants in pkg/config/constants.go.

Third, we replaced the bidirectional Read RPC (one request per chunk, one round-trip per chunk) with a server-streaming StreamFile RPC. The client sends one request; the server pushes all chunks. Per-chunk RTT drops to zero. On loopback this barely matters, but on a 24 ms WAN link it is the difference between 167 MB/s theoretical and 55 MB/s actual.

Loopback: 662 MB/s

No network involved. The number measures software overhead only: AES-256-GCM encryption, gRPC/protobuf framing, and NVMe I/O. 662 MB/s is for 3 GB over the no-FUSE path on an Intel i7-9750H (macOS, AES-NI hardware acceleration). The same code on an i5-14600KF (Windows) measured 749 MB/s in the March round. ARM machines without AES hardware instructions will be slower.

FUSE adds 77% overhead: 152 MB/s for 1 GB. Each FUSE read dispatches a kernel-to-userspace context switch, about 300 us per call on macOS. With iosize=524288 (512 KiB), a 1 GB file requires ~2048 dispatches. We tested 4 MiB iosize and it made small files 33% slower with no gain on large files. 512 KiB is the empirical optimum for macFUSE on Intel.

Laptop to VPS: 55 MB/s

Laptop in Iasi, VPS in Timisoara. 24 ms RTT. ISP cap: 500 Mbps. Connection through the bridge relay.

Upload (VPS pulls from laptop): 56.9 MB/s, 3 GB in 54 seconds. Download (laptop pulls from VPS): 54.3 MB/s, 3 GB in 56.5 seconds. Both directions are symmetric and reach 88% of the 500 Mbps wire. The remaining 12% goes to encryption, gRPC framing, and one extra hop through the bridge relay.

The bridge relay forwards encrypted TCP between the two peers. It cannot read file contents or metadata; it never sees the decryption keys.

FUSE Over WAN

Both peers have FUSE mounted. Reading a remote file through the mount triggers on-demand gRPC reads over the 24 ms link.

macOS laptop reading 3 GB from VPS: 22.5 MB/s. macFUSE issues 512 KiB reads and the stream pool prefetches with 4 parallel gRPC streams. Compared to the no-FUSE path on the same link (55 MB/s), FUSE adds 59% overhead.

Linux VPS reading 1 GB from laptop: 7.9 MB/s. Linux libfuse uses 128 KiB max_read by default. We tested 512 KiB max_read and measured the same 7.5 MB/s. The 24 ms per-chunk round-trip is the limit, not the read size. macOS is 2.8x faster on the same link because macFUSE issues larger reads and the kernel readahead is more aggressive.

The same files over no-FUSE PullFile on the same connection complete at 55 MB/s. FUSE over WAN costs 3-7x depending on the OS.

iPhone to Laptop: 47 MB/s

iPhone and laptop on the same WiFi (802.11ac, 4 Gbps router). Connected via mDNS LAN discovery, no relay. The laptop pulls a file from the phone using bench-pull.

612 MB: 45.3 MB/s. 1 GB: 47.0 MB/s. Consistent across both sizes. The WiFi radio is not the limit (4 Gbps capable). 47 MB/s = 376 Mbps, which is the phone's ARM CPU doing AES-256-GCM in software. The laptop's Intel AES-NI gets 662 MB/s for the same operation.

Through the bridge relay (phone to Timisoara, Timisoara to laptop, 24 ms detour): 42.9 MB/s, 9% slower than direct WiFi. The phone's encryption CPU is the ceiling in both cases. Adding a relay in Timisoara barely matters because the phone is already saturated.

All iOS benchmarks ran with the screen on and the app in foreground. iOS throttles network I/O for backgrounded apps.

Evolution: March to May 2026

MetricMarchAprilMay
No-FUSE loopback (1 GB)442 MB/s623 MB/s662 MB/s
FUSE loopback (1 GB)233 MB/s220 MB/s152 MB/s *
WAN (bridge, 24 ms RTT)----55 MB/s
iOS WiFi LAN----47 MB/s
iOS bridge----43 MB/s

* The May FUSE number (152 MB/s) is lower than April (220 MB/s). Different test methodology (test suite vs manual), macOS thermal state, and background processes all affect FUSE results. The iosize=4MiB experiment confirmed no improvement is possible from tuning the mount option alone.

The no-FUSE path improved 50% from March to May. The April SecureConn fix was the largest single gain (+41%). The May HTTP/2 window tuning and StreamFile RPC added another 6%.

Where the Time Goes

LayerLoopbackWAN (24 ms)WiFi (phone)
Raw TCP (baseline)~20 GB/s~60 MB/s (wire)~108 MB/s (WiFi 5 max)
+ AES-256-GCM662 MB/s55 MB/s47 MB/s
+ FUSE kernel152 MB/s22.5 MB/s (macOS)n/a (no FUSE on iOS)

On loopback, AES-256-GCM is the ceiling. On WAN, the ISP wire is the ceiling. On WiFi, the phone's ARM CPU is the ceiling. FUSE adds 59-86% on top of whichever layer already limits the path.

gRPC Configuration

BlockSize       = 4 MiB       // per-chunk size for StreamFile
GRPCMaxMsgSize  = 20 MiB      // max protobuf message
GRPCStreamBuffer = 16 MiB     // streaming read buffer
GRPCWindowSize  = 16 MiB      // HTTP/2 flow control window
GRPCIOBufferSize = 4 MiB      // transport read/write buffers
macOS iosize    = 512 KiB     // FUSE read size (sweet spot)
Linux max_read  = 128 KiB     // FUSE read size (libfuse default)

Go gRPC ships with 64 KiB windows and 32 KiB I/O buffers. These are sized for small RPC messages and throttle bulk file transfers. The four constants above are the result of the tuning commit. The macOS iosize was validated empirically at 512 KiB; 4 MiB performed worse on small files.

Benchmark data: docs/benchmarks/2026-05-22-kd-benchmark-matrix.md | gRPC frame size tuning: commit 2d7225d

8 min read | KEIBIDROP Series | vs Competition | SecureConn Optimization | Block Size Tuning