KeibiDrop Benchmark Matrix: Loopback, WAN, WiFi, and iOS

        In March we measured 442 MB/s on loopback. After fixing a double-copy in the encryption layer and increasing the gRPC HTTP/2 window from 64 KiB to 16 MiB, we now measure 662 MB/s on the same laptop. This post reports the full matrix: loopback, laptop-to-VPS (24 ms, 500 Mbps wire), iPhone-to-laptop over WiFi, iPhone through bridge relay, and FUSE reads over WAN. All traffic encrypted with AES-256-GCM.
      

Lab Setup

Laptop	Intel i7-9750H @ 2.60 GHz (AES-NI), 32 GB RAM, macOS, NVMe SSD
VPS	Ubuntu 24.04, x86_64, Timisoara, Romania
iPhone	iOS, ARM64, same WiFi as laptop (Iasi, Romania)
RTT laptop-VPS	24 ms (Iasi to Timisoara)
ISP wire	500 Mbps
WiFi router	4 Gbps (WiFi 5, 802.11ac)
Cipher	AES-256-GCM (hardware-accelerated, auto-negotiated)
Tool	`kd bench-pull` (agentic CLI) + test suite for FUSE loopback

Methodology: Pre-allocated random files (1-3 GB). Each benchmark runs alone, no concurrent traffic. bench-pull times the full PullFile call, reports MB/s, and deletes the local copy. FUSE benchmarks use dd reading through the mount. Files are pre-generated and pre-shared before timing starts.

Results

Loopback (Intel i7-9750H, macOS, AES-NI, NVMe SSD)

Mode	Size	MB/s	Time
no-FUSE	3 GB	662	4.6 s
FUSE	1 MB	89	11 ms
FUSE	10 MB	136	73 ms
FUSE	100 MB	200	500 ms
FUSE	1 GB	152	6.7 s

Laptop to VPS (Iasi to Timisoara, 24 ms RTT, 500 Mbps wire, bridge relay)

Direction	Mode	Size	MB/s	Time
Upload	no-FUSE	3 GB	56.9	54 s
Download	no-FUSE	3 GB	54.3	56.5 s
Download	FUSE (macOS)	3 GB	22.5	137 s
Upload	FUSE (Linux)	1 GB	7.9	136 s

iPhone to Laptop (WiFi 5, 4 Gbps router, ARM64 phone, AES-256-GCM)

Path	Size	MB/s	Time
WiFi LAN	612 MB	45.3	12.9 s
WiFi LAN	1 GB	47.0	21.8 s
Bridge (via Timisoara)	1 GB	42.9	23.9 s

What Changed Since March

Three changes account for the 442 to 662 MB/s improvement:

First, we eliminated a double-copy in SecureConn.Read(). A CPU profile showed 29% of time in runtime.memmove, not in actual encryption. An intermediate bytes.Buffer was copying every decrypted message twice. Replacing it with a direct slice reference removed one full copy per message. About 30 lines changed. (See the SecureConn post for the full analysis.)

Second, we increased the gRPC HTTP/2 flow control window from 64 KiB to 16 MiB and the transport I/O buffers from 32 KiB to 4 MiB. Go gRPC ships with defaults sized for small RPC messages. With 4 MiB file chunks, the 64 KiB window caused the receiver to throttle the sender on every chunk. The change is four constants in pkg/config/constants.go.

Third, we replaced the bidirectional Read RPC (one request per chunk, one round-trip per chunk) with a server-streaming StreamFile RPC. The client sends one request; the server pushes all chunks. Per-chunk RTT drops to zero. On loopback this barely matters, but on a 24 ms WAN link it is the difference between 167 MB/s theoretical and 55 MB/s actual.

Loopback: 662 MB/s

No network involved. The number measures software overhead only: AES-256-GCM encryption, gRPC/protobuf framing, and NVMe I/O. 662 MB/s is for 3 GB over the no-FUSE path on an Intel i7-9750H (macOS, AES-NI hardware acceleration). The same code on an i5-14600KF (Windows) measured 749 MB/s in the March round. ARM machines without AES hardware instructions will be slower.

FUSE adds 77% overhead: 152 MB/s for 1 GB. Each FUSE read dispatches a kernel-to-userspace context switch, about 300 us per call on macOS. With iosize=524288 (512 KiB), a 1 GB file requires ~2048 dispatches. We tested 4 MiB iosize and it made small files 33% slower with no gain on large files. 512 KiB is the empirical optimum for macFUSE on Intel.

Laptop to VPS: 55 MB/s

Laptop in Iasi, VPS in Timisoara. 24 ms RTT. ISP cap: 500 Mbps. Connection through the bridge relay.

Upload (VPS pulls from laptop): 56.9 MB/s, 3 GB in 54 seconds. Download (laptop pulls from VPS): 54.3 MB/s, 3 GB in 56.5 seconds. Both directions are symmetric and reach 88% of the 500 Mbps wire. The remaining 12% goes to encryption, gRPC framing, and one extra hop through the bridge relay.

The bridge relay forwards encrypted TCP between the two peers. It cannot read file contents or metadata; it never sees the decryption keys.

FUSE Over WAN

Both peers have FUSE mounted. Reading a remote file through the mount triggers on-demand gRPC reads over the 24 ms link.

macOS laptop reading 3 GB from VPS: 22.5 MB/s. macFUSE issues 512 KiB reads and the stream pool prefetches with 4 parallel gRPC streams. Compared to the no-FUSE path on the same link (55 MB/s), FUSE adds 59% overhead.

Linux VPS reading 1 GB from laptop: 7.9 MB/s. Linux libfuse uses 128 KiB max_read by default. We tested 512 KiB max_read and measured the same 7.5 MB/s. The 24 ms per-chunk round-trip is the limit, not the read size. macOS is 2.8x faster on the same link because macFUSE issues larger reads and the kernel readahead is more aggressive.

The same files over no-FUSE PullFile on the same connection complete at 55 MB/s. FUSE over WAN costs 3-7x depending on the OS.

iPhone to Laptop: 47 MB/s

iPhone and laptop on the same WiFi (802.11ac, 4 Gbps router). Connected via mDNS LAN discovery, no relay. The laptop pulls a file from the phone using bench-pull.

612 MB: 45.3 MB/s. 1 GB: 47.0 MB/s. Consistent across both sizes. The WiFi radio is not the limit (4 Gbps capable). 47 MB/s = 376 Mbps, which is the phone's ARM CPU doing AES-256-GCM in software. The laptop's Intel AES-NI gets 662 MB/s for the same operation.

Through the bridge relay (phone to Timisoara, Timisoara to laptop, 24 ms detour): 42.9 MB/s, 9% slower than direct WiFi. The phone's encryption CPU is the ceiling in both cases. Adding a relay in Timisoara barely matters because the phone is already saturated.

All iOS benchmarks ran with the screen on and the app in foreground. iOS throttles network I/O for backgrounded apps.

Evolution: March to May 2026

Metric	March	April	May
No-FUSE loopback (1 GB)	442 MB/s	623 MB/s	662 MB/s
FUSE loopback (1 GB)	233 MB/s	220 MB/s	152 MB/s *
WAN (bridge, 24 ms RTT)	--	--	55 MB/s
iOS WiFi LAN	--	--	47 MB/s
iOS bridge	--	--	43 MB/s

* The May FUSE number (152 MB/s) is lower than April (220 MB/s). Different test methodology (test suite vs manual), macOS thermal state, and background processes all affect FUSE results. The iosize=4MiB experiment confirmed no improvement is possible from tuning the mount option alone.

The no-FUSE path improved 50% from March to May. The April SecureConn fix was the largest single gain (+41%). The May HTTP/2 window tuning and StreamFile RPC added another 6%.

Where the Time Goes

Layer	Loopback	WAN (24 ms)	WiFi (phone)
Raw TCP (baseline)	~20 GB/s	~60 MB/s (wire)	~108 MB/s (WiFi 5 max)
+ AES-256-GCM	662 MB/s	55 MB/s	47 MB/s
+ FUSE kernel	152 MB/s	22.5 MB/s (macOS)	n/a (no FUSE on iOS)

On loopback, AES-256-GCM is the ceiling. On WAN, the ISP wire is the ceiling. On WiFi, the phone's ARM CPU is the ceiling. FUSE adds 59-86% on top of whichever layer already limits the path.

gRPC Configuration

BlockSize       = 4 MiB       // per-chunk size for StreamFile
GRPCMaxMsgSize  = 20 MiB      // max protobuf message
GRPCStreamBuffer = 16 MiB     // streaming read buffer
GRPCWindowSize  = 16 MiB      // HTTP/2 flow control window
GRPCIOBufferSize = 4 MiB      // transport read/write buffers
macOS iosize    = 512 KiB     // FUSE read size (sweet spot)
Linux max_read  = 128 KiB     // FUSE read size (libfuse default)

Go gRPC ships with 64 KiB windows and 32 KiB I/O buffers. These are sized for small RPC messages and throttle bulk file transfers. The four constants above are the result of the tuning commit. The macOS iosize was validated empirically at 512 KiB; 4 MiB performed worse on small files.

Benchmark data: docs/benchmarks/2026-05-22-kd-benchmark-matrix.md | gRPC frame size tuning: commit 2d7225d