Optimizing Encrypted P2P Transfer: From 225 to 441 MB/s

Cipher caching, combined writes, push-based streaming, and the irreducible 48%

8 min read | KEIBIDROP Series

Six changes took encrypted gRPC from ~400 to 441 MB/s and FUSE end-to-end from ~225 to 231 MB/s. The encryption tax is 1.49x (33%), and the remaining 48% gap is FUSE kernel overhead that cannot be optimized away.

The Setup

KEIBIDROP transfers files between two peers over encrypted gRPC. The full stack looks like this:

Disk I/O -> FUSE kernel -> FUSE daemon -> gRPC framing -> ChaCha20-Poly1305 -> TCP -> Peer

To understand where time goes, we built micro-benchmarks for each layer and measured throughput with 1GB files on an Intel MacBook Pro.

Baseline Numbers

Layer Throughput Overhead
Raw disk (SSD) ~5 GB/s --
Raw gRPC (no encryption) 981 MB/s 5x vs disk
Encrypted gRPC (ChaCha20) 437 MB/s 2.2x vs raw gRPC
FUSE end-to-end 225 MB/s 1.9x vs encrypted gRPC

The encryption layer costs 2.2x. FUSE adds another 1.9x.

Optimization 1: Cache the AEAD Cipher

Our SecureConn wraps every TCP connection with ChaCha20-Poly1305 encryption. The original code created a new AEAD cipher for every single message:

// Before: creating cipher per-message
func (s *SecureWriter) Write(p []byte) (int, error) {
    aead, _ := chacha20poly1305.NewX(s.key)  // expensive!
    nonce := s.nextNonce()
    ciphertext := aead.Seal(nil, nonce, p, nil)
    // ...
}

The fix is obvious -- cache the cipher:

// After: cipher created once in constructor
type SecureWriter struct {
    aead cipher.AEAD  // created once
    // ...
}

func NewSecureWriter(w io.Writer, key []byte) (*SecureWriter, error) {
    aead, err := chacha20poly1305.NewX(key)
    // ...
    return &SecureWriter{aead: aead, ...}, nil
}

This is safe because the nonce is a monotonic counter -- each message gets a unique nonce even with a cached cipher. During rekey, a new SecureWriter is created with the new key and a fresh cipher instance.

Optimization 2: Single Combined TCP Write

The original code made two separate TCP writes per message -- one for the length header, one for the encrypted payload:

// Before: two syscalls per message
binary.Write(s.w, binary.BigEndian, uint32(len(ciphertext)+nonceSize))
s.w.Write(append(nonce, ciphertext...))

We combined everything into a single write:

// After: one allocation, one syscall
buf := make([]byte, lengthHeaderSize+nonceSize+len(ciphertext))
binary.BigEndian.PutUint32(buf[:4], uint32(nonceSize+len(ciphertext)))
copy(buf[4:], nonce)
copy(buf[4+nonceSize:], ciphertext)
s.w.Write(buf)

One syscall instead of two. At 512KB gRPC chunks, this cuts the number of write() syscalls in half.

Optimization 3: In-Place Decryption

The reader was allocating a new buffer for every decryption:

// Before: allocates new buffer
plaintext, err := aead.Open(nil, nonce, ciphertext, nil)

ChaCha20-Poly1305 supports in-place decryption -- the plaintext can overwrite the ciphertext buffer:

// After: reuses existing buffer
plaintext, err := s.aead.Open(ciphertext[:0], nonce, ciphertext, nil)

This eliminates one allocation per message. For a 1GB transfer with 512KB chunks, that's ~2000 fewer allocations.

Optimization 4: Async Cache Writes in FUSE Read

When a FUSE Read fetches data from the remote peer, it needs to:

  1. Return the data to the kernel (FUSE response)
  2. Write the data to the local cache file
  3. Update the chunk bitmap

The original code did all three synchronously. But FUSE only needs step 1 to complete -- the cache write can happen in the background:

// Copy data into FUSE buffer (this is what the kernel needs)
n := copy(buff, data)

// Cache write happens async -- FUSE returns immediately
if cacheFD != nil {
    f.CacheWg.Add(1)
    go func() {
        defer f.CacheWg.Done()
        cacheFD.WriteAt(cacheData, cacheOffset)
        bitmap.SetRange(cacheOffset, len(cacheData))
    }()
}
return n

The CacheWg (WaitGroup) ensures all writes complete before Release() closes the file. This dropped cache write overhead from 16% to 1.8%.

Optimization 5: Push-Based StreamFile RPC

Our original file streaming used a request-response pattern: the client requests each chunk, the server sends it back. For sequential prefetch, this means one round-trip per chunk.

We added a server-streaming RPC where the server pushes all chunks without waiting for requests:

rpc StreamFile (StreamFileRequest) returns (stream StreamFileResponse);

The server reads the file sequentially and streams chunks as fast as the connection allows. The client just receives. For prefetch (background download of entire files), this eliminates all per-chunk round-trip latency.

On-demand reads (random access from FUSE) still use the old bidirectional stream -- they need to specify offset and size.

Optimization 6: Silencing Hot-Path Logs

We had ~120 slog.Info/Debug/Warn calls in the FUSE handler code. Even structured logging has overhead -- allocating the log record, formatting fields, checking log level. On a hot path called thousands of times per second, it adds up.

Commenting out non-error logs in the Read/Write/Getattr handlers gave a measurable 2-6% throughput improvement. We kept all logger.Error calls since those don't fire on the happy path.

After Optimizations

Layer Before After Gain
Encrypted gRPC 437 MB/s 441 MB/s +0.9%
FUSE end-to-end 225 MB/s 231 MB/s +2.7%

The Irreducible 48%

The gap between encrypted gRPC (441 MB/s) and FUSE end-to-end (231 MB/s) is 48%. Where does it go?

Every FUSE Read involves:

  1. Kernel receives read request from userspace app
  2. Kernel sends request to FUSE daemon (context switch)
  3. FUSE daemon processes request
  4. FUSE daemon sends response to kernel (context switch)
  5. Kernel copies data to userspace app

Steps 2 and 4 are kernel-to-userspace context switches. They're inherent to how FUSE works -- the kernel can't encrypt data or talk to network peers, so it must delegate to our daemon.

We confirmed this by measuring raw FUSE overhead (local file read through FUSE vs direct read): ~48% overhead, matching our encrypted transfer gap.

This is the cost of userspace filesystems. It's why production systems that need maximum throughput use kernel modules (like NFS or CIFS) instead of FUSE. For our use case -- secure P2P file sharing -- the tradeoff is acceptable. And the FUSE overhead (48%) is the dominant cost, nearly twice the encryption tax (33%).

The Encryption Tax

We built a micro-benchmark that compares identical gRPC transfers over plain TCP vs ChaCha20-Poly1305 encrypted connections:

Transport MB/s Duration (100MB)
Plain TCP + gRPC 657 152ms
Encrypted (ChaCha20-Poly1305) 441 227ms

Encryption overhead: 1.49x. ChaCha20-Poly1305 costs 33% of transfer time. This is reasonable for a stream cipher doing authenticated encryption on every 512KB chunk.

Our earlier estimate of 2.2x was comparing across different test runs with different system load. The controlled benchmark shows the real cost is much lower.

The full overhead breakdown for FUSE end-to-end (231 MB/s):

Plain gRPC:     657 MB/s  (ceiling)
  -33% encryption
Encrypted gRPC: 441 MB/s
  -48% FUSE kernel overhead
FUSE E2E:       231 MB/s

The FUSE kernel overhead (48%) is nearly twice the encryption cost (33%). If we wanted to improve throughput, reducing FUSE transitions would have more impact than switching ciphers.

What We Learned

  1. Without isolating each layer, we would have optimized the wrong thing. The biggest win (async cache writes, 16% to 1.8%) was only visible when we measured cache write overhead independently.
  2. Caching the AEAD cipher and in-place decryption are tiny code changes with measurable impact. At 2000+ messages per second, every allocation counts.
  3. Structured logging frameworks are not free. On hot paths, even checking the log level has overhead. We saw 2-6% throughput gain from commenting out non-error logs.
  4. FUSE has an inherent 48% overhead. No amount of optimization in our code can fix kernel-to-userspace context switches. Knowing this prevents wasted effort.
  5. Server-streaming eliminates round-trip latency for prefetch. Request-response is still needed for random access where you need to specify offset and size.
  6. Our initial 2.2x encryption estimate was wrong -- it compared numbers from different runs. A controlled A/B benchmark (same test, same system, plain vs encrypted) showed the real cost is 1.49x. Always compare apples to apples.

8 min read | KEIBIDROP Series | Git Inside FUSE | FUSE Deadlocks | macOS Preview + FUSE