Cutting 29% CPU from the Encrypted Transport

Double copies, buffer pools, and where the time actually goes

8 min read | KEIBIDROP Series

KEIBIDROP encrypts every byte that crosses the wire. Every gRPC message, every file chunk, every heartbeat passes through SecureConn, our AEAD encryption wrapper over raw TCP. When we profiled a 1 GB transfer, we found that 29% of CPU time was spent on memmove inside the encryption layer. Not on actual encryption. On copying bytes between buffers that didn't need to exist.

The Profile

Andrei ran a CPU profile on a 1 GB no-FUSE PullFile transfer on Linux (Intel i5, AES-NI):

flat   flat%   function
1.74s  30.4%   syscall.Syscall6        (TCP send/recv)
1.67s  29.2%   runtime.memmove         (memory copies)
0.82s  14.3%   aes/gcm.gcmAesEnc       (AES-GCM encrypt)
0.42s   7.3%   aes/gcm.gcmAesDec       (AES-GCM decrypt)

Encryption itself was only 21.6% of CPU. Memory copies were 29.2%. For a 1 GB transfer, the process allocated 3.85 GB of heap memory. Something was wrong.

The Double-Copy

The culprit was SecureConn.Read() in pkg/session/secureconn.go:

// Before: two copies per message
func (s *SecureConn) Read(p []byte) (int, error) {
    msg, err := s.r.Read()       // decrypt into fresh []byte
    s.readBuf.Write(msg)         // copy 1: plaintext -> bytes.Buffer
    n, _ := s.readBuf.Read(p)    // copy 2: bytes.Buffer -> caller's buffer
    return n, nil
}

Every decrypted message (up to 4 MiB) was copied twice through heap-allocated buffers. For 1 GB of data split into 1 MiB chunks, that's 2 GB of unnecessary memmove.

The bytes.Buffer was there to handle partial reads. When the caller's buffer p is smaller than the decrypted message, the leftover needs to go somewhere. But bytes.Buffer does a full copy on Write, which is the wrong tool for this.

The Fix: Direct Slice Handoff

Replace bytes.Buffer with a simple leftover []byte field:

// After: zero extra copies
func (s *SecureConn) Read(p []byte) (int, error) {
    if len(s.leftover) > 0 {
        n := copy(p, s.leftover)
        s.leftover = s.leftover[n:]
        if len(s.leftover) == 0 {
            s.leftover = nil
        }
        return n, nil
    }
    msg, err := s.r.Read()    // decrypt into fresh []byte
    n := copy(p, msg)         // single copy into caller's buffer
    if n < len(msg) {
        s.leftover = msg[n:]  // keep remainder, no copy
    }
    return n, nil
}

The decrypted msg slice is used directly. If the caller can't consume it all, we keep a reference to the remainder. No intermediate buffer, no extra allocation.

The Pool: Eliminating Allocation Churn

The second fix targets SecureWriter.Write(), which allocated a fresh buffer for every outgoing message:

// Before: allocate per message
buf := make([]byte, lengthHeaderSize+encSize)

With 4 MiB chunks and a 1 GB file, that's 256 allocations of ~4 MiB each. With Go's GC, these become garbage immediately after the TCP write completes (the kernel copies the data). Under production load with default GOGC, this triggers frequent stop-the-world pauses.

// After: pool reuse
var encBufPool = sync.Pool{}

// In Write():
if poolBuf, ok := encBufPool.Get().([]byte); ok && cap(poolBuf) >= totalSize {
    buf = poolBuf[:totalSize]
} else {
    buf = make([]byte, totalSize)
}
// ... encrypt into buf, write to TCP ...
defer encBufPool.Put(buf[:0])

The pool is safe here because net.Conn.Write copies data into kernel space before returning. The buffer won't be read again after Write completes.

Results

Measured on Intel Core i7-9750H @ 2.60 GHz, macOS, AES-NI:

SecureConn Micro-Benchmark (encryption layer only)

Block SizeThroughput
1 MiB2,097 MB/s
4 MiB1,850 MB/s
Raw net.Pipe (no crypto)20,617 MB/s

Full Pipeline (encrypted gRPC, loopback)

SizeMB/s
10 MB432
100 MB530
1 GB525

No-FUSE PullFile (file download, loopback)

SizeMB/s
10 MB332
100 MB647
1 GB623

FUSE End-to-End (write -> peer pulls, loopback)

SizeMB/s
100 MB233
1 GB220

On the Linux benchmark machine (Intel i5), the full PullFile pipeline was previously at 430 MB/s, approaching the SecureConn ceiling of 861 MB/s. The double-copy removal and buffer pooling push this ceiling significantly higher.

Overhead Breakdown

For a 1 GB encrypted transfer through gRPC, the overhead stacks like this:

LayerCost
Raw TCP (net.Pipe)20,617 MB/s (baseline)
+ AES-256-GCM encryption2,097 MB/s (10x overhead from crypto)
+ gRPC framing + protobuf525 MB/s (4x overhead from protocol)
+ FUSE kernel boundary220 MB/s (2.4x overhead from VFS)

Encryption is the dominant cost, followed by gRPC/protobuf marshaling. FUSE adds another 2.4x on top. For WAN transfers, none of this matters. The ISP bandwidth is the bottleneck, not CPU.

What We Didn't Change

PR #106 also tested whether removing the pullStreamFile() server-push RPC in favor of pullParallelRead() caused a regression. Using a Latin-square interleaved methodology with 4 repetitions (to control for thermal throttling and background load, as described in Georges et al., "Statistically Rigorous Java Performance Evaluation," OOPSLA 2007):

The pull strategy change is not the bottleneck. The parallel read approach stays.

Lines Changed

FileChangeLines
secureconn.goReplace bytes.Buffer with leftover []byte, add sync.Pool~30
constants.goBlockSize 1 -> 4 MiB1

Three targeted changes, no API changes, no protocol changes. The wire format is identical.

Andrei Cristian found the double-copy, wrote the fix, built the benchmark infrastructure, and profiled it on Linux.

8 min read | KEIBIDROP Series | Previous Optimization | Benchmarks | AES-256-GCM