Cutting 29% CPU from KeibiDrop's Encrypted Transport

        KEIBIDROP encrypts every byte that crosses the wire. Every gRPC message, every file chunk, every heartbeat passes through SecureConn, our AEAD encryption wrapper over raw TCP. When we profiled a 1 GB transfer, we found that 29% of CPU time was spent on memmove inside the encryption layer. Not on actual encryption. On copying bytes between buffers that didn't need to exist.
      

The Profile

Andrei ran a CPU profile on a 1 GB no-FUSE PullFile transfer on Linux (Intel i5, AES-NI):

flat   flat%   function
1.74s  30.4%   syscall.Syscall6        (TCP send/recv)
1.67s  29.2%   runtime.memmove         (memory copies)
0.82s  14.3%   aes/gcm.gcmAesEnc       (AES-GCM encrypt)
0.42s   7.3%   aes/gcm.gcmAesDec       (AES-GCM decrypt)

Encryption itself was only 21.6% of CPU. Memory copies were 29.2%. For a 1 GB transfer, the process allocated 3.85 GB of heap memory. Something was wrong.

The Double-Copy

The culprit was SecureConn.Read() in pkg/session/secureconn.go:

// Before: two copies per message
func (s *SecureConn) Read(p []byte) (int, error) {
    msg, err := s.r.Read()       // decrypt into fresh []byte
    s.readBuf.Write(msg)         // copy 1: plaintext -> bytes.Buffer
    n, _ := s.readBuf.Read(p)    // copy 2: bytes.Buffer -> caller's buffer
    return n, nil
}

Every decrypted message (up to 4 MiB) was copied twice through heap-allocated buffers. For 1 GB of data split into 1 MiB chunks, that's 2 GB of unnecessary memmove.

The bytes.Buffer was there to handle partial reads. When the caller's buffer p is smaller than the decrypted message, the leftover needs to go somewhere. But bytes.Buffer does a full copy on Write, which is the wrong tool for this.

The Fix: Direct Slice Handoff

Replace bytes.Buffer with a simple leftover []byte field:

// After: zero extra copies
func (s *SecureConn) Read(p []byte) (int, error) {
    if len(s.leftover) > 0 {
        n := copy(p, s.leftover)
        s.leftover = s.leftover[n:]
        if len(s.leftover) == 0 {
            s.leftover = nil
        }
        return n, nil
    }
    msg, err := s.r.Read()    // decrypt into fresh []byte
    n := copy(p, msg)         // single copy into caller's buffer
    if n < len(msg) {
        s.leftover = msg[n:]  // keep remainder, no copy
    }
    return n, nil
}

The decrypted msg slice is used directly. If the caller can't consume it all, we keep a reference to the remainder. No intermediate buffer, no extra allocation.

The Pool: Eliminating Allocation Churn

The second fix targets SecureWriter.Write(), which allocated a fresh buffer for every outgoing message:

// Before: allocate per message
buf := make([]byte, lengthHeaderSize+encSize)

With 4 MiB chunks and a 1 GB file, that's 256 allocations of ~4 MiB each. With Go's GC, these become garbage immediately after the TCP write completes (the kernel copies the data). Under production load with default GOGC, this triggers frequent stop-the-world pauses.

// After: pool reuse
var encBufPool = sync.Pool{}

// In Write():
if poolBuf, ok := encBufPool.Get().([]byte); ok && cap(poolBuf) >= totalSize {
    buf = poolBuf[:totalSize]
} else {
    buf = make([]byte, totalSize)
}
// ... encrypt into buf, write to TCP ...
defer encBufPool.Put(buf[:0])

The pool is safe here because net.Conn.Write copies data into kernel space before returning. The buffer won't be read again after Write completes.

Results

Measured on Intel Core i7-9750H @ 2.60 GHz, macOS, AES-NI:

SecureConn Micro-Benchmark (encryption layer only)

Block Size	Throughput
1 MiB	2,097 MB/s
4 MiB	1,850 MB/s
Raw net.Pipe (no crypto)	20,617 MB/s

Full Pipeline (encrypted gRPC, loopback)

Size	MB/s
10 MB	432
100 MB	530
1 GB	525

No-FUSE PullFile (file download, loopback)

Size	MB/s
10 MB	332
100 MB	647
1 GB	623

FUSE End-to-End (write -> peer pulls, loopback)

Size	MB/s
100 MB	233
1 GB	220

On the Linux benchmark machine (Intel i5), the full PullFile pipeline was previously at 430 MB/s, approaching the SecureConn ceiling of 861 MB/s. The double-copy removal and buffer pooling push this ceiling significantly higher.

Overhead Breakdown

For a 1 GB encrypted transfer through gRPC, the overhead stacks like this:

Layer	Cost
Raw TCP (net.Pipe)	20,617 MB/s (baseline)
+ AES-256-GCM encryption	2,097 MB/s (10x overhead from crypto)
+ gRPC framing + protobuf	525 MB/s (4x overhead from protocol)
+ FUSE kernel boundary	220 MB/s (2.4x overhead from VFS)

Encryption is the dominant cost, followed by gRPC/protobuf marshaling. FUSE adds another 2.4x on top. For WAN transfers, none of this matters. The ISP bandwidth is the bottleneck, not CPU.

What We Didn't Change

PR #106 also tested whether removing the pullStreamFile() server-push RPC in favor of pullParallelRead() caused a regression. Using a Latin-square interleaved methodology with 4 repetitions (to control for thermal throttling and background load, as described in Georges et al., "Statistically Rigorous Java Performance Evaluation," OOPSLA 2007):

T1 (current, parallel read): 410 MB/s at 1 GB
T2 (reverted, stream file): 441 MB/s at 1 GB
Ratio: 1.08x, within noise (stdev was 130+ MB/s)

The pull strategy change is not the bottleneck. The parallel read approach stays.

Lines Changed

File	Change	Lines
`secureconn.go`	Replace `bytes.Buffer` with `leftover []byte`, add `sync.Pool`	~30
`constants.go`	BlockSize 1 -> 4 MiB	1

Three targeted changes, no API changes, no protocol changes. The wire format is identical.

Andrei Cristian found the double-copy, wrote the fix, built the benchmark infrastructure, and profiled it on Linux.