SecureConn, our AEAD encryption wrapper over raw TCP. When we profiled a 1 GB transfer, we found that 29% of CPU time was spent on memmove inside the encryption layer. Not on actual encryption. On copying bytes between buffers that didn't need to exist.
The Profile
Andrei ran a CPU profile on a 1 GB no-FUSE PullFile transfer on Linux (Intel i5, AES-NI):
flat flat% function
1.74s 30.4% syscall.Syscall6 (TCP send/recv)
1.67s 29.2% runtime.memmove (memory copies)
0.82s 14.3% aes/gcm.gcmAesEnc (AES-GCM encrypt)
0.42s 7.3% aes/gcm.gcmAesDec (AES-GCM decrypt)
Encryption itself was only 21.6% of CPU. Memory copies were 29.2%. For a 1 GB transfer, the process allocated 3.85 GB of heap memory. Something was wrong.
The Double-Copy
The culprit was SecureConn.Read() in pkg/session/secureconn.go:
// Before: two copies per message
func (s *SecureConn) Read(p []byte) (int, error) {
msg, err := s.r.Read() // decrypt into fresh []byte
s.readBuf.Write(msg) // copy 1: plaintext -> bytes.Buffer
n, _ := s.readBuf.Read(p) // copy 2: bytes.Buffer -> caller's buffer
return n, nil
}
Every decrypted message (up to 4 MiB) was copied twice through heap-allocated buffers. For 1 GB of data split into 1 MiB chunks, that's 2 GB of unnecessary memmove.
The bytes.Buffer was there to handle partial reads. When the caller's buffer p is smaller than the decrypted message, the leftover needs to go somewhere. But bytes.Buffer does a full copy on Write, which is the wrong tool for this.
The Fix: Direct Slice Handoff
Replace bytes.Buffer with a simple leftover []byte field:
// After: zero extra copies
func (s *SecureConn) Read(p []byte) (int, error) {
if len(s.leftover) > 0 {
n := copy(p, s.leftover)
s.leftover = s.leftover[n:]
if len(s.leftover) == 0 {
s.leftover = nil
}
return n, nil
}
msg, err := s.r.Read() // decrypt into fresh []byte
n := copy(p, msg) // single copy into caller's buffer
if n < len(msg) {
s.leftover = msg[n:] // keep remainder, no copy
}
return n, nil
}
The decrypted msg slice is used directly. If the caller can't consume it all, we keep a reference to the remainder. No intermediate buffer, no extra allocation.
The Pool: Eliminating Allocation Churn
The second fix targets SecureWriter.Write(), which allocated a fresh buffer for every outgoing message:
// Before: allocate per message
buf := make([]byte, lengthHeaderSize+encSize)
With 4 MiB chunks and a 1 GB file, that's 256 allocations of ~4 MiB each. With Go's GC, these become garbage immediately after the TCP write completes (the kernel copies the data). Under production load with default GOGC, this triggers frequent stop-the-world pauses.
// After: pool reuse
var encBufPool = sync.Pool{}
// In Write():
if poolBuf, ok := encBufPool.Get().([]byte); ok && cap(poolBuf) >= totalSize {
buf = poolBuf[:totalSize]
} else {
buf = make([]byte, totalSize)
}
// ... encrypt into buf, write to TCP ...
defer encBufPool.Put(buf[:0])
The pool is safe here because net.Conn.Write copies data into kernel space before returning. The buffer won't be read again after Write completes.
Results
Measured on Intel Core i7-9750H @ 2.60 GHz, macOS, AES-NI:
SecureConn Micro-Benchmark (encryption layer only)
| Block Size | Throughput |
|---|---|
| 1 MiB | 2,097 MB/s |
| 4 MiB | 1,850 MB/s |
| Raw net.Pipe (no crypto) | 20,617 MB/s |
Full Pipeline (encrypted gRPC, loopback)
| Size | MB/s |
|---|---|
| 10 MB | 432 |
| 100 MB | 530 |
| 1 GB | 525 |
No-FUSE PullFile (file download, loopback)
| Size | MB/s |
|---|---|
| 10 MB | 332 |
| 100 MB | 647 |
| 1 GB | 623 |
FUSE End-to-End (write -> peer pulls, loopback)
| Size | MB/s |
|---|---|
| 100 MB | 233 |
| 1 GB | 220 |
On the Linux benchmark machine (Intel i5), the full PullFile pipeline was previously at 430 MB/s, approaching the SecureConn ceiling of 861 MB/s. The double-copy removal and buffer pooling push this ceiling significantly higher.
Overhead Breakdown
For a 1 GB encrypted transfer through gRPC, the overhead stacks like this:
| Layer | Cost |
|---|---|
| Raw TCP (net.Pipe) | 20,617 MB/s (baseline) |
| + AES-256-GCM encryption | 2,097 MB/s (10x overhead from crypto) |
| + gRPC framing + protobuf | 525 MB/s (4x overhead from protocol) |
| + FUSE kernel boundary | 220 MB/s (2.4x overhead from VFS) |
Encryption is the dominant cost, followed by gRPC/protobuf marshaling. FUSE adds another 2.4x on top. For WAN transfers, none of this matters. The ISP bandwidth is the bottleneck, not CPU.
What We Didn't Change
PR #106 also tested whether removing the pullStreamFile() server-push RPC in favor of pullParallelRead() caused a regression. Using a Latin-square interleaved methodology with 4 repetitions (to control for thermal throttling and background load, as described in Georges et al., "Statistically Rigorous Java Performance Evaluation," OOPSLA 2007):
- T1 (current, parallel read): 410 MB/s at 1 GB
- T2 (reverted, stream file): 441 MB/s at 1 GB
- Ratio: 1.08x, within noise (stdev was 130+ MB/s)
The pull strategy change is not the bottleneck. The parallel read approach stays.
Lines Changed
| File | Change | Lines |
|---|---|---|
secureconn.go | Replace bytes.Buffer with leftover []byte, add sync.Pool | ~30 |
constants.go | BlockSize 1 -> 4 MiB | 1 |
Three targeted changes, no API changes, no protocol changes. The wire format is identical.