Why We Did This
Blog post #12 benchmarked KEIBIDROP against other file transfer tools. One result stood out: our Go reimplementation of LocalSend's protocol hit 704 MB/s using TLS 1.3 with AES-128-GCM, while KEIBIDROP's encrypted gRPC topped out at 442 MB/s with ChaCha20-Poly1305.
We isolated the cipher as the variable by forcing LocalSend's protocol to use ChaCha20-Poly1305 instead of AES-GCM. The result: 612 MB/s, only 13% slower than AES-GCM. That 13% is free performance we were leaving on the table.
On modern x86 CPUs, AES-GCM runs on dedicated hardware (AES-NI instructions). ChaCha20 runs in software. On ARM64 (Apple Silicon), both have hardware support. The choice matters most on x86.
Wire Compatibility
AES-256-GCM and ChaCha20-Poly1305 have identical parameters:
| Parameter | AES-256-GCM | ChaCha20-Poly1305 |
|---|---|---|
| Key size | 32 bytes | 32 bytes |
| Nonce size | 12 bytes | 12 bytes |
| Auth tag | 16 bytes | 16 bytes |
Same key, same nonce, same tag, same wire format: [4-byte length][12-byte nonce][ciphertext+tag]. The only thing that changes is which cipher.AEAD is instantiated. No wire protocol changes, no protobuf changes, no nonce generator changes.
Implementation
Cipher Abstraction
New file pkg/crypto/cipher.go wraps both ciphers behind Go's cipher.AEAD interface:
type CipherSuite string
const (
CipherChaCha20 CipherSuite = "chacha20-poly1305"
CipherAES256 CipherSuite = "aes-256-gcm"
)
func NewAEAD(suite CipherSuite, key []byte) (cipher.AEAD, error) {
switch suite {
case CipherAES256:
block, _ := aes.NewCipher(key)
return cipher.NewGCM(block)
case CipherChaCha20:
return chacha20poly1305.New(key)
}
}
Every call site that previously used chacha20poly1305.New(kek) now calls NewAEAD(suite, kek). The return type is the same (cipher.AEAD), so all downstream code is unchanged.
Hardware Detection
func HasHardwareAES() bool {
switch runtime.GOARCH {
case "amd64":
return cpu.X86.HasAES // AES-NI
case "arm64":
return cpu.ARM64.HasAES // ARMv8 AES extension
}
return false
}
Uses golang.org/x/sys/cpu (already a dependency) to detect AES-NI at runtime. Apple Silicon Macs also have hardware AES, so they get the speedup too.
Negotiation
During the handshake, each peer advertises its supported ciphers:
{
"fingerprint": "...",
"public_keys": {...},
"enc_seeds": {...},
"port": 26001,
"supported_ciphers": ["aes-256-gcm", "chacha20-poly1305"]
}
If both peers support AES-256-GCM, they use it. If either peer lacks hardware AES, they fall back to ChaCha20-Poly1305. The negotiation is deterministic: both peers independently compute the same result because they exchange cipher lists and use the same selection algorithm.
Domain-Separated Key Derivation
Same input secrets, different HKDF labels:
// ChaCha20 key derivation
deriveKeyInternal(sha512.New, "KeibiDrop-ChaCha20-Poly1305-SEK", 32, secrets...)
// AES-256 key derivation
deriveKeyInternal(sha512.New, "KeibiDrop-AES-256-GCM-SEK", 32, secrets...)
Even with identical input secrets, the two ciphers produce different keys. This is standard practice: you never reuse keys across different algorithms.
Benchmark Results
Tested on Intel Core i7-9750H (AES-NI), macOS, localhost loopback:
| Metric | ChaCha20 (before) | AES-256-GCM (after) | Change |
|---|---|---|---|
| Encrypted gRPC (1GB) | 442 MB/s | 490 MB/s | +11% |
| FUSE E2E (1GB) | 233 MB/s | 233 MB/s | same |
| gRPC baseline (100MB) | 367 MB/s | 503 MB/s | +37% |
The encrypted gRPC layer got an 11% boost. The FUSE E2E number is unchanged because the 48% kernel overhead (context switches between kernel and userspace) is the bottleneck at that layer. Faster encryption does not help when you are waiting for the kernel.
The 100MB gRPC baseline shows the clearest picture: 37% faster at the encryption layer. This is the AES-NI hardware advantage.
Updated Comparison
With AES-256-GCM, KEIBIDROP vs the competition at 1GB:
| Tool | Speed | Gap from KEIBIDROP |
|---|---|---|
| LocalSend protocol (AES-GCM) | 704 MB/s | 1.44x faster (less framing) |
| KEIBIDROP gRPC (AES-256-GCM) | 490 MB/s | -- |
| KEIBIDROP FUSE (AES-256-GCM) | 233 MB/s | 0.48x (FUSE overhead) |
| croc (local relay) | 153 MB/s | 3.2x slower |
| wormhole-william | 126 MB/s | 3.9x slower |
The gap with LocalSend's raw protocol narrowed from 1.59x to 1.44x. The remaining difference is gRPC framing, protobuf encoding, and chunk bitmap tracking: the overhead from resume support and bidirectional multiplexing.
What Changed
| File | Change |
|---|---|
pkg/crypto/cipher.go | New: CipherSuite type, HasHardwareAES, NegotiateCipher, NewAEAD, DeriveKey |
pkg/crypto/asymmetric.go | Added DeriveAES256Key with domain-separated HKDF label |
pkg/session/secureconn.go | SecureWriter/SecureReader/SecureConn take CipherSuite parameter |
pkg/session/handshake.go | PeerHandshakeMessage gains SupportedCiphers field; negotiation in handshake |
pkg/session/session.go | Session stores negotiated CipherSuite |
pkg/session/rekey.go | Re-keying uses session's negotiated cipher suite |
~80 lines of new code, ~30 lines changed. Zero wire format changes. All existing tests pass unchanged (they now run with AES-256-GCM on AES-NI hardware).
Security Notes
AES-256-GCM provides the same security guarantees as ChaCha20-Poly1305: authenticated encryption with associated data (AEAD), 256-bit key strength, 128-bit authentication. Key derivation is domain-separated so the same secrets produce different keys for different ciphers, preventing cross-algorithm key reuse. ChaCha20-Poly1305 remains the fallback for any hardware without AES-NI (older x86, RISC-V, etc.).
One caveat: AES-GCM fails catastrophically on nonce reuse (leaks the authentication key), while ChaCha20-Poly1305 degrades more gracefully. With a 96-bit nonce and counter-based generation, the risk is a nonce collision after 232 messages under the same key. KEIBIDROP re-keys after 1 GB or 1 million messages, whichever comes first. At that rotation rate, nonce reuse under AES-GCM is not a practical concern.