Adding AES-256-GCM with Hardware Acceleration

Automatic AES-NI detection and cipher negotiation

8 min read | KEIBIDROP Series

After benchmarking against croc, wormhole, and LocalSend, we found that KEIBIDROP's ChaCha20-Poly1305 encryption was leaving performance on the table on x86 machines with AES-NI. We added AES-256-GCM as an alternative cipher with automatic hardware detection. If both peers have AES-NI (Intel/AMD x86) or hardware AES (Apple Silicon ARM64), KEIBIDROP now defaults to AES-256-GCM. Encrypted gRPC throughput improved from 442 MB/s to 490 MB/s (+11%).

Why We Did This

Blog post #12 benchmarked KEIBIDROP against other file transfer tools. One result stood out: our Go reimplementation of LocalSend's protocol hit 704 MB/s using TLS 1.3 with AES-128-GCM, while KEIBIDROP's encrypted gRPC topped out at 442 MB/s with ChaCha20-Poly1305.

We isolated the cipher as the variable by forcing LocalSend's protocol to use ChaCha20-Poly1305 instead of AES-GCM. The result: 612 MB/s, only 13% slower than AES-GCM. That 13% is free performance we were leaving on the table.

On modern x86 CPUs, AES-GCM runs on dedicated hardware (AES-NI instructions). ChaCha20 runs in software. On ARM64 (Apple Silicon), both have hardware support. The choice matters most on x86.

Wire Compatibility

AES-256-GCM and ChaCha20-Poly1305 have identical parameters:

ParameterAES-256-GCMChaCha20-Poly1305
Key size32 bytes32 bytes
Nonce size12 bytes12 bytes
Auth tag16 bytes16 bytes

Same key, same nonce, same tag, same wire format: [4-byte length][12-byte nonce][ciphertext+tag]. The only thing that changes is which cipher.AEAD is instantiated. No wire protocol changes, no protobuf changes, no nonce generator changes.

Implementation

Cipher Abstraction

New file pkg/crypto/cipher.go wraps both ciphers behind Go's cipher.AEAD interface:

type CipherSuite string

const (
    CipherChaCha20 CipherSuite = "chacha20-poly1305"
    CipherAES256   CipherSuite = "aes-256-gcm"
)

func NewAEAD(suite CipherSuite, key []byte) (cipher.AEAD, error) {
    switch suite {
    case CipherAES256:
        block, _ := aes.NewCipher(key)
        return cipher.NewGCM(block)
    case CipherChaCha20:
        return chacha20poly1305.New(key)
    }
}

Every call site that previously used chacha20poly1305.New(kek) now calls NewAEAD(suite, kek). The return type is the same (cipher.AEAD), so all downstream code is unchanged.

Hardware Detection

func HasHardwareAES() bool {
    switch runtime.GOARCH {
    case "amd64":
        return cpu.X86.HasAES    // AES-NI
    case "arm64":
        return cpu.ARM64.HasAES  // ARMv8 AES extension
    }
    return false
}

Uses golang.org/x/sys/cpu (already a dependency) to detect AES-NI at runtime. Apple Silicon Macs also have hardware AES, so they get the speedup too.

Negotiation

During the handshake, each peer advertises its supported ciphers:

{
  "fingerprint": "...",
  "public_keys": {...},
  "enc_seeds": {...},
  "port": 26001,
  "supported_ciphers": ["aes-256-gcm", "chacha20-poly1305"]
}

If both peers support AES-256-GCM, they use it. If either peer lacks hardware AES, they fall back to ChaCha20-Poly1305. The negotiation is deterministic: both peers independently compute the same result because they exchange cipher lists and use the same selection algorithm.

Domain-Separated Key Derivation

Same input secrets, different HKDF labels:

// ChaCha20 key derivation
deriveKeyInternal(sha512.New, "KeibiDrop-ChaCha20-Poly1305-SEK", 32, secrets...)

// AES-256 key derivation
deriveKeyInternal(sha512.New, "KeibiDrop-AES-256-GCM-SEK", 32, secrets...)

Even with identical input secrets, the two ciphers produce different keys. This is standard practice: you never reuse keys across different algorithms.

Benchmark Results

Tested on Intel Core i7-9750H (AES-NI), macOS, localhost loopback:

MetricChaCha20 (before)AES-256-GCM (after)Change
Encrypted gRPC (1GB)442 MB/s490 MB/s+11%
FUSE E2E (1GB)233 MB/s233 MB/ssame
gRPC baseline (100MB)367 MB/s503 MB/s+37%

The encrypted gRPC layer got an 11% boost. The FUSE E2E number is unchanged because the 48% kernel overhead (context switches between kernel and userspace) is the bottleneck at that layer. Faster encryption does not help when you are waiting for the kernel.

The 100MB gRPC baseline shows the clearest picture: 37% faster at the encryption layer. This is the AES-NI hardware advantage.

Updated Comparison

With AES-256-GCM, KEIBIDROP vs the competition at 1GB:

ToolSpeedGap from KEIBIDROP
LocalSend protocol (AES-GCM)704 MB/s1.44x faster (less framing)
KEIBIDROP gRPC (AES-256-GCM)490 MB/s--
KEIBIDROP FUSE (AES-256-GCM)233 MB/s0.48x (FUSE overhead)
croc (local relay)153 MB/s3.2x slower
wormhole-william126 MB/s3.9x slower

The gap with LocalSend's raw protocol narrowed from 1.59x to 1.44x. The remaining difference is gRPC framing, protobuf encoding, and chunk bitmap tracking: the overhead from resume support and bidirectional multiplexing.

What Changed

FileChange
pkg/crypto/cipher.goNew: CipherSuite type, HasHardwareAES, NegotiateCipher, NewAEAD, DeriveKey
pkg/crypto/asymmetric.goAdded DeriveAES256Key with domain-separated HKDF label
pkg/session/secureconn.goSecureWriter/SecureReader/SecureConn take CipherSuite parameter
pkg/session/handshake.goPeerHandshakeMessage gains SupportedCiphers field; negotiation in handshake
pkg/session/session.goSession stores negotiated CipherSuite
pkg/session/rekey.goRe-keying uses session's negotiated cipher suite

~80 lines of new code, ~30 lines changed. Zero wire format changes. All existing tests pass unchanged (they now run with AES-256-GCM on AES-NI hardware).

Security Notes

AES-256-GCM provides the same security guarantees as ChaCha20-Poly1305: authenticated encryption with associated data (AEAD), 256-bit key strength, 128-bit authentication. Key derivation is domain-separated so the same secrets produce different keys for different ciphers, preventing cross-algorithm key reuse. ChaCha20-Poly1305 remains the fallback for any hardware without AES-NI (older x86, RISC-V, etc.).

One caveat: AES-GCM fails catastrophically on nonce reuse (leaks the authentication key), while ChaCha20-Poly1305 degrades more gracefully. With a 96-bit nonce and counter-based generation, the risk is a nonce collision after 232 messages under the same key. KEIBIDROP re-keys after 1 GB or 1 million messages, whichever comes first. At that rotation rate, nonce reuse under AES-GCM is not a practical concern.

8 min read | KEIBIDROP Series | Benchmarks | Post-Quantum gRPC | Optimizing Transfer