Smooth On-Demand Playback Over 200 ms

Abstract. A follow-up to the Romania to Singapore measurements. Fetching a 16 MiB block per round trip (PR #182) made a full-speed on-demand read fast, 22 to 30 MB/s, matching whole-file transfer. But a video player does not read at full speed. It reads at the playback bitrate and keeps only a few seconds buffered, so it kept hitting cold 16 MiB boundaries and freezing, once per block, at every round trip. This change adds predictive read-ahead: while the player consumes one block the engine fetches the next ones. Measured with a paced reader over the same bridge, a cold 3 MB/s playback went from 5 freezes (2.6 s frozen) to 0 on the Windows mount, and from 13 freezes (23.6 s frozen) to 1 on Linux. On by default (read_ahead_window_mb=64). Measured on commit a649faa (PR #186).

Why fast reads still stuttered

The previous post fixed slow on-demand reads. Each read used to ask the peer for one 512 KiB chunk and wait a full round trip for it; over a 200 ms link that settled at about 2 MB/s regardless of file size. Fetching a whole 16 MiB block per round trip instead, and serving the reads in between from local disk, took a straight-through read up to 22 to 30 MB/s. That is the right number for copying a file.

A player is not a copy. It reads at the video's bitrate, say 3 MB/s for a Blu-ray, and stops once it has a few seconds buffered, then reads a little more. Every time it crosses a 16 MiB boundary that is not in the cache, the read blocks for a full round trip while that block arrives. At 200 ms that is a freeze the viewer sees, and it happens at every block. The headline throughput looked great and the playback was choppy.

Throughput is the wrong number for this

A full-speed read reports an average. One read can average 30 MB/s while it sat frozen for two seconds in the middle, because the average counts how many bytes arrived, not when. A viewer cares about exactly when.

So we measure with a paced reader. It consumes the file at a target bitrate the way a player would, sleeping to that schedule, and any read that takes longer than the time the player had budgeted for it counts as a stall, with the duration it blocked. The result is the pair of numbers a viewer actually feels: how many times it froze, and for how long in total. Every trial is cold: fresh file, cleared cache, dropped page cache, so the reads really hit the network.

Read-ahead

While the player consumes the block it already has, the engine fetches the next ones. It tracks the read position, and once a read is part of a steady forward run it fetches ahead of the playhead and keeps a window buffered, four blocks (64 MiB) by default. The player reaches each new block and finds it already on disk.

Two details decide whether it actually helps on a single shared link:

Fetch in order, one block at a time, at full speed. Fetching the whole window in parallel splits the one link four ways, so the block the player needs next arrives slower than if it had been fetched alone. Fetching sequentially gets the nearest block there first, which is the one about to be read.
Only after real sequential consumption. Read-ahead commits only once the read has actually consumed enough of a stream, so opening a folder of videos to generate thumbnails, which pokes at the start and end of each file, does not pull every file in the folder.

On a seek, the read-ahead resets to the new position and cancels the fetch still in flight for the old spot, so skipping ahead does not wait behind work the player no longer needs.

The numbers

Paced playback at 3 MB/s, cold, over the bridge at about 200 ms round-trip. Freezes is the count of reads that missed their deadline; frozen is the total time blocked; worst is the single longest stall.

Reader	File	Without read-ahead	With read-ahead
Windows (WinFsp)	100 MB	5 freezes, 2.6 s frozen, worst 1.0 s	0 freezes, 0 ms frozen, worst 0.04 s
Linux (libfuse)	200 MB	13 freezes, 23.6 s frozen, worst 2.9 s	1 freeze, 1.5 s frozen, worst 1.7 s

Without read-ahead the file stutters at every block boundary, each one a full round trip. With it, the buffer stays ahead of the playhead and the reads land on disk. The one freeze left on Linux is the first block: the startup fetch, before anything is buffered, which no read-ahead can avoid. The worst single stall dropped from a second or more to tens of milliseconds on the Windows mount.

Reads do not arrive in order

The first cut detected a sequential read by comparing it to the previous one: if this read begins where the last ended, it is sequential. That worked on Linux and did nothing on Windows, with identical settings.

The reason is that the kernel can serve one sequential scan from several worker threads at once, so the reads arrive at the filesystem out of order: 8.0, 8.5, 8.3, 9.0, 8.8 MB and so on. Compared pairwise, half of them look like backward jumps, so the detector called the access random and never read ahead. Linux happened to hand the reads over in order on this path, so it worked there by accident; WinFsp did not.

The fix is to stop comparing adjacent reads. The engine tracks the furthest point the read has reached and treats any read near that point as part of the same stream, wherever it falls in the arrival order. Order stops mattering, and the same code now works on Windows, Linux, macOS and BSD. A regression test feeds the read path deliberately shuffled offsets to keep it that way.

A stale binary almost hid the whole thing

For a stretch the Windows numbers showed no change at all, read-ahead on or off. The cause was not the code. The Windows build had been failing for days and the deploy was shipping a six-day-old binary that predated the feature. The build passed a linker path of C:\Program Files (x86)\WinFsp\lib to the compiler unquoted; the space splits the flag and the build fails, but the deploy step checked only whether the old executable still existed, found it, and reported success. So a failed build looked like a successful one and kept serving the previous binary.

The lesson is unglamorous: check the timestamp on the artifact you are testing before you trust a benchmark. The build now passes the path in its space-free short form, deletes the old binary before compiling, and exits with the compiler error when it fails, so a broken build stops being mistaken for a working one.

Tradeoffs and what is next

A seek still costs one cold block. Jumping to a new spot fetches the 16 MiB around it before the first frame appears, a second or two at this latency, and then read-ahead takes over and it plays smoothly. For interactive playback that is the right tradeoff; prefetching more on a guess would waste the link.

The next inefficiency is editing. Today, changing a shared file re-fetches it; the planned change invalidates only the byte ranges that actually changed. That matters for read, edit, hand off workflows, video post-production and audiobook production among them, where the same large file moves back and forth between people. That is a separate post.

Medians over repeated cold trials at 3 MB/s paced playback, both peers reached over the TCP bridge at about 200 ms round-trip, on commit a649faa. Windows reader is a WinFsp mount in Singapore; Linux reader is a libfuse mount on the Timisoara VPS.