FUSE Disconnect and Reconnect

cgofuse mounts once per process, so we stopped unmounting

10 min read | KEIBIDROP Series

Background

KEIBIDROP mounts a FUSE virtual filesystem so that files shared by a peer appear as a regular folder. The user can connect to a peer, browse and stream files through the mount, disconnect, and connect again to the same or a different peer. During a debugging session we found that the disconnect path had six separate bugs, each masked by the next. This post documents the bugs, their causes, and the resulting design change.

1. Stop() blocks indefinitely

The Run() goroutine calls host.Mount() from the cgofuse library, which blocks for the lifetime of the FUSE session. When the user clicks Disconnect, Stop() cancels the Go context and waits on a done channel that Run() should close. Because Run() is inside a blocking C call, it cannot receive ctx.Done() in a select statement.

On macOS, host.Mount() does not always return after host.Unmount(). The unmount syscall succeeds but the cgofuse event loop continues blocking. We confirmed this with timestamped logs: Unmount completed at T+0, Mount had not returned at T+5s.

We moved Mount() into a separate goroutine and added a select on both the mount completion channel and ctx.Done(). Stop() now has a 5-second timeout with forced cleanup.

2. Health monitor races with disconnect

The disconnect sequence calls UnmountFilesystem() first, then Stop(). The health monitor runs on a 5-second interval. When the FUSE operations drain and the gRPC connection degrades during unmount, the health monitor detects the peer as unreachable and starts the reconnection manager. The UI shows "Connecting..." while the user is trying to disconnect.

We moved StopConnectionResilience() inside UnmountFilesystem() so that calling it from any interface (Rust UI, Go CLI, agent CLI) stops the health monitor and reconnection manager before touching FUSE state. External callers do not need to handle this ordering.

3. Cancel button blocks the UI thread

The Disconnect button already spawned a background thread for FFI calls to avoid blocking the Slint event loop. The Cancel button (for aborting a connection attempt) did not. KD_Disconnect() calls Stop(), which can block for up to 5 seconds. Calling it on the UI thread freezes the window. We moved the Cancel button's FFI calls to a background thread.

4. FUSE handlers block unmount for 30 seconds

The FUSE Read() handler creates gRPC streams to fetch remote file chunks. These streams were created with context.Background(), making them immune to cancellation. When the peer disconnects, each stream waits for a 10-second gRPC timeout. The retry loop runs three times per handler. The macOS kernel will not release the FUSE mount until every pending FUSE operation returns, so host.Unmount() blocks for up to 30 seconds per open file.

We added a FsCtx field to the filesystem struct. All FUSE handlers (Read, Open, prefetch, on-demand stream creation) derive their contexts from FsCtx. On disconnect, FsCtx is cancelled before draining operations, so handlers see context.Canceled and return immediately.

5. FUSE cannot remount after disconnect

cgofuse registers a process-global FUSE signal handler the first time host.Mount() is called. There is no API to unregister it. A second call to host.Mount() on a new FileSystemHost fails with "fuse: cannot register signal source". This is a constraint of the cgofuse library, not of FUSE itself. The practical consequence is that a process gets one FUSE mount for its entire lifetime.

We changed the disconnect path to call ClearFiles() instead of Unmount(). ClearFiles() cancels the filesystem context, clears all file and directory maps, and creates a fresh context. The mount stays alive as an empty folder. On reconnect, setupFilesystem() reuses the existing FS object, re-wires the gRPC callbacks for the new session, and files repopulate through ADD_FILE notifications from the peer. The real Unmount() only runs on application exit.

6. Files cannot be opened after reconnect

After fixing bug 5, the mount persisted across reconnects and files appeared in the directory listing. Opening any file returned a permission error. The gRPC stream creation failed with "context canceled".

The cause was in the Run() goroutine's ctx.Done handler. During a temporary disconnect (not app exit), it called Unmount(), which cancels the filesystem context. But ClearFiles() had already created a fresh context for the next session. The Unmount() call destroyed it. Every subsequent FUSE handler inherited a dead context.

We removed the Unmount() call from the temporary disconnect path. Only the permanent shutdown path (triggered by Shutdown()) calls Unmount().

Resulting design

The FUSE mount is created once and persists for the lifetime of the process. On disconnect, the file maps are cleared and the filesystem context is refreshed. The mount point shows an empty folder. On reconnect, the existing FS object is reused with new gRPC callbacks, and the Run() goroutine detects that the mount is already active and waits for the next context cancellation instead of calling Mount() again. The exposed API is two methods: UnmountFilesystem() (clears state, stops resilience) and Stop() (ends the session). Both are safe to call from any thread, in any order, multiple times.

Notes

The cgofuse one-mount limitation applies to macOS (macFUSE), Linux (libfuse), and Windows (WinFsp), since the signal handler registration is in the cgofuse Go wrapper, not in the platform-specific FUSE implementations. The ClearFiles/Unmount split is cross-platform.

The six bugs appeared to be one bug ("the app freezes on disconnect"). Each fix exposed the next. The timestamps in the structured logs were the primary debugging tool: which goroutine blocked, which channel was never signaled, which context was already cancelled.

10 min read | KEIBIDROP Series | FUSE Deadlocks | Git Clone Between Peers | Write/Release Race | Architecture Map