Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism
From: Jingbo Xu
Date: Tue May 28 2024 - 05:45:35 EST
Hi, Christian,
Thanks for the review.
On 5/28/24 4:38 PM, Christian Brauner wrote:
> On Fri, May 24, 2024 at 02:40:28PM +0800, Jingbo Xu wrote:
>> Background
>> ==========
>> The fd of '/dev/fuse' serves as a message transmission channel between
>> FUSE filesystem (kernel space) and fuse server (user space). Once the
>> fd gets closed (intentionally or unintentionally), the FUSE filesystem
>> gets aborted, and any attempt of filesystem access gets -ECONNABORTED
>> error until the FUSE filesystem finally umounted.
>>
>> It is one of the requisites in production environment to provide
>> uninterruptible filesystem service. The most straightforward way, and
>> maybe the most widely used way, is that make another dedicated user
>> daemon (similar to systemd fdstore) keep the device fd open. When the
>> fuse daemon recovers from a crash, it can retrieve the device fd from the
>> fdstore daemon through socket takeover (Unix domain socket) method [1]
>> or pidfd_getfd() syscall [2]. In this way, as long as the fdstore
>> daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse
>> daemon crashes, though the filesystem service may hang there for a while
>> when the fuse daemon gets restarted and has not been completely
>> recovered yet.
>>
>> This picture indeed works and has been deployed in our internal
>> production environment until the following issues are encountered:
>>
>> 1. The fdstore daemon may be killed by mistake, in which case the FUSE
>> filesystem gets aborted and irrecoverable.
>
> That's only a problem if you use the fdstore of the per-user instance.
> The main fdstore is part of PID 1 and you can't kill that. So really,
> systemd needs to hand the fds from the per-user instance to the main
> fdstore.
Systemd indeed has implemented its own fdstore mechanism in the user space.
Nowadays more and more fuse daemons are running inside containers, but a
container generally has no systemd inside it.
>
>> 2. In scenarios of containerized deployment, the fuse daemon is deployed
>> in a container POD, and a dedicated fdstore daemon needs to be deployed
>> for each fuse daemon. The fdstore daemon could consume a amount of
>> resources (e.g. memory footprint), which is not conducive to the dense
>> container deployment.
>>
>> 3. Each fuse daemon implementation needs to implement its own fdstore
>> daemon. If we implement the fuse recovery mechanism on the kernel side,
>> all fuse daemon implementations could reuse this mechanism.
>
> You can just the global fdstore. That is a design limitation not an
> inherent limitation.
What I initially mean is that each fuse daemon implementation (e.g.
s3fs, ossfs, and other vendors) needs to make its own but similar
mechanism for daemon failover. There has not been a common component
for fdstore in container scenarios just like systemd fdstore.
I'd admit that it's controversial to implement a kernel-side fdstore.
Thus I only implement a failover mechanism for fuse server in this RFC
patch. But I also understand Miklos's concern as what we really need to
support daemon failover is just something like fdstore to keep the
device fd alive.
--
Thanks,
Jingbo