Re: [RFC] Another take at restarting FUSE servers

From: Darrick J. Wong

Date: Wed Nov 05 2025 - 19:21:38 EST


On Wed, Nov 05, 2025 at 11:48:21PM +0100, Bernd Schubert wrote:
>
>
> On 11/5/25 23:42, Darrick J. Wong wrote:
> > On Wed, Nov 05, 2025 at 11:24:01PM +0100, Bernd Schubert wrote:
> >>
> >>
> >> On 11/4/25 14:10, Amir Goldstein wrote:
> >>> On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@xxxxxxxxxx> wrote:
> >>>>
> >>>> On Tue, Sep 16 2025, Amir Goldstein wrote:
> >>>>
> >>>>> On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> >>>>>>
> >>>>>> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote:
> >>>>>>> On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@xxxxxxxxxxx> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 9/15/25 09:07, Amir Goldstein wrote:
> >>>>>>>>> On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 9/12/25 13:41, Amir Goldstein wrote:
> >>>>>>>>>>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@xxxxxxxxxxx> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 8/1/25 12:15, Luis Henriques wrote:
> >>>>>>>>>>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> >>>>>>>>>>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> >>>>>>>>>>>>>>>>> could restart itself. It's unclear if doing so will actually enable us
> >>>>>>>>>>>>>>>>> to clear the condition that caused the failure in the first place, but I
> >>>>>>>>>>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts
> >>>>>>>>>>>>>>>>> aren't totally crazy.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I'm trying to understand what the failure scenario is here. Is this
> >>>>>>>>>>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what
> >>>>>>>>>>>>>>>> is supposed to happen with respect to open files, metadata and data
> >>>>>>>>>>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run
> >>>>>>>>>>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
> >>>>>>>>>>>>>>>> potentally to be out of sync, right?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> What are the recovery semantics that we hope to be able to provide?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> <echoing what we said on the ext4 call this morning>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
> >>>>>>>>>>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> >>>>>>>>>>>>>>> would initiate GETATTR requests on all the cached inodes to validate
> >>>>>>>>>>>>>>> that they still exist; and then resend all the unacknowledged requests
> >>>>>>>>>>>>>>> that were pending at the time. It might be the case that you have to
> >>>>>>>>>>>>>>> that in the reverse order; I only know enough about the design of fuse
> >>>>>>>>>>>>>>> to suspect that to be true.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Anyhow once those are complete, I think we can resume operations with
> >>>>>>>>>>>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are
> >>>>>>>>>>>>>>> fuse_make_bad'd, which effectively revokes them.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> >>>>>>>>>>>>>> but probably GETATTR is a better option.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So, are you currently working on any of this? Are you implementing this
> >>>>>>>>>>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer
> >>>>>>>>>>>>>> look at fuse2fs too.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Sorry for joining the discussion late, I was totally occupied, day and
> >>>>>>>>>>>>> night. Added Kevin to CC, who is going to work on recovery on our
> >>>>>>>>>>>>> DDN side.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
> >>>>>>>>>>>>> server restart we want kernel to recover inodes and their lookup count.
> >>>>>>>>>>>>> Now inode recovery might be hard, because we currently only have a
> >>>>>>>>>>>>> 64-bit node-id - which is used my most fuse application as memory
> >>>>>>>>>>>>> pointer.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
> >>>>>>>>>>>>> outstanding requests. And that ends up in most cases in sending requests
> >>>>>>>>>>>>> with invalid node-IDs, that are casted and might provoke random memory
> >>>>>>>>>>>>> access on restart. Kind of the same issue why fuse nfs export or
> >>>>>>>>>>>>> open_by_handle_at doesn't work well right now.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
> >>>>>>>>>>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
> >>>>>>>>>>>>> And then FUSE_REVALIDATE_FH on server restart.
> >>>>>>>>>>>>> The file handles could be stored into the fuse inode and also used for
> >>>>>>>>>>>>> NFS export.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
> >>>>>>>>>>>>> Adding Amir to CC.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
> >>>>>>>>>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@xxxxxxxxxxxxxx/
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the reference Amir! I even had been in that thread.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
> >>>>>>>>>>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
> >>>>>>>>>>>>> Any objections against that?
> >>>>>>>>>>
> >>>>>>>>>> What if you actually /can/ reuse a nodeid after a restart? Consider
> >>>>>>>>>> fuse4fs, where the nodeid is the on-disk inode number. After a restart,
> >>>>>>>>>> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
> >>>>>>>>>> didn't delete it, obviously.
> >>>>>>>>>
> >>>>>>>>> FUSE_LOOKUP_HANDLE is a contract.
> >>>>>>>>> If fuse4fs can reuse nodeid after restart then by all means, it should sign
> >>>>>>>>> this contract, otherwise there is no way for client to know that the
> >>>>>>>>> nodeids are persistent.
> >>>>>>>>> If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
> >>>>>>>>> API trivial.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I suppose you could just ask for refreshed stat information and either
> >>>>>>>>>> the server gives it to you and the fuse_inode lives; or the server
> >>>>>>>>>> returns ENOENT and then we mark it bad. But I'd have to see code
> >>>>>>>>>> patches to form a real opinion.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> You could make fuse4fs_handle := <nodeid:fuse_instance_id>
> >>>>>>>>> where fuse_instance_id can be its start time or random number.
> >>>>>>>>> for auto invalidate, or maybe the fuse_instance_id should be
> >>>>>>>>> a native part of FUSE protocol so that client knows to only invalidate
> >>>>>>>>> attr cache in case of fuse_instance_id change?
> >>>>>>>>>
> >>>>>>>>> In any case, instead of a storm of revalidate messages after
> >>>>>>>>> server restart, do it lazily on demand.
> >>>>>>>>
> >>>>>>>> For a network file system, probably. For fuse4fs or other block
> >>>>>>>> based file systems, not sure. Darrick has the example of fsck.
> >>>>>>>> Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
> >>>>>>>> fuse-server gets restarted, fsck'ed and some files get removed.
> >>>>>>>> Now reading these inodes would still work - wouldn't it
> >>>>>>>> be better to invalidate the cache before going into operation
> >>>>>>>> again?
> >>>>>>>
> >>>>>>> Forgive me, I was making a wrong assumption that fuse4fs
> >>>>>>> was using ext4 filehandle as nodeid, but of course it does not.
> >>>>>>
> >>>>>> Well now that you mention it, there /is/ a risk of shenanigans like
> >>>>>> that. Consider:
> >>>>>>
> >>>>>> 1) fuse4fs mount an ext4 filesystem
> >>>>>> 2) crash the fuse4fs server
> >>>>>> <fuse4fs server restart stalls...>
> >>>>>> 3) e2fsck -fy /dev/XXX deletes inode 17
> >>>>>> 4) someone else mounts the fs, makes some changes that result in 17
> >>>>>> being reallocated, user says "OOOOOPS", unmounts it
> >>>>>> 5) fuse4fs server finally restarts, and reconnects to the kernel
> >>>>>>
> >>>>>> Hey, inode 17 is now a different file!!
> >>>>>>
> >>>>>> So maybe the nodeid has to be an actual file handle. Oh wait, no,
> >>>>>> everything's (potentially) fine because fuse4fs supplied i_generation to
> >>>>>> the kernel, and fuse_stale_inode will mark it bad if that happens.
> >>>>>>
> >>>>>> Hm ok then, at least there's a way out. :)
> >>>>>>
> >>>>>
> >>>>> Right.
> >>>>>
> >>>>>>> The reason I made this wrong assumption is because fuse4fs *can*
> >>>>>>> already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
> >>>>>>> which is what my fuse passthough library [1] does.
> >>>>>>>
> >>>>>>> My claim was that although fuse4fs could support safe restart, which
> >>>>>>> cannot read from recycled inode number with current FUSE protocol,
> >>>>>>> doing so with FUSE_HANDLE protocol would express a commitment
> >>>>>>
> >>>>>> Pardon my naïvete, but what is FUSE_HANDLE?
> >>>>>>
> >>>>>> $ git grep -w FUSE_HANDLE fs
> >>>>>> $
> >>>>>
> >>>>> Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE):
> >>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@xxxxxxxxxxxxxx/
> >>>>>
> >>>>> Which means to communicate a variable sized "nodeid"
> >>>>> which can also be declared as an object id that survives server restart.
> >>>>>
> >>>>> Basically, the reason that I brought up LOOKUP_HANDLE is to
> >>>>> properly support NFS export of fuse filesystems.
> >>>>>
> >>>>> My incentive was to support a proper fuse server restart/remount/re-export
> >>>>> with the same fsid in /etc/exports, but this gives us a better starting point
> >>>>> for fuse server restart/re-connect.
> >>>>
> >>>> Sorry for resurrecting (again!) this discussion. I've been thinking about
> >>>> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation.
> >>>> However, I feel there are other operations that will need to return this
> >>>> new handle.
> >>>>
> >>>> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid.
> >>>> Doesn't this means that, if the user-space server supports the new
> >>>> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE
> >>>> request?
> >>>
> >>> Yes, I think that's what it means.
> >>>
> >>>> The same question applies for TMPFILE, LINK, etc. Or is there
> >>>> something special about the LOOKUP operation that I'm missing?
> >>>>
> >>>
> >>> Any command returning fuse_entry_out.
> >>>
> >>> READDIRPLUS, MKNOD, MKDIR, SYMLINK
> >>
> >> Btw, checkout out <libfuse>/doc/libfuse-operations.txt for these
> >> things. With double checking, though, the file was mostly created by AI
> >> (just added a correction today). With that easy to see the missing
> >> FUSE_TMPFILE.
> >>
> >>
> >>>
> >>> fuse_entry_out was extended once and fuse_reply_entry()
> >>> sends the size of the struct.
> >>
> >> Sorry, I'm confused. Where does fuse_reply_entry() send the size?
> >>
> >>> However fuse_reply_create() sends it with fuse_open_out
> >>> appended and fuse_add_direntry_plus() does not seem to write
> >>> record size at all, so server and client will need to agree on the
> >>> size of fuse_entry_out and this would need to be backward compat.
> >>> If both server and client declare support for FUSE_LOOKUP_HANDLE
> >>> it should be fine (?).
> >>
> >> If max_handle size becomes a value in fuse_init_out, server and
> >> client would use it? I think appended fuse_open_out could just
> >> follow the dynamic actual size of the handle - code that
> >> serializes/deserializes the response has to look up the actual
> >> handle size then. For example I wouldn't know what to put in
> >> for any of the example/passthrough* file systems as handle size -
> >> would need to be 128B, but the actual size will be typically
> >> much smaller.
> >
> > name_to_handle_at ?
> >
> > I guess the problem here is that technically speaking filesystems could
> > have variable sized handles depending on the file. Sometimes you encode
> > just the ino/gen of the child file, but other times you might know the
> > parent and put that in the handle too.
>
> Yeah, I don't think it would be reliable for *all* file systems to use
> name_to_handle_at on startup on some example file/directory. At least
> not without knowing all the details of the underlying passthrough file
> system.

I think if you can send arbitrarily sized outblobs back to the kernel
then it would be ok for a filesystem to have different handle sizes for
a file, just so long as it doesn't change during the lifetime of a file.
Obviously you couldn't then have a meaningful fs-wide max_handle_size.

--D

>
> Thanks,
> Bernd
>