Re: [PATCH V10 00/10] famfs: port into fuse

From: Joanne Koong

Date: Thu Apr 16 2026 - 11:57:26 EST

On Tue, Apr 14, 2026 at 5:10 PM John Groves <John@xxxxxxxxxx> wrote:
>
> On 26/04/14 03:13PM, Joanne Koong wrote:
> > On Tue, Apr 14, 2026 at 11:57 AM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> > >
> > > On Tue, Apr 14, 2026 at 08:41:42AM -0500, John Groves wrote:
> > > > On 26/04/14 03:19PM, Miklos Szeredi wrote:
> > > > > On Fri, 10 Apr 2026 at 21:44, Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> > > > >
> > > > > > Overall, my intention with bringing this up is just to make sure we're
> > > > > > at least aware of this alternative before anything is merged and
> > > > > > permanent. If Miklos and you think we should land this series, then
> > > > > > I'm on board with that.
> > > > >
> > > > > TBH, I'd prefer not to add the famfs specific mapping interface if not
> > > > > absolutely necessary. This was the main sticking point originally,
> > > > > but there seemed to be no better alternative.
> > > > >
> > > > > However with the bpf approach this would be gone, which is great.
> > >
> > > Well... you can't get away with having *no* mapping interface at all.
> >
> > Yes but the mapping interface should be *generic*, not one that is so
> > specifically tailored to one server. fuse will have to support this
> > forever.
>
> Mapping interfaces being generic is a nice idea, but I'm no sure it's
> realistic in a generalized sense. But other mitigating comments below.
>
> >
> > > You still have to define a UABI that BPF programs can use to convey
> > > mapping data into fsdax/iomap. BTF is a nice piece of work that smooths
> > > over minor fluctuations in struct layout between a running kernel and
> > > a precompiled BPF program, but fundamentally we still need a fuse-native
> > > representation.
> > >
> > > That last sentence was an indirect way of saying: No, we're not going
> > > to export struct iomap to userspace. The fuse-iomap patchset provides
> > > all the UABI pieces we need for regular filesystems (ext4) and hardware
> > > adjacent filesystems (famfs) to exchange file mapping data with the
> > > kernel. This has been out for review since last October, but the lack
> > > of engagement with that patchset (or its February resubmission) doesn't
> > > leave me with confidence that any of it is going anywhere.
> > >
> > > Note: The reason for bolting BPF atop fuse-iomap is so that famfs can
> > > upload bpf programs to generate interleaved mappings. It's not so hard
> > > to convert famfs' iomapping paths to use fuse-iomap, but I haven't
> > > helped him do that because:
> > >
> > > a) I have no idea what Miklos' thoughts are about merging any of the
> > > famfs stuff.
> > >
> > > b) I also have no idea what his thoughts are about fuse-iomap. The
> > > sparse replies are not encouraging.
> > >
> > > c) It didn't seem fair to John to make him take on a whole new patchset
> > > dependency given (a) and (b).
> > >
> > > d) Nobody ever replied to my reply to the LSFMM thread about "can we do
> > > some code review of fuse iomap without waiting three months for LSFMM?"
> > > I've literally done nothing with fuse-iomap for two of the three months
> > > requested.
> > >
> > > > > So let us please at least have a try at this. I'm not into bpf yet,
> > > > > but willing to learn.
> > >
> > > I sent out the patches to enable exactly this sort of experimentation
> > > two months ago, and have not received any responses:
> > >
> > > https://lore.kernel.org/linux-fsdevel/177188736765.3938194.6770791688236041940.stgit@frogsfrogsfrogs/
> > >
> > > I would like to say this as gently as possible: I don't know what the
> > > problem here is, Miklos -- are you uninterested in the work? Do you
> > > have too many other things to do inside RH that you can't talk about?
> > > Is it too difficult to figure out how the iomap stuff fits into the rest
> > > of the fuse codebase? Do you need help from the rest of us to get
> > > reviews done? Is there something else with which I could help?
> > >
> > > Because ... over the past few years, many of my team's filesystem
> > > projects have endured monthslong review cycles and often fail to get
> > > merged. This has led to burnout and frustration among my teammates such
> > > that many of them chose to move on to other things. For the remaining
> > > people, it was very difficult to justify continuing headcount when
> > > progress on projects is so slow that individuals cannot achieve even one
> > > milestone per quarter on any project.
> > >
> > > There's now nobody left here but me.
> > >
> > > I'm not blaming you (Miklos) for any of this, but that is the current
> > > deplorable state of things.
> > >
> > > > > Thanks,
> > > > > Miklos
> > > >
> > > > Thanks for responding...
> > > >
> > > > My short response: Noooooooooo!!!!!!
> > > >
> > > > I very strongly object to making this a prerequisite to merging. This
> > > > is an untested idea that will certainly delay us by at least a couple
> > > > of merge windows when products are shipping now, and the existing approach
> > > > has been in circulation for a long time. It is TOO LATE!!!!!!
> > >
> > > /me notes that has "we're shipping so you have to merge it over peoples'
> > > concerns" rarely carries the day in LKML land, and has never ended well
> > > in the few cases that it happens. As Ted is fond of saying, this is a
> > > team sport, not an individual effort. Unfortunately, to abuse your
> > > sports metaphor, we all play for the ******* A's.
> > >
> > > That said, you're clearly pissed at the goalposts changing yet again,
> > > and that's really not fair that we collectively keep moving them.
> > >
> > > It's a rotten situation that I could have even helped you to solve both
> > > our problems via fuse-iomap, but I just couldn't motivate myself to
> > > entwine our two projects until the technical direction questions got
> > > answered.
> > >
> > > > Famfs is not a science project, it's enablement for actual products and
> > > > early versions are available now!!!
> > > >
> > > > That doesn't mean we couldn't convert later IF THERE ARE NO HIDDEN PROBLEMS.
> > >
> > > Heck, the fuse command field is a u32. There are plenty of numberspace
> > > left, and the kernel can just *stop issuing them*.
> >
> > I don't think the problem is the command field. As I understand it, if
> > this lands and is converted over later, none of the famfs code in this
> > series can be removed from fuse. If fuse has native non-bpf support
> > for famfs, then it will always need to have that. That's the part that
> > worries me.
>
> I believe this basic premise is completely wrong. Here is why:
>
> There is a FUSE_DAX_FMAP capability that the kernel may advertise or not
> at init time; this capability "is" the famfs GET_FMAP AND GET_DAXDEV
> commands. In the future, if we find a way to use BPF (or some other
> mechanism) to avoid needing those fuse messages, the kernel could be updated
> to NEVER advertise the FUSE_DAX_FMAP capability. All of the famfs-specific
> code could be taken out of kernels that never advertise that capability.

I’m not sure the capability bit can be used like that (though I am
hoping it can!). As I understand it, once the kernel advertises a
capability, it must continue supporting it in future kernels else
userspace programs that rely on it will break.

John, with what you have in this series, is there any way we can make
it generalizable so it can be used by any dax-based server? Would you
be open to that?

My main gripe is with the layout encoding for FUSE_GET_FMAP and how it
bakes in famfs-specific layout concepts (simple vs interleaved
extents, file type, strip definitions, etc) into the uapi and forces
the kernel to interpret/process this super-specific encoding in a
hardcoded way. Is it not possible to just move this all to the server
side? Could the famfs server just preprocess the mappings and take
care of the simple vs interleaved extents stuff and do all the logic
that's in [1] in userspace and simply just give the kernel a list of
generic extent mapping information? The kernel could then just cache
this generic mapping per inode and do the lookup on fault with no
famfs-specific logic needed in the kernel. Or is there a reason the
famfs processing logic needs to happen in the kernel? That's the part
I'm unclear about.

I think this is pretty much what Darrick is doing in his iomap
patchset [2] except in his case, the mappings are fetched lazily on
access, whereas I guess for your case that would be too slow so it
needs to be prepopulated at open time. imo that seems fine since
prepopulating mappings would be a useful feature in general. So could
we rename FUSE_GET_FMAP to something like FUSE_IOMAP_GETMAP and have
it just return super generic mapping info? Darrick already adds a
"struct fuse_iomap_io" [3] to the uapi, could we just use that and add
a uint32_t dev_id; to that?

Looking more at Darrick's patchset, his caching [4] uses the b+ tree
and is more performant but imo also more complex, but I don't think
that would be necessary for this series. imo we could just use a basic
interval tree for now for famfs's needs and optimize / converge it
later.

As well, are you open to renaming the FUSE_DAX_FMAP capability and
FUSE_FAMFS_DAX config to a generic FUSE_IOMAP_DAX naming scheme
instead?

Additionally, it would be really nice to replace FUSE_GET_DAXDEV and
add the daxdev config stuff to the generic FUSE_IOMAP_CONFIG Darrick
has for iomap [5], but this one might be more contentious in getting
the config uapi solidified. I think in general though for famfs, it
makes more sense for all the dax devices to be set up upfront all at
once at connection init time instead of discovered lazily on every
file open, which as I understand it, would make things less complex
server side for famfs too, so even if this doesn't go through
FUSE_IOMAP_CONFIG, maybe it makes sense to have this be on init where
the server sends all the daxdev device information all at once? That
seems less confusing (and better performance-wise) of a flow than
having it be lazily discovered on open, though if I'm misunderstanding
something here please let me know.

I'm not sure how Miklos feels about this, but I think this would get
rid of all the famfs-specific logic in fuse and would be adding
infrastrucutre that other dax-backed servers in the future would use.
On my end, this eases my concerns and would be just as good as what
the bpf approach is trying to accomplish.

There's some ties between your and Darrick's work as you guys are both
using iomap, and maybe/hopefully this aligns with Darrick's thoughts
combining the two, but in my opinion (and maybe Darrick disagrees with
this), to get famfs out the door, we wouldn't need all the code in
Darrick's patchset to land before famfs. In my view, we would only
need to borrow/steal the 'struct fuse_iomap_io' definition in uapi and
hopefully converge on a FUSE_IOMAP_CONFIG uapi format, for famfs to
land, if you agree this direction makes sense. I think we just need to
get the uapi stuff figured out and everything else could be optimized
in the future post-merge. Darrick, do you agree with this assessment?

John, is this a middle ground you think is reasonable? I know this
whole thing has kind of been a mess with the moving goalposts - if the
above would work for famfs but you're slammed with other work and
don't have the bandwidth to make these changes, I'm happy to help out
with the restructuring if that would be useful.

>
> Simple, really. Can't re-use the message opcodes, but as Darrick pointed out
> those are not a scarce resource.
>
> >
> > >
> > > > What are the risks of converting to BPF?
> >
> > I think maybe there is a misinterpretation of what the alternative
> > approach entails. From my point of view, the alternative approach is
> > not that different from what is already in this series. The only piece
> > of the famfs logic that would need to use bpf is the logic for
> > finding/computing the extent mappings (which is the famfs-specific
> > logic that would not be applicable to any other server). That famfs
> > bpf code is minimal and already written [1], as it is just the logic
> > that is in patch 6 [2] in this series copied over. No other part of
> > famfs touches bpf. The rest is renaming the functions in
> > fs/fuse/famfs.c to generic fuse_iomap_dax_XXX names (the logic is the
> > same logic in this series, eg invoking the lower-level calls to
> > dax_iomap_rw/fault/etc) and moving the daxdev setup/initialization to
> > connection initialization time where the server passes that daxdev
> > setup info/configs upfront. I don't think this would delay things by
> > several merge windows, as the code is already mostly written. If it
> > would be helpful, I can clean up what's in the prototype and send that
> > out.
> >
> > I think the part that is not clear yet and needs to be verified is
> > whether this approach runs into any technical limitations on famfs's
> > production workloads. For example, does the overhead of using bpf maps
> > lead to a noticeable performance drop on real workloads? In the
> > future, will there be too many extent mappings on high-scale systems
> > to make this feasible? etc. If there are technical reasons why the
> > famfs logic has to be in fuse, then imo we should figure that out and
> > ideally that's the discussion we should be having. I am not a cxl
> > expert so perhaps there is something missing in the approach that
> > makes it not sufficient on production systems. If we don't end up
> > going with the alternative approach, I still think this series should
> > try to make the famfs uapi additions to fuse as generic as possible
> > since that will be irreversible.
> >
> > If we expedited the alternative approach in terms of reviewing and
> > merging, would that suffice? Is the main pushback the timing of it, eg
> > that it would take too long to get reviewed, merged, and shipped?
> >
> > > >
> > > > - I don't know how to do it - so it'll be slow (kinda like my fuse learning
> > > > curve cost about a year because this is not that similar to anything
> > > > else that was already in fuse.
> > >
> > > ...and per above, BPF isn't some magic savior that avoids the expansion
> > > of the UABI.
> >
> > It doesn't avoid the expansion of the UABI but it makes the UABI
> > generic (eg plenty of future servers can/will use the generic iomap
> > layer).
>
> Um, advertised capabilities allow contraction of the UABI-handling code with
> only some small cruft. Code that is only reachable in the presence of dead
> capability can totally be removed.
>
> >
> > >
> > > > - Those of us who are involved don't fully understand either the security
> > > > or performance implications of this. It
> > >
> > > Correct. I sure think it's swell that people can inject IR programs
> > > that jit/link into the kernel. Don't ask which secondary connotation of
> > > "swell" I'm talking about.
> >
> > bpf is used elsewhere in the kernel (eg networking, scheduling). If it
> > is the case that it is unsafe (which maybe it is, I don't know), then
> > wouldn't those other areas have the same issues?
>
> See my long comment to Darrick's prior email.
>
> I suspect that this would be the only place BPF has been tried for a vma
> fault handler. That is a special, performance critical path - especially
> for famfs. In discussion with the right people we can probably reason
> through whether this is a non-starter or not.

Yes, I think the bpf overhead is the main uncertainty about whether
this suffices or not for famfs. I understand (from the all caps in
your previous messages :)) that performance is critical. The bpf
overhead could indeed be too much for the special famfs performance
critical path. I'll try to get some time next week to benchmark this.

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/0100019d43e79794-0eadcf5e-b659-43f7-8fdc-dec9f4ccce14-000000@xxxxxxxxxxxxxxxxxxx/
[2] https://lore.kernel.org/linux-fsdevel/177188733084.3935219.10400570136529869673.stgit@frogsfrogsfrogs/T/#t
[3] https://lore.kernel.org/linux-fsdevel/176169810371.1424854.3010195280915622081.stgit@frogsfrogsfrogs/
[4] https://lore.kernel.org/linux-fsdevel/177188735954.3937557.841478048197856035.stgit@frogsfrogsfrogs/
[5] https://lore.kernel.org/linux-fsdevel/176169813786.1427432.414564085463311156.stgit@frogsfrogsfrogs/

>
> >
> > >
> > > > - Famfs is enabling access to memory and mapping fault handling must be
> > > > at "memory speed". We know that BPF walks some data structures when a
> > > > program executes. That exposes us to additional serialized L3 cache
> > > > misses each time we service a mapping fault (any TLB & page table miss).
> > > > This should be studied side-by-side with the existing approach under
> > > > multiple loads before being adopted for production.
> > >
> > > Yes, it should. AFAICT if one switched to a per-inode bpf program, then
> > > you could do per-inode bpf programs. Then you don't even need the bpf
> > > map, and the ->iomap_begin becomes an indirect call into JITted x86_64
> > > math code.
> > >
> > > (The downside is that dyn code can't be meaningfully signed, requires
> > > clang on the system, and you have to deal with inode eviction issues.)
> > >
> > > > - This has never been done in production, and we're throwing it in the way
> > > > of a project that has been soaking for years and needs to support early
> > > > shipments of products.
> > >
> > > Correct. I haven't even implemented BPF-iomap for fuse4fs. This BPF
> > > integration stuff is *highly* experimental code.
> >
> > I think what fuse4fs needs for bpf is significantly more complicated
> > and intensive than what famfs needs. For famfs, the extent mapping
> > logic is straightforward computation.
> >
> > >
> > > > If this is the only path, I'd like to revive famfs as a standalone file
> > > > system. I'm still maintaining that and it's still in use.
> > >
> > > Honestly, you should probably just ship that to your users. As long as
> > > the ondisk format doesn't change much, switching the implementation at a
> > > later date is at least still possible.
> >
> > I recognize this is an unfair situation John as you've already spent
> > years working on this and did what the community asked with rewriting
> > it. What I'm hoping to convey is that the approach where the extent
> > computing/finding logic gets moved to bpf is not radically different
> > from the famfs logic already in this patchset. In my view, moving this
> > logic to bpf is more advantageous for both fuse *and* famfs
> > (decoupling famfs releases from kernel releases) - it would be great
> > to consider this on technical merits if expediting the timeline of the
> > alternative approach would suffice.
> >
> > Thanks,
> > Joanne
> >
> > [1] https://github.com/joannekoong/libfuse/blob/444fa27fa9fd2118a0dc332933197faf9bbf25aa/example/famfs.bpf.c
> > [2] https://lore.kernel.org/linux-fsdevel/0100019d43e79794-0eadcf5e-b659-43f7-8fdc-dec9f4ccce14-000000@xxxxxxxxxxxxxxxxxxx/
> >
> > >
> > > --D
>
> Regards,
> John
>