Re: [PATCH V10 00/10] famfs: port into fuse

From: Joanne Koong

Date: Mon Apr 20 2026 - 23:16:25 EST


On Sun, Apr 19, 2026 at 5:27 PM Gregory Price <gourry@xxxxxxxxxx> wrote:
>
> On Sun, Apr 19, 2026 at 03:36:30PM -0500, John Groves wrote:
> > On 26/04/15 10:16AM, David Hildenbrand (Arm) wrote:
> > > On 4/15/26 00:20, Gregory Price wrote:
> > > > On Tue, Apr 14, 2026 at 11:57:40AM -0700, Darrick J. Wong wrote:
> >
> > Gregory's code, in the current form, still uses two new fuse messages,
> > GET_FMAP and GET_DAXDEV, but it makes the fmap message format opaque by
> > removing fmap format structs from the uapi. It also uses two BPF programs.
> > One BPF program parses and validates the GET_FMAP payload for every file,
> > and hangs it from a 'void *' in each fuse_inode (just like the current famfs
> > code). The other BPF program is called during vma faults and reads the
> > fuse_inode->'void *' in order to handle faults the same way famfs-fuse does
> > today, but via BPF instead.
> >
>

Thanks John for running the benchmarks on your hardware. And thanks
Gregory for your work on this too.

> I'll just lay out what i've done and why.
>
> For John's sanity, if there are NACKs, knowing sooner rather than later
> would be a kindness.
>
> === Problem: Any lookup() in iomap_begin() is too much overhead.
>
> No dax-backed server will want to eat the cost of a lookup() that
> could be multiple microseconds on what should be a 1-5us soft-fault.
>
> Joanne's prototype had this:
>
> meta = bpf_map_lookup_elem(&inode_map, &nodeid);
>
> But it was offsetting a single pointer dereference:
>
> struct fuse_inode *fi = get_fuse_inode(inode);
> struct famfs_file_meta *meta = fi->famfs_meta;
>
> Not all O(1) are created equal here.
>
> A single L3 LLC miss plus page table walk can cost you ~100ns.
> If that pointer was cache-hot, it's almost free.
>
> A pointer chase through any structure is N x ~100ns.
> This is unlikely to ever be sufficiently cache hot by comparison.
>
> So, lets just avoid this problem altogether.
>
>
> === Requirements
>
> 1) No hard-coded OMF structures in the FUSE API.
>
> While RAID0 style interleaving isn't exactly fancy or novel,
> folks think this should not be in the kernel headers.
>
> (I'm not going to argue, I think the argument is pointless)
>
>
> 2) imap_begin() needs metadata accessible on the order of a single
> pointer dereference - which is what John has implemented.
>
>
> 3) open() needs to validate the metadata and identify DAX devices
>
> a) it needs to validate the DAX devices are available and
> acquire them / set them up / etc. This is a kernel-side op.
>
> b) it needs to validate the addressing information is valid for
> the relevant dax devices
>
> Both GET_FMAP and GET_DAXDEV are avoided if the metadata is
> already cached or the DAXDEV is already setup. So keeping these
> separate is actually important.
>
>
> Joanne's code deals with #1 - but it doesn't handle #2 or #3.
> (It also doesn't handle GET_DAXDEV at all).

It handles #3 by removing GET_DAXDEV as a fuse op and having the
daxdev initialization / setup routed through FUSE_IOMAP_CONFIG at
iomap initialization time instead, which integrates with the generic
iomap infrastruture/uapi additions Darrick added in his fuse-iomap
series [1].

In this series the GET_DAXDEV op gets sent lazily on file opens but
it's still not clear to me why this is necessary. imo device setup
should happen logically as part of iomap configuration and it seems
more efficient to have devices validated/acquired before any files are
opened. I thihnk that makes things a lot simpler on the kernel side in
other ways (eg we can get rid of famfs_dax_devlist / famfs_daxdev /
famfs_devlist_sem / famfs_update_daxdev_table() /
famfs_fuse_get_daxdev() altogether). It also saves the famfs server
the roundtrip context switching cost if we get rid of GET_DAXDEV and
move it to iomap initialization time, which will improve FUSE_OPEN
performance for famfs. Maybe there's something I'm missing here as to
why the daxdev initialization has to be done lazily on open?

I think Darrick had also mentioned something earlier about how he
thinks GET_DAXDEV should be another application of backing files [2] -
I like this idea too, as it gets rid of the GET_DAXDEV op and reuses
fuse's existing infrastructure.

[1] https://lore.kernel.org/linux-fsdevel/177188734695.3935739.8198854011004837207.stgit@frogsfrogsfrogs/
[2] https://lore.kernel.org/linux-fsdevel/20260416224331.GD114184@frogsfrogsfrogs/

>
> John's code mananges #2 and #3 by having the fuse-server pass meta data
> on open() via GET_FMAP and GET_DAXDEV.
>
> GET_FMAP acquires the meta data on how dax devices are used
>
> GET_DAXDEV just translates an ID to specific dax device.
> iomap_being() then uses the OMF to do the mapping.
>
> But it does this by hard-coding the format into kernel headers.
>
>
> === Observation: Add a BPF dax_fmap_parse() on open()
>
> Pair Joanne's suggestion with John's GET_FMAP/GET_DAXDEV operations.
>
> struct fuse_dax_fmap_ops {
> char name[FUSE_DAX_FMAP_OPS_NAME_LEN]; // 16 bytes
> int (*dax_fmap_parse)(struct fuse_dax_fmap_parse_ctx *ctx);

Just a note for later, if the bpf approach gets pursued further:
instead of making this a dax specific ops, I think this needs to be
integrated interface-wise with Darrick's fuse-iomap work since he does
the same thing. I think dax_fmap_parse() could be renamed to something
like iomap_setup(), where userspace can use this to do any sort of
generic setup, whether that's mapping related or dax related or not.
In my mind, the dax vs non dax distinction is handled by the fuse
iomap plumbing that chooses which iomap entry points to call, but
beyond that, the callbacks and struct ops themselves should be
generic enough to be shared between the two.

> int (*iomap_begin)(struct fuse_dax_fmap_resolve_ctx *ctx,
> struct fuse_iomap_io *io);
> };
>
> This parse function is used to do filesystem specific setup the (such as
> populate the dax bitmap) based on filesystem-specific per-file metadata.
>
> In John's case, essentially all it does is populate the dax bitmap and
> toss the data onto fi->dax_fmap.meta.
>
> Pseudo code:
>
> fuse_dax_fmap_open(inode):
> fmap_size = send_GET_FMAP(inode, fmap_buf)
>
> /* Make space to store the metadata */
> meta_buf = kzalloc(meta_size)
> ctx = { ... }
> kern = { .ctx, .blob = blob, .meta_buf = meta_buf }
>
> /* Parse the metadata: i.e. fill out the daxdev bitmap */
> fc->dax_fmap_ops->dax_fmap_parse(&ctx)
>
> /* Call GET_DAXDEV for any new dax devices */
> resolve_dev_bitmap(ctx.dev_bitmap)
>
> /* cache the meta data on the inode */
> inode_lock()
> fi->dax_fmap.meta = meta_buf
> ... etc etc ...
> inode_unlock()
>
> And otherwise, imap_begin() works exactly as Joanne proposed, but with
> in-kernel cached data instead of the bpfmap.
>
> const struct dax_simple_meta *meta = (const struct dax_simple_meta *)
> bpf_fuse_dax_resolve_get_meta(ctx, 0, sizeof(*meta));

another note for later, if the benchmarks prove promising and after
the LSF discussions we decide to go with this approach: imo we
could/should repurpose this into a generic
bpf_fuse_iomap_get_inode_meta() that returns a bounded pointer into
whatever opaque blob was cached on the inode during iomap_setup(),
where it'd be a generic kfunc serving both the dax and non-dax case
for any kind of mapping layout

>
> And since both parse() and iomap_begin() are bpf programs - and they're
> the only consumers of the metadata - FUSE itself no longer needs to know
> anything about the server's particular strategy to use the dax devices.
>
> struct fuse_inode {
> ...
> #if IS_ENABLED(CONFIG_FUSE_DAX_FMAP)
> struct {
> void *meta;
> u32 meta_size;
> u64 file_size;

I don't think file_size is needed here? seems like we could just
derive this from i_size_read(inode)?

> } dax_fmap;

/s/dax_fmap/iomap

> #endif
> };
>
> Just a big ol' honkin' void* that otherwise gets ignored.
>
> (Note: while i'm not a BPF wizard, this pattern seems well established in
> existing BPF code, i found code in the network stack that caches
> data on kernel objects this way as well)
>
> ==== Caveats
>
> 1) We don't know the overhead BPF introduces in the fault path.
>
> My napkin math (and best understanding of BPF) suggests:
>
> 1) trampoline / vtable for bpf ops (iomap_begin func)
> 2) retpoline cost of BPF (assuming this is on, safe assumption)
> 3) bpf_fuse_dax_resolve_get_meta() overhead (extra pointer deref)
>
> This *should* (i think) amount to an extra pointer dereference, a longjump,
> and a retpoline, which hopefully is <100ns since any extra pointer
> derefs here SHOULD be cache-hot (hard to know).
>
> It's not 0 overhead, and if the average fault time is 1us then every
> additional 10ns not an insignificant cost.
>
> But this is napkin math. John will collect data.
>
>
> 2) FUSE needs to be ok with the BPF-driven changes:
>
> https://github.com/joannekoong/linux/commits/prototype_generic_iomap_dax/
>
>
> 3) FUSE needs to be ok with GET_FMAP/GET_DAXDEV as opaque meta-data
> handlers for DAX devices.

I think we could kill GET_DAXDEV and for GET_FMAP, we could make this
a generic FUSE_IOMAP_GETMAP where the server can set a flag on open to
indicate whether the mapping blob should be fetched or not.
>
> That means there is no default parser or format. If you don't
> register ops, these functions are functionally dead.
>
> (probably fine to enforce during init, which is what i did)
>
>
> 4) As John said: MM needs to be good with it.
>
> Any server using DAX like this already essentially has CAP_SYS_RAWIO
> for DAX, and most likely some form of CAP_SYS_ADMIN.
>
> Additionally, as folks have pointed out, the resolution to PTE is
> bounded by dax device extents, so it's not entirely arbitrary.
>
> ===
>
> As mentioned at the start - you'd be doing John a kindness if there are
> clear and obvious NACK's to be had here.

I don't have a NACK on what you wrote above, thank you for your work
on this and bridging it into John's famfs server.

As I understand it, Amir also scheduled a cross-track FS+MM+IO session
at LSF to discuss famfs and dax iomap. Christoph had posted a
suggestion in another message about solving this problem with adding
generic stride/offset multi-device support to fs/iomap, and I'm hoping
the LSF session will shed more light on this, as that to me seems the
cleanest solution and would pretty much give everyone what they want
(including getting famfs unblocked, as I think with this approach we
would just need to figure out the generic stride/offset format for the
fuse iomap uapi, and could have the interleaving logic living in fuse
initially with fs/iomap migrations done post-merge). In the meantime,
I think it's really helpful getting the data points on how bpf
performs, thank you for running the benchmarks on your setup, John.

Thanks,
Joanne
>
> ~Gregory