Re: [PATCH V10 00/10] famfs: port into fuse

From: Gregory Price

Date: Sun Apr 19 2026 - 20:27:28 EST

On Sun, Apr 19, 2026 at 03:36:30PM -0500, John Groves wrote:
> On 26/04/15 10:16AM, David Hildenbrand (Arm) wrote:
> > On 4/15/26 00:20, Gregory Price wrote:
> > > On Tue, Apr 14, 2026 at 11:57:40AM -0700, Darrick J. Wong wrote:
>
> Gregory's code, in the current form, still uses two new fuse messages,
> GET_FMAP and GET_DAXDEV, but it makes the fmap message format opaque by
> removing fmap format structs from the uapi. It also uses two BPF programs.
> One BPF program parses and validates the GET_FMAP payload for every file,
> and hangs it from a 'void *' in each fuse_inode (just like the current famfs
> code). The other BPF program is called during vma faults and reads the
> fuse_inode->'void *' in order to handle faults the same way famfs-fuse does
> today, but via BPF instead.
>

I'll just lay out what i've done and why.

For John's sanity, if there are NACKs, knowing sooner rather than later
would be a kindness.

=== Problem: Any lookup() in iomap_begin() is too much overhead.

No dax-backed server will want to eat the cost of a lookup() that
could be multiple microseconds on what should be a 1-5us soft-fault.

Joanne's prototype had this:

meta = bpf_map_lookup_elem(&inode_map, &nodeid);

But it was offsetting a single pointer dereference:

struct fuse_inode *fi = get_fuse_inode(inode);
struct famfs_file_meta *meta = fi->famfs_meta;

Not all O(1) are created equal here.

A single L3 LLC miss plus page table walk can cost you ~100ns.
If that pointer was cache-hot, it's almost free.

A pointer chase through any structure is N x ~100ns.
This is unlikely to ever be sufficiently cache hot by comparison.

So, lets just avoid this problem altogether.

=== Requirements

1) No hard-coded OMF structures in the FUSE API.

While RAID0 style interleaving isn't exactly fancy or novel,
folks think this should not be in the kernel headers.

(I'm not going to argue, I think the argument is pointless)

2) imap_begin() needs metadata accessible on the order of a single
pointer dereference - which is what John has implemented.

3) open() needs to validate the metadata and identify DAX devices

a) it needs to validate the DAX devices are available and
acquire them / set them up / etc. This is a kernel-side op.

b) it needs to validate the addressing information is valid for
the relevant dax devices

Both GET_FMAP and GET_DAXDEV are avoided if the metadata is
already cached or the DAXDEV is already setup. So keeping these
separate is actually important.

Joanne's code deals with #1 - but it doesn't handle #2 or #3.
(It also doesn't handle GET_DAXDEV at all).

John's code mananges #2 and #3 by having the fuse-server pass meta data
on open() via GET_FMAP and GET_DAXDEV.

GET_FMAP acquires the meta data on how dax devices are used

GET_DAXDEV just translates an ID to specific dax device.
iomap_being() then uses the OMF to do the mapping.

But it does this by hard-coding the format into kernel headers.

=== Observation: Add a BPF dax_fmap_parse() on open()

Pair Joanne's suggestion with John's GET_FMAP/GET_DAXDEV operations.

struct fuse_dax_fmap_ops {
char name[FUSE_DAX_FMAP_OPS_NAME_LEN]; // 16 bytes
int (*dax_fmap_parse)(struct fuse_dax_fmap_parse_ctx *ctx);
int (*iomap_begin)(struct fuse_dax_fmap_resolve_ctx *ctx,
struct fuse_iomap_io *io);
};

This parse function is used to do filesystem specific setup the (such as
populate the dax bitmap) based on filesystem-specific per-file metadata.

In John's case, essentially all it does is populate the dax bitmap and
toss the data onto fi->dax_fmap.meta.

Pseudo code:

fuse_dax_fmap_open(inode):
fmap_size = send_GET_FMAP(inode, fmap_buf)

/* Make space to store the metadata */
meta_buf = kzalloc(meta_size)
ctx = { ... }
kern = { .ctx, .blob = blob, .meta_buf = meta_buf }

/* Parse the metadata: i.e. fill out the daxdev bitmap */
fc->dax_fmap_ops->dax_fmap_parse(&ctx)

/* Call GET_DAXDEV for any new dax devices */
resolve_dev_bitmap(ctx.dev_bitmap)

/* cache the meta data on the inode */
inode_lock()
fi->dax_fmap.meta = meta_buf
... etc etc ...
inode_unlock()

And otherwise, imap_begin() works exactly as Joanne proposed, but with
in-kernel cached data instead of the bpfmap.

const struct dax_simple_meta *meta = (const struct dax_simple_meta *)
bpf_fuse_dax_resolve_get_meta(ctx, 0, sizeof(*meta));

And since both parse() and iomap_begin() are bpf programs - and they're
the only consumers of the metadata - FUSE itself no longer needs to know
anything about the server's particular strategy to use the dax devices.

struct fuse_inode {
...
#if IS_ENABLED(CONFIG_FUSE_DAX_FMAP)
struct {
void *meta;
u32 meta_size;
u64 file_size;
} dax_fmap;
#endif
};

Just a big ol' honkin' void* that otherwise gets ignored.

(Note: while i'm not a BPF wizard, this pattern seems well established in
existing BPF code, i found code in the network stack that caches
data on kernel objects this way as well)

==== Caveats

1) We don't know the overhead BPF introduces in the fault path.

My napkin math (and best understanding of BPF) suggests:

1) trampoline / vtable for bpf ops (iomap_begin func)
2) retpoline cost of BPF (assuming this is on, safe assumption)
3) bpf_fuse_dax_resolve_get_meta() overhead (extra pointer deref)

This *should* (i think) amount to an extra pointer dereference, a longjump,
and a retpoline, which hopefully is <100ns since any extra pointer
derefs here SHOULD be cache-hot (hard to know).

It's not 0 overhead, and if the average fault time is 1us then every
additional 10ns not an insignificant cost.

But this is napkin math. John will collect data.

2) FUSE needs to be ok with the BPF-driven changes:

https://github.com/joannekoong/linux/commits/prototype_generic_iomap_dax/

3) FUSE needs to be ok with GET_FMAP/GET_DAXDEV as opaque meta-data
handlers for DAX devices.

That means there is no default parser or format. If you don't
register ops, these functions are functionally dead.

(probably fine to enforce during init, which is what i did)

4) As John said: MM needs to be good with it.

Any server using DAX like this already essentially has CAP_SYS_RAWIO
for DAX, and most likely some form of CAP_SYS_ADMIN.

Additionally, as folks have pointed out, the resolution to PTE is
bounded by dax device extents, so it's not entirely arbitrary.

===

As mentioned at the start - you'd be doing John a kindness if there are
clear and obvious NACK's to be had here.

~Gregory