Re: Question about the "EXPERIMENTAL" tag for dax in XFS

From: Dan Williams
Date: Tue Mar 02 2021 - 02:10:53 EST


On Sun, Feb 28, 2021 at 11:27 PM Yasunori Goto <y-goto@xxxxxxxxxxx> wrote:
>
> Hello, Dan-san,
>
> On 2021/02/27 4:24, Dan Williams wrote:
> > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> >>
> >> On Fri, Feb 26, 2021 at 09:45:45AM +0000, ruansy.fnst@xxxxxxxxxxx wrote:
> >>> Hi, guys
> >>>
> >>> Beside this patchset, I'd like to confirm something about the
> >>> "EXPERIMENTAL" tag for dax in XFS.
> >>>
> >>> In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> >>> when we mount a pmem device with dax option, has been existed for a
> >>> while. It's a bit annoying when using fsdax feature. So, my initial
> >>> intention was to remove this tag. And I started to find out and solve
> >>> the problems which prevent it from being removed.
> >>>
> >>> As is talked before, there are 3 main problems. The first one is "dax
> >>> semantics", which has been resolved. The rest two are "RMAP for
> >>> fsdax" and "support dax reflink for filesystem", which I have been
> >>> working on.
> >>
> >> <nod>
> >>
> >>> So, what I want to confirm is: does it means that we can remove the
> >>> "EXPERIMENTAL" tag when the rest two problem are solved?
> >>
> >> Yes. I'd keep the experimental tag for a cycle or two to make sure that
> >> nothing new pops up, but otherwise the two patchsets you've sent close
> >> those two big remaining gaps. Thank you for working on this!
> >>
> >>> Or maybe there are other important problems need to be fixed before
> >>> removing it? If there are, could you please show me that?
> >>
> >> That remains to be seen through QA/validation, but I think that's it.
> >>
> >> Granted, I still have to read through the two patchsets...
> >
> > I've been meaning to circle back here as well.
> >
> > My immediate concern is the issue Jason recently highlighted [1] with
> > respect to invalidating all dax mappings when / if the device is
> > ripped out from underneath the fs. I don't think that will collide
> > with Ruan's implementation, but it does need new communication from
> > driver to fs about removal events.
> >
> > [1]: http://lore.kernel.org/r/CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@xxxxxxxxxxxxxx
> >
>
> I'm not sure why there is a race condition between unbinding operation
> and accessing mmaped file on filesystem dax yet.
>
> May be silly question, but could you tell me why the "unbinding"
> operation of the namespace which is mounted by filesystem dax must be
> allowed?

The unbind operation is used to switch the mode of a namespace between
fsdax and devdax. There is no way to fail unbind. At most it can be
delayed for a short while to perform cleanup, but it can't be blocked
indefinitely. There is the option to specify 'suppress_bind_attrs' to
the driver to preclude software triggered device removal, but that
would disable mode changes for the device.

> If "unbinding" is rejected when the filesystem is mounted with dax
> enabled, what is inconvenience?

It would be interesting (read difficult) to introduce the concept of
dynamic 'suppress_bind_attrs'. Today the decision is static at driver
registration time, not in response to how the device is being used.

I think global invalidation of all inodes that might be affected by a
dax-capable device being ripped away from the filesystem is sufficient
and avoids per-fs enabling, but I'm willing to be convinced that
->corrupted_range() is the proper vehicle for this.

>
> I can imagine if a device like usb memory stick is removed surprisingly,
> kernel/filesystem need to reject writeback at the time, and discard page
> cache. Then, I can understand that unbinding operation is essential for
> such case.

For usb the system is protected by the fact that all future block-i/o
submissions to the old block-device will fail, and a new usb-device
being plugged in will get a new block-device. I.e. the old security
model is invalidated / all holes are closed by blk_cleanup_queue().

> But I don't know why PMEM device/namespace allows unbinding operation
> like surprising removal event.

DAX hands direct mappings to physical pages, if the security model
fronting those physical pages changes the mappings attained via the
old security model need to be invalidated. blk_cleanup_queue() does
not invalidate DAX mappings.

The practical value of fixing that hole is small given that physical
unplug is not implemented for NVDIMMs today, but the get_user_pages()
path can be optimized if this invalidation is implemented, and future
hotplug-capable NVDIMM buses like CXL will need this.