Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

From: Dan Williams
Date: Wed Feb 06 2019 - 18:30:42 EST

On Wed, Feb 6, 2019 at 3:21 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> On Wed, Feb 06, 2019 at 02:44:45PM -0800, Dan Williams wrote:
> > > Do they need to stick with xfs?
> >
> > Can you clarify the motivation for that question? This problem exists
> > for any filesystem that implements an mmap that where the physical
> > page backing the mapping is identical to the physical storage location
> > for the file data.
> .. and needs to dynamicaly change that mapping. Which is not really
> something inherent to the general idea of a filesystem. A file system
> that had *strictly static* block assignments would work fine.
> Not all filesystem even implement hole punch.
> Not all filesystem implement reflink.
> ftruncate doesn't *have* to instantly return the free blocks to
> allocation pool.
> ie this is not a DAX & RDMA issue but a XFS & RDMA issue.
> Replacing XFS is probably not be reasonable, but I wonder if a XFS--
> operating mode could exist that had enough features removed to be
> safe?

You're describing the current situation, i.e. Linux already implements
this, it's called Device-DAX and some users of RDMA find it
insufficient. The choices are to continue to tell them "no", or say
"yes, but you need to submit to lease coordination".

> Ie turn off REFLINK. Change the semantic of ftruncate to be more like
> ETXTBUSY. Turn off hole punch.
> > > Are they really trying to do COW backed mappings for the RDMA
> > > targets? Or do they want a COW backed FS but are perfectly happy
> > > if the specific RDMA targets are *not* COW and are statically
> > > allocated?
> >
> > I would expect the COW to be broken at registration time. Only ODP
> > could possibly support reflink + RDMA. So I think this devolves the
> > problem back to just the "what to do about truncate/punch-hole"
> > problem in the specific case of non-ODP hardware combined with the
> > Filesystem-DAX facility.
> Usually the problem with COW is that you make a READ RDMA MR and on a
> COW'd file, and some other thread breaks the COW..
> This probably becomes a problem if the same process that has the MR
> triggers a COW break (ie by writing to the CPU mmap). This would cause
> the page to be reassigned but the MR would not be updated, which is
> not what the app expects.
> WRITE is simpler, once the COW is broken during GUP, the pages cannot
> be COW'd again until the DMA pin is released. So new reflinks would be
> blocked during the DMA pin period.
> To fix READ you'd have to treat it like WRITE and break the COW at GPU.

Right, that's what I'm proposing that any longterm-GUP break COW as if
it were a write.