Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

From: Jan Kara
Date: Fri Jun 07 2019 - 06:50:51 EST


On Thu 06-06-19 15:22:28, Ira Weiny wrote:
> On Thu, Jun 06, 2019 at 04:51:15PM -0300, Jason Gunthorpe wrote:
> > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> >
> > > So I'd like to actually mandate that you *must* hold the file lease until
> > > you unpin all pages in the given range (not just that you have an option to
> > > hold a lease). And I believe the kernel should actually enforce this. That
> > > way we maintain a sane state that if someone uses a physical location of
> > > logical file offset on disk, he has a layout lease. Also once this is done,
> > > sysadmin has a reasonably easy way to discover run-away RDMA application
> > > and kill it if he wishes so.
> > >
> > > The question is on how to exactly enforce that lease is taken until all
> > > pages are unpinned. I belive it could be done by tracking number of
> > > long-term pinned pages within a lease. Gup_longterm could easily increment
> > > the count when verifying the lease exists, gup_longterm users will somehow
> > > need to propagate corresponding 'filp' (struct file pointer) to
> > > put_user_pages_longterm() callsites so that they can look up appropriate
> > > lease to drop reference - probably I'd just transition all gup_longterm()
> > > users to a saner API similar to the one we have in mm/frame_vector.c where
> > > we don't hand out page pointers but an encapsulating structure that does
> > > all the necessary tracking. Removing a lease would need to block until all
> > > pins are released - this is probably the most hairy part since we need to
> > > handle a case if application just closes the file descriptor which
> > > would
> >
> > I think if you are going to do this then the 'struct filp' that
> > represents the lease should be held in the kernel (ie inside the RDMA
> > umem) until the kernel is done with it.
>
> Yea there seems merit to this. I'm still not resolving how this helps track
> who has the pin across a fork.

Yes, my thought was that gup_longterm() would return a structure that would
be tracking filp (or whatever is needed) and that would be embedded inside
RDMA umem.

> > Actually does someone have a pointer to this userspace lease API, I'm
> > not at all familiar with it, thanks
>
> man fcntl
> search for SETLEASE
>
> But I had to add the F_LAYOUT lease type. (Personally I'm for calling it
> F_LONGTERM at this point. I don't think LAYOUT is compatible with what we are
> proposing here.)

I think F_LAYOUT still expresses it pretty well. The lease is pinning
logical->physical file offset mapping, i.e. the file layout.

> >
> > And yes, a better output format from GUP would be great..
> >
> > > Maybe we could block only on explicit lease unlock and just drop the layout
> > > lease on file close and if there are still pinned pages, send SIGKILL to an
> > > application as a reminder it did something stupid...
> >
> > Which process would you SIGKILL? At least for the rdma case a FD is
> > holding the GUP, so to do the put_user_pages() the kernel needs to
> > close the FD. I guess it would have to kill every process that has the
> > FD open? Seems complicated...
>
> Tending to agree... But I'm still not opposed to killing bad actors... ;-)
>
> NOTE: Jason I think you need to be more clear about the FD you are speaking of.
> I believe you mean the FD which refers to the RMDA context. That is what I
> called it in my other email.

I keep forgetting that the file with RDMA context may be held by multiple
processes so thanks for correcting me. My proposal with SIGKILL was jumping
to conclusion too quickly :) We have two struct files here: A file with RDMA
context that effectively is the owner of the page pins (let's call it
"context file") and a file which is mapped and on which we hold the lease and
whose blocks (pages) we are pinning (let's call it "buffer file"). Now once
buffer file is closed (and this means that all file descriptors pointing to
this struct file are closed - so just one child closing the file descriptor
won't trigger this) we need to release the lease and I want to have a way
of safely releasing remaining pins associated with this lease as well.
Because the pins would be invisible to sysadmin from that point on. Now if
the context file would be open only by the process closing the buffer file,
SIGKILL would work as that would close the buffer file as a side effect.
But as you properly pointed out, that's not necessarily the case. Walking
processes that have the context file open is technically complex and too
ugly to live so we have to come up with something better. The best I can
currently come up with is to have a method associated with the lease that
would invalidate the RDMA context that holds the pins in the same way that
a file close would do it.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR