Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

From: Doug Ledford
Date: Thu Feb 07 2019 - 11:25:43 EST


I think I've finally wrapped my head around all of this. Let's see if I
have this right:

* People are using filesystem DAX to expose byte addressable persistent
memory because putting a filesystem on the memory makes an easy way to
organize the data in the memory and share it between various processes.
It's worth noting that this is not the only way to share this memory,
and arguably not even the best way, but it's what people are doing.
However, to get byte level addressability on the remote side, we must
create files on the server side, mmap those files, then give a handle to
the memory region to the client side that the client then addresses on a
byte by byte basis. This is because all of the normal kernel based
device sharing mechanisms are block based and don't provide byte level
addressability.

* People are asking for thin allocations, reflinks, deduplication,
whatever else because persistent memory is still small in terms of size
compared to the amount of data people want to put on it, so these
techniques stretch its usefulness.

* Because there is no kernel level mechanism for sharing byte
addressable memory, this only works with specific applications that have
been written to create files on byte addressable memory, mmap them, then
share them out via RDMA. I bring this up because in the video linked in
this email, Oracle is gushing about how great this feature is. But it's
important to understand that this only works because the Oracle
processes themselves are the filesystem sharing entity. That means at
other points in this conversation where we've talked about the need for
forward progress, and non-ODP hardware, and the talk has come down to
sending SIGKILL to a process in order to free memory reservations, I
feel confident in saying that Oracle would *never* agree to this. If
you kill an Oracle process to make forward progress, you are probably
also killing the very process that needed you to make progress in the
first place. I'm pretty confident that Oracle will have no problem
what-so-ever saying that ODP capable hardware is a hard requirement for
using their software with DAX.

* So if Oracle is likely to demand ODP hardware, period, are there other
scenarios that might be more accepting of a more limited FS on top of
DAX that doesn't support reflinks and deduplication? I can think of a
possible yes to that answer rather easily. Message brokerage servers
(amqp, qpid) have strict requirements about receiving a message and then
making sure that it makes it once, and only once, to all subscribed
receivers. A natural way of organizing this sort of thing is to create
a persistent ring buffer for incoming messages, one per each connecting
client that is sending messages. Then a log file for each client you
are sending messages back out to. Putting these files on persistent
memory and then mapping the ring buffer to the clients, and writing your
own transmission journals to the persistent memory, would allow the
program to be very robust in the face of a program or system crash.
This sort of usage would not require any thin allocations, reflinks, or
other such features, and yet would still find the filesystem
organization useful. Therefore I think the answer is yes, there are at
least some use cases that would find a less featureful filesystem that
works with persistent memory and RDMA but without ODP to be of value.

* Really though, as I said in my email to Tom Talpey, this entire
situation is simply screaming that we are doing DAX networking wrong.
We shouldn't be writing the networking code once in every single
application that wants to do this. If we had a memory segment that we
shared from server to client(s), and in that memory segment we
implemented a clustered filesystem, then applications would simply mmap
local files and be done with it. If the file needed to move, the kernel
would update the mmap in the application, done. If you ask me, it is
the attempt to do this the wrong way that is resulting in all this
heartache. That said, for today, my recommendation would be to require
ODP hardware for XFS filesystem with the DAX option, but allow ext2
filesystems to mount DAX filesystems on non-ODP hardware, and go in and
modify the ext2 filesystem so that on DAX mounts, it disables hole punch
and ftrunctate any time they would result in the forced removal of an
established mmap.


On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote:
> On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford <dledford@xxxxxxxxxx> wrote:
> > On Wed, 2019-02-06 at 15:08 -0700, Jason Gunthorpe wrote:
> > > On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote:
> > > > > On Wed, 6 Feb 2019, Doug Ledford wrote:
> > > > >
> > > > > > > Most of the cases we want revoke for are things like truncate().
> > > > > > > Shouldn't happen with a sane system, but we're trying to avoid users
> > > > > > > doing awful things like being able to DMA to pages that are now part of
> > > > > > > a different file.
> > > > > >
> > > > > > Why is the solution revoke then? Is there something besides truncate
> > > > > > that we have to worry about? I ask because EBUSY is not currently
> > > > > > listed as a return value of truncate, so extending the API to include
> > > > > > EBUSY to mean "this file has pinned pages that can not be freed" is not
> > > > > > (or should not be) totally out of the question.
> > > > > >
> > > > > > Admittedly, I'm coming in late to this conversation, but did I miss the
> > > > > > portion where that alternative was ruled out?
> > > > >
> > > > > Coming in late here too but isnt the only DAX case that we are concerned
> > > > > about where there was an mmap with the O_DAX option to do direct write
> > > > > though? If we only allow this use case then we may not have to worry about
> > > > > long term GUP because DAX mapped files will stay in the physical location
> > > > > regardless.
> > > >
> > > > No, that is not guaranteed. Soon as we have reflink support on XFS,
> > > > writes will physically move the data to a new physical location.
> > > > This is non-negotiatiable, and cannot be blocked forever by a gup
> > > > pin.
> > > >
> > > > IOWs, DAX on RDMA requires a) page fault capable hardware so that
> > > > the filesystem can move data physically on write access, and b)
> > > > revokable file leases so that the filesystem can kick userspace out
> > > > of the way when it needs to.
> > >
> > > Why do we need both? You want to have leases for normal CPU mmaps too?
> > >
> > > > Truncate is a red herring. It's definitely a case for revokable
> > > > leases, but it's the rare case rather than the one we actually care
> > > > about. We really care about making copy-on-write capable filesystems like
> > > > XFS work with DAX (we've got people asking for it to be supported
> > > > yesterday!), and that means DAX+RDMA needs to work with storage that
> > > > can change physical location at any time.
> > >
> > > Then we must continue to ban longterm pin with DAX..
> > >
> > > Nobody is going to want to deploy a system where revoke can happen at
> > > any time and if you don't respond fast enough your system either locks
> > > with some kind of FS meltdown or your process gets SIGKILL.
> > >
> > > I don't really see a reason to invest so much design work into
> > > something that isn't production worthy.
> > >
> > > It *almost* made sense with ftruncate, because you could architect to
> > > avoid ftruncate.. But just any FS op might reallocate? Naw.
> > >
> > > Dave, you said the FS is responsible to arbitrate access to the
> > > physical pages..
> > >
> > > Is it possible to have a filesystem for DAX that is more suited to
> > > this environment? Ie designed to not require block reallocation (no
> > > COW, no reflinks, different approach to ftruncate, etc)
> >
> > Can someone give me a real world scenario that someone is *actually*
> > asking for with this?
>
> I'll point to this example. At the 6:35 mark Kodi talks about the
> Oracle use case for DAX + RDMA.
>
> https://youtu.be/ywKPPIE8JfQ?t=395
>
> Currently the only way to get this to work is to use ODP capable
> hardware, or Device-DAX. Device-DAX is a facility to map persistent
> memory statically through device-file. It's great for statically
> allocated use cases, but loses all the nice things (provisioning,
> permissions, naming) that a filesystem gives you. This debate is what
> to do about non-ODP capable hardware and Filesystem-DAX facility. The
> current answer is "no RDMA for you".
>
> > Are DAX users demanding xfs, or is it just the
> > filesystem of convenience?
>
> xfs is the only Linux filesystem that supports DAX and reflink.
>
> > Do they need to stick with xfs?
>
> Can you clarify the motivation for that question? This problem exists
> for any filesystem that implements an mmap that where the physical
> page backing the mapping is identical to the physical storage location
> for the file data. I don't see it as an xfs specific problem. Rather,
> xfs is taking the lead in this space because it has already deployed
> and demonstrated that leases work for the pnfs4 block-server case, so
> it seems logical to attempt to extend that case for non-ODP-RDMA.
>
> > Are they
> > really trying to do COW backed mappings for the RDMA targets? Or do
> > they want a COW backed FS but are perfectly happy if the specific RDMA
> > targets are *not* COW and are statically allocated?
>
> I would expect the COW to be broken at registration time. Only ODP
> could possibly support reflink + RDMA. So I think this devolves the
> problem back to just the "what to do about truncate/punch-hole"
> problem in the specific case of non-ODP hardware combined with the
> Filesystem-DAX facility.



--
Doug Ledford <dledford@xxxxxxxxxx>
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD

Attachment: signature.asc
Description: This is a digitally signed message part