Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap

From: Christoph Hellwig
Date: Sun Aug 13 2017 - 05:24:58 EST


On Sat, Aug 12, 2017 at 12:19:50PM -0700, Dan Williams wrote:
> The application does not need to know the storage address, it needs to
> know that the storage address to file offset is fixed. With this
> information it can make assumptions about the permanence of results it
> gets from the kernel.

Only if we clearly document that fact - and documenting the permanence
is different from saying the block map won't change.

> For example get_user_pages() today makes no guarantees outside of
> "page will not be freed",

It also makes the extremely important gurantee that the page won't
_move_ - e.g. that we won't do a memory migration for compaction or
other reasons. That's why for example RDMA can use to register
memory and then we can later set up memory windows that point to this
registration from userspace and implement userspace RDMA.

> but with immutable files and dax you now
> have a mechanism for userspace to coordinate direct access to storage
> addresses. Those raw storage addresses need not be exposed to the
> application, as you say it doesn't need to know that detail. MAP_SYNC
> does not fully satisfy this case because it requires agents that can
> generate MMU faults to coordinate with the filesystem.

The file system is always in the fault path, can you explain what other
agents you are talking about?

> All I know is that SMB Direct for persistent memory seems like a
> potential consumer. I know they're not going to use a userspace
> filesystem or put an SMB server in the kernel.

Last I talked to the Samba folks they didn't expect a userspace
SMB direct implementation to work anyway due to the fact that
libibverbs memory registrations interact badly with their fork()ing
daemon model. That being said during the recent submission of the
RDMA client code some comments were made about userspace versions of
it, so I'm not sure if that opinion has changed in one way or another.

Thay being said I think we absolutely should support RDMA memory
registrations for DAX mappings. I'm just not sure how S_IOMAP_IMMUTABLE
helps with that. We'll want a MAP_SYNC | MAP_POPULATE to make sure
all the blocks are polulated and all ptes are set up. Second we need
to make sure get_user_page works, which for now means we'll need a
struct page mapping for the region (which will be really annoying
for PCIe mappings, like the upcoming NVMe persistent memory region),
and we need to gurantee that the extent mapping won't change while
the get_user_pages holds the pages inside it. I think that is true
due to side effects even with the current DAX code, but we'll need to
make it explicit. And maybe that's where we need to converge -
"sealing" the extent map makes sense as such a temporary measure
that is not persisted on disk, which automatically gets released
when the holding process exits, because we sort of already do this
implicitly. It might also make sense to have explicitl breakable
seals similar to what I do for the pNFS blocks kernel server, as
any userspace RDMA file server would also need those semantics.

Last but not least we have any interesting additional case for modern
Mellanox hardware - On Demand Paging where we don't actually do a
get_user_pages but the hardware implements SVM and thus gets fed
virtual addresses directly. My head spins when talking about the
implications for DAX mappings on that, so I'm just throwing that in
for now instead of trying to come up with a solution.