[LSF/MM TOPIC] Page Cache Flexibility for NVM

From: Adam Manzanares
Date: Thu Feb 21 2019 - 18:11:58 EST


Hello,

I would like to attend the LSF/MM Summit 2019. I'm interested in
several MM topics that are mentioned below as well as Zoned Block
Devices and any io determinism topics that come up in the storage
track.

I have been working on a caching layer, hmmap (heterogeneous memory
map) [1], for emerging NVM and it is in spirit close to the page
cache. The key difference being that the backend device and caching
layer of hmmap is pluggable. In addition, hmmap supports DAX and write
protection, which I believe are key features for emerging NVMs that may
have write/read asymmetry as well as write endurance constraints.
Lastly we can leverage hardware, such as a DMA engine, when moving
pages between the cache while also allowing direct access if the device
is capable.

I am proposing that as an alternative to using NVMs as a NUMA node
we expose the NVM through the page cache or a viable alternative and
have userspace applications mmap the NVM and hand out memory with
their favorite userspace memory allocator.

This would isolate the NVMs to only applications that are well aware
of the performance implications of accessing NVM. I believe that all
of this work could be solved with the NUMA node approach, but the two
approaches are seeming to blur together.

The main points I would like to discuss are:

* Is the page cache model a viable alternative to NVM as a NUMA NODE?
* Can we add more flexibility to the page cache?
* Should we force separation of NVM through an explicit mmap?

I believe this discussion could be merged with NUMA, memory hierarchy
and device memory, Use NVDIMM as NUMA node and NUMA API, or memory
reclaim with NUMA balancing.

Here are some performance numbers of hmmap (in development):

All numbers are collected on a 4GiB hmmap device with a 128MiB cache.
For the mmap tests I used cgroups to limit the page cache usage to
128MiB. All results are an average of 10 runs. W and R access the
entire device with all threads segregated in the address space. RR
reads the entire device randomly 8 bytes at a time and is limited to
8MiB of data accessed.

hmmap brd vs. mmap of brd

hmmap mmap

Threads W R RR W R RR

1 7.21 5.39 5.04 6.80 5.63 5.23
2 5.19 3.87 3.74 4.66 3.33 3.20
4 3.65 2.95 3.07 3.53 2.26 2.18
8 4.52 3.43 3.59 4.30 1.98 1.88
16 5.00 3.85 3.98 4.92 2.00 1.99



Memory Backend Test (Dax capable)

hmmap hmmap-dax hmmap-wrprotect

Threads W R RR W R RR W R RR

1 6.29 4.94 4.37 2.54 1.36 0.16 7.12 2.13 0.73
2 4.62 3.63 3.57 1.41 0.69 0.08 5.06 1.14 0.41
4 3.45 2.97 3.11 0.77 0.36 0.04 3.66 0.63 0.25
8 4.10 3.53 3.71 0.44 0.19 0.02 4.03 0.35 0.17
16 4.60 3.98 4.04 0.34 0.16 0.02 4.52 0.27 0.14


Thanks,
Adam