Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges

From: David Hildenbrand
Date: Mon Aug 03 2020 - 03:48:22 EST


[...]

> Well, no v5.8-rc8 to line this up for v5.9, so next best is early
> integration into -mm before other collisions develop.
>
> Chatted with Justin offline and it currently appears that the missing
> numa information is the fault of the platform firmware to populate all
> the necessary NUMA data in the NFIT.

I'm planning on looking at some bits of this series this week, but some
questions upfront ...

>
> ---
> Cover:
>
> The device-dax facility allows an address range to be directly mapped
> through a chardev, or optionally hotplugged to the core kernel page
> allocator as System-RAM. It is the mechanism for converting persistent
> memory (pmem) to be used as another volatile memory pool i.e. the
> current Memory Tiering hot topic on linux-mm.
>
> In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
> it, but that labeling mechanism is not available / applicable to
> soft-reserved ("EFI specific purpose") memory [3]. This series provides
> a sysfs-mechanism for the daxctl utility to enable provisioning of
> volatile-soft-reserved memory ranges.
>
> The motivations for this facility are:
>
> 1/ Allow performance differentiated memory ranges to be split between
> kernel-managed and directly-accessed use cases.
>
> 2/ Allow physical memory to be provisioned along performance relevant
> address boundaries. For example, divide a memory-side cache [4] along
> cache-color boundaries.
>
> 3/ Parcel out soft-reserved memory to VMs using device-dax as a security
> / permissions boundary [5]. Specifically I have seen people (ab)using
> memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
> device-dax interface on custom address ranges. A follow-on for the VM
> use case is to teach device-dax to dynamically allocate 'struct page' at
> runtime to reduce the duplication of 'struct page' space in both the
> guest and the host kernel for the same physical pages.


I think I am missing some important pieces. Bear with me.

1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
automatically used in the buddy during boot, but remains untouched
(similar to pmem). But as it involves ACPI as well, it could also be
used on arm64 (-e820), correct?

2. Soft-reserved memory is volatile RAM with differing performance
characteristics ("performance differentiated memory"). What would be
examples of such memory? Like, memory that is faster than RAM (scratch
pad), or slower (pmem)? Or both? :) Is it a valid use case to use pmem
in a hypervisor to back this memory?

3. There seem to be use cases where "soft-reserved" memory is used via
DAX. What is an example use case? I assume it's *not* to treat it like
PMEM but instead e.g., use it as a fast buffer inside applications or
similar.

4. There seem to be use cases where some part of "soft-reserved" memory
is used via DAX, some other is given to the buddy. What is an example
use case? Is this really necessary or only some theoretical use case?

5. The "provisioned along performance relevant address boundaries." part
is unclear to me. Can you give an example of how this would look like
from user space? Like, split that memory in blocks of size X with
alignment Y and give them to separate applications?

6. If you add such memory to the buddy, is there any way the system can
differentiate it from other memory? E.g., via fake/other NUMA nodes?


Also, can you give examples of how kmem-added memory is represented in
/proc/iomem for a) pmem and b) soft-resered memory after this series
(skimming over the patches, I think there is a change for pmem, right?)?

I am really wondering if it's the right approach to squeeze this into
our pmem/nvdimm infrastructure just because it's easy to do. E.g., man
"ndctl" - "ndctl - Manage "libnvdimm" subsystem devices (Non-volatile
Memory)" speaks explicitly about non-volatile memory.


>
> [2]: http://lore.kernel.org/r/20200713160837.13774-11-joao.m.martins@xxxxxxxxxx
> [3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> [4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> [5]: http://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@xxxxxxxxxx
>
> ---
>
> Dan Williams (19):
> x86/numa: Cleanup configuration dependent command-line options
> x86/numa: Add 'nohmat' option
> efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
> ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
> resource: Report parent to walk_iomem_res_desc() callback
> mm/memory_hotplug: Introduce default phys_to_target_node() implementation
> ACPI: HMAT: Attach a device for each soft-reserved range
> device-dax: Drop the dax_region.pfn_flags attribute
> device-dax: Move instance creation parameters to 'struct dev_dax_data'
> device-dax: Make pgmap optional for instance creation
> device-dax: Kill dax_kmem_res
> device-dax: Add an allocation interface for device-dax instances
> device-dax: Introduce 'seed' devices
> drivers/base: Make device_find_child_by_name() compatible with sysfs inputs
> device-dax: Add resize support
> mm/memremap_pages: Convert to 'struct range'
> mm/memremap_pages: Support multiple ranges per invocation
> device-dax: Add dis-contiguous resource support
> device-dax: Introduce 'mapping' devices
>
> Joao Martins (4):
> device-dax: Make align a per-device property
> device-dax: Add an 'align' attribute
> dax/hmem: Introduce dax_hmem.region_idle parameter
> device-dax: Add a range mapping allocation attribute
>
>
> Documentation/x86/x86_64/boot-options.rst | 4
> arch/powerpc/kvm/book3s_hv_uvmem.c | 14
> arch/x86/include/asm/numa.h | 8
> arch/x86/kernel/e820.c | 16
> arch/x86/mm/numa.c | 11
> arch/x86/mm/numa_emulation.c | 3
> arch/x86/xen/enlighten_pv.c | 2
> drivers/acpi/numa/hmat.c | 76 --
> drivers/acpi/numa/srat.c | 9
> drivers/base/core.c | 2
> drivers/dax/Kconfig | 4
> drivers/dax/Makefile | 3
> drivers/dax/bus.c | 1046 +++++++++++++++++++++++++++--
> drivers/dax/bus.h | 28 -
> drivers/dax/dax-private.h | 60 +-
> drivers/dax/device.c | 134 ++--
> drivers/dax/hmem.c | 56 --
> drivers/dax/hmem/Makefile | 6
> drivers/dax/hmem/device.c | 100 +++
> drivers/dax/hmem/hmem.c | 65 ++
> drivers/dax/kmem.c | 199 +++---
> drivers/dax/pmem/compat.c | 2
> drivers/dax/pmem/core.c | 22 -
> drivers/firmware/efi/x86_fake_mem.c | 12
> drivers/gpu/drm/nouveau/nouveau_dmem.c | 15
> drivers/nvdimm/badrange.c | 26 -
> drivers/nvdimm/claim.c | 13
> drivers/nvdimm/nd.h | 3
> drivers/nvdimm/pfn_devs.c | 13
> drivers/nvdimm/pmem.c | 27 -
> drivers/nvdimm/region.c | 21 -
> drivers/pci/p2pdma.c | 12
> include/acpi/acpi_numa.h | 14
> include/linux/dax.h | 8
> include/linux/memory_hotplug.h | 5
> include/linux/memremap.h | 11
> include/linux/numa.h | 11
> include/linux/range.h | 6
> kernel/resource.c | 11
> lib/test_hmm.c | 15
> mm/memory_hotplug.c | 10
> mm/memremap.c | 299 +++++---
> tools/testing/nvdimm/dax-dev.c | 22 -
> tools/testing/nvdimm/test/iomap.c | 2
> 44 files changed, 1825 insertions(+), 601 deletions(-)
> delete mode 100644 drivers/dax/hmem.c
> create mode 100644 drivers/dax/hmem/Makefile
> create mode 100644 drivers/dax/hmem/device.c
> create mode 100644 drivers/dax/hmem/hmem.c
>
> base-commit: 01830e6c042e8eb6eb202e05d7df8057135b4c26
>


--
Thanks,

David / dhildenb