Re: [PATCH 4/4] dax: "Hotplug" persistent memory for use like normal RAM

From: Yanmin Zhang
Date: Thu Jan 17 2019 - 03:18:45 EST


On 2019/1/17 äå2:19, Dave Hansen wrote:
From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>

Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries.

However, this limits persistent memory use to applications which
*have* been modified. To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed ad used just like
normal RAM would be.

To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:

echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind

and then bind it to this new driver:

echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind

After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.

Note: this inherits any existing NUMA information for the newly-
added memory from the persistent memory device that came from the
firmware. On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node. That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.

There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches. But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.

Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
Cc: Dave Jiang <dave.jiang@xxxxxxxxx>
Cc: Ross Zwisler <zwisler@xxxxxxxxxx>
Cc: Vishal Verma <vishal.l.verma@xxxxxxxxx>
Cc: Tom Lendacky <thomas.lendacky@xxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: linux-nvdimm@xxxxxxxxxxxx
Cc: linux-kernel@xxxxxxxxxxxxxxx
Cc: linux-mm@xxxxxxxxx
Cc: Huang Ying <ying.huang@xxxxxxxxx>
Cc: Fengguang Wu <fengguang.wu@xxxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxx>
Cc: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
Cc: Yaowei Bai <baiyaowei@xxxxxxxxxxxxxxxxxxxx>
Cc: Takashi Iwai <tiwai@xxxxxxx>
---

b/drivers/dax/Kconfig | 5 ++
b/drivers/dax/Makefile | 1
b/drivers/dax/kmem.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 99 insertions(+)

diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
--- a/drivers/dax/Kconfig~dax-kmem-try-4 2019-01-08 09:54:44.051694874 -0800
+++ b/drivers/dax/Kconfig 2019-01-08 09:54:44.056694874 -0800
@@ -32,6 +32,11 @@ config DEV_DAX_PMEM
Say M if unsure
+config DEV_DAX_KMEM
+ def_bool y
+ depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure
+ depends on MEMORY_HOTPLUG # for add_memory() and friends
+
config DEV_DAX_PMEM_COMPAT
tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
depends on DEV_DAX_PMEM
diff -puN /dev/null drivers/dax/kmem.c
--- /dev/null 2018-12-03 08:41:47.355756491 -0800
+++ b/drivers/dax/kmem.c 2019-01-08 09:54:44.056694874 -0800
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */
+#include <linux/memremap.h>
+#include <linux/pagemap.h>
+#include <linux/memory.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/pfn_t.h>
+#include <linux/slab.h>
+#include <linux/dax.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include "dax-private.h"
+#include "bus.h"
+
+int dev_dax_kmem_probe(struct device *dev)
+{
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct resource *res = &dev_dax->region->res;
+ resource_size_t kmem_start;
+ resource_size_t kmem_size;
+ struct resource *new_res;
+ int numa_node;
+ int rc;
+
+ /* Hotplug starting at the beginning of the next block: */
+ kmem_start = ALIGN(res->start, memory_block_size_bytes());
+
+ kmem_size = resource_size(res);
+ /* Adjust the size down to compensate for moving up kmem_start: */
+ kmem_size -= kmem_start - res->start;
+ /* Align the size down to cover only complete blocks: */
+ kmem_size &= ~(memory_block_size_bytes() - 1);
+
+ new_res = devm_request_mem_region(dev, kmem_start, kmem_size,
+ dev_name(dev));
+
+ if (!new_res) {
+ printk("could not reserve region %016llx -> %016llx\n",
+ kmem_start, kmem_start+kmem_size);
+ return -EBUSY;
+ }
+
+ /*
+ * Set flags appropriate for System RAM. Leave ..._BUSY clear
+ * so that add_memory() can add a child resource.
+ */
+ new_res->flags = IORESOURCE_SYSTEM_RAM;
+ new_res->name = dev_name(dev);
+
+ numa_node = dev_dax->target_node;
+ if (numa_node < 0) {
+ pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node);
+ numa_node = 0;
+ }
+
+ rc = add_memory(numa_node, new_res->start, resource_size(new_res));
I didn't try pmem and I am wondering it's slower than DRAM.
Should a flag, such like _GFP_PMEM, be added to distinguish it from
DRAM?

If it's used for DMA, perhaps it might not satisfy device DMA request on time?