Re: [3/7, v9] NUMA Hotplug Emulator: Add node hotplug emulation

From: Andrew Morton
Date: Wed Dec 22 2010 - 19:29:15 EST


On Fri, 10 Dec 2010 15:31:22 +0800
shaohui.zheng@xxxxxxxxx wrote:

> From: David Rientjes <rientjes@xxxxxxxxxx>
>
> Add an interface to allow new nodes to be added when performing memory
> hot-add. This provides a convenient interface to test memory hotplug
> notifier callbacks and surrounding hotplug code when new nodes are
> onlined without actually having a machine with such hotpluggable SRAT
> entries.
>
> This adds a new debugfs interface at /sys/kernel/debug/mem_hotplug/add_node
> that behaves in a similar way to the memory hot-add "probe" interface.
> Its format is size@start, where "size" is the size of the new node to be
> added and "start" is the physical address of the new memory.
>
> The new node id is a currently offline, but possible, node. The bit must
> be set in node_possible_map so that nr_node_ids is sized appropriately.
>
> For emulation on x86, for example, it would be possible to set aside
> memory for hotplugged nodes (say, anything above 2G) and to add an
> additional four nodes as being possible on boot with
>
> mem=2G numa=possible=4
>
> and then creating a new 128M node at runtime:
>
> # echo 128M@0x80000000 > /sys/kernel/debug/mem_hotplug/add_node
> On node 1 totalpages: 0
> init_memory_mapping: 0000000080000000-0000000088000000
> 0080000000 - 0088000000 page 2M
> Once the new node has been added, its memory can be onlined. If this
> memory represents memory section 16, for example:
>
> # echo online > /sys/devices/system/memory/memory16/state
> Built 2 zonelists in Node order, mobility grouping on. Total pages: 514846
> Policy zone: Normal
> [ The memory section(s) mapped to a particular node are visible via
> /sys/kernel/debug/mem_hotplug/node1, in this example. ]
>
> The new node is now hotplugged and ready for testing.
>
> CC: Haicheng Li <haicheng.li@xxxxxxxxx>
> CC: Greg KH <gregkh@xxxxxxx>
> Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx>
> Signed-off-by: Shaohui Zheng <shaohui.zheng@xxxxxxxxx>
> ---
> Documentation/memory-hotplug.txt | 24 +++++++++++++++
> mm/memory_hotplug.c | 59 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 83 insertions(+), 0 deletions(-)
> Index: linux-hpe4/Documentation/memory-hotplug.txt
> ===================================================================
> --- linux-hpe4.orig/Documentation/memory-hotplug.txt 2010-11-30 12:40:43.527622001 +0800
> +++ linux-hpe4/Documentation/memory-hotplug.txt 2010-11-30 14:11:11.827622000 +0800
> @@ -18,6 +18,7 @@
> 4. Physical memory hot-add phase
> 4.1 Hardware(Firmware) Support
> 4.2 Notify memory hot-add event by hand
> + 4.3 Node hotplug emulation
> 5. Logical Memory hot-add phase
> 5.1. State of memory
> 5.2. How to online memory
> @@ -215,6 +216,29 @@
> Please see "How to online memory" in this text.
>
>
> +4.3 Node hotplug emulation
> +------------
> +With debugfs, it is possible to test node hotplug by assigning the newly
> +added memory to a new node id when using a different interface with a similar
> +behavior to "probe" described in section 4.2. If a node id is possible
> +(there are bits in /sys/devices/system/memory/possible that are not online),
> +then it may be used to emulate a newly added node as the result of memory
> +hotplug by using the debugfs "add_node" interface.
> +
> +The add_node interface is located at "mem_hotplug/add_node" at the debugfs
> +mount point.
> +
> +You can create a new node of a specified size starting at the physical
> +address of new memory by
> +
> +% echo size@start_address_of_new_memory > /sys/kernel/debug/mem_hotplug/add_node
> +
> +Where "size" can be represented in megabytes or gigabytes (for example,
> +"128M" or "1G"). The minumum size is that of a memory section.
> +
> +Once the new node has been added, it is possible to online the memory by
> +toggling the "state" of its memory section(s) as described in section 5.1.
> +
>
> ------------------------------
> 5. Logical Memory hot-add phase
> Index: linux-hpe4/mm/memory_hotplug.c
> ===================================================================
> --- linux-hpe4.orig/mm/memory_hotplug.c 2010-11-30 12:40:43.757622001 +0800
> +++ linux-hpe4/mm/memory_hotplug.c 2010-11-30 14:02:33.877622002 +0800
> @@ -924,3 +924,63 @@
> }
> #endif /* CONFIG_MEMORY_HOTREMOVE */
> EXPORT_SYMBOL_GPL(remove_memory);
> +
> +#ifdef CONFIG_DEBUG_FS
> +#include <linux/debugfs.h>
> +
> +static struct dentry *memhp_debug_root;
> +
> +static ssize_t add_node_store(struct file *file, const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + nodemask_t mask;

NODEMASK_ALLOC()?

> + u64 start, size;
> + char buffer[64];
> + char *p;
> + int nid;
> + int ret;
> +
> + memset(buffer, 0, sizeof(buffer));
> + if (count > sizeof(buffer) - 1)
> + count = sizeof(buffer) - 1;

This will cause the write to return a smaller number than `count': a
short write. Some userspace code may then decide to write the
remainder of the data (whcih is the correct way to use the write()
syscall).

Could be a bit dangerous, and perhaps simply declaring an error if too
much data was written would be a better approach.

> + if (copy_from_user(buffer, buf, count))
> + return -EFAULT;
> +
> + size = memparse(buffer, &p);
> + if (size < (PAGES_PER_SECTION << PAGE_SHIFT))

PAGES_PER_SECTION has type unsigned long, so the rhs of this comparison
might overflow on 32-bit, should anyone ever try to use this code on
32-bit.

otoh the compiler might do it as 64-bit because the lhs is 64-bit. Not
sure.

> + return -EINVAL;
> + if (*p != '@')
> + return -EINVAL;
> +
> + start = simple_strtoull(p + 1, NULL, 0);

You disagreed with checkpatch?

> + nodes_andnot(mask, node_possible_map, node_online_map);
> + nid = first_node(mask);
> + if (nid == MAX_NUMNODES)
> + return -ENOMEM;
> +
> + ret = add_memory(nid, start, size);
> + return ret ? ret : count;
> +}
> +
> +static const struct file_operations add_node_file_ops = {
> + .write = add_node_store,
> + .llseek = generic_file_llseek,
> +};
> +
> +static int __init node_debug_init(void)
> +{
> + if (!memhp_debug_root)
> + memhp_debug_root = debugfs_create_dir("mem_hotplug", NULL);
> + if (!memhp_debug_root)
> + return -ENOMEM;
> +
> + if (!debugfs_create_file("add_node", S_IWUSR, memhp_debug_root,
> + NULL, &add_node_file_ops))
> + return -ENOMEM;
> +
> + return 0;
> +}
> +
> +module_init(node_debug_init);
> +#endif /* CONFIG_DEBUG_FS */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/