Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

From: Tang Chen
Date: Tue Jul 07 2015 - 04:56:50 EST



On 07/07/2015 12:42 AM, Yasuaki Ishimatsu wrote:
On Fri, 3 Jul 2015 09:26:05 +0800
Tang Chen <tangchen@xxxxxxxxxxxxxx> wrote:

On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote:
Hi Tang,

On my box, if I run lscpu, the output looks like this:

NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220

Node 2 and 3 are not exist, but they are online.
According your description of patch, node 4 and 5 are mistakenly
Not node 4 and 5, it is node 2 and 3 which are mistakenly set online.
Please add the results of lscpu before/after applyinig the patch into
description of your patch.

Feel free to add my
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@xxxxxxxxxxxxxx>

Thanks for reviewing. Will update the patch soon.

Thanks.


Thanks,
Yasuaki Ishimatsu

set to online. Why does lscpu show the above result?
Well, actually not only lscpu gives the strange result, under
/sys/device/system/node,
interfaces for node 2 and 3 are also created.

I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But
obviously,
node 2 and 3 are set online, which is incorrect.

For now, I only found that in numa_cleanup_meminfo(), memory above
max_pfn is removed,
but holes between nodes are not removed.

I think libraries are not able to handle this problem since nodes are
set online in kernel.
Seeing from user space, there is no hole.

Thanks.

Thanks,
Yasuaki Ishimatsu

On Wed, 1 Jul 2015 15:55:30 +0800
Tang Chen <tangchen@xxxxxxxxxxxxxx> wrote:

On 07/01/2015 02:25 PM, Xishi Qiu wrote:
On 2015/7/1 11:16, Tang Chen wrote:

When parsing SRAT, all memory ranges are added into numa_meminfo.
In numa_init(), before entering numa_cleanup_meminfo(), all possible
memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
all ranges over max_pfn or empty.

But, this only works if the nodes are continuous. Let's have a look
at the following example:

We have an SRAT like this:
SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug

On boot, only node 0,1,2,3 exist.

And the numa_meminfo will look like this:
numa_meminfo.nr_blks = 9
1. on node 0: [0, 60000000]
2. on node 0: [100000000, 20000000000]
3. on node 1: [20000000000, 40000000000]
4. on node 4: [40000000000, 60000000000]
5. on node 5: [60000000000, 80000000000]
6. on node 2: [80000000000, a0000000000]
7. on node 3: [a0000000000, a0800000000]
8. on node 6: [c0000000000, a0800000000]
9. on node 7: [e0000000000, a0800000000]

And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
the end address is over max_pfn, which is a0800000000. But 4 and 5
are not removed because their end addresses are less then max_pfn.
But in fact, node 4 and 5 don't exist.

In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.

Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(),
node 4 and 5 will be mistakenly set to online.

In this patch, we use memblock_overlaps_region() to check if ranges in
numa_meminfo overlap with ranges in memory_block. Since memory_block contains
all available memory at boot time, if they overlap, it means the ranges
exist. If not, then remove them from numa_meminfo.

Hi Tang Chen,

What's the impact of this problem?

Command "numactl --hard" will show an empty node(no cpu and no memory,
but pgdat is created), right?
On my box, if I run lscpu, the output looks like this:

NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220

Node 2 and 3 are not exist, but they are online.

Thanks.

Thanks,
Xishi Qiu

Signed-off-by: Tang Chen <tangchen@xxxxxxxxxxxxxx>
---
arch/x86/mm/numa.c | 6 ++++--
include/linux/memblock.h | 2 ++
mm/memblock.c | 2 +-
3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4053bb5..0c55cc5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
bi->start = max(bi->start, low);
bi->end = min(bi->end, high);
- /* and there's no empty block */
- if (bi->start >= bi->end)
+ /* and there's no empty or non-exist block */
+ if (bi->start >= bi->end ||
+ memblock_overlaps_region(&memblock.memory,
+ bi->start, bi->end - bi->start) == -1)
numa_remove_memblk_from(i--, mi);
}
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 0215ffd..3bf6cc1 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
void memblock_trim_memory(phys_addr_t align);
+long memblock_overlaps_region(struct memblock_type *type,
+ phys_addr_t base, phys_addr_t size);
int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index 1b444c7..55b5f9f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p
return ((base1 < (base2 + size2)) && (base2 < (base1 + size1)));
}
-static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
+long __init_memblock memblock_overlaps_region(struct memblock_type *type,
phys_addr_t base, phys_addr_t size)
{
unsigned long i;
.

.

.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/