Re: [PATCH v2 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

From: Srikar Dronamraju
Date: Fri May 08 2020 - 09:03:56 EST


* Michal Hocko <mhocko@xxxxxxxxxx> [2020-05-04 11:37:12]:

> > >
> > > Have you tested on something else than ppc? Each arch does the NUMA
> > > setup separately and this is a big mess. E.g. x86 marks even memory less
> > > nodes (see init_memory_less_node) as online.
> > >
> >
> > while I have predominantly tested on ppc, I did test on X86 with CONFIG_NUMA
> > enabled/disabled on both single node and multi node machines.
> > However, I dont have a cpuless/memoryless x86 system.
>
> This should be able to emulate inside kvm, I believe.
>

I did try but somehow not able to get cpuless / memoryless node in a x86 kvm
guest.

Also I am unable to see how to enable HAVE_MEMORYLESS_NODES on x86 system.
# git grep -w HAVE_MEMORYLESS_NODES | cat
arch/ia64/Kconfig:config HAVE_MEMORYLESS_NODES
arch/powerpc/Kconfig:config HAVE_MEMORYLESS_NODES
#
I forced enabled but it got disabled while kernel build.
May be I am missing something.

> >
> > So we have a redundant page hinting numa faults which we can avoid.
>
> interesting. Does this lead to any observable differences? Btw. it would
> be really great to describe how the online state influences the numa
> balancing.
>

If numa_balancing is enabled, it has a check to see if the number of online
nodes is 1. If its one, it disables numa_balancing, else the numa_balancing
stays as is. In this case, the actual node (node nr > 0) and
node 0 were marked online without the patch.

Here are 2 sample numa programs.

numa01.sh is a set of 2 process each running threads as many as number of cpus;
each thread doing 50 loops on 3GB process shared memory operations.

numa02.sh is a single process with threads as many as number of cpus;
each thread doing 800 loops on 32MB thread local memory operations.

Testcase Time: Min Max Avg StdDev
./numa01.sh Real: 149.62 149.66 149.64 0.02
./numa01.sh Sys: 3.21 3.71 3.46 0.25
./numa01.sh User: 4755.13 4758.15 4756.64 1.51
./numa02.sh Real: 24.98 25.02 25.00 0.02
./numa02.sh Sys: 0.51 0.59 0.55 0.04
./numa02.sh User: 790.28 790.88 790.58 0.30

Testcase Time: Min Max Avg StdDev %Change
./numa01.sh Real: 149.44 149.46 149.45 0.01 0.127133%
./numa01.sh Sys: 0.71 0.89 0.80 0.09 332.5%
./numa01.sh User: 4754.19 4754.48 4754.33 0.15 0.0485873%
./numa02.sh Real: 24.97 24.98 24.98 0.00 0.0800641%
./numa02.sh Sys: 0.26 0.41 0.33 0.08 66.6667%
./numa02.sh User: 789.75 790.28 790.01 0.27 0.072151%

numa01.sh
param no_patch with_patch %Change
----- ---------- ---------- -------
numa_hint_faults 1131164 0 -100%
numa_hint_faults_local 1131164 0 -100%
numa_hit 213696 214244 0.256439%
numa_local 213696 214244 0.256439%
numa_pte_updates 1131294 0 -100%
pgfault 1380845 241424 -82.5162%
pgmajfault 75 60 -20%

numa02.sh
param no_patch with_patch %Change
----- ---------- ---------- -------
numa_hint_faults 111878 0 -100%
numa_hint_faults_local 111878 0 -100%
numa_hit 41854 43220 3.26373%
numa_local 41854 43220 3.26373%
numa_pte_updates 113926 0 -100%
pgfault 163662 51210 -68.7099%
pgmajfault 56 52 -7.14286%

Observations:
The real time and user time actually doesn't change much. However the system
time changes to some extent. The reason being the number of numa hinting
faults. With the patch we are not seeing the numa hinting faults.

> > 2. Few people have complained about existence of this dummy node when
> > parsing lscpu and numactl o/p. They somehow start to think that the tools
> > are reporting incorrectly or the kernel is not able to recognize resources
> > connected to the node.
>
> Please be more specific.

Taking the below example of numactl
available: 2 nodes (0,7)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 7 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 7 size: 16238 MB
node 7 free: 15449 MB
node distances:
node 0 7
0: 10 20
7: 20 10

We know node 0 can be special, but users may not feel the same.

When users parse numactl/lscpu or /sys directory; they find there are 2
online nodes. They find none of the resources for a node(node 0) are
available but still online. However they find other nodes (nodes 1-6) with
don't have resources but not online. So they tend to think the kernel has
been unable to online some of the resources or the resources have gone bad.
Please do note that on hypervisors like PowerVM, the admins don't have
control over which nodes the resources are allocated.

--
Thanks and Regards
Srikar Dronamraju