Re: [PATCH RFC 0/5] support NUMA emulation for arm64

From: Rongwei Wang
Date: Thu Oct 12 2023 - 09:31:10 EST



On 2023/10/12 20:37, Pierre Gondois wrote:
Hello Rongwei,

On 10/12/23 04:48, Rongwei Wang wrote:
A brief introduction
====================

The NUMA emulation can fake more node base on a single
node system, e.g.

one node system:

[root@localhost ~]# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 31788 MB
node 0 free: 31446 MB
node distances:
node   0
   0:  10

add numa=fake=2 (fake 2 node on each origin node):

[root@localhost ~]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 15806 MB
node 0 free: 15451 MB
node 1 cpus: 0 1 2 3 4 5 6 7
node 1 size: 16029 MB
node 1 free: 15989 MB
node distances:
node   0   1
   0:  10  10
   1:  10  10

As above shown, a new node has been faked. As cpus, the realization
of x86 NUMA emulation is kept. Maybe each node should has 4 cores is
better (not sure, next to do if so).

Why do this
===========

It seems has following reasons:
   (1) In x86 host, apply NUMA emulation can fake more nodes environment
       to test or verify some performance stuff, but arm64 only has
       one method that modify ACPI table to do this. It's troublesome
       more or less.
   (2) Reduce competition for some locks. Here an example we found:
       will-it-scale/tlb_flush1_processes -t 96 -s 10, it shows obvious
       hotspot on lruvec->lock when test in single environment. What's
       more, The performance improved greatly if test in two more nodes
       system. The data shows below (more is better):

---------------------------------------------------------------------
       threads/process |   1     |     12   |     24   | 48     |   96
---------------------------------------------------------------------
       one node        | 14 1122 | 110 5372 | 111 2615 | 79 7084  | 72 4516
---------------------------------------------------------------------
       numa=fake=2     | 14 1168 | 144 4848 | 215 9070 | 157 0412 | 142 3968
---------------------------------------------------------------------
                       | For concurrency 12, no lruvec->lock hotspot. For 24,
       hotspot         | one node has 24% hotspot on lruvec->lock, but
                       | two nodes env hasn't.
---------------------------------------------------------------------

As for risks (e.g. numa balance...), they need to be discussed here.

Lastly, this just is a draft, I can improve next if it's acceptable.

I'm not engaging on the utility/relevance of the patch-set, but I tried
them on an arm64 system with the 'numa=fake=2' parameter and could not

Sorry, my fault.

I should mention this in previous brief introduction: acpi=on numa=fake=2.

The default patch of arm64 numa initialize is numa_init() -> dummy_numa_init() if turn off acpi (this path has not been taken into account yet in this patch, next will to do).

What's more, if you test these patchset in qemu-kvm, you should add below parameters in the script.

object memory-backend-ram,id=mem0,size=32G \
numa node,memdev=mem0,cpus=0-7,nodeid=0 \

(Above parameters just make sure SRAT table has NUMA configure, avoiding path of numa_init() -> dummy_numa_init())

see 2 nodes being created under:
  /sys/devices/system/node/
Indeed it seems that even though numa_emulation() is moved to a generic
mm/numa.c file, the function is only called from:
  arch/x86/mm/numa.c:numa_init()
(or maybe I'm misinterpreting the intent of the patches).

Here drivers/base/arch_numa.c:numa_init() has called numa_emulation() (I guess it works if you add acpi=on :-)).



Also I had the following errors when building (still for arm64):
mm/numa.c:862:8: error: implicit declaration of function 'early_cpu_to_node' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
        nid = early_cpu_to_node(cpu);

It seems CONFIG_DEBUG_PER_CPU_MAPS enabled in your environment? You can disable CONFIG_DEBUG_PER_CPU_MAPS and test it again.

I have not test it with CONFIG_DEBUG_PER_CPU_MAPS enabled. It's very helpful, I will fix it next time.

If you have any questions, please let me know.

Regards,

-wrw

^
mm/numa.c:862:8: note: did you mean 'early_map_cpu_to_node'?
./include/asm-generic/numa.h:37:13: note: 'early_map_cpu_to_node' declared here
void __init early_map_cpu_to_node(unsigned int cpu, int nid);
            ^
mm/numa.c:874:3: error: implicit declaration of function 'debug_cpumask_set_cpu' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
                debug_cpumask_set_cpu(cpu, nid, enable);
                ^
mm/numa.c:874:3: note: did you mean '__cpumask_set_cpu'?
./include/linux/cpumask.h:474:29: note: '__cpumask_set_cpu' declared here
static __always_inline void __cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
                            ^
2 errors generated.

Regards,
Pierre


Thanks!

Rongwei Wang (5):
   mm/numa: move numa emulation APIs into generic files
   mm: percpu: fix variable type of cpu
   arch_numa: remove __init in early_cpu_to_node()
   mm/numa: support CONFIG_NUMA_EMU for arm64
   mm/numa: migrate leftover numa emulation into mm/numa.c

  arch/x86/Kconfig                          |   8 -
  arch/x86/include/asm/numa.h               |   3 -
  arch/x86/mm/Makefile                      |   1 -
  arch/x86/mm/numa.c                        | 216 +-------------
  arch/x86/mm/numa_internal.h               |  14 +-
  drivers/base/arch_numa.c                  |   7 +-
  include/asm-generic/numa.h                |  33 +++
  include/linux/percpu.h                    |   2 +-
  mm/Kconfig                                |   8 +
  mm/Makefile                               |   1 +
  arch/x86/mm/numa_emulation.c => mm/numa.c | 333 +++++++++++++++++++++-
  11 files changed, 373 insertions(+), 253 deletions(-)
  rename arch/x86/mm/numa_emulation.c => mm/numa.c (63%)