Re: [RFC 0/2] How effective is numa_preferred_nid w.r.t. NUMA performance?

From: Chris Hyser
Date: Mon Feb 26 2024 - 19:48:39 EST


Included is additional micro-benchmark data from an AMD 128 cpu machine
(EPYC 7551 processor) concerning the effectiveness of setting a task's
numa_preferred_nid with respect to improving the NUMA awareness of the
scheduler. The details of the test procedure are identical to that
described in the original RFC and while obviously this and the original
RFC are answers to a specific question asked by Peter, feedback on the
experimental setup as well as the data would be appreciated.

The original RFC can be found at:
[https://lore.kernel.org/lkml/20231216001801.3015832-1-chris.hyser@xxxxxxxxxx/]

Key:
-----------------
NB   - auto-numa-balancing (0 - off, 1 - on)
PNID - the prctl() "forced" numa_preferred_nid, ie 'Preferred Node
            Affinity'.
           (given 8 nodes:  0, 1, 2, 3, 4, 5, 6, 7 and -1 for not_set)
Mem  - represents the Memory node when memory is bound, else 'F' floating,
           ie not set
CPU  - represents the CPUs of the node that the probe is hard-affined
           to, else 'F' floating, ie not set
Avg  - the average time of the probe's measurements in secs

NumSamples: 36
Load: 60
CPU_Model: AMD EPYC 7551 32-Core Processor
NUM_CPUS: 128
Migration Cost: 500000

      Avg     max     min     stdv        Test Parameters
-----------------------------------------------------------------
[00] 215.78  223.77  195.02   7.60  |  PNID: -1 NB: 0 Mem: 0 CPU 0
[01] 299.77  307.21  282.93   6.60  |  PNID: -1 NB: 0 Mem: 0 CPU 1
[02] 418.78  449.45  387.53  15.64  |  PNID: -1 NB: 0 Mem: 0 CPU F
[03] 301.27  311.84  280.22   8.98  |  PNID: -1 NB: 0 Mem: 1 CPU 0
[04] 213.60  221.36  190.10   6.53  |  PNID: -1 NB: 0 Mem: 1 CPU 1
[05] 396.37  418.58  376.10  10.15  |  PNID: -1 NB: 0 Mem: 1 CPU F
[06] 402.04  411.85  378.71   8.97  |  PNID: -1 NB: 0 Mem: F CPU 0
[07] 401.28  410.06  384.80   6.41  |  PNID: -1 NB: 0 Mem: F CPU 1
[08] 439.86  459.61  392.28  19.09  |  PNID: -1 NB: 0 Mem: F CPU F

[09] 214.81  225.35  199.34   5.38  |  PNID: -1 NB: 1 Mem: 0 CPU 0
[10] 299.15  314.84  274.00   8.18  |  PNID: -1 NB: 1 Mem: 0 CPU 1
[11] 395.70  425.22  340.33  21.54  |  PNID: -1 NB: 1 Mem: 0 CPU F
[12] 300.43  310.93  281.67   7.40  |  PNID: -1 NB: 1 Mem: 1 CPU 0
[13] 210.86  222.80  189.54   7.55  |  PNID: -1 NB: 1 Mem: 1 CPU 1
[14] 402.57  433.72  299.73  32.96  |  PNID: -1 NB: 1 Mem: 1 CPU F
[15] 390.04  410.10  370.63  10.72  |  PNID: -1 NB: 1 Mem: F CPU 0
[16] 393.32  418.43  370.52  10.71  |  PNID: -1 NB: 1 Mem: F CPU 1
[17] 370.07  424.58  255.16  43.26  |  PNID: -1 NB: 1 Mem: F CPU F

[18] 216.26  224.95  198.62   5.86  |  PNID:  0 NB: 1 Mem: 0 CPU 0
[19] 303.60  314.29  275.32   7.99  |  PNID:  0 NB: 1 Mem: 0 CPU 1
[20] 280.36  316.40  242.15  18.25  |  PNID:  0 NB: 1 Mem: 0 CPU F
[21] 301.17  315.03  283.77   8.07  |  PNID:  0 NB: 1 Mem: 1 CPU 0
[22] 209.34  218.63  187.69   9.11  |  PNID:  0 NB: 1 Mem: 1 CPU 1
[23] 342.34  369.42  311.99  12.79  |  PNID:  0 NB: 1 Mem: 1 CPU F
[24] 399.23  409.19  375.73   8.15  |  PNID:  0 NB: 1 Mem: F CPU 0
[25] 391.67  410.01  372.27  10.88  |  PNID:  0 NB: 1 Mem: F CPU 1
[26] 363.19  396.58  254.56  32.02  |  PNID:  0 NB: 1 Mem: F CPU F

[27] 215.29  224.59  193.76   8.16  |  PNID:  1 NB: 1 Mem: 0 CPU 0
[28] 300.19  312.95  280.26   9.32  |  PNID:  1 NB: 1 Mem: 0 CPU 1
[29] 340.97  362.79  323.94  10.69  |  PNID:  1 NB: 1 Mem: 0 CPU F
[30] 304.41  312.14  283.69   6.59  |  PNID:  1 NB: 1 Mem: 1 CPU 0
[31] 213.58  224.24  191.11   6.98  |  PNID:  1 NB: 1 Mem: 1 CPU 1
[32] 299.73  337.17  266.98  17.04  |  PNID:  1 NB: 1 Mem: 1 CPU F
[33] 395.56  411.33  359.70  12.24  |  PNID:  1 NB: 1 Mem: F CPU 0
[34] 398.52  409.42  377.33   7.28  |  PNID:  1 NB: 1 Mem: F CPU 1
[35] 355.64  377.61  279.13  26.71  |  PNID:  1 NB: 1 Mem: F CPU F

All data is present for completeness, however the analysis can be limited
to just comparing {00,01,02} (PNID=-1, NB=0), {09,10,11} (PNID=-1, NB=1)
and {18,19,20} (PNID=0, NB=1, mem=0, cpu=F).

{00,09,18} are all basically the same when memory and CPU are both
pinned to the same node as expected since neither PNID or NB should
affect scheduling in this case. We see basically the same pattern (values
being near equal) when memory and CPU are pinned in different nodes
{01,10,19}. The interesting analysis in terms of the original problem
(pinned RDMA buffers, tasks floating) is how NB and PNID affect the
case when memory is pinned and the CPU allowed to float. The base
value {02} (PNID=-1, NB=0) is quite a bit worse than when the CPU and
memory are pinned in different nodes. This is similar to the Intel case
where allowing the load balancer to load balance is worse than pinning
tasks and memory on different nodes and while this simply may be an
artifact of the micro benchmark, given that the benchmark is really just
a sum of a large number of access times by the task to memory, it is
representative of the NUMA awareness of scheduler/load-balancer.

We do see that enabling NB (with the default values) does provide some
help {11} versus {02} and that setting PNID to the node where the memory
is at does provide a significant benefit {20} 280.36 versus {11} 395.70
versus {02} 418.78. Unlike the prior Intel results, where PNID=0, NB=1,
mem=0, cpu=F was generally less than pinned on same node {20} 129.20
versus {00} 136.5, on the AMD platform we don't see nearly the same level
of improvement {20} 280.36 versus {00} 215.78.

This can be explained by the relatively small number of CPUs in a node
(16) and that said node contains two 8-CPU LLCs.

Analysis:

As mentioned in the RFC, the entire micro-benchmark can be traced and all
migrations of the benchmark task can be tabulated.  Obviously, a same-core
migration is also a same-llc migration which is also a same-node migration.
Cross-node migrations are however further broken into 'from node 0' and
'to node 0'.


    {00}            CPU: 0, Mem: 0, NB=0, PNID=-1
--------------------------------------------------------------------
    num_migrations_samecore : 1823 num_migrations_samecore : 1683
    num_migrations_same_llc : 3455 num_migrations_same_llc : 3277
    num_migrations_samenode : 914 num_migrations_samenode : 1016
    num_migrations_crossnode: 1 num_migrations_crossnode: 1
      num_migrations_to_0   : 1 num_migrations_to_0   : 1
      num_migrations_from_0 : 0 num_migrations_from_0 : 0
    num_migrations: 6193                  num_migrations: 5977

    {01}            CPU: 1, Mem: 0, NB=0, PNID=-1
---------------------------------------------------------------------
    num_migrations_samecore : 2453 num_migrations_samecore : 2579
    num_migrations_same_llc : 4693 num_migrations_same_llc : 4735
    num_migrations_samenode : 1429 num_migrations_samenode : 1466
    num_migrations_crossnode: 1 num_migrations_crossnode: 1
      num_migrations_to_0   : 0 num_migrations_to_0   : 0
      num_migrations_from_  : 1 num_migrations_from_0 : 1
    num_migrations: 8576                  num_migrations: 8781

In the two cases where both the task's CPU and the memory buffer is
pinned we do see no cross-node migrations (ignoring the first needed to get
on to the correct node in the first place which is due to the benchmark
starting the task in a different node). Why pinning cross-node results
in more migrations in general needs more investigation as this seems fairly
consistent.

    {02}            CPU: F, Mem: 0, NB=0, PNID=-1
---------------------------------------------------------------------
    num_migrations_samecore : 1620 num_migrations_samecore : 1744
    num_migrations_samecore : 1620 num_migrations_samecore : 1744
    num_migrations_same_llc : 3142 num_migrations_same_llc : 2818
    num_migrations_samenode : 853 num_migrations_samenode : 625
    num_migrations_crossnode: 6344 num_migrations_crossnode: 6778
      num_migrations_to_0   : 769 num_migrations_to_0   : 776
      num_migrations_from_0 : 769 num_migrations_from_0 : 777
    num_migrations: 11959                 num_migrations: 11965

    {11}            CPU: F, Mem: 0, NB=1, PNID=-1
---------------------------------------------------------------------
    num_migrations_samecore : 1966 num_migrations_samecore : 1963
    num_migrations_same_llc : 2803 num_migrations_same_llc : 3314
    num_migrations_samenode : 514 num_migrations_samenode : 721
    num_migrations_crossnode: 6833 num_migrations_crossnode: 6618
      num_migrations_to_0   : 818 num_migrations_to_0   : 630
      num_migrations_from_0 : 818 num_migrations_from_0 : 630
    num_migrations: 12116                 num_migrations: 12616

From the data table, we see that {02} is slightly slower than {11} even
though there are more total migrations. Ultimately, what matters to the
total time is how much time the task spent running on node 0.

    {20}            CPU: F, Mem: 0, NB=1, PNID=0
---------------------------------------------------------------------
    num_migrations_samecore : 1706 num_migrations_samecore : 1663
    num_migrations_same_llc : 2185 num_migrations_same_llc : 2816
    num_migrations_samenode : 591 num_migrations_samenode : 980
    num_migrations_crossnode: 4621 num_migrations_crossnode: 4243
      num_migrations_to_0   : 480 num_migrations_to_0   : 419
      num_migrations_from_0 : 480 num_migrations_from_0 : 418
    num_migrations: 9103                  num_migrations: 9702

The trace results here are more representative of the observed performance
improvements. The cross-node migrations are significantly lower and the
number of migrations away from node 0 are much less.

In summary, the data (relevant copied below) shows that setting a task's
numa_preferred_nid results in a sizable improvement in completion times.

[00] 215.78  223.77  195.02   7.60  |  PNID: -1 NB: 0 Mem: 0 CPU 0
[01] 299.77  307.21  282.93   6.60  |  PNID: -1 NB: 0 Mem: 0 CPU 1
[02] 418.78  449.45  387.53  15.64  |  PNID: -1 NB: 0 Mem: 0 CPU F
[11] 395.70  425.22  340.33  21.54  |  PNID: -1 NB: 1 Mem: 0 CPU F
[20] 280.36  316.40  242.15  18.25  |  PNID:  0 NB: 1 Mem: 0 CPU F