Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing

From: Bharata B Rao
Date: Wed Mar 01 2023 - 06:21:48 EST


On 27-Feb-23 1:24 PM, Huang, Ying wrote:
> Thank you very much for detailed data. Can you provide some analysis
> for your data?

The overhead numbers I shared earlier weren't correct as I
realized that while obtaining those numbers from function_graph
tracing, the trace buffer was silently getting overrun. I had to
reduce the number of memory access iterations to ensure that I get
the full trace buffer. I will be summarizing the findings
based on this new numbers below.

Just to recap - The microbenchmark is run on an AMD Genoa
two node system. The benchmark has two set of threads,
(one affined to each node) accessing two different chunks
of memory (chunk size 8G) which are initially allocated
on first node. The benchmark touches each page in the
chunk iteratively for a fixed number of iterations (384
in this case given below). The benchmark score is the
amount of time it takes to complete the specified number
of accesses.

Here is the data for the benchmark run:

Time taken or overhead (us) for fault, task_work and sched_switch
handling

Default IBS
Fault handling 2875354862 2602455
Task work handling 139023 24008121
Sched switch handling 37712
Total overhead 2875493885 26648288

Default
-------
Total Min Max Avg
do_numa_page 2875354862 0.08 392.13 22.11
task_numa_work 139023 0.14 5365.77 532.66
Total 2875493885

IBS
---
Total Min Max Avg
ibs_overflow_handler 2602455 0.14 103.91 1.29
task_ibs_access_work 24008121 0.17 485.09 37.65
hw_access_sched_in 37712 0.15 287.55 1.35
Total 26648288


Default IBS
Benchmark score(us) 160171762.0 40323293.0
numa_pages_migrated 2097220 511791
Overhead per page 1371 52
Pages migrated per sec 13094 12692
numa_hint_faults_local 2820311 140856
numa_hint_faults 38589520 652647
hint_faults_local/hint_faults 7% 22%

Here is the summary:

- In case of IBS, the benchmark completes 75% faster compared to
the default case. The gain varies based on how many iterations of
memory accesses we run as part of the benchmark. For 2048 iterations
of accesses, I have seen a gain of around 50%.
- The overhead of NUMA balancing (as measured by the time taken in
the fault handling, task_work time handling and sched_switch time
handling) in the default case is seen to be pretty high compared to
the IBS case.
- The number of hint-faults in the default case is significantly
higher than the IBS case.
- The local hint-faults percentage is much better in the IBS
case compared to the default case.
- As shown in the graphs (in other threads of this mail thread), in
the default case, the page migrations start a bit slowly while IBS
case shows steady migrations right from the start.
- I have also shown (via graphs in other threads of this mail thread)
that in IBS case the benchmark is able to steadily increase
the access iterations over time, while in the default case, the
benchmark doesn't do forward progress for a long time after
an initial increase.
- Early migrations due to relevant access sampling from IBS,
is most probably the significant reason for the uplift that IBS
case gets.
- It is consistently seen that the benchmark in the IBS case manages
to complete the specified number of accesses even before the entire
chunk of memory gets migrated. The early migrations are offsetting
the cost of remote accesses too.
- In the IBS case, we re-program the IBS counters for the incoming
task in the sched_switch path. It is seen that this overhead isn't
that significant to slow down the benchmark.
- One of the differences between the default case and the IBS case
is about when the faults-since-last-scan is updated/folded into the
historical faults stats and subsequent scan period update. Since we
don't have the notion of scanning in IBS, I have a threshold (number
of access faults) to determine when to update the historical faults
and the IBS sample period. I need to check if quicker migrations
could result from this change.
- Finally, all this is for the above mentioned microbenchmark. The
gains on other benchmarks is yet to be evaluated.

Regards,
Bharata.