Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing

From: Huang, Ying
Date: Thu Mar 02 2023 - 03:11:17 EST


Bharata B Rao <bharata@xxxxxxx> writes:

> On 27-Feb-23 1:24 PM, Huang, Ying wrote:
>> Thank you very much for detailed data. Can you provide some analysis
>> for your data?
>
> The overhead numbers I shared earlier weren't correct as I
> realized that while obtaining those numbers from function_graph
> tracing, the trace buffer was silently getting overrun. I had to
> reduce the number of memory access iterations to ensure that I get
> the full trace buffer. I will be summarizing the findings
> based on this new numbers below.
>
> Just to recap - The microbenchmark is run on an AMD Genoa
> two node system. The benchmark has two set of threads,
> (one affined to each node) accessing two different chunks
> of memory (chunk size 8G) which are initially allocated
> on first node. The benchmark touches each page in the
> chunk iteratively for a fixed number of iterations (384
> in this case given below). The benchmark score is the
> amount of time it takes to complete the specified number
> of accesses.
>
> Here is the data for the benchmark run:
>
> Time taken or overhead (us) for fault, task_work and sched_switch
> handling
>
> Default IBS
> Fault handling 2875354862 2602455
> Task work handling 139023 24008121
> Sched switch handling 37712
> Total overhead 2875493885 26648288
>
> Default
> -------
> Total Min Max Avg
> do_numa_page 2875354862 0.08 392.13 22.11
> task_numa_work 139023 0.14 5365.77 532.66
> Total 2875493885
>
> IBS
> ---
> Total Min Max Avg
> ibs_overflow_handler 2602455 0.14 103.91 1.29
> task_ibs_access_work 24008121 0.17 485.09 37.65
> hw_access_sched_in 37712 0.15 287.55 1.35
> Total 26648288
>
>
> Default IBS
> Benchmark score(us) 160171762.0 40323293.0
> numa_pages_migrated 2097220 511791
> Overhead per page 1371 52
> Pages migrated per sec 13094 12692
> numa_hint_faults_local 2820311 140856
> numa_hint_faults 38589520 652647

For default, numa_hint_faults >> numa_pages_migrated. It's hard to be
understood. I guess that there aren't many shared pages in the
benchmark? And I guess that the free pages in the target node is enough
too?

> hint_faults_local/hint_faults 7% 22%
>
> Here is the summary:
>
> - In case of IBS, the benchmark completes 75% faster compared to
> the default case. The gain varies based on how many iterations of
> memory accesses we run as part of the benchmark. For 2048 iterations
> of accesses, I have seen a gain of around 50%.
> - The overhead of NUMA balancing (as measured by the time taken in
> the fault handling, task_work time handling and sched_switch time
> handling) in the default case is seen to be pretty high compared to
> the IBS case.
> - The number of hint-faults in the default case is significantly
> higher than the IBS case.
> - The local hint-faults percentage is much better in the IBS
> case compared to the default case.
> - As shown in the graphs (in other threads of this mail thread), in
> the default case, the page migrations start a bit slowly while IBS
> case shows steady migrations right from the start.
> - I have also shown (via graphs in other threads of this mail thread)
> that in IBS case the benchmark is able to steadily increase
> the access iterations over time, while in the default case, the
> benchmark doesn't do forward progress for a long time after
> an initial increase.

Hard to understand this too. Pages are migrated to local, but
performance doesn't improve.

> - Early migrations due to relevant access sampling from IBS,
> is most probably the significant reason for the uplift that IBS
> case gets.

In original kernel, the NUMA page table scanning will delay for a
while. Please check the below comments in task_tick_numa().

/*
* Using runtime rather than walltime has the dual advantage that
* we (mostly) drive the selection from busy threads and that the
* task needs to have done some actual work before we bother with
* NUMA placement.
*/

I think this is generally reasonable, while it's not best for this
micro-benchmark.

Best Regards,
Huang, Ying

> - It is consistently seen that the benchmark in the IBS case manages
> to complete the specified number of accesses even before the entire
> chunk of memory gets migrated. The early migrations are offsetting
> the cost of remote accesses too.
> - In the IBS case, we re-program the IBS counters for the incoming
> task in the sched_switch path. It is seen that this overhead isn't
> that significant to slow down the benchmark.
> - One of the differences between the default case and the IBS case
> is about when the faults-since-last-scan is updated/folded into the
> historical faults stats and subsequent scan period update. Since we
> don't have the notion of scanning in IBS, I have a threshold (number
> of access faults) to determine when to update the historical faults
> and the IBS sample period. I need to check if quicker migrations
> could result from this change.
> - Finally, all this is for the above mentioned microbenchmark. The
> gains on other benchmarks is yet to be evaluated.
>
> Regards,
> Bharata.