Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing

From: Huang, Ying
Date: Fri Mar 03 2023 - 00:56:16 EST


Bharata B Rao <bharata@xxxxxxx> writes:

> On 02-Mar-23 1:40 PM, Huang, Ying wrote:
>> Bharata B Rao <bharata@xxxxxxx> writes:
>>>
>>> Here is the data for the benchmark run:
>>>
>>> Time taken or overhead (us) for fault, task_work and sched_switch
>>> handling
>>>
>>> Default IBS
>>> Fault handling 2875354862 2602455
>>> Task work handling 139023 24008121
>>> Sched switch handling 37712
>>> Total overhead 2875493885 26648288
>>>
>>> Default
>>> -------
>>> Total Min Max Avg
>>> do_numa_page 2875354862 0.08 392.13 22.11
>>> task_numa_work 139023 0.14 5365.77 532.66
>>> Total 2875493885
>>>
>>> IBS
>>> ---
>>> Total Min Max Avg
>>> ibs_overflow_handler 2602455 0.14 103.91 1.29
>>> task_ibs_access_work 24008121 0.17 485.09 37.65
>>> hw_access_sched_in 37712 0.15 287.55 1.35
>>> Total 26648288
>>>
>>>
>>> Default IBS
>>> Benchmark score(us) 160171762.0 40323293.0
>>> numa_pages_migrated 2097220 511791
>>> Overhead per page 1371 52
>>> Pages migrated per sec 13094 12692
>>> numa_hint_faults_local 2820311 140856
>>> numa_hint_faults 38589520 652647
>>
>> For default, numa_hint_faults >> numa_pages_migrated. It's hard to be
>> understood.
>
> Most of the migration requests from the numa hint page fault path
> are failing due to failure to isolate the pages.
>
> This is the check in migrate_misplaced_page() from where it returns
> without even trying to do the subsequent migrate_pages() call:
>
> isolated = numamigrate_isolate_page(pgdat, page);
> if (!isolated)
> goto out;
>
> I will further investigate this.
>
>> I guess that there aren't many shared pages in the
>> benchmark?
>
> I have a version of the benchmark which has a fraction of
> shared memory between sets of thread in addition to the
> per-set exclusive memory. Here too the same performance
> difference is seen.
>
>> And I guess that the free pages in the target node is enough
>> too?
>
> The benchmark is using 16G totally with 8G being accessed from
> threads on either nodes. There is enough memory on the target
> node to accept the incoming page migration requests.
>
>>
>>> hint_faults_local/hint_faults 7% 22%
>>>
>>> Here is the summary:
>>>
>>> - In case of IBS, the benchmark completes 75% faster compared to
>>> the default case. The gain varies based on how many iterations of
>>> memory accesses we run as part of the benchmark. For 2048 iterations
>>> of accesses, I have seen a gain of around 50%.
>>> - The overhead of NUMA balancing (as measured by the time taken in
>>> the fault handling, task_work time handling and sched_switch time
>>> handling) in the default case is seen to be pretty high compared to
>>> the IBS case.
>>> - The number of hint-faults in the default case is significantly
>>> higher than the IBS case.
>>> - The local hint-faults percentage is much better in the IBS
>>> case compared to the default case.
>>> - As shown in the graphs (in other threads of this mail thread), in
>>> the default case, the page migrations start a bit slowly while IBS
>>> case shows steady migrations right from the start.
>>> - I have also shown (via graphs in other threads of this mail thread)
>>> that in IBS case the benchmark is able to steadily increase
>>> the access iterations over time, while in the default case, the
>>> benchmark doesn't do forward progress for a long time after
>>> an initial increase.
>>
>> Hard to understand this too. Pages are migrated to local, but
>> performance doesn't improve.
>
> Migrations start a bit late and too much of time is spent later
> in the run in hint faults and failed migration attempts (due to failure
> to isolate the pages) is probably the reason?
>>
>>> - Early migrations due to relevant access sampling from IBS,
>>> is most probably the significant reason for the uplift that IBS
>>> case gets.
>>
>> In original kernel, the NUMA page table scanning will delay for a
>> while. Please check the below comments in task_tick_numa().
>>
>> /*
>> * Using runtime rather than walltime has the dual advantage that
>> * we (mostly) drive the selection from busy threads and that the
>> * task needs to have done some actual work before we bother with
>> * NUMA placement.
>> */
>>
>> I think this is generally reasonable, while it's not best for this
>> micro-benchmark.
>
> This is in addition to the initial scan delay that we have via
> sysctl_numa_balancing_scan_delay. I have an equivalent of this
> initial delay where the IBS access sampling is not started for
> the task until an initial delay.

What is the memory accessing pattern of the workload? Uniform random or
something like Gauss distribution?

Anyway, it may take some time for the original method to scan enough
memory space to trigger enough hint page fault. We can check
numa_pte_updates to check whether enough virtual space has been scanned.

Best Regards,
Huang, Ying