Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

From: K Prateek Nayak
Date: Thu Sep 14 2023 - 00:14:12 EST


Hello Chenyu,

On 9/13/2023 8:27 AM, Chen Yu wrote:
> On 2023-09-12 at 19:56:37 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> On 9/12/2023 6:02 PM, Chen Yu wrote:
>>> [..snip..]
>>>
>>>>> If I understand correctly, WF_SYNC is to let the wakee to woken up
>>>>> on the waker's CPU, rather than the wakee's previous CPU, because
>>>>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
>>>>> wakee's previous CPU. We can only restrict that other wakee does not
>>>>> occupy the previous CPU, but do not enhance the possibility that
>>>>> wake_affine_idle() chooses the previous CPU.
>>>>
>>>> Correct me if I'm wrong here,
>>>>
>>>> Say a short sleeper, is always woken up using WF_SYNC flag. When the
>>>> task is dequeued, we mark the previous CPU where it ran as "cache-hot"
>>>> and restrict any wakeup happening until the "cache_hot_timeout" is
>>>> crossed. Let us assume a perfect world where the task wakes up before
>>>> the "cache_hot_timeout" expires. Logically this CPU was reserved all
>>>> this while for the short sleeper but since the wakeup bears WF_SYNC
>>>> flag, the whole reservation is ignored and waker's LLC is explored.
>>>>
>>>
>>> Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
>>> forces the short sleeping wakee to be woken up on waker's CPU rather the
>>> wakee's previous CPU, but wakee's previous has been marked as cache hot
>>> for nothing?
>>
>> Precisely :)
>>
>>>
>>>> Should the timeout be cleared if the wakeup decides to not target the
>>>> previous CPU? (The default "sysctl_sched_migration_cost" is probably
>>>> small enough to curb any side effect that could possibly show here but
>>>> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
>>>> a larger value, the wakeup path might be affected where lot of idle
>>>> targets are overlooked since the CPUs are marked cache-hot forr longer
>>>> duration)
>>>>
>>>> Let me know what you think.
>>>>
>>>
>>> This makes sense. In theory the above logic can be added in
>>> select_idle_sibling(), if target CPU is chosen rather than
>>> the previous CPU, the previous CPU's cache hot flag should be
>>> cleared.
>>>
>>> But this might bring overhead. Because we need to grab the rq
>>> lock and write to other CPU's rq, which could be costly. It
>>> seems to be a trade-off of current implementation.
>>
>> I agree, it will not be pretty. Maybe the other way is to have a
>> history of the type of wakeup the task experiences (similar to
>> wakee_flips but for sync and non-syn wakeups) and only reserve
>> the CPU if the task wakes up more via non-sync wakeups? Thinking
>> out loud here.
>>
>
> This looks good to consider the task's attribute, or maybe something
> like this:
>
> new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> if (new_cpu != prev_cpu)
> p->burst_sleep_avg >>= 1;
> So the duration of reservation could be shrinked.

That seems like a good approach.

Meanwhile, here is result for the current series without any
modifications:

tl;dr

- There seems to be a noticeable increase in hackbench runtime with a
single group but big gains beyond that. The regression could possibly
be because of added searching but let me do some digging to confirm
that.

- Small regressions (~2%) noticed in ycsb-mongodb (medium utilization)
and DeathStarBench (High Utilization)

- Other benchmarks are more of less perf neutral with the changes.

More information below:

o System information

- Dual socket 3rd Generation EPYC System (2 x 64C/128T)
- NPS1 mode (each socket is a NUMA node)
- Boost Enabled
- C2 disabled (MWAIT based C1 is still enabled)


o Kernel information

base : tip:sched/core at commit b41bbb33cf75 ("Merge branch
'sched/eevdf' into sched/core")
+ cheery-pick commit 63304558ba5d ("sched/eevdf: Curb
wakeup-preemption")

SIS_CACHE : base
+ this series as is


o Benchmark results

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: base[pct imp](CV) SIS_CACHE[pct imp](CV)
1-groups 1.00 [ -0.00]( 1.89) 1.10 [-10.28]( 2.03)
2-groups 1.00 [ -0.00]( 2.04) 0.98 [ 1.57]( 2.04)
4-groups 1.00 [ -0.00]( 2.38) 0.95 [ 4.70]( 0.88)
8-groups 1.00 [ -0.00]( 1.52) 0.93 [ 7.18]( 0.76)
16-groups 1.00 [ -0.00]( 3.44) 0.90 [ 9.76]( 1.04)


==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: base[pct imp](CV) SIS_CACHE[pct imp](CV)
1 1.00 [ 0.00]( 0.18) 0.98 [ -1.61]( 0.27)
2 1.00 [ 0.00]( 0.63) 0.98 [ -1.58]( 0.09)
4 1.00 [ 0.00]( 0.86) 0.99 [ -0.52]( 0.42)
8 1.00 [ 0.00]( 0.22) 0.98 [ -1.77]( 0.65)
16 1.00 [ 0.00]( 1.99) 1.00 [ -0.10]( 1.55)
32 1.00 [ 0.00]( 4.29) 0.98 [ -1.73]( 1.55)
64 1.00 [ 0.00]( 1.71) 0.97 [ -2.77]( 3.74)
128 1.00 [ 0.00]( 0.65) 1.00 [ -0.14]( 0.88)
256 1.00 [ 0.00]( 0.19) 0.97 [ -2.65]( 0.49)
512 1.00 [ 0.00]( 0.20) 0.99 [ -1.10]( 0.33)
1024 1.00 [ 0.00]( 0.29) 0.99 [ -0.70]( 0.16)


==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: base[pct imp](CV) SIS_CACHE[pct imp](CV)
Copy 1.00 [ 0.00]( 4.32) 0.90 [ -9.82](10.72)
Scale 1.00 [ 0.00]( 5.21) 1.01 [ 0.59]( 1.83)
Add 1.00 [ 0.00]( 6.25) 0.99 [ -0.91]( 4.49)
Triad 1.00 [ 0.00](10.74) 1.02 [ 2.28]( 6.07)


==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: base[pct imp](CV) SIS_CACHE[pct imp](CV)
Copy 1.00 [ 0.00]( 0.70) 0.98 [ -1.79]( 2.26)
Scale 1.00 [ 0.00]( 6.55) 1.03 [ 2.80]( 0.74)
Add 1.00 [ 0.00]( 6.53) 1.02 [ 2.05]( 1.82)
Triad 1.00 [ 0.00]( 6.66) 1.04 [ 3.54]( 1.04)


==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: base[pct imp](CV) SIS_CACHE[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.46) 0.99 [ -0.55]( 0.49)
2-clients 1.00 [ 0.00]( 0.38) 0.99 [ -1.23]( 1.19)
4-clients 1.00 [ 0.00]( 0.72) 0.98 [ -1.91]( 1.21)
8-clients 1.00 [ 0.00]( 0.98) 0.98 [ -1.61]( 1.08)
16-clients 1.00 [ 0.00]( 0.70) 0.98 [ -1.80]( 1.04)
32-clients 1.00 [ 0.00]( 0.74) 0.98 [ -1.55]( 1.20)
64-clients 1.00 [ 0.00]( 2.24) 1.00 [ -0.04]( 2.77)
128-clients 1.00 [ 0.00]( 1.72) 1.03 [ 3.22]( 1.99)
256-clients 1.00 [ 0.00]( 4.44) 0.99 [ -1.33]( 4.71)
512-clients 1.00 [ 0.00](52.42) 0.98 [ -1.61](52.72)


==================================================================
Test : schbench (old)
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: base[pct imp](CV) SIS_CACHE[pct imp](CV)
1 1.00 [ -0.00]( 2.28) 0.96 [ 4.00](15.68)
2 1.00 [ -0.00]( 6.42) 1.00 [ -0.00](10.96)
4 1.00 [ -0.00]( 3.77) 0.97 [ 3.33]( 7.61)
8 1.00 [ -0.00](13.83) 1.08 [ -7.89]( 2.86)
16 1.00 [ -0.00]( 4.37) 1.00 [ -0.00]( 2.13)
32 1.00 [ -0.00]( 8.69) 0.95 [ 4.94]( 2.73)
64 1.00 [ -0.00]( 2.30) 1.05 [ -5.13]( 1.26)
128 1.00 [ -0.00](12.12) 1.03 [ -3.41]( 5.08)
256 1.00 [ -0.00](26.04) 0.91 [ 8.88]( 2.59)
512 1.00 [ -0.00]( 5.62) 0.97 [ 3.32]( 0.37)


==================================================================
Test : Unixbench
Units : Various, Throughput
Interpretation: Higher is better
Statistic : AMean, Hmean (Specified)
==================================================================
Metric variant base SIS_CACHE
Hmean unixbench-dhry2reg-1 41248390.97 ( 0.00%) 41485503.82 ( 0.57%)
Hmean unixbench-dhry2reg-512 6239969914.15 ( 0.00%) 6233919689.40 ( -0.10%)
Amean unixbench-syscall-1 2968518.27 ( 0.00%) 2841236.43 * 4.29%*
Amean unixbench-syscall-512 7790656.20 ( 0.00%) 7631558.00 * 2.04%*
Hmean unixbench-pipe-1 2535689.01 ( 0.00%) 2598208.16 * 2.47%*
Hmean unixbench-pipe-512 361385055.25 ( 0.00%) 368566373.76 * 1.99%*
Hmean unixbench-spawn-1 4506.26 ( 0.00%) 4551.67 ( 1.01%)
Hmean unixbench-spawn-512 69380.09 ( 0.00%) 69264.30 ( -0.17%)
Hmean unixbench-execl-1 3824.57 ( 0.00%) 3822.67 ( -0.05%)
Hmean unixbench-execl-512 12288.64 ( 0.00%) 11728.12 ( -4.56%)


==================================================================
Test : ycsb-mongodb
Units : Throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
base : 309589.33 (var: 1.41%)
SIS_CACHE : 304931.33 (var: 1.29%) [diff: -1.50%]


==================================================================
Test : DeathStarBench
Units : Normalized Throughput, relative to base
Interpretation: Higher is better
Statistic : AMean
==================================================================
Pinning base SIS_CACHE
1 CCD 100% 99.18% [%diff: -0.82%]
2 CCD 100% 97.46% [%diff: -2.54%]
4 CCD 100% 97.22% [%diff: -2.78%]
8 CCD 100% 99.01% [%diff: -0.99%]

--

Regression observed could either be because of the larger search time to
find a non cache-hot idle CPU, or perhaps just the larger search time in
general adding to utilization and curbing the SIS_UTIL limits further.
I'll go gather some stats to back my suspicion (particularly for
hackbench).

>
> [..snip..]

--
Thanks and Regards,
Prateek