Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

From: Chen Yu
Date: Thu Sep 14 2023 - 07:01:49 EST


Hi Prateek,

thanks for the test,

On 2023-09-14 at 09:43:52 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> On 9/13/2023 8:27 AM, Chen Yu wrote:
> > On 2023-09-12 at 19:56:37 +0530, K Prateek Nayak wrote:
> >> Hello Chenyu,
> >>
> >> On 9/12/2023 6:02 PM, Chen Yu wrote:
> >>> [..snip..]
> >>>
> >>>>> If I understand correctly, WF_SYNC is to let the wakee to woken up
> >>>>> on the waker's CPU, rather than the wakee's previous CPU, because
> >>>>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> >>>>> wakee's previous CPU. We can only restrict that other wakee does not
> >>>>> occupy the previous CPU, but do not enhance the possibility that
> >>>>> wake_affine_idle() chooses the previous CPU.
> >>>>
> >>>> Correct me if I'm wrong here,
> >>>>
> >>>> Say a short sleeper, is always woken up using WF_SYNC flag. When the
> >>>> task is dequeued, we mark the previous CPU where it ran as "cache-hot"
> >>>> and restrict any wakeup happening until the "cache_hot_timeout" is
> >>>> crossed. Let us assume a perfect world where the task wakes up before
> >>>> the "cache_hot_timeout" expires. Logically this CPU was reserved all
> >>>> this while for the short sleeper but since the wakeup bears WF_SYNC
> >>>> flag, the whole reservation is ignored and waker's LLC is explored.
> >>>>
> >>>
> >>> Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
> >>> forces the short sleeping wakee to be woken up on waker's CPU rather the
> >>> wakee's previous CPU, but wakee's previous has been marked as cache hot
> >>> for nothing?
> >>
> >> Precisely :)
> >>
> >>>
> >>>> Should the timeout be cleared if the wakeup decides to not target the
> >>>> previous CPU? (The default "sysctl_sched_migration_cost" is probably
> >>>> small enough to curb any side effect that could possibly show here but
> >>>> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
> >>>> a larger value, the wakeup path might be affected where lot of idle
> >>>> targets are overlooked since the CPUs are marked cache-hot forr longer
> >>>> duration)
> >>>>
> >>>> Let me know what you think.
> >>>>
> >>>
> >>> This makes sense. In theory the above logic can be added in
> >>> select_idle_sibling(), if target CPU is chosen rather than
> >>> the previous CPU, the previous CPU's cache hot flag should be
> >>> cleared.
> >>>
> >>> But this might bring overhead. Because we need to grab the rq
> >>> lock and write to other CPU's rq, which could be costly. It
> >>> seems to be a trade-off of current implementation.
> >>
> >> I agree, it will not be pretty. Maybe the other way is to have a
> >> history of the type of wakeup the task experiences (similar to
> >> wakee_flips but for sync and non-syn wakeups) and only reserve
> >> the CPU if the task wakes up more via non-sync wakeups? Thinking
> >> out loud here.
> >>
> >
> > This looks good to consider the task's attribute, or maybe something
> > like this:
> >
> > new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> > if (new_cpu != prev_cpu)
> > p->burst_sleep_avg >>= 1;
> > So the duration of reservation could be shrinked.
>
> That seems like a good approach.
>
> Meanwhile, here is result for the current series without any
> modifications:
>
> tl;dr
>
> - There seems to be a noticeable increase in hackbench runtime with a
> single group but big gains beyond that. The regression could possibly
> be because of added searching but let me do some digging to confirm
> that.

Ah OK. May I have the command to run 1 group hackbench?

>
> - Small regressions (~2%) noticed in ycsb-mongodb (medium utilization)
> and DeathStarBench (High Utilization)
>
> - Other benchmarks are more of less perf neutral with the changes.
>
> More information below:
>
> o System information
>
> - Dual socket 3rd Generation EPYC System (2 x 64C/128T)
> - NPS1 mode (each socket is a NUMA node)
> - Boost Enabled
> - C2 disabled (MWAIT based C1 is still enabled)
>
>
> o Kernel information
>
> base : tip:sched/core at commit b41bbb33cf75 ("Merge branch
> 'sched/eevdf' into sched/core")
> + cheery-pick commit 63304558ba5d ("sched/eevdf: Curb
> wakeup-preemption")
>
> SIS_CACHE : base
> + this series as is
>
>
> o Benchmark results
>
> ==================================================================
> Test : hackbench
> Units : Normalized time in seconds
> Interpretation: Lower is better
> Statistic : AMean
> ==================================================================
> Case: base[pct imp](CV) SIS_CACHE[pct imp](CV)
> 1-groups 1.00 [ -0.00]( 1.89) 1.10 [-10.28]( 2.03)
> 2-groups 1.00 [ -0.00]( 2.04) 0.98 [ 1.57]( 2.04)
> 4-groups 1.00 [ -0.00]( 2.38) 0.95 [ 4.70]( 0.88)
> 8-groups 1.00 [ -0.00]( 1.52) 0.93 [ 7.18]( 0.76)
> 16-groups 1.00 [ -0.00]( 3.44) 0.90 [ 9.76]( 1.04)
>
>
> ==================================================================
> Test : tbench
> Units : Normalized throughput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: base[pct imp](CV) SIS_CACHE[pct imp](CV)
> 1 1.00 [ 0.00]( 0.18) 0.98 [ -1.61]( 0.27)
> 2 1.00 [ 0.00]( 0.63) 0.98 [ -1.58]( 0.09)
> 4 1.00 [ 0.00]( 0.86) 0.99 [ -0.52]( 0.42)
> 8 1.00 [ 0.00]( 0.22) 0.98 [ -1.77]( 0.65)
> 16 1.00 [ 0.00]( 1.99) 1.00 [ -0.10]( 1.55)
> 32 1.00 [ 0.00]( 4.29) 0.98 [ -1.73]( 1.55)
> 64 1.00 [ 0.00]( 1.71) 0.97 [ -2.77]( 3.74)
> 128 1.00 [ 0.00]( 0.65) 1.00 [ -0.14]( 0.88)
> 256 1.00 [ 0.00]( 0.19) 0.97 [ -2.65]( 0.49)
> 512 1.00 [ 0.00]( 0.20) 0.99 [ -1.10]( 0.33)
> 1024 1.00 [ 0.00]( 0.29) 0.99 [ -0.70]( 0.16)
>
>
> ==================================================================
> Test : stream-10
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: base[pct imp](CV) SIS_CACHE[pct imp](CV)
> Copy 1.00 [ 0.00]( 4.32) 0.90 [ -9.82](10.72)
> Scale 1.00 [ 0.00]( 5.21) 1.01 [ 0.59]( 1.83)
> Add 1.00 [ 0.00]( 6.25) 0.99 [ -0.91]( 4.49)
> Triad 1.00 [ 0.00](10.74) 1.02 [ 2.28]( 6.07)
>
>
> ==================================================================
> Test : stream-100
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: base[pct imp](CV) SIS_CACHE[pct imp](CV)
> Copy 1.00 [ 0.00]( 0.70) 0.98 [ -1.79]( 2.26)
> Scale 1.00 [ 0.00]( 6.55) 1.03 [ 2.80]( 0.74)
> Add 1.00 [ 0.00]( 6.53) 1.02 [ 2.05]( 1.82)
> Triad 1.00 [ 0.00]( 6.66) 1.04 [ 3.54]( 1.04)
>
>
> ==================================================================
> Test : netperf
> Units : Normalized Througput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: base[pct imp](CV) SIS_CACHE[pct imp](CV)
> 1-clients 1.00 [ 0.00]( 0.46) 0.99 [ -0.55]( 0.49)
> 2-clients 1.00 [ 0.00]( 0.38) 0.99 [ -1.23]( 1.19)
> 4-clients 1.00 [ 0.00]( 0.72) 0.98 [ -1.91]( 1.21)
> 8-clients 1.00 [ 0.00]( 0.98) 0.98 [ -1.61]( 1.08)
> 16-clients 1.00 [ 0.00]( 0.70) 0.98 [ -1.80]( 1.04)
> 32-clients 1.00 [ 0.00]( 0.74) 0.98 [ -1.55]( 1.20)
> 64-clients 1.00 [ 0.00]( 2.24) 1.00 [ -0.04]( 2.77)
> 128-clients 1.00 [ 0.00]( 1.72) 1.03 [ 3.22]( 1.99)
> 256-clients 1.00 [ 0.00]( 4.44) 0.99 [ -1.33]( 4.71)
> 512-clients 1.00 [ 0.00](52.42) 0.98 [ -1.61](52.72)
>
>
> ==================================================================
> Test : schbench (old)
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: base[pct imp](CV) SIS_CACHE[pct imp](CV)
> 1 1.00 [ -0.00]( 2.28) 0.96 [ 4.00](15.68)
> 2 1.00 [ -0.00]( 6.42) 1.00 [ -0.00](10.96)
> 4 1.00 [ -0.00]( 3.77) 0.97 [ 3.33]( 7.61)
> 8 1.00 [ -0.00](13.83) 1.08 [ -7.89]( 2.86)
> 16 1.00 [ -0.00]( 4.37) 1.00 [ -0.00]( 2.13)
> 32 1.00 [ -0.00]( 8.69) 0.95 [ 4.94]( 2.73)
> 64 1.00 [ -0.00]( 2.30) 1.05 [ -5.13]( 1.26)
> 128 1.00 [ -0.00](12.12) 1.03 [ -3.41]( 5.08)
> 256 1.00 [ -0.00](26.04) 0.91 [ 8.88]( 2.59)
> 512 1.00 [ -0.00]( 5.62) 0.97 [ 3.32]( 0.37)
>
>
> ==================================================================
> Test : Unixbench
> Units : Various, Throughput
> Interpretation: Higher is better
> Statistic : AMean, Hmean (Specified)
> ==================================================================
> Metric variant base SIS_CACHE
> Hmean unixbench-dhry2reg-1 41248390.97 ( 0.00%) 41485503.82 ( 0.57%)
> Hmean unixbench-dhry2reg-512 6239969914.15 ( 0.00%) 6233919689.40 ( -0.10%)
> Amean unixbench-syscall-1 2968518.27 ( 0.00%) 2841236.43 * 4.29%*
> Amean unixbench-syscall-512 7790656.20 ( 0.00%) 7631558.00 * 2.04%*
> Hmean unixbench-pipe-1 2535689.01 ( 0.00%) 2598208.16 * 2.47%*
> Hmean unixbench-pipe-512 361385055.25 ( 0.00%) 368566373.76 * 1.99%*
> Hmean unixbench-spawn-1 4506.26 ( 0.00%) 4551.67 ( 1.01%)
> Hmean unixbench-spawn-512 69380.09 ( 0.00%) 69264.30 ( -0.17%)
> Hmean unixbench-execl-1 3824.57 ( 0.00%) 3822.67 ( -0.05%)
> Hmean unixbench-execl-512 12288.64 ( 0.00%) 11728.12 ( -4.56%)
>
>
> ==================================================================
> Test : ycsb-mongodb
> Units : Throughput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> base : 309589.33 (var: 1.41%)
> SIS_CACHE : 304931.33 (var: 1.29%) [diff: -1.50%]
>
>
> ==================================================================
> Test : DeathStarBench
> Units : Normalized Throughput, relative to base
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Pinning base SIS_CACHE
> 1 CCD 100% 99.18% [%diff: -0.82%]
> 2 CCD 100% 97.46% [%diff: -2.54%]
> 4 CCD 100% 97.22% [%diff: -2.78%]
> 8 CCD 100% 99.01% [%diff: -0.99%]
>
> --
>
> Regression observed could either be because of the larger search time to
> find a non cache-hot idle CPU, or perhaps just the larger search time in
> general adding to utilization and curbing the SIS_UTIL limits further.

Yeah that is possible. And you also mentioned that we should consider the
cache-hot idle CPU if we can not find any cache-cold idle CPUs, that
might be a better choice than forcely putting the wakee on the current
CPU which brings task stacking.

> I'll go gather some stats to back my suspicion (particularly for
> hackbench).
>

Thanks!
Chenyu