Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

From: Li, Aubrey
Date: Thu Dec 10 2020 - 07:26:40 EST


On 2020/12/10 19:34, Mel Gorman wrote:
> On Thu, Dec 10, 2020 at 04:23:47PM +0800, Li, Aubrey wrote:
>>> I ran this patch with tbench on top of of the schedstat patches that
>>> track SIS efficiency. The tracking adds overhead so it's not a perfect
>>> performance comparison but the expectation would be that the patch reduces
>>> the number of runqueues that are scanned
>>
>> Thanks for the measurement! I don't play with tbench so may need a while
>> to digest the data.
>>
>
> They key point is that it appears the idle mask was mostly equivalent to
> the full domain mask, at least for this test.

I'm more interested in how tbench with heavy load behaves.
If the load is heavy enough and idle thread has no chance to switch in,
idle cpumask will be empty at the first scheduler tick and remain empty
before the load comes down, during this period of heavy load:
- default select_idle_cpu still scan the entire sched domain(or throttled to
4) everytime
- the patched select_idle_cpu does not scan at all
>
>>>
>>> tbench4
>>> 5.10.0-rc6 5.10.0-rc6
>>> schedstat-v1r1 idlemask-v7r1
>>> Hmean 1 504.76 ( 0.00%) 500.14 * -0.91%*
>>> Hmean 2 1001.22 ( 0.00%) 970.37 * -3.08%*
>>> Hmean 4 1930.56 ( 0.00%) 1880.96 * -2.57%*
>>> Hmean 8 3688.05 ( 0.00%) 3537.72 * -4.08%*
>>> Hmean 16 6352.71 ( 0.00%) 6439.53 * 1.37%*
>>> Hmean 32 10066.37 ( 0.00%) 10124.65 * 0.58%*
>>> Hmean 64 12846.32 ( 0.00%) 11627.27 * -9.49%*
>>> Hmean 128 22278.41 ( 0.00%) 22304.33 * 0.12%*
>>> Hmean 256 21455.52 ( 0.00%) 20900.13 * -2.59%*
>>> Hmean 320 21802.38 ( 0.00%) 21928.81 * 0.58%*
>>>
>>> Not very optimistic result. The schedstats indicate;
>>
>> How many client threads was the following schedstats collected?
>>
>
> That's the overall summary for all client counts. While proc-schedstat
> was measured every few seconds over all client counts, presenting that
> in text format is not easy to parse. However, looking at the graphs over
> time, it did not appear that scan rates were consistently lower for any
> client count for tbench.
>
>>>
>>> 5.10.0-rc6 5.10.0-rc6
>>> schedstat-v1r1 idlemask-v7r1
>>> Ops TTWU Count 5599714302.00 5589495123.00
>>> Ops TTWU Local 2687713250.00 2563662550.00
>>> Ops SIS Search 5596677950.00 5586381168.00
>>> Ops SIS Domain Search 3268344934.00 3229088045.00
>>> Ops SIS Scanned 15909069113.00 16568899405.00
>>> Ops SIS Domain Scanned 13580736097.00 14211606282.00
>>> Ops SIS Failures 2944874939.00 2843113421.00
>>> Ops SIS Core Search 262853975.00 311781774.00
>>> Ops SIS Core Hit 185189656.00 216097102.00
>>> Ops SIS Core Miss 77664319.00 95684672.00
>>> Ops SIS Recent Used Hit 124265515.00 146021086.00
>>> Ops SIS Recent Used Miss 338142547.00 403547579.00
>>> Ops SIS Recent Attempts 462408062.00 549568665.00
>>> Ops SIS Search Efficiency 35.18 33.72
>>> Ops SIS Domain Search Eff 24.07 22.72
>>> Ops SIS Fast Success Rate 41.60 42.20
>>> Ops SIS Success Rate 47.38 49.11
>>> Ops SIS Recent Success Rate 26.87 26.57
>>>
>>> The field I would expect to decrease is SIS Domain Scanned -- the number
>>> of runqueues that were examined but it's actually worse and graphing over
>>> time shows it's worse for the client thread counts. select_idle_cpu()
>>> is definitely being called because "Domain Search" is 10 times higher than
>>> "Core Search" and there "Core Miss" is non-zero.
>>
>> Why SIS Domain Scanned can be decreased?
>>
>
> Because if idle CPUs are being targetted and its a subset of the entire
> domain then it follows that fewer runqueues should be examined when
> scanning the domain.

Sorry, I probably messed up "SIS Domain Scanned" and "SIS Domain Search".
How is "SIS Domain Scanned" calculated?

>
>> I thought SIS Scanned was supposed to be decreased but it seems not on your side.
>>
>
> It *should* have been decreased but it's indicating that more runqueues
> were scanned with the patch. It should be noted that the scan count is
> naturally variable because the depth of each individual search is variable.

I'm confused here, I saw 4 places to increase sis_scanned, so what's the difference
between "SIS Domain Scanned" and "SIS scanned"?

select_idle_sibling(struct task_struct *p, int prev, int target)
---->schedstat_inc(this_rq()->sis_scanned);

select_idle_core(struct task_struct *p, struct sched_domain *sd, int
---->schedstat_inc(this_rq()->sis_scanned);

static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
---->schedstat_inc(this_rq()->sis_scanned);

select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
---->schedstat_inc(this_rq()->sis_scanned);

>
>> I printed some trace log on my side by uperf workload, and it looks properly.
>> To make the log easy to read, I started a 4 VCPU VM to run 2-second uperf 8 threads.
>>
>> stage 1: system idle, update_idle_cpumask is called from idle thread, set cpumask to 0-3
>> ========================================================================================
>> <idle>-0 [002] d..1 137.408681: update_idle_cpumask: set_idle-1, cpumask: 2
>> <idle>-0 [000] d..1 137.408713: update_idle_cpumask: set_idle-1, cpumask: 0,2
>> <idle>-0 [003] d..1 137.408924: update_idle_cpumask: set_idle-1, cpumask: 0,2-3
>> <idle>-0 [001] d..1 137.409035: update_idle_cpumask: set_idle-1, cpumask: 0-3
>>
>
> What's the size of the LLC domain on this machine? If it's 4 then this
> is indicating that there is little difference between scanning the full
> domain and targetting idle CPUs via the idle cpumask.

But idle cpumask saves 4 loops of scanning runqueue everytime select_idle_cpu().

>
>> stage 3: uperf running, select_idle_cpu scan all the CPUs in the scheduler domain at the beginning
>> ===================================================================================================
>> uperf-560 [000] d..3 138.418494: select_task_rq_fair: scanning: 0-3
>> uperf-560 [000] d..3 138.418506: select_task_rq_fair: scanning: 0-3
>> uperf-560 [000] d..3 138.418514: select_task_rq_fair: scanning: 0-3
>> uperf-560 [000] dN.3 138.418534: select_task_rq_fair: scanning: 0-3
>> uperf-560 [000] dN.3 138.418543: select_task_rq_fair: scanning: 0-3
>> uperf-560 [000] dN.3 138.418551: select_task_rq_fair: scanning: 0-3
>> uperf-561 [003] d..3 138.418577: select_task_rq_fair: scanning: 0-3
>> uperf-561 [003] d..3 138.418600: select_task_rq_fair: scanning: 0-3
>> uperf-561 [003] d..3 138.418617: select_task_rq_fair: scanning: 0-3
>> uperf-561 [003] d..3 138.418640: select_task_rq_fair: scanning: 0-3
>> uperf-561 [003] d..3 138.418652: select_task_rq_fair: scanning: 0-3
>> uperf-561 [003] d..3 138.418662: select_task_rq_fair: scanning: 0-3
>> uperf-561 [003] d..3 138.418672: select_task_rq_fair: scanning: 0-3
>> uperf-560 [000] d..5 138.418676: select_task_rq_fair: scanning: 0-3
>> uperf-561 [003] d..3 138.418693: select_task_rq_fair: scanning: 0-3
>> kworker/u8:3-110 [002] d..4 138.418746: select_task_rq_fair: scanning: 0-3
>>
>> stage 4: scheduler tick comes, update idle cpumask to EMPTY
>> ============================================================
>> uperf-572 [002] d.h. 139.420568: update_idle_cpumask: set_idle-0, cpumask: 1,3
>> uperf-574 [000] d.H2 139.420568: update_idle_cpumask: set_idle-0, cpumask: 1,3
>> uperf-565 [003] d.H6 139.420569: update_idle_cpumask: set_idle-0, cpumask: 1
>> tmux: server-528 [001] d.h2 139.420572: update_idle_cpumask: set_idle-0, cpumask:
>>
>
> It's the timing of the clear that may be problematic. For the
> configurations I use, the tick is every 4 milliseconds which is a very
> long time in the scheduler for communicating workloads. A simple
> round-trip of perf pipe where tasks will be entering/exiting rapidly is
> just 0.004 milliseconds on my desktop.

Agreed, but that's not the case this patch targeted. The target of this
patch is the system load is heavy enough so that idle thread has no chance
to be scheduled.

>
>>>
>>> I suspect the issue is that the mask is only marked busy from the tick
>>> context which is a very wide window. If select_idle_cpu() picks an idle
>>> CPU from the mask, it's still marked as idle in the mask.
>>>
>> That should be fine because we still check available_idle_cpu() and sched_idle_cpu for the selected
>> CPU. And even if that CPU is marked as idle in the mask, that mask should not be worse(bigger) than
>> the default sched_domain_span(sd).
>>
>
> I agree that it should never be worse/bigger but if the idle cpu mask is
> often the same bits as the sched_domain_span then it's not going to be
> any better either.
>

At least on my side, v5 is easier to see the benefit, as rapidly entering
and exiting idle will make idle governor to retain the tick, v5 uses stop_tick
signal from idle governor to decide if this CPU is added to idle cpumask.
https://lore.kernel.org/lkml/20201118043113.53128-1-aubrey.li@xxxxxxxxxxxxxxx/

But as Vincent and Peter concerned, v5 may break the latency sensitive workload.
Also the short idle definition need more work.

Thanks,
-Aubrey