Re: [PATCH v2 3/2] sched/deadline: Check bandwidth overflow earlier for hotplug
From: Dietmar Eggemann
Date: Tue Feb 18 2025 - 09:13:51 EST
On 18/02/2025 10:58, Juri Lelli wrote:
> Hi!
>
> On 17/02/25 17:08, Juri Lelli wrote:
>> On 14/02/25 10:05, Jon Hunter wrote:
>
> ...
>
>> At this point I believe you triggered suspend.
>>
>>> [ 57.290150] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
>>> [ 57.335619] tegra-xusb 3530000.usb: Firmware timestamp: 2020-07-06 13:39:28 UTC
>>> [ 57.353364] dwc-eth-dwmac 2490000.ethernet eth0: Link is Down
>>> [ 57.397022] Disabling non-boot CPUs ...
>>
>> Offlining CPU5.
>>
>>> [ 57.400904] dl_bw_manage: cpu=5 cap=3072 fair_server_bw=52428 total_bw=209712 dl_bw_cpus=4 type=DYN span=0,3-5
>>> [ 57.400949] CPU0 attaching NULL sched-domain.
>>> [ 57.415298] span=1-2
>>> [ 57.417483] __dl_sub: cpus=3 tsk_bw=52428 total_bw=157284 span=0,3-5 type=DYN
>>> [ 57.417487] __dl_server_detach_root: cpu=0 rd_span=0,3-5 total_bw=157284
>>> [ 57.417496] rq_attach_root: cpu=0 old_span=NULL new_span=1-2
>>> [ 57.417501] __dl_add: cpus=3 tsk_bw=52428 total_bw=157284 span=0-2 type=DEF
>>> [ 57.417504] __dl_server_attach_root: cpu=0 rd_span=0-2 total_bw=157284
>>> [ 57.417507] CPU3 attaching NULL sched-domain.
>>> [ 57.454804] span=0-2
>>> [ 57.456987] __dl_sub: cpus=2 tsk_bw=52428 total_bw=104856 span=3-5 type=DYN
>>> [ 57.456990] __dl_server_detach_root: cpu=3 rd_span=3-5 total_bw=104856
>>> [ 57.456998] rq_attach_root: cpu=3 old_span=NULL new_span=0-2
>>> [ 57.457000] __dl_add: cpus=4 tsk_bw=52428 total_bw=209712 span=0-3 type=DEF
>>> [ 57.457003] __dl_server_attach_root: cpu=3 rd_span=0-3 total_bw=209712
>>> [ 57.457006] CPU4 attaching NULL sched-domain.
>>> [ 57.493964] span=0-3
>>> [ 57.496152] __dl_sub: cpus=1 tsk_bw=52428 total_bw=52428 span=4-5 type=DYN
>>> [ 57.496156] __dl_server_detach_root: cpu=4 rd_span=4-5 total_bw=52428
>>> [ 57.496162] rq_attach_root: cpu=4 old_span=NULL new_span=0-3
>>> [ 57.496165] __dl_add: cpus=5 tsk_bw=52428 total_bw=262140 span=0-4 type=DEF
>>> [ 57.496168] __dl_server_attach_root: cpu=4 rd_span=0-4 total_bw=262140
>>> [ 57.496171] CPU5 attaching NULL sched-domain.
>>> [ 57.532952] span=0-4
>>> [ 57.535143] rq_attach_root: cpu=5 old_span= new_span=0-4
>>> [ 57.535147] __dl_add: cpus=5 tsk_bw=52428 total_bw=314568 span=0-5 type=DEF
>>
>> Maybe we shouldn't add the dl_server contribution of a CPU that is going
>> to be offline.
>
> I tried to implement this idea and ended up with the following. As usual
> also pushed it to the branch on github. Could you please update and
> re-test?
>
> Another thing that I noticed is that in my case an hotplug operation
> generating a sched/root domain rebuild ends up calling dl_rebuild_
> rd_accounting() (from partition_and_rebuild_sched_domains()) which
> resets accounting for def and dyn domains. In your case (looking again
> at the last dmesg you shared) I don't see this call, so I wonder if for
> some reason related to your setup we do the rebuild by calling partition_
> sched_domains() (instead of partition_and_rebuild_) and this doesn't
> call dl_rebuild_rd_accounting() after partition_sched_domains_locked() -
> maybe it should? Dietmar, Christian, Peter, what do you think?
Yeah, looks like suspend/resume behaves differently compared to CPU hotplug.
On my Juno [L b b L L L]
^^^
isolcpus=[2,3]
# ps2 | grep DLN
98 98 S 140 0 - DLN sugov:0
99 99 S 140 0 - DLN sugov:1
# taskset -p 98; taskset -p 99
pid 98's current affinity mask: 39
pid 99's current affinity mask: 6
[ 87.679282] partition_sched_domains() called
...
[ 87.684013] partition_sched_domains() called
...
[ 87.687961] partition_sched_domains() called
...
[ 87.689419] psci: CPU3 killed (polled 0 ms)
[ 87.689715] __dl_bw_capacity() mask=2-5 cap=1024
[ 87.689739] dl_bw_cpus() cpu=6 rd->span=2-5 cpu_active_mask=0-2 cpus=1
[ 87.689757] dl_bw_manage: cpu=2 cap=0 fair_server_bw=52428
total_bw=209712 dl_bw_cpus=1 type=DEF span=2-5
[ 87.689775] dl_bw_cpus() cpu=6 rd->span=2-5 cpu_active_mask=0-2 cpus=1
[ 87.689789] dl_bw_manage() cpu=2 cap=0 overflow=1 return=-16
[ 87.689864] Error taking CPU2 down: -16 <-- !!!
...
[ 87.690674] partition_sched_domains() called
...
[ 87.691496] partition_sched_domains() called
...
[ 87.693702] partition_sched_domains() called
...
[ 87.695819] partition_and_rebuild_sched_domains() called