[RFC] sched/deadline: only mark active cpu as free
From: Doug Berger
Date: Fri Jan 10 2025 - 18:30:51 EST
There is a hazard in the deadline scheduler where an offlined CPU
can have its free_cpus bit left set in the def_root_domain when
the schedutil cpufreq governor is used. This can allow a deadline
thread to be pushed to the runqueue of a powered down CPU which
breaks scheduling.
This commit works around the issue by only setting the free_cpus
bit for a CPU when it is "active". It is likely that the ordering
of sched_set_rq_online() and set_cpu_active() at the end of the
sched_cpu_deactivate() function should be revisited if this
approach has merit.
Signed-off-by: Doug Berger <opendmb@xxxxxxxxx>
---
Coffee is recommended before proceeding.
While stress testing CPU hotplug on a quad-core arm64 architecture
system I encountered a deadlock. My specific deadlock appears to be
dependent on the system having three or more cores and using the
sched-util cpufreq governor which uses a deadline scheduled thread
named "sugov:n" where n is the CPU number.
The scenario I observe is as follows:
Initially, CPU0 and CPU1 are active and CPU2 and CPU3 have been
previously offlined so their runqueues are attached to the
def_root_domain.
1) A hot plug is initiated on CPU2.
2) The cpuhp/2 thread invokes the cpufreq governor driver during
the CPUHP_AP_ONLINE_DYN step.
3) The sched util cpufreq governor creates the "sugov:2" thread to
execute on CPU2 with the deadline scheduler.
4) The deadline scheduler clears the free_cpus mask for CPU2 within
the def_root_domain when "sugov:2" is scheduled.
5) When the "sugov:2" thread blocks, cpudl_clear() gets called to
clear the deadline which sets the free_cpus mask for CPU2 within
the def_root_domain.
6) When cpuhp/2 reaches the CPUHP_AP_ACTIVE step a new scheduling
domain is created to include CPU0, CPU1, and CPU2.
o detach_destroy_domains() invokes rq_attach_root() for CPU0 and
CPU1 which offlines their runqueues and detaches their current
dynamic scheduling domain (clearing their deadline free_cpus
bits there) and attaches the def_root_domain and onlines their
runqueus (setting their deadline free_cpus bits there).
o build_sched_domains() invokes rq_attach_root() for CPU0, CPU1,
and CPU2.
- Since only CPU0 and CPU1 are online in the def_root_domain
set_rq_offline() is only called for them to offline their
runqueues and detach the def_root_domain (clearing their
deadline free_cpus bits there).
- The free_cpus bit for CPU2 in def_root_domain is allowed to
remain set.
- The newly created dynamic scheduling domain is attached to
CPU0, CPU1, and CPU2 runqueues and set_rq_online() is used
to online their runqueues (setting their deadline free_cpus
bits there).
7) The cpuhp/2 thread also invokes sched_set_rq_online() in the
CPUHP_AP_ACTIVE step, but since the runqueues are already online
essentially nothing happens.
8) Some time later CPU2 is hot unplugged.
9) At the CPUHP_AP_ACTIVE step, cpuhp/2 marks CPU2 not active and
invokes balance_push_set() for CPU2 which migrates "sugov:2" to
a different CPU through fallback.
10) Also at this step, cpuhp/2 invokes sched_set_rq_offline() for
CPU2 which takes its runqueue offline and clears its deadline
free_cpus bit in the current dynamic scheduling domain.
11) Also at this step, cpuhp/2 updates the scheduling domain to
remove CPU2.
o detach_destroy_domains() invokes rq_attach_root() for CPU0,
CPU1, and CPU2 to move them back to the def_root_domain.
- Since only CPU0 and CPU1 are online in the current dynamic
scheduling domain (CPU2 was removed at 10 above),
set_rq_offline() is only called for them to clear their
deadline free_cpus bits.
- The def_root_domain is attached to CPU0, CPU1, and CPU2
runqueues and since only CPU0 and CPU1 are marked active
set_rq_online() is used to online their runqueues (setting
their deadline free_cpus bits there).
- The free_cpus bit for CPU2 in def_root_domain is allowed
to remain set.
o build_sched_domains() invokes rq_attach_root() for CPU0 and
CPU1 which offlines their runqueues (clearing their deadline
free_cpus bits in def_root_domain), attaches a new dynamic
scheduling domain, and onlines their runqueus (setting their
deadline free_cpus bits there).
12) The cpuhp/2 thread invokes the cpufreq governor driver during
the CPUHP_AP_ONLINE_DYN step which attempts to stop the
"sugov:2" kthread by calling kthread_flush_worker() followed by
kthread_stop().
13) The "sugov:2" thread can likely be successfully deadline
scheduled on CPU0 or CPU1 to allow the cpuhp/2 thread to
complete offlining CPU2 and power it off.
14) Some time later CPU1 is hot unplugged.
15) At the CPUHP_AP_ACTIVE step, cpuhp/1 marks CPU1 not active and
invokes balance_push_set() for CPU1 which migrates "sugov:1" to
CPU0 through fallback.
16) Also at this step, cpuhp/1 invokes sched_set_rq_offline() for
CPU1 which takes its runqueue offline and clears its deadline
free_cpus bit in the current dynamic scheduling domain.
17) Also at this step, cpuhp/1 updates the scheduling domain to
remove CPU1.
o detach_destroy_domains() invokes rq_attach_root() for CPU0
and CPU1 to move them back to the def_root_domain.
- Since only CPU0 is online in the current dynamic scheduling
domain (CPU1 was removed at 16 above), set_rq_offline() is
only called for it to clear its deadline free_cpus bit.
- The def_root_domain is attached to CPU0 and CPU1 runqueues
and since only CPU0 is marked active set_rq_online() is
used to online its runqueue (setting its deadline free_cpus
bits there).
- The free_cpus bit for CPU1 is untouched in def_root_domain.
- The free_cpus bit for CPU2 in def_root_domain remains set
from the preceeding sequence.
18) If CPU0 executes the "sugov:0" deadline thread at this time it
may see that the "sugov:1" deadline thread is also on its
runqueue and may call push_dl_task() to attempt to push it to a
different CPU.
19) The effort to find a later runqueue will find the stale
free_cpus bit of CPU2 in the currently attached def_root_domain
and will migrate the "sugov:1" thread to the runqueue of the
powered down CPU2 where it can never get scheduled.
20) The cpuhp/1 thread invokes the cpufreq governor driver during
the CPUHP_AP_ONLINE_DYN step which attempts to stop the
"sugov:1" kthread by calling kthread_flush_worker() followed by
kthread_stop(). Since "sugov:1" never gets scheduled cpuhp/1
remains blocked on completion events.
Steps 1-13 amount to setting a trap by allowing a free_cpus bit in
the deadline scheduler def_root_domain to remain set for a CPU that
is powered off. The trap can be sprung during the narrow timing
hazard when the def_root_domain is transitionally attached while
changing scheduling domains if the deadline scheduler pushes a
queued task to the powered off CPU.
This problem appears to have been initially introduced by commit
120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") which
moved the set_rq_offline() handling from sched_cpu_dying() to
sched_cpu_deactivate(). The original sequence allowed the free_cpus
bit to be forcibly cleared in the def_root_domains after all of the
scheduler dust settled. The new location makes the
sched_set_rq_offline() essentially meaningless for the deadline
scheduler since the managing of changed scheduling domains happens
later.
There are likely many different approaches to address this issue
and I'm hopeful that somone more familiar with the scheduler than
I can propose a better solution than the one suggested here.
Thank you for reading this far. Any advice is appreciated.
-Doug
kernel/sched/cpudeadline.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 95baa12a1029..6896bbe0e9ae 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -195,7 +195,8 @@ void cpudl_clear(struct cpudl *cp, int cpu)
cp->elements[cpu].idx = IDX_INVALID;
cpudl_heapify(cp, old_idx);
- cpumask_set_cpu(cpu, cp->free_cpus);
+ if (cpu_active(cpu))
+ cpumask_set_cpu(cpu, cp->free_cpus);
}
raw_spin_unlock_irqrestore(&cp->lock, flags);
}
--
2.34.1