Re: [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
From: Waiman Long
Date: Tue Mar 03 2026 - 11:06:35 EST
On 2/26/26 11:06 AM, Frederic Weisbecker wrote:
Le Sat, Feb 21, 2026 at 01:54:17PM -0500, Waiman Long a écrit :
The cpuset_handle_hotplug() may need to invoke housekeeping_update(),I am a bit confused here. Why would CPU hotplug path need to call
for instance, when an isolated partition is invalidated because its
last active CPU has been put offline.
As we are going to enable dynamic update to the nozh_full housekeeping
cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
allowing the CPU hotplug path to call into housekeeping_update() directly
from update_isolation_cpumasks() will likely cause deadlock.
update_isolation_cpumasks() -> housekeeping_update() for
HK_TYPE_KERNEL_NOISE?
Oh, this is not the current behavior. However, to make nohz_full fully dynamically changeable in the near future, we will have to do that eventually.
Cheers,
Longman
Good point. Will send additional patch to do the rename.So weNit about recent wq renames:
have to defer any call to housekeeping_update() after the CPU hotplug
operation has finished. This is now done via the workqueue where
the update_hk_sched_domains() function will be invoked via the
hk_sd_workfn().
An concurrent cpuset control file write may have executed the required
update_hk_sched_domains() function before the work function is called. So
the work function call may become a no-op when it is invoked.
Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
kernel/cgroup/cpuset.c | 31 ++++++++++++++++---
.../selftests/cgroup/test_cpuset_prs.sh | 11 ++++++-
2 files changed, 36 insertions(+), 6 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 3d0d18bf182f..2c80bfc30bbc 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1323,6 +1323,16 @@ static void update_hk_sched_domains(void)
rebuild_sched_domains_locked();
}
+/*
+ * Work function to invoke update_hk_sched_domains()
+ */
+static void hk_sd_workfn(struct work_struct *work)
+{
+ cpuset_full_lock();
+ update_hk_sched_domains();
+ cpuset_full_unlock();
+}
+
/**
* rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
* @parent: Parent cpuset containing all siblings
@@ -3795,6 +3805,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
*/
static void cpuset_handle_hotplug(void)
{
+ static DECLARE_WORK(hk_sd_work, hk_sd_workfn);
static cpumask_t new_cpus;
static nodemask_t new_mems;
bool cpus_updated, mems_updated;
@@ -3877,11 +3888,21 @@ static void cpuset_handle_hotplug(void)
}
- if (update_housekeeping || force_sd_rebuild) {
- mutex_lock(&cpuset_mutex);
- update_hk_sched_domains();
- mutex_unlock(&cpuset_mutex);
- }
+ /*
+ * Queue a work to call housekeeping_update() & rebuild_sched_domains()
+ * There will be a slight delay before the HK_TYPE_DOMAIN housekeeping
+ * cpumask can correctly reflect what is in isolated_cpus.
+ *
+ * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item that
+ * is still pending. Before the pending bit is cleared, the work data
+ * is copied out and work item dequeued. So it is possible to queue
+ * the work again before the hk_sd_workfn() is invoked to process the
+ * previously queued work. Since hk_sd_workfn() doesn't use the work
+ * item at all, this is not a problem.
+ */
+ if (update_housekeeping || force_sd_rebuild)
+ queue_work(system_unbound_wq, &hk_sd_work);
s/system_unbound_wq/system_dfl_wq
But what makes sure this work is executed by the end of the hotplug operations?
Is there a risk for a stale hierarchy to be observed when it shouldn't? Or a
stale housekeeping cpumask?
If you look at the work function, it will make a copy of HK_TYPE_DOMAIN cpumask while holding rcu_read_lock(). So the current hotplug operation must have finished at that point. Of course, if there is another hot-add/remove operation right after the rcu_read_lock is released, the cpumask passed down to housekeeping_update() may not be the latest one. In this case, another work will be scheduled to call housekeeping_update() with the new cpumask again.
Cheers,
Longman