Re: [PATCH] cgroup/cpuset: make DL attach bandwidth reservation root-domain aware
From: Waiman Long
Date: Thu Apr 30 2026 - 08:41:09 EST
On 4/30/26 4:53 AM, Guopeng Zhang wrote:
在 2026/4/24 22:15, Waiman Long 写道:
On 4/21/26 4:34 AM, Guopeng Zhang wrote:RCU comment at first.
cpuset_can_attach() currently sums the bandwidth of all migratingYou should make sure that cs->nr_migrate_dl_tasks is cleared too. Alternatively, you can move the dl task increment under the dl_task_needs_bw_move() check. It doesn't seem to do any harm if nr_migrate_dl_tasks is non-zero, but dl_bw is 0, but it still doesn't look right.😅 Sorry, I missed this part of your review earlier and only noticed the
SCHED_DEADLINE tasks and reserves destination bandwidth whenever the
old and new cpuset effective CPU masks do not overlap.
That condition is stronger than what the scheduler uses when migrating
a deadline task. set_cpus_allowed_dl() only subtracts bandwidth from
the source side when moving the task requires a DL bandwidth move
between root domains.
As a result, moving a deadline task between disjoint member cpusets that
still belong to the same root domain can reserve destination bandwidth
even though no matching source-side subtraction happens. Successful
back-and-forth migrations between such cpusets can monotonically
increase dl_bw->total_bw.
Fix this by extracting the source root-domain test already used by
set_cpus_allowed_dl() into a shared helper and make cpuset DL bandwidth
preallocation use that same condition. Count all migrating deadline
tasks for cpuset task accounting, but only accumulate sum_migrate_dl_bw
for tasks that actually need a DL bandwidth move. Reserve and rollback
bandwidth only for that subset.
This keeps successful attach accounting aligned with
set_cpus_allowed_dl() and avoids double-accounting within a single
root domain.
Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails")
Signed-off-by: Guopeng Zhang <zhangguopeng@xxxxxxxxxx>
---
include/linux/sched/deadline.h | 9 +++++++++
kernel/cgroup/cpuset-internal.h | 1 +
kernel/cgroup/cpuset.c | 34 ++++++++++++++++-----------------
kernel/sched/deadline.c | 14 +++++++++++---
4 files changed, 38 insertions(+), 20 deletions(-)
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 1198138cb839..273538200a44 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -33,6 +33,15 @@ struct root_domain;
extern void dl_add_task_root_domain(struct task_struct *p);
extern void dl_clear_root_domain(struct root_domain *rd);
extern void dl_clear_root_domain_cpu(int cpu);
+/*
+ * Return whether moving DL task @p to @new_mask requires moving DL
+ * bandwidth accounting between root domains. This helper is specific to
+ * DL bandwidth move accounting semantics and is shared by
+ * cpuset_can_attach() and set_cpus_allowed_dl() so both paths use the
+ * same source root-domain test.
+ */
+extern bool dl_task_needs_bw_move(struct task_struct *p,
+ const struct cpumask *new_mask);
extern u64 dl_cookie;
extern bool dl_bw_visited(int cpu, u64 cookie);
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index bb4e692bea30..f7aaf01f7cd5 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -167,6 +167,7 @@ struct cpuset {
*/
int nr_deadline_tasks;
int nr_migrate_dl_tasks;
+ /* DL bandwidth that needs destination reservation for this attach. */
u64 sum_migrate_dl_bw;
/*
* CPU used for temporary DL bandwidth allocation during attach;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e3a081a07c6d..761098b45f23 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2993,7 +2993,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
struct cpuset *cs, *oldcs;
struct task_struct *task;
bool setsched_check;
- int ret;
+ int cpu, ret;
/* used later by cpuset_attach() */
cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
@@ -3039,31 +3039,31 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
if (dl_task(task)) {
cs->nr_migrate_dl_tasks++;
- cs->sum_migrate_dl_bw += task->dl.dl_bw;
+
+ if (dl_task_needs_bw_move(task, cs->effective_cpus))
+ cs->sum_migrate_dl_bw += task->dl.dl_bw;
}
}
- if (!cs->nr_migrate_dl_tasks)
+ if (!cs->sum_migrate_dl_bw)
goto out_success;
For nr_migrate_dl_tasks, my understanding is that it should remain
separate from sum_migrate_dl_bw.
nr_migrate_dl_tasks is used for cpuset-level DL task accounting, while
sum_migrate_dl_bw is used for destination root-domain bandwidth
reservation. A DL task may move between cpusets without requiring a
root-domain bandwidth move, but cpuset_attach() still needs to update
the cpuset DL task counts for that task.
So, unless I am missing another use of nr_migrate_dl_tasks, moving
nr_migrate_dl_tasks++ under the dl_task_needs_bw_move() check does not
look like the right fix to me. It could miss DL tasks which still need
cpuset DL task accounting updates, even though no DL bandwidth
reservation is needed.
Please let me know if I am missing something here, or if there is a
better way to represent this state in the code.
You are right. Cpuset structure has a nr_deadline_tasks count that requires proper accounting of nr_migrate_dl_tasks to set this field correctly.
Cheers,
Longman
In v2, I plan to keep nr_migrate_dl_tasks counting all migrating DL
tasks, and only restrict sum_migrate_dl_bw to tasks that actually need
destination root-domain bandwidth reservation. I will also add a separate
first patch to reset the pending DL migration state
(nr_migrate_dl_tasks, sum_migrate_dl_bw and dl_bw_cpu) on
cpuset_can_attach() internal failure paths, so these temporary states
will not leak.
I still need to do some testing after my leave, and will send v2 after
that.
Thanks,
Guopeng