Re: [PATCH] cgroup/cpuset: Support multiple source/destination cpusets using pids pattern

From: Waiman Long

Date: Fri Jun 05 2026 - 13:23:03 EST


On 6/5/26 3:35 AM, Ridong Chen wrote:

On 6/4/2026 2:47 AM, Waiman Long wrote:
On 6/3/26 6:26 AM, Ridong Chen wrote:
The current cpuset_can_attach() and cpuset_attach() functions assume task
migration is from one source cpuset to one destination cpuset. This
can be
wrong in several scenarios:
  - Moving a multi-threaded process with threads in different cpusets
  - Disabling the cpuset controller (many children to one parent)
  - Enabling the cpuset controller (one parent to many children)

Fix this by adopting the pids subsystem's per-task accounting pattern.
In cpuset_can_attach(), use task_cs(task) to get the correct source
cpuset
for each task (like pids_can_attach uses task_css), adjust
nr_deadline_tasks
and reserve DL bandwidth per-task, and increment attach_in_progress
per-task
on the destination cpuset. In cpuset_attach(), handle destination cpuset
changes within the task iteration loop.

A shared helper cpuset_undo_attach() reverses the per-task operations for
both partial rollback in cpuset_can_attach() and full reversal in
cpuset_cancel_attach().

When multiple source cpusets are detected in can_attach(), set
attach_many_sources so that cpuset_attach() forces cpus_updated and
mems_updated to true, ensuring all tasks get properly updated regardless
of which source cpuset cpuset_attach_old_cs points to.

This eliminates the need for nr_migrate_dl_tasks, sum_migrate_dl_bw, and
dl_bw_cpu fields in struct cpuset.

Fixes: 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default
hierarchy")
Signed-off-by: Ridong Chen <ridong.chen@xxxxxxxxx>
It is not a problem doing per-task DL BW allocation and eliminating the
*dl_bw* fields. However, updating nr_deadline_tasks before it is
committed can be problematic.

Good to hear that.

nr_deadline_tasks is used in dl_rebuild_rd_accounting() which is called
by partition_sched_domains_locked(). After the release of cpuset_mutex
at the end of cpuset_can_attach() and before cpuset_attach() or
cpuset_cancel_attach() is called, it is possible
that partition_sched_domains_locked() can be called
and dl_rebuild_rd_accounting() is not getting the right DL BW accounting
information. So unless there is a way to confirm that this situation
cannot happen, we can't change nr_deadline_tasks before the attach is
commited.

We can keep the nr_migrate_dl_tasks field and update nr_deadline_tasks
once migration is complete. I think this will be much simpler than
fixing the issue using lists.

But we still need to track the set of source and destination cpusets to commit or cancel the change. Doing it task-by-task will add code in the cpuset_attach() and cpuset_cancel_attach() to check if a task is a DL task and act accordingly. So we are just trading task-by-task code with code to handle the lists.

Cheers,
Longman