Re: [PATCH 1/2] cgroup/cpuset: record DL BW alloc CPU for attach rollback

From: Guopeng Zhang

Date: Tue Apr 21 2026 - 04:56:16 EST

在 2026/4/20 10:31, Waiman Long 写道:
> On 4/19/26 10:21 PM, Guopeng Zhang wrote:
>>
>> 在 2026/4/18 2:51, Waiman Long 写道:
>>> On 4/16/26 11:37 PM, Guopeng Zhang wrote:
>>>> cpuset_can_attach() allocates DL bandwidth only when migrating
>>>> deadline tasks to a disjoint CPU mask, but cpuset_cancel_attach()
>>>> rolls back based only on nr_migrate_dl_tasks. This makes the DL
>>>> bandwidth alloc/free paths asymmetric: rollback can call dl_bw_free()
>>>> even when no dl_bw_alloc() was done.
>>>>
>>>> Rollback also needs to undo the reservation against the same CPU/root
>>>> domain that was charged. Record the CPU used by dl_bw_alloc() and use
>>>> that state in cpuset_cancel_attach(). If no allocation happened,
>>>> dl_bw_cpu stays at -1 and rollback skips dl_bw_free(). If allocation
>>>> did happen, bandwidth is returned to the same CPU/root domain.
>>>>
>>>> Successful attach paths are unchanged. This only fixes failed attach
>>>> rollback accounting.
>>>>
>>>> Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails")
>>>> Signed-off-by: Guopeng Zhang <zhangguopeng@xxxxxxxxxx>
>> ...
>>> The patch looks correct to me.
>>>
>>> Reviewed-by: Waiman Long <longman@xxxxxxxxxx>
>> Hi Waiman,
>>
>> Thank you for the review and for the Reviewed-by.
>>> However, I have a DL bandwidth accounting question unrelated to this patch that I would like the scheduler people to clarify. The allocation of additional DL BW is based on the condition
>>>
>>> if (!cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus)) {
>>>
>>> IOW, additional DL BW will need to be allocated when the old and new cpuset doesn't overlap. However, they could still be in the same root domain. Does that mean we will be double counting it?
>> I think you are right to call this out. Looking at the
>> current logic, !cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus)
>> does not obviously guarantee that the migration is crossing into a different
>> root domain. If the old and new cpusets are disjoint but still belong to the
>> same root domain, it does look possible that we reserve bandwidth on the
>> destination side without a corresponding subtraction from the source side.
>> I will try to reproduce that configuration and follow up with results.
Hi Waiman,

I reproduced the issue you pointed out, and the result does support
your concern.

I also tested the follow-up fix here:
https://lore.kernel.org/all/20260421083449.95750-1-zhangguopeng@xxxxxxxxxx/

I tested two cases:
1. disjoint member cpusets that still belong to the same root-domain
setup
2. disjoint partition-root cpusets that do cross root domains

The results look consistent with the bug and with the fix.
Case 1: disjoint member cpusets
Setup:
src: cpuset.cpus = 1-15
dst: cpuset.cpus = 0
both remained "member"

Without the fix, successful back-and-forth migration of the same
SCHED_DEADLINE task caused dl_bw->total_bw on CPU0 to increase
monotonically:
BW0 = 2027221
BW1 = 2376746
BW2 = 2726271

So:
BW1 - BW0 = 349525
BW2 - BW0 = 699050

That is, after src -> dst, dl_bw->total_bw increased, and after
dst -> src it increased again by about the same amount instead of
returning to the original value.

With the fix applied, the same reproducer no longer shows any net
increase, while the attach path still succeeds:

BW0 = 2027221
BW1 = 2027221
BW2 = 2027221

So:
BW1 - BW0 = 0
BW2 - BW0 = 0

Case 2: disjoint partition-root cpusets (true cross-root-domain move)
I also tested a configuration where src and dst are separate partition
roots:
src: cpuset.cpus = 0-6, cpuset.cpus.partition = root
dst: cpuset.cpus = 8-14, cpuset.cpus.partition = root

Then I started the DL task in src and migrated it to dst.

The accounting moved as expected:
Before src -> dst:
SRC0 = 1083517
DST0 = 733992

After src -> dst:
SRC1 = 733992
DST1 = 1083517

Deltas:
SRC delta after src -> dst = -349525
DST delta after src -> dst = +349525

After moving the same task back to src:
SRC2 = 1083517
DST2 = 733992

So both sides returned to baseline:
SRC2 - SRC0 = 0
DST2 - DST0 = 0

So with the fix applied, the same-root-domain case no longer leaves
persistent extra DL bandwidth accounting, while the true cross-root-domain
case still moves the bandwidth accounting as expected.

Shortened reproducers and observed values are below.
Case 1: disjoint member cpusets

echo "+cpu +cpuset" > /sys/fs/cgroup/cgroup.subtree_control
mkdir -p /sys/fs/cgroup/dl-rd-test
echo 0-15 > /sys/fs/cgroup/dl-rd-test/cpuset.cpus
echo 0 > /sys/fs/cgroup/dl-rd-test/cpuset.mems
echo "+cpu +cpuset" > /sys/fs/cgroup/dl-rd-test/cgroup.subtree_control

mkdir -p /sys/fs/cgroup/dl-rd-test/src
mkdir -p /sys/fs/cgroup/dl-rd-test/dst
echo 1-15 > /sys/fs/cgroup/dl-rd-test/src/cpuset.cpus
echo 0 > /sys/fs/cgroup/dl-rd-test/src/cpuset.mems
echo 0 > /sys/fs/cgroup/dl-rd-test/dst/cpuset.cpus
echo 0 > /sys/fs/cgroup/dl-rd-test/dst/cpuset.mems

/tmp/dl_test &
PID=$!
echo $PID > /sys/fs/cgroup/dl-rd-test/src/cgroup.procs

# read BW0 from cpu0 dl_bw->total_bw
# move src -> dst
# read BW1
# move dst -> src
# read BW2

Observed without fix:
BW0=2027221
BW1=2376746
BW2=2726271

Observed with fix:
BW0=2027221
BW1=2027221
BW2=2027221

Case 2: disjoint partition-root cpusets
echo "+cpu +cpuset" > /sys/fs/cgroup/cgroup.subtree_control
mkdir -p /sys/fs/cgroup/dl-rd-part-test
echo 0-15 > /sys/fs/cgroup/dl-rd-part-test/cpuset.cpus
echo 0 > /sys/fs/cgroup/dl-rd-part-test/cpuset.mems
echo 0-15 > /sys/fs/cgroup/dl-rd-part-test/cpuset.cpus.exclusive
echo "+cpu +cpuset" > /sys/fs/cgroup/dl-rd-part-test/cgroup.subtree_control

mkdir -p /sys/fs/cgroup/dl-rd-part-test/src
echo 0-6 > /sys/fs/cgroup/dl-rd-part-test/src/cpuset.cpus
echo 0 > /sys/fs/cgroup/dl-rd-part-test/src/cpuset.mems
echo 0-6 > /sys/fs/cgroup/dl-rd-part-test/src/cpuset.cpus.exclusive
echo root > /sys/fs/cgroup/dl-rd-part-test/src/cpuset.cpus.partition

mkdir -p /sys/fs/cgroup/dl-rd-part-test/dst
echo 8-14 > /sys/fs/cgroup/dl-rd-part-test/dst/cpuset.cpus
echo 0 > /sys/fs/cgroup/dl-rd-part-test/dst/cpuset.mems
echo 8-14 > /sys/fs/cgroup/dl-rd-part-test/dst/cpuset.cpus.exclusive
echo root > /sys/fs/cgroup/dl-rd-part-test/dst/cpuset.cpus.partition

sh -c 'echo $$ > /sys/fs/cgroup/dl-rd-part-test/src/cgroup.procs; exec /tmp/dl_test' &
PID=$!

# read source-side and destination-side dl_bw->total_bw
# move src -> dst
# read both again
# move dst -> src
# read both again

Observed with fix:
SRC0=1083517
DST0=733992
SRC1=733992
DST1=1083517
SRC2=1083517
DST2=733992

This matches the intended behavior: no persistent increase within one
root domain, and correct bandwidth transfer across root domains.

Cheers,
Guopeng

>>> Looking from the other side, the root domain may have enough DL BW for the task migration, but the subset of CPUs in the cpuset itself may not have enough total DL BW to host all the DL tasks to be migrated, is that a problem?
>> my current understanding is that the DL bandwidth
>> accounting is done at root-domain granularity, not at arbitrary cpuset-subset
>> granularity.
> That is my understanding too.
>> That also seems consistent with
>> Documentation/scheduler/sched-deadline.rst, which says that deadline tasks
>> cannot have a CPU affinity mask smaller than the root domain they are created
>> on, and that a restricted CPU set should be achieved by creating a restricted
>> root domain with cpuset.
>
> A root domain should be created by creating cpuset root partition for v2 or using the cpuset.cpu_exclusive flag in v1.
>
> What is listed in the documentation is the ideal case, but users may not strictly follow the rule.
>
> Cheers,
> Longman
>
>>
>> So if a cpuset is only a subset inside a larger root domain, it does not seem
>> to get an independent DL bandwidth limit of its own. If that understanding is
>> correct, then the smaller cpuset not having enough bandwidth by itself would
>> be a limitation of that model rather than something this code checks
>> separately. I'd appreciate confirmation from the scheduler folks on that
>> point.
>>
>> Thanks,
>> Guopeng
>>> Cheers,
>>> Longman