Re: [PATCH] dl_server: Reset DL server params when rd changes

From: Waiman Long
Date: Wed Nov 06 2024 - 13:06:06 EST


On 11/6/24 11:08 AM, Juri Lelli wrote:
On 04/11/24 17:41, Joel Fernandes wrote:
On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote:
...

I added a printk in __dl_server_attach_root which is called after the
dynamic rd is built to transfer bandwidth to it.

__dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair
server interface"), do you have this change in your backport?
You nailed it! Our 5.15 backport appears to be slightly older and is missing
this from topology.c as you mentioned. Thanks for clarifying!


/*
* Because the rq is not a task, dl_add_task_root_domain() did not
* move the fair server bw to the rd if it already started.
* Add it now.
*/
if (rq->fair_server.dl_server)
__dl_server_attach_root(&rq->fair_server, rq);

So if rd changes during boot initialization, the correct dl_bw has to be
updated AFAICS. Also if cpusets are used, the rd for a CPU may change.
cpusets changes are something that I still need to double check. Will
do.

Sounds good, that would be good to verify.
So, I played a little bit with it and came up with a simple set of ops
that point out an issue (default fedora server install):

echo Y >/sys/kernel/debug/sched/verbose

echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control

echo 0-7 > /sys/fs/cgroup/user.slice/cpuset.cpus
echo 6-7 > /sys/fs/cgroup/user.slice/cpuset.cpus.exclusive
echo root >/sys/fs/cgroup/user.slice/cpuset.cpus.partition

The domains are rebuilt correctly, but we end up with a null total_bw.

The conditional call above takes care correctly of adding back dl_server
per-rq bandwidth when we pass from the single domain to the 2 exclusive
ones, but I noticed that we go through partition_sched_domains_locked()
twice for a single write of 'root' and the second one, since it's not
actually destroying/rebuilding anything, is resetting total_bw w/o
addition dl_server contribution back.

Now, not completely sure why we need to go through partition_sched_
domains_locked() twice, as we have (it also looked like a pattern from
other call paths)

update_prstate()
-> update_cpumasks_hier()
-> rebuild_sched_domains_locked() <- right at the end
-> update_partition_sd_lb()
-> rebuild_sched_domains_locked() <- right after the above call

Removing the first call does indeed fix the issue and domains look OK,
but I'm pretty sure I'm missing all sort of details and corner cases.

Waiman (now Cc-ed), maybe you can help here understanding why the two
back to back calls are needed?

Thanks for letting me know about this case.

I am aware that rebuild_sched_domains_locked() can be called more than once. I have addressed the hotplug case, but it can happen in some other corner cases as well. The problem with multiple rebuild_sched_domains_locked() calls is the fact that intermediate ones may be called where the internal states may not be consistent. I am going to work on a fix to this issue by making sure that rebuild_sched_domains_locked() will only be called once.

Cheers,
Longman