Re: [PATCH] dl_server: Reset DL server params when rd changes

From: Juri Lelli
Date: Mon Nov 11 2024 - 07:24:32 EST


On 11/11/24 09:37, Juri Lelli wrote:
> On 09/11/24 13:18, Waiman Long wrote:
> > On 11/8/24 10:30 PM, Waiman Long wrote:
> > > I have the patchset to enforce that rebuild_sched_domains_locked() will
> > > only be called at most once per cpuset operation.
> > >
> > > By adding some debug code to further study the null total_bw issue when
> > > cpuset.cpus.partition is being changed, I found that eliminating the
> > > redundant rebuild_sched_domains_locked() reduced the chance of hitting
> > > null total_bw, it did not eliminate it. By running my cpuset test
> > > script, I hit 250 cases of null total_bw with the v6.12-rc6 kernel. With
> > > my new cpuset patch applied, it reduces it to 120 cases of null
> > > total_bw.
> > >
> > > I will try to look further for the exact condition that triggers null
> > > total_bw generation.
> >
> > After further testing, the 120 cases of null total_bw can be classified into
> > the following 3 categories.
> >
> > 1) 51 cases when an isolated partition with isolated CPUs is created.
> > Isolated CPU is not subjected to scheduling and so a total_bw of 0 is fine
> > and not really a problem.
> >
> > 2) 67 cases when a nested partitions are being removed (A1 - A2). There is
> > probably caused by some kind of race condtion. If I insert an artifical
> > delay between the removal of A2 and A1, total_bw is fine. If there is no
> > delay, I can see a null total_bw. That shouldn't really a problem in
> > practice, though we may still need to figure out why.
> >
> > 2) Two cases where null total_bw is seen when a new partition is created by
> > moving all the CPUs in the parent cgroup into its partition and the parent
> > becomes a null partition with no CPU. The following example illustrates the
> > steps.
> >
> > #!/bin/bash
> > CGRP=/sys/fs/cgroup
> > cd $CGRP
> > echo +cpuset > cgroup.subtree_control
> > mkdir A1
> > cd A1
> > echo 0-1 > cpuset.cpus
> > echo root > cpuset.cpus.partition
> > echo "A1 partition"
> > echo +cpuset > cgroup.subtree_control
> > mkdir A2
> > cd A2
> > echo 0-1 > cpuset.cpus
> > echo root > cpuset.cpus.partition
> > echo "A2 partition"
> > cd ..
> > echo "Remove A2"
> > rmdir A2
> > cd ..
> > echo "Remove A1"
> > rmdir A1
> >
> > In this corner case, there is actually no change in the set of sched
> > domains. In this case, the sched domain set of CPUs 0-1 is being moved from
> > partition A1 to A2 and vice versa in the removal of A2. This is similar to
> > calling rebuild_sched_domains_locked() twice with the same input. I believe
> > that is the condition that causes null total_bw.
> >
> > Now the question is why the deadline code behaves this way. It is probably a
> > bug that needs to be addressed.
>
> Thanks for the analysis (and the patches). Will take a look, but I
> suspect it might be because in case domains are not destroyed, we clear
> them up (total_bw set to 0), but we don't add fair server contribution
> back because rqs are not attached to domains (as there have been alredy
> attached when such domains were created).

OK, I'm thinking maybe something like the below might help. It seems to
do well with your test above, does it maybe fix the other cases you
mentioned as well?

---
include/linux/sched/deadline.h | 1 +
kernel/sched/deadline.c | 12 ++++++++++++
kernel/sched/topology.c | 9 ++++++---
3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 3a912ab42bb5..a74fb0da0fa3 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -34,6 +34,7 @@ static inline bool dl_time_before(u64 a, u64 b)
struct root_domain;
extern void dl_add_task_root_domain(struct task_struct *p);
extern void dl_clear_root_domain(struct root_domain *rd);
+extern void dl_restore_server_root_domain(struct root_domain *rd);

#endif /* CONFIG_SMP */

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7884e566557c..acbd3039215f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2981,6 +2981,18 @@ void dl_add_task_root_domain(struct task_struct *p)
task_rq_unlock(rq, p, &rf);
}

+void dl_restore_server_root_domain(struct root_domain *rd) {
+ int i;
+
+ guard(raw_spinlock_irqsave)(&rd->dl_bw.lock);
+ for_each_cpu(i, rd->span) {
+ struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server;
+
+ if (dl_server(dl_se))
+ rd->dl_bw.total_bw += dl_se->dl_bw;
+ }
+}
+
void dl_clear_root_domain(struct root_domain *rd)
{
unsigned long flags;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9748a4c8d668..a0fcae07585e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2721,12 +2721,15 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],

/*
* This domain won't be destroyed and as such
- * its dl_bw->total_bw needs to be cleared. It
- * will be recomputed in function
- * update_tasks_root_domain().
+ * its dl_bw->total_bw needs to be cleared.
+ * Tasks contribution will be then recomputed
+ * in function dl_update_tasks_root_domain(),
+ * dl_servers contribution in function
+ * dl_restore_server_root_domain().
*/
rd = cpu_rq(cpumask_any(doms_cur[i]))->rd;
dl_clear_root_domain(rd);
+ dl_restore_server_root_domain(rd);
goto match1;
}
}